Monday, March 18, 2013

Increase Google Page Rank


One of the reasons why GoogleTM is such an eective search engine is the PageRankTM algorithm developed by Google’s founders, Larry Page and Sergey Brin, when they were graduate students at Stanford University. PageRank is de- termined entirely by the link structure of the World Wide Web. It is recomputed
about once a month and does not involve the actual content of any Web pages or individual queries. Then, for any particular query, Google finds the pages on the Web that match that query and lists those pages in the order of their PageRank. Imagine surfing the Web, going from page to page by randomly choosing an outgoing link from one page to get to the next. This can lead to dead ends at pages with no outgoing links, or cycles around cliques of interconnected pages. So, a certain fraction of the time, simply choose a random page from the Web. This theoretical random walk is known as a Markov chain or Markov process. The limiting probability that an infinitely dedicated random surfer visits any particular page is its PageRank. A page has high rank if other pages with high rank link to it. Let W be the set of Web pages that can be reached by following a chain of hyperlinks starting at some root page, and let n be the number of pages in W. For Google, the set W actually varies with time, but by June 2004, n was over 4 billion. Let G be the n-by-n connectivity matrix of a portion of the Web, that is, gij = 1 if there is a hyperlink to page i from page j and gij = 0 otherwise. The matrix G can be huge, but it is very sparse. Its jth column shows the links on the jth page.
The number of nonzeros in G is the total number of hyperlinks in W. starts at a specified URL and tries to surf the Web until it has visited n pages. If successful, it returns an n-by-1 cell array of URLs and an n-by-n sparse connectivity matrix. The function uses urlread, which was introduced in Matlab 6.5, along with underlying Java utilities to access the Web. Surfing the Web automatically is a dangerous undertaking and this function must be used with care. Some URLs contain typographical errors and illegal characters. There is a list of URLs to avoid that includes .gif files and Web sites known to cause diculties. Most importantly, surfer can get completely bogged down trying to read a page from a site that appears to be responding, but that never delivers the complete page.
When this happens, it may be necessary to have the computer’s operating system ruthlessly terminate Matlab. With these precautions in mind, you can use surfer to generate your own PageRank examples. The URL where the search began, www.4sharedsoft.edu, dominates. Like most uni- versities, Harvard is organized into various colleges Harvard Medical School, the Harvard Business School, and the Radclie Institute. You can see that the home pages of these schools have high PageRank. With a dierent sample, such as the one generated by Google itself, the ranks would be dierent.

No comments:

Post a Comment