Sci7’s website optimisation services are informed by a scientific approach, which includes an awareness of both current and historical trends.
Google’s emergence into the search engine scene was an important occurance in the history of search on the web. The founders of Google, Sergey Brin and Lawrence Page wrote a paper describing the first incarnation of their search engine when the internet application was a prototype located at Stanford University, here this paper is discussed by Sci7 Ltd.
Session Initiation
In this paper they noted that people were likely to surf the web, following links from one page to another, starting their journey at either a directory or search engine. Search engines have since become the most popular starting point for a “web session”. Other starting points have emerged such as RSS newsreaders which alert users to fresh content on sites which they are interested in, there are also a variety of “push” technologies which have a similar effect on the initiation of the browsing experience as RSS. Institutional intranet pages, and email clients (web based or not) are also current significant starting points, though many of these will themselves be portals including feeds (content supplied by push, or push-like technology)”, search boxes, and directories. Novel manners in which surfing sessions are emerging, in many European cities when you enter your phone will receive a text advertising what’s available locally, these message can include web-links. When wireless users connect to networks they are often first presented with a splash registration page from the network provider. Future initiation events might include brining a certain product into proximity of a computer, the product could be detected by its internal RFID tags, or by image recognition via a webcam.
PageRank
Brin and Page’s paper introduced the basis of PageRank, where information present in terms of which pages and sites link to each other is extracted and interpreted. It is worth noting that this paper was not the first to describe using what the paper describes as ” “, for determining the relative relevance of pages. Three groups of people who had previously described such techniques are cited:
- Ellen Spertus. ParaSite: Mining Structural Information on the Web. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
- Massimo Marchiori. The Quest for Correct Information on the Web: Hyper Search Engines. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
- Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998.
- Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on Hypertext. New York, 1996.
There is also a reference in the paper to the way in which the relevance and importance of academic papers has been assessed by the number of references to them in other academic papers. In the academic world this approach is known to have a number major flaws, one is that a particular type of paper tends to get a disproportionate number of citations - that’s one describing how to do a particular type of experiment for the first time, that’s because from then on all scientists using that technique cite the original paper. Google’s PageRank takes into account both the number of links coming to a page, and the PageRank of the pages on which those inbound links were found.
Link Text
The second technique, used by the prototype Google in addition to PageRank, is analysis of the link text, the text enclosed by the anchor tags. The paper describes the association of such link text with both the page that it is on, and the page it links to. This makes sense as link text is usually “important”, in that it is highlighted, like titles, URLs, capitalised text, headings, and other specially tagged text, the use of such indicators is also mentioned in the paper. The Google founders also note that the link text is likely to offer a better description of the target page content than can be obtained from the target page, the fact that content such as images which contain no inherent descriptive textual descriptive content can be associated with the link text of inbound links.
Updates
The reason behind the three month update cycle at Google, and one purpose of the “see more results” link is revealed by the article, one is that indexes need to be rebuilt and this takes time, and another is that a first pass of the results takes place on pre-computed collections of data, and only if that is insufficient is the whole dataset queried.
Google Local
Even this first incarnation of Google is revealed to make use of proximity in ordering the search results, though it is almost certain that at this stage this was at the country level, rather than the city / street level that is emerging with Google local and other services.
The prototype Google described had a databank of 24 million web pages 147GB, which appears decidedly tiny compared to current data storage requirements for even highly specialised crawling options.
Feedback
The fact that the amount of traffic a page receives should ideally be taken into account in producing the search results is discussed. At the time though Google did not have access to such information, as of 2005 Google tracks which sites people click to from the Google results (and tracks Google Toolbar users to a greater degree). The tracking information can be used to infer the traffic to various sites, as well as provide a way of determining the relevance of search results. This feedback loop, where commonly clicked links are promoted is known to be used for advert positioning by Google - its use in the search results has not yet been confirmed, though Sci7’s internal (ongoing) research reveals indications that it is occurring.
Some key points about Google’s data management revealed:
- Whenever Google comes across a new URL in a page a unique ID number is assigned to that URL.
- Crawled documents are stored in a compressed form (zlib) on Google’s servers.
- Each document is split into chunks, these are ranked depending how far into the document they occur, and how they are tagged.
- These chunks are then grouped.
- Links are extracted from the crawled document and placed in another file(index), along with the link text, source and destination.
- Relative URLs are corrected into absolute URLs.
- A database of pairs of ID numbers (linked pages) is created, this is an element of the raw data on from which PageRank is calculated.
Metadata is described as a poor source of information for determining relevance as poor, as information not visible to the user is often used intentionally to attempt to manipulate search results.
Challenges for a Search Engine
The paper gives much insight into what the challenges faced by a search engine are: “One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user’s ability to look at documents has not. People are still only willing to look at the first few tens of results.” This problem can be viewed from the point of view of an organisation or individual running a website - if you’re not in the top thirty or so results on the popular search engines for relevant search terms then you won’t get much traffic from them. This problem might be lessened with localisation technology, many searches are naturally localised, localisation can reduce the number of documents from which your search results are to be selected.
Googlebot
The Google web crawler, now known as the Googlebot is discussed.
- It is made of a number of machines, URL servers, and URL grabbers
- The machines which get the data hold DNS caches
- The code for this element of the operation is written in Python
Crawling problems are also discussed, in that crawling an online game resulted in problems as the Googlebot presumably behaved like a crazed player following allsorts of links. Sci7 has been made aware of a re-emergence of this kind of problem when companies run the Google search appliance within their company networks, or from IP addresses which are logged into online applications.
Even this early prototype reveals the use of “trusted feedback”, in the this case just for evaluation of the results, presumably for improving the algorithm. However Sci7 now believes Google uses a large number of individuals to manually check results, primarily for the purpose of removing search engine spam from the system, Sci7 customers get the benefit from our knowing the criteria used by these individuals to assess sites.
Google Rank
The paper uses as an example throughout the name of the then current US president, highlighting the fact the top result as whitehouse.gov as evidence that the search engine is working well. While this is the case with the current president the other results are interesting with bushorchimp.com coming in at number four, and much has been made of “google bombs” such as:
http://www.google.com/search?q=failure
These effects are created by many people online creating links with to the Whitehouse official biography of George W Bush using the word failure as the link text as has been done in this sentence.
Forward Looking
Much is also made in this early paper about the scalability of the techniques used, steps such as designing their own operating system capable of dealing with very large files, and use in a failure tolerant distributive environment. Thinking about the future is clearly important for all organisations involved in online technologies where the future has a tendency to arrive very quickly.
Sci7
Sci7 provides website optimisation consultancy, tools, and software, among its portfolio of services and information products details on the This entry was posted on Tuesday, December 6th, 2005 at 8:15 pm and is filed under Resources. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.