This is Sci7 Ltd’s review of an article by Amanda Spink, Minsoo Park, Bernard J. Jansen and Jan Pedersen published in volume 42 of the journal “Information Processing and Management” (pages 264–275) with an expected publication date of early 2006.

Data from Altavista

The research discussed is based on analysis of users of the Altavista search engine, the paper states that data was obtained from Altavista “transaction logs” from 2002. The fact that this data is not recent is one weakness of the research conducted, though the fact that at the time Altavista had a larger fraction of search users than it can now boast is given to support the use of this dataset. The logs are for a 24h period, taken on a Sunday, it is possible that weekend search queries are different to weekday ones, this isn’t an issue addressed by this research. Interestingly the analysis was affected by the presence of automated searches that were being submitted to Altavista. Sci7 believes the method used for removing such ‘robot’ derived hits was dubious in that all sessions of more than one hundred queries were removed. Sci7’s believes smarter techniques for identifying robots such as looking for very short intervals between queries, and using “User Agent” may have been more appriopriate. The apparent assumption that users have unique IP addresses could well have resulted in significant chunks of the user base such as those in institutions using web-caching, and those using certain ISPs being excluded from the dataset by being identified as robots by the techniques applied.

Conflict of Interest?

It is perhaps likely that Altavista would make such out dated information available for academic research, however the fact one of the authors, Pendersen works for Yahoo which provides the search function on Altavista, is presumably the situation which enabled the data to be obtained for the study. The potential for a conflict of interest should be considered when considering the results of the paper discussed here.

User Behaviour

The research focuses on how users interact with search engines, looking at areas such as multitasking - what else users are doing while they’re involved in a search session, and the type of queries that they are using.

Key findings presented included:

  • Sessions ranged in duration from less than a minute to a few hours.
  • In only 19% of two-query sessions were both queries on the same topic.
  • 91.3% of three or more query sessions included multiple topics.
  • There are a broad variety of topics in multitasking search sessions.

A session is defined as : “…the entire series of queries submitted by a user during one interaction with the Web search engine. Session length varied from less than a minute to a few hours.”. Sci7 would have been given more confidence by a rigourous definition based on the method used to identify queries from the same session from the raw transaction logs.

A key question is if the multiple search terms are devised to try and find the same information for the user, the suggestion is made - due to other referenced research that the multiple topics are due to users seeking a range of different information within the session. This referenced work includes studies carried out on data from other search engines including Excite and Alltheweb. These sources suggest that an average “multitasking” search session involves two topics. This is a bit of an odd statistic as if a single query session is the most common, two queries the next most common and further multiples of queries even less common it is clear that the most popular “multitasking” query is one with two queries.

An sample of “multitasking’ is given in the paper:

In Example 1, the user ranged over more than three topics during a period of 11 minutes from crystal meth to ephedrine then catwoman and finally plastic surgery.

Sample Size

A random set of sessions were taken from the dataset obtained from altavista, and split depending on the number of queries per session. The sets then obtained were tiny - only 254 two query sessions, 206 of which were determined to be “multitasking” for example. 254 sessions is in the view of Sci7 far too small a dataset from which to draw conclusions of any sort when compared to the 250 million searches per day in 2003 on Google and partners reported by Danny Sullivan of Search Engine Watch (Searchenginewatch.com), citing Google itsself as the source of that number.

Importance

An apparent important omission from this paper appears to be the relative number of single query sessions, without this information it is difficult to judge the importance of the research presented here on multi query sessions. The only evidence given related to this crucial question is derived from an Excite user survey which is a rather different approach to data collection than log analysis. The Excite survey revealed 96.2% of sessions were single session queries, bringing the impact of research, such as that being reviewed here focused on the remaining 3.8% (11 out of 287) into question.

Author’s Recommentations

A number of recommendations are made, these are commented on below:

  • Provide users with the ability to access, refine, and use results from a previous searches within the confines of a session across multiple topics.
    What comes out from the research is in fact that where this is done it should be done with care as the previous results within the session are unlikely to be related.
  • Assist users in coordinating multiple topics into effective queries (i.e., search histories, various thesauri or keyword generation tools).
    There are two options here, one is to bring multiple topics together, the other is to keep them separate, and to ensure that sequential searches are generally treated independently, or at least the opportunity to treat them independently is maintained.
  • Provide searchers the ability to create multiple sets of working notes related to different or related search topics (i.e., sketching, hyper-linking, and note creation tools).To allow multiple distinct sets of any such notes is I think the key point here, and one not well made by the recommendations.
  • Enable the submission and tracking of multiple queries concurrently on different or related topics. There is a basis for this, but research on fraction of sessions which involve multiple searches should be reviewed in order to determine the prominence such features should be given.
  • Allow for searching multiple search engines or collections concurrently on multiple topics.No data on users using multiple search engines or collections concurrently is introduced within this paper
  • Enable the reformulation of multiple queries on different or related topics.What search engine doesn’t provide this functionality?
  • Provide windowing facilities to allow Web users to generate and track separate topic or related topic queries and facilitate topic switching.Again, this appears a sensible suggestion, but the evidence provided suggests that this will only be required by the minority of users engaged in multi-query search sessions, so should perhaps not be present in a “default” or “simple” search interface
  • Enable the generation and comparison of relevance judgments from different or related searches.Allowing for either different or related searches appears sensible
  • Enable the tracking, storing and manipulating retrieved results and printouts related to different topics over multiple searches.Printouts?!, no evidence is presented that users print their search results, or that people store search results, this recommendation appears well beyond the scope of of the research either conducted or referenced
  • Provide the ability to create clusters of retrieved information related to different topics.This is absolutely indicated, but really only becomes important when the search interface is such that searches are “remembered” and past searches influence future results. If retrieved information is to be stored, then it should be clustered into various different topics.

Sci7’s Summery

While much of the above suggestions appear sensible, many do not appear to directly follow from the research results described in the paper, and are therefore simply opinions and suggestions, the fact they appear in a “research paper” should not give them undue credibility.

The key finding of this research can be summarised as:
When users enter multiple queries within a session they are highly likely (81.1% if two queries, 91.3% if three) to be unrelated queries not on the same topic.

In Sci7’s view the key recommendation coming out of this work is that simply that this pattern of behaviour should be recognised. User interfaces of search engines should not make the apparently unfounded assumption that sequential queries in the same session will be on the same topic. In practice this means that options such as “search within results”, and “conduct this same search on another database” should be given lesser prominence than options to simply start a new search from scratch.

Sci7’s advice and recommendations are based on a scientific, evidence based approach to the optimisation of online services, derived from both propriety and published research.

Leave a Reply

Cambridge UK