Mildrid Ljosland
Norwegian University of Science and Technology
For evaluating such systems, recall (the number of relevant retrieved documents divided by the number of relevant documents) and precision (the number of relevant retrieved documents divided by the number of retrieved documents) are the most used measures. One main use is in the TREC (Text retrieval conference, http://trec.nist.gov), where many research groups get their system tested against a common database of documents.
There are some difficulties in using these measures, however. They are both based on assigning a binary value relevant/not relevant to each document. Researchers like Su [Su93, Su98], Green [Green95] and Borland and Ingwersen [BorIng97] argue that relevance is a complex item with many topical aspects, one of them being who is assigning the relevance, another being which relevance is considered. Others, like Leighton and Srivastava [LeiSri97] and Spink, Greisdorf, and Bateman [SpGrBa98] use a non-binary relevance measure. Spink et al. advocate the use of partly relevant documents as a source for new viewpoints on the query topic.
Another difficulty is that as more and more documents are gathered in the database, it will be impossible to evaluate all of them for relevance. This has led the TREC community to use pooled results. Only documents retrieved by one or more groups are evaluated for relevance, and the collection of such retrieved, relevant documents is considered to be the total collection of relevant documents. That is, relevant documents found by no groups, will not count in the recall computation.
In the Web environment this difficulty is even more predominant. Here, the database of documents is increasing each day, and is estimated to reach a billion in 2000 [BrinPage98]. A complete search for relevant ones among all these documents is impossible, and even checking only the retrieved ones is unrealistic when queries may give millions of hits.
Therefore, researchers publishing works on comparison of Web search engines use precision as their main evaluation measure [Winship95, DingMarch96, ChuRos96, ClaWil97, LeiSri97], evaluating only the highest ranked hits. If the n first documents are evaluated, the precision found from these documents is called P@n (precision at n).
Some of the researchers also use pooled recall (computed from the n highest ranked), and some evaluate qualitative properties like user interface and ease of use. All these studies use main search engines like Alta Vista, Excite, Lycos, and Infoseek.
I used a three point relevance scale, relevant (1), partly relevant (0.5) and not relevant (0). To be deemed relevant, a page would have to closely match the query, and give some information about it. A partly relevant page was defined as either a page giving information on where to get information (a hypertext link, an address or a reference to a book or article about the subject), or some information connected to the query, but not quite to the point. Pages not containing any information about the query topic, were deemed not relevant, even if they contained information connected to some of the query terms (when the query was "the name of the king of Norway in 1990", information about ancient Norwegian kings were assigned the 0 score).
Mirror pages, which contain the same information, but have a different URL were counted as ordinary pages, whereas copy pages (same contents and same URL) got 0 score, since the user is certainly not interested in such links. Also error reports, such as not existing page or no response got the 0 score (no response pages after some additional trials). The exception was Google, which has a cache facility making me able to evaluate even such pages.
From the results, the mean precision (P@10) of each engine was estimated. When counting only the relevant, not the partly relevant documents, I got precision rates 0.4 for Alta Vista, 0.7 for Google, and 0.4 for AllTheWeb. Counting also the partly relevant documents, the mean precisions were 0.5 (Alta Vista), 0.9 (Google) and 0.5 (AllTheWeb). An analysis of variance showed no statistical significant difference between the precisions at the 0.05 level when counting only the relevant documents, whereas taking also the partly relevant documents into account gave a statistical significant difference (F=4.5 compared to F0.05,2,31= 3.3) between the engines.
The total sum of relevant pages was defined as the sum of the relevant, retrieved documents over all the three engines, counting equal documents found by two or more engines only once. Using this number as the denominator, the mean pooled recall was estimated to 0.3 (Alta Vista), 0.5 (Google) and 0.3 (AllTheWeb). Analysis of variance showed no significant differences.
Counting how many times the highest ranked document is deemed relevant for the twelve queries, the probability of getting a relevant document as the first, was estimated to 0.3 (Alta Vista), 0.8 (Google) and 0.5 (AllTheWeb). An approximate test based on the X2 statistics, showed no significant differences in these probabilities. However, taking also the partly relevant documents into account, the probability of getting a (partly) relevant document as the first, was estimated to 0.5 (Alta Vista), 1.0 (Google) and 0.7 (AllTheWeb). This gave a significant difference in the X2 test, but since the numbers of observations are small, this test is questionable.
In the test I used 50 queries from the TREC query collection (no. 351-400). The title-only version of the queries were sent to the two engines (after adding a plus sign to each word in order to get the AND version rather than the default OR version of the queries), and the top twenty documents were evaluated for relevance.
The relevance judgements were done against the full topic descriptions, as either relevant or not relevant. Only the documents, not the links they contained, were evaluated, and duplicate documents were treated as ordinary ones. Documents that could not be evaluated due to "no response", "file not found" etc., were simply passed over and a new document was selected instead. A similar relevance judgement procedure was used by Hawking et al. [HaCrThHa99] in a comparison between some TREC IR systems and some well-known Web search engines.
Using the described procedure, the P@20 values were estimated to 0.33 for the official version of the engine, and to 0.36 for the new version. (Hawking et al. [HaCrThHa99] found a precision range of 0.23 to 0.38 for the Web search engines they tested.)
Is an observed precision difference of 0.03 significant? A standard paired data two-tailed test with the null hypothesis that the difference is zero, is not rejected at the 0.05 level of significance (z=1.79 against $z_{0.025}=1.96$). So this test has not shown any statistical significant difference between the two versions, but at least it seems reasonable to conclude that the increase in indexed documents has not made the precision to decrease.
Savoy [Sav97] showed that in his tests, the precision differences could not be considered to come
from a normally distributed population. If the population is not (approximately) normal, the standard
tests are not applicable. Therefore, I tested this for my data, using the statistics X2=\sum_{i=1}^{c}
\frac{(O_i-E_i)^2}{E_i}$ where
One of the goals of this study is to test whether the differences in precision that I found in the my study, also is present in this study. Some variations are to be expected, partly because the users themselves do the relevance assignments in this study, not a "neutral" observer.
The effect of using young and/or inexperienced people is not (to my knowledge) investigated in previous studies, so another goal is to get an evaluation of the usefulness of the search engines for such users. Other studies have used adults in the searching and evaluation process, often search experts for the searching, and/or topic experts for the evaluation.
Some work has been reported in this area. Here I will mention a few studies.
Li [Li98] has reported a study on using the hyperlinks in the indexing process. He used the text in the hyperlinks as a representation for the documents that is pointed to by the hyperlinks. In this way, the document is described with the words that others use about it, not with the words that is used in the document itself. This seems to be a good idea in the Web environment, where many Web document creators try to fool the retrieval engines into giving their documents high ranking.
In the creation of the Google search engine, this was used together with another interesting technique [BrinPage98]. Here "significant" hyperlinks to a document enlarged the rating of this document, thereby trying to heighten the precision level. They used a "pagerank", which is a number calculated for each document based on how many hyperlinks points to this document, and the pagerank of the documents containing the hyperlinks.
Another way of trying to improve the results, is using data fusing. Ng and Kantor [NgKant98] found in their experiments with TREC-4 runs that combining two runs can improve the precision beyond the precision of each of the runs. Their conclusion is that in order to achieve this, one should have two runs that are not too similar, and both runs should perform reasonably well.
Using the TREC-8 Web track data, I want to evaluate the ranking and resource effects of various strategies: