Evaluation of Web search engines and the search for better ranking algorithms.

(A paper given at the SIGIR99 Workshop on Evaluation of Web Retrieval August 19, 1999)

Mildrid Ljosland
Norwegian University of Science and Technology

Abstract

In this paper I will discuss two topics: The evaluation of Web search engines and the search for better ranking algorithms. In the first section I show the results from two relevance experiments. One is a comparison between three Web search engines, and the other is a comparison between two sizes of the number of indexed documents in a search engine. In the second section I discuss some ideas of how to enhance the ranking algorithms in a Web search engine.

Evaluation of Web search engines

Background

Web search engines have their ancestors in the information retrieval (IR) systems developed during the last fifty years. IR methods include (among others) the Boolean search methods, the vector space methods, the probabilistic methods, and the clustering methods [BelCroft87]. All these methods aim at finding the relevant documents for a given query.

For evaluating such systems, recall (the number of relevant retrieved documents divided by the number of relevant documents) and precision (the number of relevant retrieved documents divided by the number of retrieved documents) are the most used measures. One main use is in the TREC (Text retrieval conference, http://trec.nist.gov), where many research groups get their system tested against a common database of documents.

There are some difficulties in using these measures, however. They are both based on assigning a binary value relevant/not relevant to each document. Researchers like Su [Su93, Su98], Green [Green95] and Borland and Ingwersen [BorIng97] argue that relevance is a complex item with many topical aspects, one of them being who is assigning the relevance, another being which relevance is considered. Others, like Leighton and Srivastava [LeiSri97] and Spink, Greisdorf, and Bateman [SpGrBa98] use a non-binary relevance measure. Spink et al. advocate the use of partly relevant documents as a source for new viewpoints on the query topic.

Another difficulty is that as more and more documents are gathered in the database, it will be impossible to evaluate all of them for relevance. This has led the TREC community to use pooled results. Only documents retrieved by one or more groups are evaluated for relevance, and the collection of such retrieved, relevant documents is considered to be the total collection of relevant documents. That is, relevant documents found by no groups, will not count in the recall computation.

In the Web environment this difficulty is even more predominant. Here, the database of documents is increasing each day, and is estimated to reach a billion in 2000 [BrinPage98]. A complete search for relevant ones among all these documents is impossible, and even checking only the retrieved ones is unrealistic when queries may give millions of hits.

Therefore, researchers publishing works on comparison of Web search engines use precision as their main evaluation measure [Winship95, DingMarch96, ChuRos96, ClaWil97, LeiSri97], evaluating only the highest ranked hits. If the n first documents are evaluated, the precision found from these documents is called P@n (precision at n).

Some of the researchers also use pooled recall (computed from the n highest ranked), and some evaluate qualitative properties like user interface and ease of use. All these studies use main search engines like Alta Vista, Excite, Lycos, and Infoseek.

Relevance experiments

A comparison between three Web search engines

In spring 1999, I made a small comparison study between three search engines, the well-known Alta Vista (http://www.altavista.com), and two newcomers, AllTheWeb (http://www.alltheweb.com) and Google (http://www.google.com). Twelve queries were sent to the engines, and the 10 first retrieved documents evaluated. Ten of the queries were fetched from a similar study made by Clarke and Willet [ClaWil97], the last two were formulated by myself. All the relevance assignments were done by me.

I used a three point relevance scale, relevant (1), partly relevant (0.5) and not relevant (0). To be deemed relevant, a page would have to closely match the query, and give some information about it. A partly relevant page was defined as either a page giving information on where to get information (a hypertext link, an address or a reference to a book or article about the subject), or some information connected to the query, but not quite to the point. Pages not containing any information about the query topic, were deemed not relevant, even if they contained information connected to some of the query terms (when the query was "the name of the king of Norway in 1990", information about ancient Norwegian kings were assigned the 0 score).

Mirror pages, which contain the same information, but have a different URL were counted as ordinary pages, whereas copy pages (same contents and same URL) got 0 score, since the user is certainly not interested in such links. Also error reports, such as not existing page or no response got the 0 score (no response pages after some additional trials). The exception was Google, which has a cache facility making me able to evaluate even such pages.

From the results, the mean precision (P@10) of each engine was estimated. When counting only the relevant, not the partly relevant documents, I got precision rates 0.4 for Alta Vista, 0.7 for Google, and 0.4 for AllTheWeb. Counting also the partly relevant documents, the mean precisions were 0.5 (Alta Vista), 0.9 (Google) and 0.5 (AllTheWeb). An analysis of variance showed no statistical significant difference between the precisions at the 0.05 level when counting only the relevant documents, whereas taking also the partly relevant documents into account gave a statistical significant difference (F=4.5 compared to F0.05,2,31= 3.3) between the engines.

The total sum of relevant pages was defined as the sum of the relevant, retrieved documents over all the three engines, counting equal documents found by two or more engines only once. Using this number as the denominator, the mean pooled recall was estimated to 0.3 (Alta Vista), 0.5 (Google) and 0.3 (AllTheWeb). Analysis of variance showed no significant differences.

Counting how many times the highest ranked document is deemed relevant for the twelve queries, the probability of getting a relevant document as the first, was estimated to 0.3 (Alta Vista), 0.8 (Google) and 0.5 (AllTheWeb). An approximate test based on the X2 statistics, showed no significant differences in these probabilities. However, taking also the partly relevant documents into account, the probability of getting a (partly) relevant document as the first, was estimated to 0.5 (Alta Vista), 1.0 (Google) and 0.7 (AllTheWeb). This gave a significant difference in the X2 test, but since the numbers of observations are small, this test is questionable.

A size experiment

In another experiment, I wanted to test whether the number of documents indexed has a significant impact on the precision. I was able to try a new, unofficial version of the search engine AllTheWeb, and compared this with the official one. The official engine has indexed approximately 80 millions URLs, whereas the new one has indexed more than the double of this. The ranking algorithm has not been changed, so if a shift in the precision could be observed, the increased number of indexed documents would be the only explanation.

In the test I used 50 queries from the TREC query collection (no. 351-400). The title-only version of the queries were sent to the two engines (after adding a plus sign to each word in order to get the AND version rather than the default OR version of the queries), and the top twenty documents were evaluated for relevance.

The relevance judgements were done against the full topic descriptions, as either relevant or not relevant. Only the documents, not the links they contained, were evaluated, and duplicate documents were treated as ordinary ones. Documents that could not be evaluated due to "no response", "file not found" etc., were simply passed over and a new document was selected instead. A similar relevance judgement procedure was used by Hawking et al. [HaCrThHa99] in a comparison between some TREC IR systems and some well-known Web search engines.

Using the described procedure, the P@20 values were estimated to 0.33 for the official version of the engine, and to 0.36 for the new version. (Hawking et al. [HaCrThHa99] found a precision range of 0.23 to 0.38 for the Web search engines they tested.)

Is an observed precision difference of 0.03 significant? A standard paired data two-tailed test with the null hypothesis that the difference is zero, is not rejected at the 0.05 level of significance (z=1.79 against $z_{0.025}=1.96$). So this test has not shown any statistical significant difference between the two versions, but at least it seems reasonable to conclude that the increase in indexed documents has not made the precision to decrease.

Savoy [Sav97] showed that in his tests, the precision differences could not be considered to come from a normally distributed population. If the population is not (approximately) normal, the standard tests are not applicable. Therefore, I tested this for my data, using the statistics X2=\sum_{i=1}^{c} \frac{(O_i-E_i)^2}{E_i}$ where c is the number of classes (9 is used here), Oi is the number of observations in class i and Ei is the expected number of observations if the population is normal. The test gave no indication that the normality hypothesis should be rejected (X2=5.4 against the critical value 12.6 at the 0.05 level), so the standard paired test can be considered to be appropriate in this case. (Of cause, failing to reject the alternative hypothesis does not mean that the null hypothesis is proved, but in this case, where the computed statistic is far from the critical value, the null hypothesis seems like a reasonable assumption.)

A planned experiment with real users

In a future experiment, I want to get high-school students to do searches on the same Web search engines as used in the first experiment. They will be asked to formulate a search and evaluate the first ten responses on each of the engines. Besides, they will be asked to evaluate the search engines (layout, help features, satisfaction with the responses etc.).

One of the goals of this study is to test whether the differences in precision that I found in the my study, also is present in this study. Some variations are to be expected, partly because the users themselves do the relevance assignments in this study, not a "neutral" observer.

The effect of using young and/or inexperienced people is not (to my knowledge) investigated in previous studies, so another goal is to get an evaluation of the usefulness of the search engines for such users. Other studies have used adults in the searching and evaluation process, often search experts for the searching, and/or topic experts for the evaluation.

Search for better ranking algorithms

Background

Ranking algorithms were first developed in IR systems, partly as an enhancement to the Boolean search technique, partly as new retrieval techniques. For Web search engines, where the number of relevant documents is often very large, the ranking between the documents is more important than in the traditional systems. Besides, new features, such as hyperlinks and metatags are integral parts of the Web documents, and may give rise to better ranking algorithms.

Some work has been reported in this area. Here I will mention a few studies.

Li [Li98] has reported a study on using the hyperlinks in the indexing process. He used the text in the hyperlinks as a representation for the documents that is pointed to by the hyperlinks. In this way, the document is described with the words that others use about it, not with the words that is used in the document itself. This seems to be a good idea in the Web environment, where many Web document creators try to fool the retrieval engines into giving their documents high ranking.

In the creation of the Google search engine, this was used together with another interesting technique [BrinPage98]. Here "significant" hyperlinks to a document enlarged the rating of this document, thereby trying to heighten the precision level. They used a "pagerank", which is a number calculated for each document based on how many hyperlinks points to this document, and the pagerank of the documents containing the hyperlinks.

Another way of trying to improve the results, is using data fusing. Ng and Kantor [NgKant98] found in their experiments with TREC-4 runs that combining two runs can improve the precision beyond the precision of each of the runs. Their conclusion is that in order to achieve this, one should have two runs that are not too similar, and both runs should perform reasonably well.

Some possible future experiments

A Web search engine should deliver adequate documents, and do it fast, but has limited resources. Therefore both relevance and resource consumption must be considered when evaluating ranking algorithms. The users' response time should be kept low without demanding too much computing power, and the time and space used in the indexing process should be kept under strict control.

Using the TREC-8 Web track data, I want to evaluate the ranking and resource effects of various strategies:

In the TREC-8 Web track data, the documents are ranked relevant or not relevant. More precise evaluation could be done if they were ranked in a non binary fashion, as we want the highly relevant documents at the top of the list of results. Perhaps the TREC community will decide on a broader evaluation in a future TREC [HaCrThHa99].

References