Search Engines and Their Public Interfaces By Frank McCown June 1, 2007 Abstract

Дата канвертавання22.04.2016
Памер14.21 Kb.

Search Engines and Their Public Interfaces

By Frank McCown

June 1, 2007


Google, Yahoo and MSN all provide both web user interfaces (WUIs) and application programming interfaces (APIs) to their collections. Whether building collections of resources or studying the search engines themselves, the search engines request that researchers use their APIs and not “scrape” the WUIs. However, anecdotal evidence suggests the interfaces produce different results. We provide the first in depth quantitative analysis of the results produced by the Google, MSN and Yahoo API and WUI interfaces. We have queried both interfaces for five months and found significant discrepancies between the interfaces in several categories. In general, we found MSN to produce the most consistent results between their two interfaces. Our findings suggest that the API indexes are not older, but they are probably smaller for Google and Yahoo. We also examined how search results decay over time and built predictive models based on the observed decay rates. Based on our findings, it can take over a year for half of the top 10 results to a popular query to be replaced in Google and Yahoo; for MSN it may take only 2-3 months.

1. Introduction

Commercial search engines have long been used in academic studies. Sometimes the search engines themselves are being studied, and sometimes they are used to study the Web or web phenomena. In the past, researchers have either manually submitted queries to the search engine’s web user interface (WUI), or they have created programs that automate the task of submitting queries. The returned results have been processed manually or by programs that rely on brittle screen-scraping techniques.

But data collection mechanisms for search engines have changed in the past few years. Three of the most popular commercial search engines (Google, MSN and Yahoo) have developed freely available APIs for accessing their index, and researchers can now use these APIs in their automated data collection processes. Unfortunately, the APIs do not always give the same results as the WUI. The listserves and newsgroups that cater to the API communities are full of questions regarding the perceived differences in results between the two. None of the search engines publicly disclose the inner workings of their APIs, so users are left wondering if the APIs are giving second-rate data. This anecdotal evidence has led some researchers to question the validity of their findings. For example, Bar-Yossef and Gurevich [2] state that “due to legal limitations on automatic queries, we used the Google, MSN, and Yahoo! web search APIs, which are, reportedly, served from older and smaller indices than the indices used to serve human users.” Other researchers may be totally unaware of the differences. When writing about a 2004 experiment using the Google search engine, Thelwall [6] stated that “the Google API... could have been used to automate the initial collection of the results pages, which should have given the same outcome” as using the WUI.

The main purpose of this study is to examine the differences between what is reported between the WUI and API when queried with a variety of queries that researchers and users frequently use. Our findings allow us to address the question of whether the APIs are pulling from older and smaller indexes. A secondary purpose is to examine how search results decay over time and provide predictive models for determining the half-lives of search results.

2. Background and Related Work

2.1 Search Engine APIs

At this point you are probably getting really excited about this paper. If you will just continue to hang on, it’s going to get even better [5].

2.2 Research Use of the APIs

Researchers use the APIs to do many things. That’s all you need to know.

2.2.1 Sub-subsection

This is a sub-subsection.

2.2.2 Another Sub-subsection

And this is another sub-subsection.

2.2.3 And the Final Sub-subsection

This is the final sub-subsection.

2.3 Comparing Search Engine Results

There are many ways to compare search results; Table 1 lists just a few of these ways.

Table 1: Three methods for comparing search engine results.




Takes into account only shared results.

Kendall tau distance for top k results [4]

Penalizes movements in the results.

Bar-Ilan et al.’s M measure [1]

Penalizes movements at the bottom of the results less heavily than results at the top.

3. Experiment Setup

Here’s where we explain how we setup our experiment.

4. Experiment Results

Here’s where we dazzle you with the results. Figure 1shows the top 100 search results compared over the 150 days of the experiment. Notice that Yahoo shows some very large dips in both popular and CS term results.

Figure 1: K distance between top 100 search results when comparing day n to day n − 1.

4. Conclusions

Our five month experiment has uncovered a variety of disagreements between the interfaces of Google, MSN and Yahoo. Our findings suggest that the API indexes are not older, but they are probably smaller. Although the indexes used by the WUI and API appear to be updated at the same rate for all three search engines, the top 100 WUI and API results are rarely identical for CS and popular term queries. When examining just the top 10 results, Google’s API produces results that are 20% different than the WUI results; Yahoo’s are 14% different, and MSN’s are 8% different. In general, we found MSN to produce the most consistent results between their two interfaces.

We have also examined how search results decay over time. We have built predictive models based on the observed decay rates of the popular and CS search term results used in our experiment. In general, we have found that the decay rates for popular results differ significantly for Google and MSN, and the top 10 results decay at a much slower rate than do the top 100 results.

5. References

  1. J. Bar-Ilan, M. Mat-Hassan, and M. Levene. Methods for comparing rankings of search engine results. Computer Networks, 50(10):1448–1463, 2006.

  2. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine’s index. In Proceedings of the 15th International Conference on World Wide Web, pages 367–376, 2006.

  3. M. Cutts. GoogleGuy’s posts, June 2005.

  4. R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. SIAM Journal on Discrete Mathematics, 17(1):134–160, 2003.

  5. The Lycos 50.

  6. M. Thelwall. Can the Web give useful information about commercial uses of scientific research? Online Information Review, 28:120–130, 2004.

База данных защищена авторским правом © 2016
звярнуцца да адміністрацыі

    Галоўная старонка