Extracting Accurate and Complete Results from Search Engines: Case Study Windows Live1

Памер230.8 Kb.
  1   2   3   4   5   6   7


Extracting Accurate and Complete Results from Search Engines: Case Study Windows Live1
Mike Thelwall

School of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK. E-mail: m.thelwall@wlv.ac.uk

Tel: +44 1902 321470 Fax: +44 1902 321478
Although designed for general web searching, in webometrics and related research commercial search engines are also used to produce estimated hit counts or lists of URLs matching a query. Unfortunately, however, they do not return all matching URLs for a search and their hit count estimates are unreliable. In this paper, we assess whether it is possible to obtain complete lists of matching URLs from Windows Live, and whether any of its hit count estimates are robust. As part of this, we introduce two new methods to extract extra URLs from search engines: automated query splitting and automated domain and TLD searching. Both methods successfully identify additional matching URLs but the findings suggest that there is no way to get complete lists of matching URLs or accurate hit counts from Windows Live, although some estimating suggestions are provided.


Commercial search engines like Google, Yahoo! and Windows Live constantly crawl the web and maintain huge searchable databases of the pages that they have found. Search engine results are now widely used for measurement purposes, not only by information researchers in webometrics (Almind & Ingwersen, 1997; Bar-Ilan, 2004), and related fields (Foot, Schneider, Dougherty, Xenos, & Larsen, 2003; Park, 2003; Pennock, Flake, Lawrence, Glover, & Giles, 2002), but also by commercial activities such as web analytics and search engine optimisation. Hence, there is a need for research into the reliability of the results that search engines deliver and two relevant issues are discussed here.

First, search engine hit count estimates (e.g., 119,000 in ‘Results 1-10 of about 119,000’) are often used in webometrics research, for example to determine how many pages in one country link to another (Ingwersen, 1998). These hit count estimates are normally reported on each results page, and can vary between results pages (e.g., the second results page might state: ‘Results 11-20 of about 116,000’). Hence it is logical to question which estimate is the most reliable, that on the first page of result or that on a subsequent or the last page of results? Nevertheless, despite continued use of search engines in webometrics research, there has been no systematic study into how hit count estimates vary between results pages. Such a study could shed light on reasons for differences and any systematic biases, as well as providing simple best practice advice.

Second, instead of hit count estimates, some webometrics research requires lists of URLs matching a query, for example if the individual URLs need to be visited or their country of origin determined (Thelwall, Vann, & Fairclough, 2006). This is often problematic because search engines normally stop at about the 1000th result, with all other matching URLs remaining hidden from the user (Jepsen, Seiden, Ingwersen et al., 2004). It is currently not known whether it is possible to use other methods to extract all of the remaining URLs in such cases. Moreover, search engines employ unreported methods to select which URLs they return, such as their page ranking algorithms (Chakrabarti, 2003), and so it is unclear whether their results are representative of their databases. Of course, since search engines do not index the whole web, it is not possible to get a complete list of all pages matching a query.

In order to address the two issues above, this paper introduces new methods to obtain extended lists of URLs for a search, including the initially hidden URLs, and to evaluate the hit counts reported by a search engine for queries with multiple pages of results. Previous research (reviewed below) has already developed several methods to assess various aspects of search engine performance but, surprisingly, none has fully investigated whether the hit count estimates are reliable reflections of the number of matching URLs in a search engine’s database. We apply these methods to a case study of Windows Live, via its search service, and also present similar results for Google and Yahoo!. No previous study has evaluated Windows Live for webometrics, and this is an important omission because it is currently the best for some types of investigation, as described below. Note that this paper is concerned with extracting results from a single search engine and is not concerned with methods to obtain more complete URL lists or more comprehensive hit count estimates, such as through the use of multiple search engines (cf., Lawrence & Giles, 1998).

Webometric methods for search engine evaluation

This section briefly reviews research evaluating search engines to set the background for the current study. An important issue in the early years of the web was to discover the percentage of the web in the databases of major commercial search engines. A method has been developed to assess this: submitting a set of queries to search engines and comparing the results (lists of URLs from each search engine) to discover their degree of overlap. This method was also used to make inferences about the percentage of the whole web (however defined) that each one indexed (Lawrence & Giles, 1998, 1999). The research showed that the search engines of the day covered up to 16% of the “indexable web”: i.e. the pages that search engines could retrieve, in theory, by finding all web site home pages and following their links recursively. From the Lawrence and Giles research we can be confident that no search engine today indexes the whole web, and it also seems that any two unrelated search engines are likely to overlap by less than 50%.

Although there is no perfect method to evaluate search engine coverage, related research has continued. For example a web site sampling method has been used to show that search engine coverage has an almost inevitable international bias against web newcomers, caused by the link structure of the web (Vaughan & Thelwall, 2004). Others have focussed on the ranking of search engine results, in an attempt to propose an alternative ranking system that is not too biased towards popularity and against page quality (Cho & Roy, 2004; Cho, Roy, & Adams, 2005).

A separate research strand has focussed on the consistency of the results reported by search engines. Even though search engines do not cover the whole web, the numbers that they report as hit count estimates for any query are interesting for at least two reasons. First, Webometric research has used these hit counts as the raw data for many studies of web information (e.g., Aguillo, Granadino, Ortega, & Prieto, 2006; Ingwersen, 1998). Second, from an information retrieval perspective, it is useful to know how reliable the estimates reported by search engines are. In response several researchers set out to systematically analyse variations in the results reported by commercial search engines. First, a comparison of results for the same query over short periods of time showed that fluctuations of several orders of magnitude could occur, and also that sets of related queries could give inconsistent results (Snyder & Rosenbaum, 1999). Second, Rousseau (1999) tracked the variation over time of specific queries in NorthernLight and AltaVista, showing that the results tended to be quite stable but were subject to large fluctuations, presumably due to software or hardware upgrades. Bar-Ilan (1999) investigated the results of six search engines in more detail, discovering that they forgot information, in the sense that URLs were occasionally not reported in results, and that these URLs pointed to information that was not available elsewhere in the search engine results returned. Subsequent research encompassed Google and tracked the coverage of a large set of web sites, finding a pattern of stability but with occasional sudden changes (Thelwall, 2001). The research of Mettrop and Nieuwenhuysen (2001) also used a time series approach but used a set of controlled seed URLs in order to get more detailed information on search engine performance. They confirmed that search engines sometimes did not report a page even when it matched a query and was in their index (Mettrop & Nieuwenhuysen, 2001). Bar-Ilan describes search engines as ‘concealing’ pages when they do not report them, despite matching a query and the page being in their database (Bar-Ilan, 2002).

In conclusion, search engines should be viewed as engineering products, designed to produce fit-for-purpose results, but not as mathematical “black boxes” that deliver logically correct results. Search engines may take shortcuts when estimating and returning results in order to improve their speed or efficiency of operation. For example they may only search a fraction of their index for a query, stopping when they run out of time or have found enough results. See also (Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001; Bar-Ilan, 2004; Brin & Page, 1998) for technical issues that may impact on search engine results.

Поделитесь с Вашими друзьями:
  1   2   3   4   5   6   7

База данных защищена авторским правом ©shkola.of.by 2022
звярнуцца да адміністрацыі

    Галоўная старонка