Combining Multiple Sources of Evidence to Enhance Web Search Performance
- Author: Yang Kiduk
- Organization: Yang Kiduk
- Publish: Journal of Korean Library and Information Science Society Volume 45, Issue3, p5~36, 30 Sep 2014
The Web is rich with various sources of information that go beyond the contents of documents, such as hyperlinks and manually classified directories of Web documents such as Yahoo. This research extends past fusion IR studies, which have repeatedly shown that combining multiple sources of evidence (i.e. fusion) can improve retrieval performance, by investigating the effects of combining three distinct retrieval approaches for Web IR: the text-based approach that leverages document texts, the link-based approach that leverages hyperlinks, and the classification-based approach that leverages Yahoo categories. Retrieval results of text-, link-, and classification-based methods were combined using variations of the linear combination formula to produce fusion results, which were compared to individual retrieval results using traditional retrieval evaluation metrics. Fusion results were also examined to ascertain the significance of overlap (i.e. the number of systems that retrieve a document) in fusion. The analysis of results suggests that the solution spaces of text-, link-, and classification-based retrieval methods are diverse enough for fusion to be beneficial while revealing important characteristics of the fusion environment, such as effects of system parameters and relationship between overlap, document ranking and relevance.
웹은 하이퍼링크 및 야후와 같이 수동으로 분류된 웹 디렉토리 처럼 문서의 콘텐츠를 넘어선 다양한 정보의 소스가 풍부하다. 이 연구는 웹문서 내용을 활용한 텍스트기반의 검색 방식, 하이퍼 링크를 활용한 링크 기반의 검색 방식, 그리고 야후의 카테고리를 활용한 분류 기반의 검색 방식을 융합하므로서 여러 정보소스를 결합하면 검색 성능을 향상시킬 수 있다는 기존 융합검색연구들을 확장시켰다. 텍스트, 링크, 분류 기반 검색 결과를 여러가지 선형조합식으로 생성한 융합결과를 기존의 검색 평가 지표를 사용하여 각각의 검색 결과와 비교 한 후, 검색결과 오버랩의 중요성 또한 조사 하였다. 본 연구는 텍스트, 링크, 분류 기반 검색의 솔루션 스패이스들의 다양성이 융합검색의 적합성을 제시한다는 결론과 더불어 시스템 파라미터의 영향, 그리고 오버랩, 문서순위, 관련성들의 상호 관계 같은 융합 환경의 중요한 특성들을 분석하였다.
Fusion , Web search , Information retrieval
The Web document collection is rich in sources of evidence(e.g., text, hyperlinks, Web directories) and thus offers an opportunity to employ a diverse set of retrieval approaches that can leverage a wide range of information. Furthermore, findings from fusion information retrieval (IR) research suggest that multiple sources of evidence(MSE) and multiple methods that leverage them could be combined to enrich the retrieval process on the Web. The nature of the Web search environment is such that retrieval approaches based on single sources of evidence suffer from weaknesses that can diminish the retrieval performance in certain situations. For example, text-based IR approaches have difficulty dealing with the diversity in vocabulary and quality of web documents, while link-based approaches can suffer from incomplete or noisy link topology. The inadequacies of singular Web IR approaches coupled with the fusion hypothesis of “fusion is good for IR” make a strong argument for combining MSE as a potentially advantageous retrieval strategy for Web search.
The first step in fusion IR is to identify and examine the sources of evidence to combine. MSE on the Web that can be leveraged for IR are: contents of Web documents, characteristics of Web documents, hyperlinks, Web directories such as Yahoo, and user statistics. The main hypothesis of the study, namely that combining text-, link-, and classification-based
1)methods can enhance Web search performance, is based on three observations. First, each of text-, link-, and classification-based approaches suffers from individual weaknesses that hinder optimum retrieval performance. Second, all three approaches have complementary strengths that can boost retrieval performance when combined. Third, there is ample evidence in literature that suggests fusion to be beneficial for IR.
Text-based approaches have difficulties dealing with the vocabulary problem(e.g., different expressions of the same concept in Web documents and queries), the diversity of document quality and content, fragmented documents, and documents with little textual contents, such as “hubs”, index pages, and bookmarks. Link-based approaches do not fare well when faced with a variety of link types(e.g., citation links, navigational links, commercial links, spam links), and “emerging” communities with incomplete link topologies. Web directories, in addition to classification and vocabulary problems(e.g., different categorizations of the same Web document and different labeling of the same category), contain only a fraction of the documents on the Web.
The most obvious of the complementary strengths is found in the combination of text- and link-based approaches. Ranking text-based retrieval results by a measure “link importance” can help differentiate documents with similar textual contents but varying degrees of importance, popularity, or quality. Pruning documents based on their textual contents before applying link analysis techniques can help alleviate the problems introduced by spurious links. On an abstract level, web directories, which embody explicit human judgments about the quality and topicality of documents, can augment ranking algorithms based on counting of words or links. Specifically, Web directories can be used to train document classifiers, to find related documents, and to help refine queries by finding subcategories, searching within a category, and suggesting related categories, all of which employ text- and/or link-based methods to exploit the information contained in Web directories.
Combining text-, link-, and classification-based methods can be viewed as attempting to maximize the combined strengths of leveraging author’s knowledge, peer’s knowledge, and cataloger’s knowledge about Web documents while minimizing their individual weaknesses. Although the idea of fusion is intuitively appealing, there is a shortage of techniques that utilize the considerable body of explicit human judgment(e.g. Web directories
2)) in combination with hyperlinks and textual contents of Web documents. IR research dealing with knowledge organization focus mostly on automatic clustering and classification of documents, and there is little investigation on how existing hierarchical knowledge bases like Yahoo can be brought into the fusion of text retrieval and link analysis techniques in Web IR. Consequently, this work explores the question of combining link analysis, content analysis, and classification-based techniques to improve retrieval performance.
Two best-known link analysis methods, which are based on the notion that hyperlinks contain implicit recommendations about the documents to which they point, are Page Rank and HITS (Yang 2005).
Page Rank is a method for assigning a universal rank to Web pages based on a recursive weight-propagation algorithm(Page et al. 1998). Page et al. start with the notion of counting backlinks(i.e., indegree) to assess the importance of a Web page, but point out that simple indegree does not always correspond to importance; thus they arrive at propagation of importance through links, where a page is important if the sum of the importance of its backlinks is high.
Kleinberg’s(1999) HITS(Hyperlink Induced Topic Search) algorithm considers both inlinks and outlinks to identify mutually reinforcing communities of “authority” and “hub” pages. Though HITS embraces the link analysis maxim that says a hyperlink is an annotation of human judgment conferring authority to pointed pages, it differs from other link-based approaches in several regards. Instead of simply counting the number of links, HITS calculates the value of page
pbased on the aggregate values of pages that point to por are pointed to by p, much in the same fashion as Page Rank. HITS, however, differs from Page Rank in three major points. First, it takes into account the contributions from both inlinks and outlinks to compute two separate measures of a page’s value, namely authority and hub scores, instead of the single measure of importance like Page Rank. Second, HITS measures pages’ values dynamically for each query, rather than assigning their global scores once and for all regardless of any query. Third, HITS scores are computed from a relatively small subset of the Web instead of the totality of the Web.
The bulk of IR fusion research, which investigates the various ways of combining different retrieval strategies, have found fusion to have a positive effect on retrieval performance regardless of what strategies were combined. The potential of fusion to leverage the strengths of its components while minimizing their weaknesses is not only promising in its own right, but offers a novel perspective of IR that relaxes the research goal of discovering the one best retrieval strategy.
Earlier studies on combining different document representations discovered that combined representations achieved better retrieval outcome than single representations(e.g., title and keyword vs. title or keyword) despite the fact that the difference in retrieval performance among single representations were small(Keen 1973; Spark Jones 1974). To explain this phenomenon, subsequent studies examined the overlap between different document representations(i.e., common terms) and found overlap to be small(Williams 1977; Smith 1979). A more systematic study of different document representations' relative effectiveness was later conducted by Katzer et al.(1982), who executed 84 queries on seven representations of 12,000 INSPEC documents and found but higher overlap among relevant documents than nonrelevant documents. The relationship between overlap and relevance was further studied from the perspective of the searcher by Saracevic and Kantor(1988), who found that the odds of a document being relevant increased monotonically with the number of retrieved sets in which it appeared. Fox and Shaw(1994; 1995), in combining different retrieval methods as well as query representations, discovered that combining dissimilar sources of evidence was better than combining similar ones.
Lee(1996) experimented with the fusion of multiple relevance feedback methods and found that different relevance feedback methods retrieve different sets of documents even though they achieve a similar level of retrieval effectiveness. By examining the overlap of relevant and non relevant documents retrieved by different retrieval systems separately, Lee(1997) observed what he calls the “unequal overlap property”, which is the phenomenon that different systems tend to retrieve similar relevant documents but dissimilar nonrelevant documents. The unequal overlap property challenges the fusion assumption by Belkin et al.(1993), which suggests that retrieval performance improvement by fusion may be partially due to combining of different relevant documents retrieved by different sources of evidence. In fact, Lee hypothesized that fusion is warranted when systems retrieve similar sets of relevant documents but different sets of nonrelevant documents. Lee’s hypothesis about the condition for effective fusion was more formally validated by Vogt and Cottrell(1998). Based on the performance estimation of a linearly combined system based on measurable characteristics of the component systems, they concluded that to achieve effective fusion by linear combination, one should maximize both the individual system’s performance and the overlap of relevant documents between systems, while minimizing the overlap of nonrelevant documents. They also suggested that component systems should distribute scores similarly but not rank relevant documents similarly.
We should note at this point that the art of fusion lies in discovering how different experts or sources of evidence can be combined to exploit their strengths while at the same time remaining unaffected by their weaknesses. As the study by Bartell et al.(1994) demonstrates, the fusion solution space is defined not only by the fusion algorithm but also by how that fusion algorithm is applied. In other words, the optimal fusion approach should consider the context of fusion in determining its overall fusion strategy.
This study combines methods that leverage text, hyperlink, and Web directory information to improve retrieval performance. Retrieval results of methods that leverage Web page text, hyperlink, and Yahoo Web directory information, both combined and individual, were examined to determine whether such fusion approaches are beneficial for Web search. The following sections describe the experiment in more detail.
The data used for the study is the WT10g test collection from Text REtreival Conference (http://trec.nist.gov/), which consists of 1.7 million Web documents, 100 queries(topics 451-550), and relevance judgments. The WT10g collection also includes the connectivity data, which provides lists of inlinks and outlinks of all documents in the collection. The connectivity data, however, is restricted to the WT10g universe, meaning that inlinks and outlinks that are not part of the collection themselves are omitted, which is likely to make the link topology of the collection incomplete. The document set of WT10g is comprised of 1,692,096 html pages, arrived at by sampling the original Internet Archive data in such a way to include a balanced representation of the real Web characteristics such as hyperlink structure and content types. Duplicates, non-English, and binary documents were excluded from the collection. Queries, or topics as they are called in TREC, consist of a title field, containing actual queries as they were submitted to Web search engines, a description field, which is typically a one sentence description of the topic area, and a narrative field, containing descriptions of what makes documents relevant. The description and narrative fields were created by TREC assessors to fit the intent of the real Web search queries represented in the title.
The characteristics of a Web directory, such as breadth of coverage, consistency of classification, and granularity of categories, are important factors to consider in determining the data source for classification-based retrieval methods. An ideal Web directory would be one that has classified all the documents of the test collection into fine-grained categories in a consistent manner. Lacking the ideal Web directory, Yahoo(http://yahoo.com) was used to leverage Web directory information for its size and popularity. Instead of using the actual Web documents associated with Yahoo categories, the classification-based method uses document titles and Yahoo’s descriptions of the categorized pages to represent each categorized document. This not only speeds up processing but also reduces noise in representation, which is a preferred practice in text categorization. In addition, the annotated description of a Yahoo site, along with the category to which it belongs, represent the cataloger’s knowledge about the classified document, which complements the author’s knowledge embodied in the document text and peers’ knowledge embodied in hyperlinks that point to the document.
2.1 Text-based Retrieval Method
The text-based retrieval component is based on a Vector Space Model(VSM) using the SMART length-normalized term weights (Singhal et al. 1996). Documents are processed by first removing markup tags and punctuation and then excluding stopwords, low frequency terms, non-alphabetical words(exception: embedded hyphen), words consisting of more than 25 or less than 3 characters, and words that contain 3 or more repeated characters. After punctuation and stopword removal, each word was conflated by applying the simple plural remover(Frakesand Baeza-Yates 1992). The simple plural remover was chosen to speed up indexing time and to minimize the overstemming effect of more aggressive stemmers. TREC topics were stopped and stemmed in the same manner as the document texts.
In addition to body text terms(i.e. terms between < BODY> and < /BODY> tags), header text terms were extracted from document titles, meta-keyword and description texts, and heading texts (i.e. texts between
andtags). A combination of body and header text terms was also created, where the header text terms were emphasized by multiplying the term frequencies by 10.In each of the three term sources(i.e. body, header, body and header), noun phrases were identified to construct noun phrase indexes. A noun phrase is defined as up to three adjacent nouns or capitalized words within a phrase window.
The VSM method used SMART
Lnuweights for document terms(Buckley et al. 1996; 1997) and SMART ltcweights (Buckley et al. 1995) for query terms. Lnuweights attempt to match the probability of retrieval given a document length with the probability of relevance given that length (Singhal et al. 1996).The formula for Lnudocument term weight is:
fikis the number of times term kappears in document i(i.e. in-document frequency), avgfiis the average in-document frequency for document i, Tis the number of unique terms in the collection, piis the average number of unique terms in a document i, and slopeis the normalization parameter that can be adjusted for a given document collection.The formula for ltcquery term weight is:
fkis the number of times term kappears in the query, and idfkis the inverse document frequency of term k. The denominator is a document length normalization factor, which compensates for the length variation in queries.
In addition to initial retrieval results, top ten positive and top two negative weighted terms from the feedback vector, created by the adaptive linear model using the top three ranked documents of the initial retrieval result as “pseudo-relevant”, were used as a query to generate the pseudo-feedback retrieval results. The basic approach of the adaptive linear model, which is based on the concept of preference relations from decision theory(Fishburn 1970), is to find a
solution vectorthat will rank a more-preferred document before a less-preferred one(Wong et al. 1988). The solution vector is arrived at via an error-correction procedure, which begins with a starting vector q(0) and repeats the cycle of “error-correction” until a vector is found that ranks documents according to the preference order estimation based on relevance feedback(Wong et al. 1991). The error-correction cycle iis defined by
wherea is a constant, and
bis the difference vectorresulting from subtracting a less-preferred document vector from a more preferred one(Sumner et al. 1998).
Table 1 enumerates the text-based method parameters for VSM systems, which are query length, term source, use of phrase terms, and use of pseudo-feedback. Query length ranges from short(topic title) and medium(topic title and description) to long(topic title, description, and narrative). Term sources are body text, header text, and body plus header text. The use of noun phrase indicates whether the term index for each term source contains both single and phrase terms or single terms only. The combination of parameters(3 query lengths, 3 term sources, 2 for phrase use, 2 for feedback use) resulted in 36 VSM systems.
2.2 Link-based Retrieval Method
For the study, the authority score of documents computed by the HITS algorithm (Kleinberg 1999) is used to generate a ranked list of documents with respect to a given query. HITS was chosen over PageRank scores, whose effective computation requires a link propagation over much larger set of linked documents than the WT10g corpus(Brinand Page 1998), and the Clever algorithm(Chakrabarti et al. 1998), which makes it difficult to isolate the contributions and behaviors of individual methods by implicitly combining link- and text-based methods to extend HITS.
HITS defines “authority” as a page that is pointed to by many good hubs and defines “hub” as a page that points to many good authorities. Mathematically, these circular definitions can be expressed as follows:
The above equations define the authority weight
a(p)and the hub weight h(p)for each page p, where p→ qdenotes “page phas a hyperlink to page q”. HITS starts with a root set Sof text-based search engine results in response to a query about some topic, expands Sto a base set Twith the inlinks and outlinks of S, eliminates links between pages with the same domain name in Tto define the graph G, runs an iterative algorithm on Guntil convergence, and returns a set of documents with high h(p)weights(i.e. hubs) and another set with high a(p)weights(i.e. authorities).
The original HITS algorithm was modified by adopting a couple of improvements from other HITS-based approaches. As implemented in the ARC algorithm (Chakrabarti et al. 1998), the root set was expanded by 2 links instead of 1 link(i.e. expand
Sby all pages that are 2 link distance away from S). Also, the edge weights by Bharat and Henzinger(1998), which essentially normalize the contribution of authorship by dividing the contribution of each page by the number of pages created by the same author, was used to modify the HITS formulas as follows:
In above equations,
auth_wt(q,p)is 1/ mfor page q, whose host has m documents pointing to p, and hub_wt(p,q)is 1/ nfor page q, which is pointed by ndocuments from the host of p. To establish a definition of a host, we opted for a simplistic method based on URL lengths. Short host form was arrived at by truncating the document URL at the first occurrence of a slash mark (i.e. ‘/’), and long host form from the last occurrence.
Among the 36 text-based system results, we chose the best performing system with all variations of query lengths as the seed sets. The combination of host definition and seed set parameters, as seen in Table 2, resulted in 6 HITS systems.
2.3 Classification-based Retrieval Method
The first step of the classification-based method is to find the categories that best match the query, which can be thought of as a query-category matching problem. Since Web directories classify only a fraction of the Web and do not rank documents that are classified, a second step that classifies and ranks documents with respect to the best matching category is needed, which can be thought of as a category-document matching problem.
To leverage the classification information, we employed the
Term Match(TM) method that selects the best Yahoo categories for a query based on matching of query-category terms. The first step of the TM method matches query terms to terms in the Yahoo categories, which we reextracted from category labels, site titles, descriptions and URLs, to generate a ranked list of categories. In the second step, an expanded query vector is constructed from the best matching categories to produce a ranked list of the WT10g documents. The expanded query vector consists of terms in the original query, the label of the best matching category, and the titles and descriptions of the top three sites 3)in the best matching categories. The term weight of the expanded query vector, which is computed by multiplying the term-category association weight by term frequency and dividing by the vector length, can be expressed as:
cdkcis the association weight of term kto category c, f ’kis the total number of times term kappears in the query, the label of category c, and the titles and descriptions of the top three sites of category c, and the denominator is the length-normalization factor. The term association weight, which estimates the probability of association between terms and categories (Plauntand Norgard 1998),is computed by the formula below 4):
logL(p,k,n) = k log p + (n − k)log(1 − p),, , , k1 = AB, n1 = AB+-AB, k2 = A-B, andn2 = A-B+-A-B.
The TM method finds the best category for a query based on the number of matching query terms, and expands the original query with selected category terms that are weighted by term association weights to rank documents. The way the TM method leverages the classification information is twofold. First, it uses manually assigned category terms(i.e., category labels, site titles and descriptions) to find the best matching categories and to expand the query. Second, it uses term association weights, which are based on term-category co-occurrence, to compute the term weights of the expanded query vector. In other words, the importance of category labels as well as multi-term concepts are underscored in ranking categories by the number of unique query terms in category labels, while the importance of term and category co-occurrence in the classification hierarchy is emphasized in the term weights of the expanded query vector.
The parameters tested for the TM systems are number of top (i.e., best matching) categories used, WT10g term index, and use of pseudo-feedback. The combination of the parameters (3 top categories, 4 WT10g term index, 2 for feedback use) resulted in 24TM systems (Table 3).
2.4 Fusion Method
How to combine or integrate fusion components is the key question in fusion IR. One of the most common fusion methods is the
Weighted Sum(WS) formula Bartell et al. 1994; Modhaand Spangler 2000), which computes the fusion score by the weighted sum of individual retrieval scores. When fusion component systems are highly distinct from one another, normalization of retrieved documents cores across systems may not be able to compensate for differences in document ranking functions. This is the case with fusion of text-, link-, and classification-based retrieval systems, whose document scores are computed in different manners to measure different values of a document with respect to a query. More specifically, the document score of VSM systems measures the textual similarity between a query and a document, the document score of HITS systems represents the hyperlink-conferred authority of a document for the topic of a query, and the document score of TM systems measures the likelihood of a document belonging to the same category as the query. In such scenarios, it might be useful to merge the ranks of a document instead of the scores of documents.
To compensate for the differences among fusion component systems, the
Weighted Rank Sum(WRS) formula, which uses rank-based scores(e.g., 1/rank) in place of document scores of the WS formula, was tested:
FS= fusion score of a document, wi = weight of system i, RSi = rank-based score of a document by system I.
Although the WRS formula aims to weight the contributions of individual fusion components to the retrieval outcome by their relative strength, it does not explicitly differentiate between overlapped and non-overlapped instances. In other words, the absolute contribution of a document retrieved by a system remains the same whether the document in question is retrieved by other systems or not. What WRS formula neglects is the possibility that the contribution of a document by a system to the retrieval outcome could be different across overlap partitions(e.g., documents retrieved by system 1 only, by system 1 and 2 only, etc.).
To leverage the retrieval overlap and rank information, we devised two additional fusion formulas.
Overlap Weighted Rank Sum(OWRS), attempts to leverage overlap while compensating for the differences among fusion component systems by weighting rank-based scores by overlap partitions(Equation 9). Rank-Overlap Weighted Rank Sum(ROWRS) is a variation of the OWRS formula that considers not only the overlap partition but also the rank at which a document is retrieved in its computation of weights(Equation 10).
FS= fusion score of a document, wik = weight of system i in overlap partition k, RSi = rank-based score of a document by system i.
FS= fusion score of a document, wikj = weight of system i in overlap partition k at rank j, RSi = rank-based score of a document by system i.
In all the formulas of Weighted Sum variation, topics 451 to 500 were used as training data to determine the weights. Overall average precision, which is a single-value measure that reflects the overall performance over all relevant documents, was used to determine the weight in the WRS formula. In the OWRS formula, overall average precision was multiplied by overlap average precision, which is the average precision computed for each overlap partition. In a 3-system fusion, for example, average precision is computed for each of the 4 overlap partitions for each system. In other words, the result set of a system is partitioned into overlap partitions (e.g., for system A: documents retrieved by system A, by system A and B, by system A and C, by system A, B, and C), and average precision is computed in each partition of each system.
For the ROWRS, which needs a performance estimate at a given rank, overall average precision is not appropriate. Instead, three rank-based measures, namely
Precision(P), Effectiveness(F), and Success/Failure(sf)at each rank, were used to compute the weights in three versions of the ROWRS formula. Since weights based on rank-based measures can be overly sensitive to the exact rank of a document, they were applied in “rank blocks”(e.g., ranks 1 to 10, 11 to 20, etc.). In other words, fusion component scores( RSiin Equation 10) in a given rank block had the same weights, which was determined by averaging rank-based measures over rank blocks.
Pand Fmeasures are based on performance up to a given rank k(e.g., number of relevant documents in the top kresults), sfmeasure is based on the retrieval success/failure at each rank(e.g., 1/ kif the document at rank kis relevant, 0 otherwise). The sfmeasure estimates the system performance at a given rank interval without regard to its performances in prior rank intervals in an attempt to boost the results of systems that retrieve relevant documents at lower ranks. For example, a non-relevant document at rank 101 with 100 relevant documents in rank 1 to 100(doc-A) will have much higher Pand Fthan a relevant document at rank 101 with 0 relevant documents in rank 1 to 100(doc-B), but the sfof doc-B will be higher than the sfof doc-A. When fusion components include systems that retrieve relevant documents in lower-ranked clusters, such an approach might be beneficial. Since HITS systems can retrieve relevant documents in the non-principal communities of hubs and authorities, and the best matching category of TM systems is not always the top-ranking category, it would be interesting to see how sf-based weighting would perform in overlap rank-based fusion formulas.
For both WRS and OWRS formulas, three variations that amplify the contribution of the top performing system were investigated. These variations, in an increasing order of emphasis for the top system, are
Top System Pivot 1(pivot1), Top System Pivot 2(pivot2), and Overlap Boost(olpboost). The basic idea here is to su