Discovering and Using Meta-Terms

W. Bruce Croft, Principal Investigator

Many queries, particularly “content-based” queries, contain terms that are difficult to match directly with documents. Transforming these queries using replacements or expansions for these terms can make a substantial difference to performance. One of the interesting phenomena that can be observed in the TREC Terabyte track (using Web data) is that adding the narrative description, which is often verbose and usually contains irrelevant terms, to the short keyword query results in a significant average increase in effectiveness. Our hypothesis is that this happens because the narrative often contains a small number of key terms that make a significant difference to the ability of the system to match relevant text. We believe that many of these important terms are in fact instances, examples, or more specific forms of query terms which we call “meta-terms”. Some examples of meta-terms found in recent TREC queries, together with related words that can readily be found on the Web, and that make a difference to retrieval performance, include “marine mammals” (“whales”), “candy makers” (“hershey”), “volcanic activity” (“lava flow”), “civil war battle” (“antietam”), “massachusetts textile mills” (“lowell”), “mass transit” (“rail”), and “alternative fuel cars” (“hydrogen”). We propose to develop techniques to mine replacements or related words from the Web for these meta-terms and show that effectiveness can be significantly improved by incorporating them in queries.

Although this proposed technique is related to query expansion, current query expansion techniques are either too coarse, resulting in many unrelated terms (pseudo relevance feedback) or too generic, resulting in too few expansions or the wrong expansions (e.g. using Wordnet). We also do not expect to simply make the query longer by adding terms, but will in addition investigate approaches that would be better described as reformulating or transforming the query.

In this project we are using both the Microsoft query logs and the TREC GOV2 collection to develop techniques to discover meta-terms in queries and then mine related words from the Web. More specifically we are performing a relatively simple analysis of queries in the log (or the TREC queries) to determine candidate meta-terms, then those terms are being used in queries that look for the occurrence of specific patterns in Web documents. If a significant number of pattern matches are found, the associated text will be analyzed, using both simple linguistic techniques and frequency counts, to extract the instances or related words. An obvious (and effective) example of a pattern for finding potential meta-terms is <meta term> “such as” <instances>. We manually check the results of this analysis for quality and improve the techniques as appropriate. We may also use bootstrapping techniques to learn new patterns. Pattern-based extraction techniques like this have been used successfully in recent question-answering research, and our approach, although related, uses simpler patterns to improve more general queries. The result of this stage of the research will be information about the frequency of meta-terms in queries, the type of meta-terms that are found, and a dictionary or thesaurus of meta-terms and their related words that can be used for retrieval experiments.

The next step of the research will be to use the meta-term dictionary to carry out retrieval experiments and to test various approaches to query reformulation or transformation. This will include techniques such as simple expansion (unlikely to be effective), reranking the initial retrieved set, and generating multiple queries and combining results. We will investigate using the query log with the click-through data to evaluate searches, and the TREC data will provide some solid baseline performance figures. By clustering the query log using both text and click-through overlap we hope to find examples of meta-term replacement and click-through data that can be used as relevance judgments. We also expect to have to do some manual relevance judgments of retrieved Web pages.

Apart from evaluation, another issue we will need to address is the potential ambiguity of meta-terms. For example, if we simply looked for instances of “textile mills” in the Web, we will find many mills that are not located in Massachusetts. The general approach we intend to use as a solution for this is to use the other terms in the query (in this case, “Massachusetts”) as background to disambiguate the pattern results while not being part of the pattern itself.

This project is supported by Microsoft Live Labs.