Principal Investigators:
W. Bruce Croft, PI
James Allan, Co-PI
Center for Intelligent Information Retrieval (CIIR)
Computer Science Department
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264
Project Summary
This project is under the Advanced Question and Answering for Intelligence (AQUAINT) program initiative. Current question answering (QA) systems are based on a combination of heuristic techniques, require considerable knowledge engineering, are unpredictable in their responses to very similar questions, and are restricted in the types of questions that can be answered. In order to extend the QA paradigm to questions that require more complex answers, it will be necessary to provide a more solid basis for the design of QA systems. In particular, a formal framework for QA could support more rational development of algorithms for dealing with structured data, answer granularity, and answer updating. In this project, we describe how the statistical language model framework used for information retrieval could be adapted to QA, and suggest how this framework could be used as the basis for a more general QA system that is capable of learning about appropriate answers for questions.
The language model approach to information retrieval was introduced recently and has had considerable success. Retrieval algorithms based on this approach are simple, do not make use of heuristic weights, and are very effective. The language model framework has been successfully used to describe a number of important processes in information retrieval, such as query expansion and cross-lingual retrieval, which were difficult to incorporate in previous probabilistic retrieval models.
The phrase "language model" is used by the speech recognition community to refer to a probability distribution that captures the statistical regularities of the generation of language. Generally speaking, language models for speech attempt to predict the probability of the next word in an ordered sequence. The typical approach to information retrieval is to infer a language model for each document and to estimate the probability of generating the query according to each of these models. Documents are then ranked according to these probabilities. As a framework for question answering, this approach is too limited and queries cannot be regarded simply as samples from document models.
Earlier work on probabilistic models of information retrieval took a conceptually different approach. They attempted to model word occurrences in relevant and non-relevant classes of documents, and used the model to classify the document into the more likely class. From a language modeling standpoint, classical probabilistic models of retrieval can be viewed as estimating the relevance model - the language model of the class of relevant documents. For the retrieval task, the documents were ranked by the probability that they belong to the relevant class. The primary obstacle to constructing effective models of relevance is the lack of training data. In a typical retrieval environment we are given a query, a large collection of documents and no indication of which documents might be relevant. Recent advances in techniques for estimating models from limited data have been successfully used to address this problem.
The relevance model is a description of an information need or, alternatively, a description of the topic area associated with the information need. The relevance model approach does not, however, address the problem that query texts often do not resemble document texts. Topic words in a question can reasonably be viewed as being generated by a relevance model, but structure words are very unlikely to occur in the same form in relevant text. Instead of regarding a question as being generated by a simple relevance model, we could instead model the question words as being generated by a mixture of a topic (relevance) model and a question type model. The question model would be a probabilistic description of the typical words, phrases and word orders associated with questions of a particular type.
In the relevance model approach, documents are retrieved by computing the probability that the relevance model estimated from the query could generate them. This "document-likelihood" approach produces good results for document retrieval, but it is inadequate for question answering. In current QA systems, valid answers are determined by much smaller contexts than the document such as, for example, individual sentences or small text passages. Instead of document likelihood, we need to compute "answer likelihood" based on a variety of possible contexts. In addition to the problem of answer contexts, the issue of answer "granularity" becomes important as we consider generalizing the QA approach to handle more complex questions. In our approach, we associate an "answer model" with every question model. The answer model will be used to represent the expectations about the form of the answer and answer granularity.
We are proposing that for each question type, we will learn word sequences and contexts associated with answers based on training data. As with question models, we expect that many of the features in these models will actually be syntactic or structural rather than specific words. We also expect to at least partially address the granularity issue through training data. In other words, we need to capture a description of the typical form of an answer for different query types. To do this we will need to include syntactic and structural features such as the presence of lists or the size of the answer text in the answer model. These types of features are different to the word-based language models that have been primarily used in IR, but frameworks for dealing with them have been developed in the speech recognition community.
Given the answer model associated with the question model, the retrieval process will be a variation of that used in the relevance model approach. A document database will be considered to be a collection of answer contexts of different granularity. We will then rank the answer contexts by the probability that they could be generated by a mixture of the answer model and the relevance (topic) model. In addition, we will aim for a system that will, in an analogous fashion to an information extraction system, tag the answer in the text.
One of the goals of this project is to extend the QA paradigm to include databases of structured data, tables, and metadata. These pages contain many facts and figures, as well as text explanations, on a variety of topics in a variety of formats, We are particularly interested in Web data where the structure is indicated mainly through HTML tags, metadata fields, and, occasionally, with XML markup. Although from one perspective the answers are more clearly delineated than in unstructured text, there is a significant challenge in identifying and ranking the possible answer contexts. The crucial steps in integrating structured data in the general QA language model framework will be generating appropriate candidate answer contexts and including metadata information such as HTML tags in the answer models.
Question answering has a dynamic aspect that is mostly ignored by current systems. As new information becomes available or as new information resources are searched, answers may change or be modified. If we view a stream of documents as simply a larger database that is being examined sequentially, there is in fact little difference between updating over time and dealing with multiple possible answers derived from a large database. Our proposal, then, is to develop techniques for dealing with questions with multiple answers and adapt these techniques to dynamic environments. We also intend to integrate these techniques with the research on detection of novelty in event tracking.
In summary, we are developing and evaluating a language modeling framework for question answering. The main data dimensions to be explored are TREC-style English databases and semi-structured HTML and XML data from the Web. The algorithms developed for this research will be implemented and evaluated with the new LEMUR toolkit being developed in a joint UMass/CMU project.
This work was supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the Advanced Research and Development Activity (ARDA) under contract number MDA904-01-C-0984.