Interactive Construction of Complex Query Models

Principal Investigator:
James Allan, PI
allan@cs.umass.edu

Center for Intelligent Information Retrieval (CIIR)
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts Amherst
Amherst, MA 01003-9264

Project Abstract

This research program will investigate and implement SearchIE, a search-based approach to information “extraction.” SearchIE will allow rapid, personalized, situational identification of types of objects or actions in text, where those types are likely to be useful for a complex search task. Modern search engines often provide some mechanism to indicate that a query keyword matches a document only if it occurs in the name of a person or in a location. To make that possible, annotators found and marked a large number of people names (for example) in text, a machine learning algorithm was applied to learn which low-level features are indicative of the name type, and then a resulting classifier for that type is run across the collection of documents. It is then possible to write a query that means "paris used as a person's name rather than a location." Unfortunately, the existing approaches do not serve searchers interested in novel, unanticipated types – for example, names of whaling ships, officers in Queen Victoria’s navy, local watering holes. Such examples cannot be handled currently because the classifiers need to be trained and run ahead of time, an expensive data labeling process that is too daunting for many search tasks. Since on-line information gathering almost always starts with search and frequently involves identifying items of interest in the found text, bringing these two together has the potential to change both substantially. The SearchIE approach makes it possible for someone to build personalized extractors contextualized by their topical interest. The research will surface new challenges that were not obvious in the traditional annotate-learn-tag approach for extraction. It will also sidestep the problem of globally training for local tasks, where ambiguity in features is removed when only locally relevant information is used.

It does not appear that the information extraction task has ever been approached directly as a search task. SearchIE is unique in bringing an information retrieval (search) mindset to the extraction problem, providing new capabilities that are either impossible or extremely difficult in the traditional "annotate then detect" model of the problem. This project will investigate the fundamental issues raised by the SearchIE approach. What models can best integrate extraction and search in new setting where they can truly happen simultaneously? How can a searcher describe and edit a model for the types of interest? Can an interactively developed model be a springboard into a machine learned model and when is there enough information to do that? Does using topical context to limit the scope of extraction provide the expected accuracy gains using SearchIE's approach? What data structure modifications are needed to fully implement SearchIE so that it is efficient as well as effective? How well does this approach fare on additional standard test collections? Addressing the systems and algorithmic issues are fundamental problems that have the potential to greatly impact both search and extraction.

View Recent Project Activities and Results

Publications:

IR-1054: Foley, J., OConnor, B. and Allan, J., "Improving Entity Ranking for Keyword Queries," in the Proceedings of The 25th ACM International Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, IN, Oct. 24-28. 2016, pp. 2061-2064.

IR-1092: Foley, J., Sarwar, S. and Allan, J., "Named Entity Recognition with Extremely Limited Data," In the ACM SIGIR 2018 Workshop on Learning from Limited or Noisy Data (LND4IR ’18), Ann Arbor, Michigan, USA, July 12, 2018.

IR-1124: Sarwar, S., Foley, J. and Allan, J., "Term Relevance Feedback for Contextual Named Entity Retrieval," in the Proceedings of the 2018 ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR ’18), New Brunswick, NJ, March 11-15, 2018, pp. 301-304.

IR-1139: Cohen, D., Foley, J., Zamani, H., Allan, J. and Croft, W. B. , "Universal Approximation Functions for Fast Learning to Rank," in the Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, Ann Arbor, MI, Jul. 8-12 2018 (SIGIR ’18), pp. 1017-1020.

IR-1173: (2019) Montazeralghaem, A., Zamani, H. and Allan, J., "A Reinforcement Learning Framework for Relevance Feedback," to appear in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2020), Xi'an, China, July 25-30, 2020, pp. 59-68.

IR-1174: Dalton, J., Naseri, S., Dietz, L. and Allan, J., "Local and Global Query Expansion for Hierarchical Complex Topics," in the Proceedings of the European Conference on Information Retrieval (ECIR 2019), Cologne, Germany, April 14-18, 2019. pp. 290-303.

IR-1176: Sarwar, S., Foley, J., Yang, L. and Allan, J., "Sentence Retrieval for Entity List Extraction with a Seed, Context and Topic," in the Proceedings of ICTIR '19: International Conference on the Theory of Information Retrieval, October 02-05, 2019, Santa Clara, California, pp. 209-212.

IR-1182: Foley, J., "Poetry: Identification, Entity Recognition, and Retrieval," Ph.D. Thesis, University of Massachusetts Amherst, May 2019.

IR-1186: Sarwar, S. and Allan, J., "SearchIE: A Retrieval Approach for Information Extraction," in the Proceedings of ICTIR '19: International Conference on the Theory of Information Retrieval, October 02-05, 2019, Santa Clara, California, pp. 249-252.

IR-1188: Naseri, S., Sarwar, S. and Allan, J., "Semantic Driven Fielded Entity Retrieval," presented at The Entity Retrieval Workshop (EYRE18) Co-located with CIKM 2018, Turin, Italy, October 22-26, 2018.

IR-1189: Naseri, S., Foley, J., Allan, J. and OConnor, B., "Exploring Summary-Expanded Entity Embeddings for Entity Retrieval," presented at The EntitY REtrieval (EYRE) Workshop Co-located with CIKM 2018, Turin, Italy, October 22-26, 2018.

IR-1191: Montazeralghaem, A., Rahimi, N. and Allan, J., "Term Discrimination Value for Cross-Language Information Retrieval," in the Proceedings of International Conference on the Theory of Information Retrieval Conference (ICTIR 2019), Santa Clara, California, October 2-9, 2019, pp. 137-140.

IR 1211: Yu, P., Rahimi, N., Huang, Z. and Allan, J., "Learning to Rank Entities for Set Expansion from Unstructured Data ," in the Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR 2020), Stavanger, Norway, September 14-18, 2020, pp. 21-28.

IR-1212: Montazeralghaem, A., Rahimi, N. and Allan, J., "Relevance Ranking based on Query-Aware Context Analysis," in the Proceedings of the 42nd European Conference on Information Retrieval (ECIR 2020), Lisbon, Portugal, April 14-17, 2020, pp. 446-460.

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-1617408). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.