Recently completed:
EAGER: Dynamic Contextual Explanation of Search Results (Defuddle)
This NSF-funded research project aims to investigate and develop Defuddle, an approach and a system that analyzes documents at the top of a search engine’s ranked list to find human-readable explanations for why documents were retrieved for this query and, unlike existing technology, for how the documents relate to each other. The resulting advances in result explanation will make it easier for people to make sense of what happens when they search the web or any other collection of text documents.
Searching for Answers Through Iterative Feedback
In this NSF-funded research project, we will work on four research tasks: (a) develop and evaluate iterative relevance feedback models for answers; (b) develop and evaluate interactive summarization techniques for answers; (c) develop and evaluate finer-grained feedback approaches for answers; (d) develop and evaluate a conversation-based model for answer retrieval. This project will be the first to study methods and models for interacting with ranked lists of answers.
NSF - Connecting the Ephemeral and Archival Information Networks
This NSF-funded project is a collaboration with the CIIR, Carnegie Mellon University, and RMIT University. The team will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. Researchers will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.
NSF - Interactive Construction of Complex Query Models
This NSF-funded research program will investigate and implement SearchIE, a search-based approach to information "extraction." SearchIE will allow rapid, personalized, situational identification of types of objects or actions in text, where those types are likely to be useful for a complex search task. The result is that the technology can radically improve online searching for lay persons as well as professionals by significantly reducing the time needed to focus queries into relevant information.
NSF - Topical Positioning System (TPS) for Informed Reading of Web Pages
This NSF-funded project addresses the challenge of increasing the critical literacy of people looking for information on the Web, including information regarding healthcare, policy, or any other broadly discussed topic. The research on Topical Positioning System "TPS" drives the vision of developing a browser tool that shows a person whether the web page in front of them discusses a provocative topic, whether the material is presented in a heavily biased way, whether it represents an outlier (fringe) idea, and how its discussion of issues relates to the broader context and to information presented in "familiar" sources.
NSF - Understanding the Relevance of Text Passages
Developing effective passage retrieval would have a major effect on search tools by greatly extending the range of queries that could be answered directly using text passages retrieved from the web. This is particularly important for mobile search applications with limited output bandwidth based on using either a small screen or speech output. In this case, the ability to use passages to reduce the amount of output while maintaining high relevance will be critical. In this NSF-funded project, we study research issues that have either been ignored, or only partially addressed, in prior research, such as showing whether passages can be better answers than documents for some queries, predicting which queries have good answers at the passage level, ranking passages to retrieve the best answers, and evaluating the effectiveness of passages as answers.
Archived Projects:
AQUAINT Project
This research project, Relevance Models and Answer Granularity for Question Answering, is an ARDA initiative under the Advanced Question and Answering for Intelligence (AQUAINT) program.
Automated Diagnosis of Usability Problems Using Statistical Computational Methods
The effects of poor usability range from mere inconvenience to disaster. Human factors specialists employ usability analysis to reduce the likelihood or impact of such failures. However, good usability analysis requires usability reports that are rarely collected, rarely complete, and difficult to analyze.The CIIR and Aptima have partnered on this AFOSR STTR project to develop a usability analysis system that addresses these problems.
Broad Operational Language Technology (BOLT)
The Broad Operational Language Technology (BOLT) Program has a goal of creating technology capable of translating multiple foreign languages in all genres, retrieving information from the translated material, and enabling bilingual communication via speech or text. The CIIR at UMass Amherst is part of the IBM team. The CIIR will be focusing on developing a cross-lingual information retrieval system for informal document genres (e.g., forums).
CALO Project
As part of DARPA’s Perceptive Agent that Learns (PAL) program, SRI and team members including the CIIR are working on developing a next-generation "Cognitive Agent that Learns and Organizes" (CALO).
Confidence Measures for Information Extraction of Entities, Relations and Object Correspondence
In this NSF KDD project, UMass Amherst intends to improve the state-of-the-art in the ability to associate confidence measures with information extracted from unstructured text. The team will build on its previously successful research in probabilistic models for confidence assessment of individual extracted text segments, and will provide new capabilities for confidence assessment of object correspondence, and relations between entities.
Discovering and Using Meta-Terms (Microsoft Live Labs)
Sponsored by Microsoft Live Labs' Accelerating Search in Academic Research Initiative, project researchers will use Microsoft query logs and another Web-based collection to develop techniques to discover meta-terms in queries and then mine related words from the Web in an effort to test various approaches to query reformulation or transformation.
Flexible Acquisition and Understanding System for Text (FAUST)
This DARPA project is developing an automated machine reading system that makes the information in natural language texts accessible to formal reading systems. UMass Amherst is part of the SRI International team that also consists of Columbia, Stanford, University of Illinois, University of Washington, University of Wisconsin, and Wake Forest University.
Machine Learning for Sequences and Structured Data: Tools for Non-Experts
In this NSF-funded ITR collaborative research project between UMass Amherst, UPenn, and CMU, the team is researching ways to dramatically improve the ability of people who are not experts in machine learning to design and automatically train models for analyzing and transforming sequences and other structured data such as text, signals, handwriting, and biological sequences.
Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
This NSF-funded Data Intensive Computing project is a collaborative project with Tufts University and the Internet Archive. It aims for transformative advances in current technology, to provide improved automatic support for search and analysis, through the use of data-intensive processing of large corpora. This research is being carried out using a collection of over a million scanned books gathered by the Internet Archive. The collection includes 8.5 terabytes of text and half a petabyte of scanned images. CIIR Researchers will develop new approaches to processing the large collection. The resulting improved corpus will be indexed at the Internet Archive, allowing more accurate and powerful search. Researchers at Tufts University will develop approaches for exploratory data analysis on the processed collection.
Nightingale
The CIIR embarked on a five-year DARPA project under the Global Autonomous Language Exploitation (GALE) program. The goal of GALE is make foreign language (Arabic and Chinese) speech and text accessible to English monolingual people, particularly in military settings. The Nightingale research team includes UMass Amherst, Columbia University, International Computer Science Institute (ICSI), IDIAP Research Institute, HNC/Fair Isaac Corporation, New York University, National Research Council (NRC) Canada, Purdue University, RWTH Aachen University, University of California San Diego, University of Washington, Systran Software, and SRI International. The UMass Amherst team focuses on highly accurate retrieval, dynamic topic models, social network discovery, and statistical machine translation.
NSF - Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"
This NSF-funded project addresses research in relation extraction of "universal schema," where the researchers learn a generalizing model of the union of all input schemas, including multiple available pre-structured knowledge bases as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the successful probabilistic matrix factorization and vector embedding methods.
NSF Digital Government Project
This research project, A Language-Modeling Approach to Metadata for Cross-Database Linkage and Search, is a National Science Foundation sponsored initiative. The CIIR is working in collaboration with Carnegie Mellon University, the Library of Congress, Department of Commerce, U.S. Geological Survey, and R.I.S.C. on this project.
NSF - Flexible Machine Learning for Natural Language in the MALLET Toolkit
This NSF-funded project aims to to enhance the MALLET (MAchine Learning for LanguagE) and FACTORIE (Factor graphs, Imperative, Extensible), open-source software toolkits. These provide many modern state-of-the-art machine learning methods, specially tuned to be scalable for the idiosyncrasies of natural language data, while also applying well to many other discrete non-language tasks. The research team is broadening these toolkits' applicability to new data and tasks (with better end-user interfaces for labeling, training and diagnostics), enhancing their research-support capabilities (with infrastructure for flexibly specifying model structures), and (3) improving their understandability and support (with new documentation, examples, online community support).
NSF IDM Mongrel Project
"Supporting Effective Access through User- and Topic-Based Language Models" is a research project in collaboration with Rutgers University and sponsored by the NSF IDM program.
NSF Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.
NSF - New Methods to Enhance Our Understanding of the Diversity of Science
This NSF-funded project focuses on the development of analytical tools that capture the diversity of science. The work moves beyond traditional "citation-counting" methods that focus only on the rate of scientific innovation. The project's primary goal is to develop and implement new methods, grounded in the computer science literature (specifically statistical topic modeling and social network analysis), for the analyzing the impact of science policy interventions on the diversity of science.
NSF NSDL - Search and Browsing Support for NSDL
On an NSF National Science, Mathematics, Engineering, and Technology Education Digital Libary (NSDL) project, the CIIR worked with a team of institutions that are developing the technical capabilities and executing the organizational responsibilities of the core integration of the NSDL Program.
NSF NSDL - Question Triage for Experts and Documents: Expanding the Information Retrieval Function of the NSDL
On an NSF National Science, Mathematics, Engineering, and Technology Education Digital Libary (NSDL) project, the CIIR is partnering with the Information Institute of Syracuse (IIS) and the Wondir Foundation to enhance the NSDL by merging the information retrieval (IR) and digital reference components. By combining these functions, users can find answers to their questions regardless if those answers come from documents in NSDL collections or experts accessible through the NSDL's virtual reference desk.
NSF - Searching Archives of Community Knowledge
In this NSF-funded project, we are studying the task of finding good answers in Collaborative Question Answering archives by investigating techniques for question retrieval and comparing them to alternatives such as direct answer retrieval. The techniques that we are developing to search the CQA archives also have the potential to have a significant impact on all types of search engines. The large CQA archives can be used as training data for models of text transformation. In other words, by developing models that learn how to recognize questions using these resources, we will also be learning how concepts or topics can be expressed in different ways. These transformation models could then be used to significantly improve the robustness of the topic models used in search engines, which will in turn substantially improve the effectiveness of the system.
NSF SGER: Breaking the keyword bottleneck: Towards more effective access of government information
In this NSF project, we are carrying out initial experiments with retrieval models for complex queries that go beyond the typical “bag-of-words” approach. There are two major issues that we explore in the development of new retrieval models. First, in order to improve system robustness, we need to develop models that more reliably capture topical relevance than our current models. This means we need to have models that are better at recognizing different ways that topics can be described in text. Second, in order to improve the system accuracy in the top ranked documents, we need to develop models that more precisely capture topical relevance. This means retrieval models need to be better at recognizing and incorporating the specific concepts and relationships that are required by the query.
NSF - Text Reuse and Information Flow
In this NSF-funded project, we are studying a range of approaches to detecting reuse at the sentence level, and a range of approaches for combining sentence-level evidence into document-level evidence. We are also developing algorithms for inferring information flow from timelines, sources, and reuse measures. Given the importance of the Web as a source for detecting reuse, we also focus on techniques that can make efficient use of this huge but unwieldy resource. The research is being evaluated using a range of corpora, such as news, Web crawls, and blogs, in order to explore the dimensions of reuse and information flow in different situations.
NSF - The Synthesis Genome: Data Mining for Synthesis of New Materials
This NSF-funded project is a collaboration with UMass IESL and MIT. The project's research will develop the framework to do for materials synthesis what modern computational methods have done for materials properties: build predictive tools for synthesis so that targeted compounds can be synthesized in a matter of days, rather than months or years. Researchers will pursue an innovative approach leveraging documentation of compound synthesis compiled over decades of scientific work by using natural language processing (NLP) techniques to automatically extract synthesis methods from hundreds of thousands of peer-reviewed papers.
NSF - Transforming Long Queries
The focus of this NSF-funded project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework.
OCRing Early Modern Text
This Mellon Foundation grant is part of a larger grant to Texas A&M. The teams will recognize the text from the 18th century English books using optical character recognition systems, and they will use their technology to automatically estimate OCR errors and correct the output of multiple OCR engines. This will be done using fast alignment algorithms.
PCORI - Patient Experience Recommender System
The focus of this PCORI pilot project is to maximize patient perspective and effectively support lifestyle choices by developing the "Patient Experience Recommender System for Persuasive Communication Tailoring" (PERSPeCT), an adaptive computer system that assesses a patient's individual perspective, understands the patient's preference for health messages, and provides personalized, persuasive health communication relevant to the individual patient. The project is a collaboration with UMass Medical Division of Health Informatics and Implementation Science and the CIIR.
Proteus Infrastructure: Work Aggregation and Entity Extraction
This Mellon Foundation grant supports development of software and techniques for scholars in the humanities to use in processing large corpora of digitized books. Specifically, this a pilot project to build and evaluate research infrastructure for scanned books. While there are several large scanned book collections (for example the Internet Archive) much of this is unstructured and not easily used by scholars in the humanities. The grant will support building a Proteus infrastructure which will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book’s language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection.
Proteus - Supporting Scholarly Information Seeking Through Text-Reuse Analysis and Interactive Corpus Construction
This Andrew W. Mellon Foundation grant is a collaborative project with Northeastern University's NULab for Texts, Maps, and Networks to develop the Proteus toolset for researchers in the digital humanities to explore the contents of large, unstructured collections of historical books (two million out-of-copyright books), newspapers and other documents. Users of the Proteus system will be able to interactively and incrementally build up collections by analyzing networks of text reuse among books, passages, authors, and journals; provide feedback on terms, phrases, named entities, and metadata; and explore these growing collections during search, while browsing, and with an interactive full-text visualization tool.
Statistical Models for Information Extraction for REFLEX
In this project, UMass Amherst is a subcontractor to BBN Technologies on a DARPA-sponsored project to develop statistical models for information extraction that combine many sources of information in novel, integrated ways.
TIDES (DARPA)
The research project, Tools for Rapidly Adaptable Translingual Information Retrieval and Organization, is a DARPA-sponsored initiative under the Translingual Information Detection, Extraction, and Summarization (TIDES) program on fast machine translation and information access. Another project, Formal Frameworks and Empirical Evaluations for Information Organization, is a continuation of the DARPA-sponsored TIDES initiative.
Topic Detection and Tracking (TDT)
DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories.
Unified Graphical Models
"Unified Graphical Models of Information Extraction and Data Mining with Application to Social Network Analysis" is an NSF ITR research project that aims to improve the ability to data mine information previously locked in unstructured natural language text. The research focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappear, resulting in a powerful unified framework for extraction and mining.
USAR MURI - Situation Understanding Bot Through Language and Environment (SUBTLE)
This USAR MURI project is a collaboration with Cornell, George Mason, Stanford, University of Pennsylvania, University of Massachusetts Amherst, and University of Massachusetts Lowell. The researchers are developing methods for constructing a computationally tractable end-to-end system for a habitable subset of English, including both a formal representation of the implicit meaning of utterances and the generation of control programs for a robot platform. They are also developing a virtual simulation of the USAR environment to enable inexpensive large-scale corpus collection to proceed during many stages of system development.The UMass Amherst team is working on the natural language processing and machine learning research on the project, specifically the command interface and command logic.
Word Spotting: Indexing Handwritten Manuscripts
Word Spotting is sponsored by the National Science Foundation Digital Libraries II program. This project researches and develops innovative techniques for indexing handwritten historical manuscripts written by a single author.