Article Preview
TopIntroduction
To succeed in today’s business environment, every enterprise must be able to efficiently find information on the web. Although the web is a rich source of information, there are many challenges associated with finding the right information in a timely manner. Web search engines typically retrieve a large number of web pages and overload business analysts with irrelevant information (Chung, Chen, & Nunamaker Jr., 2005).
Ultraseek reported that the average employee spends 3.5 hours a week on unsuccessful searches (Ultraseek, 2006). KMWorld reported that middle managers spend approximately 25% of their time searching for information that is required for the successful completion of their jobs, that the information they find often is wrong, and that 86% of enterprise searchers are dissatisfied with their firms’ search capabilities (KMWorld, 2008). More fine-grained technologies capable of understanding Business Intelligence tasks and representing their results in comprehensible formats are required.
One approach that has been proposed for overcoming some of these challenges is automated Question Answering (QA). The objective of a QA system is to locate, extract, and present the answer to a specific user question that has been expressed in natural language (Roussinov, Fan, & Robles-Flores, 2008). QA systems enable the searcher to pose queries as questions using natural language, and enable the computer to retrieve answers to questions that require the fusion of information from multiple sources. The ability to fuse information from multiple sources allows QA systems to take as input a question like “What are the countries in Central America?” and produce as output a list such as “Guatemala, Belize, El Salvador, Honduras, Nicaragua, Costa Rica, and Panama are countries in Central America.” This is an example of a list question, so called because the answer is a list of items of information.
When dealing with list questions, it is important to differentiate between questions where constructing the answer list requires the fusion of information from multiple web sites (fusion questions), and questions where the answer list can be found on a single web page (non-fusion questions). An example of a non-fusion question is “What are the names of all the teams in the National Football League?” The complete answer to this question is available in many locations, and simply entering “names of all NFL teams” in the Google search bar will provide links to several sites that contain the desired list.
An example of a fusion question is “Which companies manufacture home appliances in the U.S.?” Entering “names of home appliance manufacturers located in the U.S.” in the Google search bar will not and cannot provide links to a single site that contains the desired list, because there is no single site that contains the desired list. Answers to fusion questions require a search engine or service that can query the web for information, parse the returned web pages for the relevant information, and fuse the relevant information into an aggregated answer list. Fusion questions are very common in the business intelligence arena.
Search engines, like Google, Yahoo, and MSN, use many tools to identify relevant snippets for keyword searches; including page rank, term frequency, term proximity, and inverse document frequency. However, these tools are not designed to handle fusion list questions. They treat questions as a “bag of words”. Entering “Who is the largest producer of software?” in the Google search bar, for example, will yield nearly the same results as entering “largest producer software”; and both of these produce unexpected snippets that identify the largest producers of carbon steel, pork, ethanol, and sugar; but do not identify Microsoft, which is the answer the user would expect (see Figure 1). Moreover, even if the correct answer is among the search results, the user still needs to review the snippets in order to locate it.
Figure 1. Results for question: “Who is the largest producer of software?”