Semantic Models in Information Retrieval

Semantic Models in Information Retrieval

Edmond Lassalle (Oranges Labs, France) and Emmanuel Lassalle (Université Paris 7, France)
DOI: 10.4018/978-1-4666-0330-1.ch007
OnDemand PDF Download:
No Current Special Offers


Robertson and Spärck Jones pioneered experimental probabilistic models (Binary Independence Model) with both a typology generalizing the Boolean model, a frequency counting to calculate elementary weightings, and their combination into a global probabilistic estimation. However, this model did not consider indexing terms dependencies. An extension to mixture models (e.g., using a 2-Poisson law) made it possible to take into account these dependencies from a macroscopic point of view (BM25), as well as a shallow linguistic processing of co-references. New approaches (language models, for example “bag of words” models, probabilistic dependencies between requests and documents, and consequently Bayesian inference using Dirichlet prior conjugate) furnished new solutions for documents structuring (categorization) and for index smoothing. Presently, in these probabilistic models the main issues have been addressed from a formal point of view only. Thus, linguistic properties are neglected in the indexing language. The authors examine how a linguistic and semantic modeling can be integrated in indexing languages and set up a hybrid model that makes it possible to deal with different information retrieval problems in a unified way.
Chapter Preview

Introduction: Using Semantics In Information Retrieval

Several tasks in IR are based on the common principle of content matching. This is not only apparent with search engines comparing a query with various documents of its database, but also in other major tasks such as document categorization (or clustering) and question/answering (QA). For example, in document categorization the content matching is operated between a document and the descriptive sections of a category system, and the document is then assigned to the best matched category. Another example is making an abridgment of a document: this can be done by splitting the text in sentences and comparing them to the entire text, keeping a few of those that produce the best matches. However, the complexities of these several tasks are different:

  • Documents categorization is simpler to achieve than clustering.

  • Documents retrieval is easier than question/answering (QA).

  • Within the QA task, simplest systems deal with factual information (the height of a monument, the capital of a country) whereas most complex state of the art systems try to answer the questions of why or how.

  • Extracting the answer in QA systems requires a filtering mechanism and selecting sentences of the documents so as to make an automatic abridgment. On the other hand, the difficulty of QA lies in mapping the sentence and the question opposed to mapping the sentence and the document in an automatic abridgment. Knowing that a question provides less information than a document, we can imagine the difficulty.

These different tasks are processed by specific implementations that are not reusable from one task to another. If we want to unify the processes for all those tasks in a reusable mechanism, we must define a generic model that allows a wide coverage of all the tasks. The difference in complexity between two similar tasks lies in the fact that one requires a deeper “semantic analysis” than the other, as we can see below:

  • If we compare clustering to categorization, the second has the outset of a precomputed classification nomenclature and predefined comparison criteria for each section. Categorizing a document consists in extracting from the document elements allowing the comparison with those aforementioned criteria. On the other hand, clustering has to build a classification nomenclature beforehand and calculate for each section the characteristics of comparison. This is done by analyzing a referential corpus of documents that corresponds to a prior semantic processing.

  • While QA task has to supply one or several concise answers to a composed question, “search engine” task returns a list of documents to a query, in responsibility for the user to view each document and estimate its relevance to her/his query. A QA system may be seen as a search engine coupled with a post-processing system: the question is initially treated as a query by the search engine component; and the contents of documents listed as reply are then parsed to extract salient elements that may be possible answers to the question. The complexity in this case is due to a subsequent a posteriori semantic processing.

If we were able to address the highest semantic analysis we would obtain a generic model and then we would be able to deal with IR problems in a unified way. The difficulty of semantic approaches is due to the high cost of the manual construction of large knowledge databases. Sometimes, for simple semantic tasks, resorting to manual analyses (such as editorial functions) or semi-automatic processing (such as statistical analysis of queries logs and manual revision) will improve the quality of a search engine. Thus, a fully automatic semantic processing has a real interest only if we manage to build automatically large knowledge databases in a reasonable cost.

Exploiting the collaborative knowledge databases available on the Internet is a rather good alternative as far as it is validated by experimental implementations. However, the problem remains when passing to full-scale applications: harmonizing or completing collaborative databases becomes essential to cover the needs of real applications. Ad hoc methods have been developed to merge databases with heterogeneous formats and contents. Yet, because of the explosion of such resources, automatic methods are needed, which leads inevitably to a classic machine learning problem.

Complete Chapter List

Search this Book: