Probabilistic Topic Discovery and Automatic Document Tagging

Probabilistic Topic Discovery and Automatic Document Tagging

Davide Magatti (Università degli Studi di Milano-Bicocca, Italy) and Fabio Stella (Università degli Studi di Milano-Bicocca, Italy)
DOI: 10.4018/978-1-60960-881-1.ch002
OnDemand PDF Download:
No Current Special Offers


A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.
Chapter Preview

1. Introduction

From 1990 to 2005 more than one billion people worldwide entered the middle class, get richer, become more literate and thus fueled the information market (The Economist, 2010). The effect of such an economic and social revolution, together with the improvements achieved by information and communication technologies, is called the information explosion. Indeed, in the last five years the information created started to diverge from the storage capacity as reported by the International Data Corporation (2010). Data and information has gone from scarce to superabundant. While it is common opinion that this setting brings huge benefits it is also clear to everyone that it brings big headaches (The Economist, 2010). In the next ten years the data available on the WEB will amount to forty times the current size (International Data Corporation, 2010). The knowledge hidden in such a huge amount of data will heavily influence social behavior, political decisions, medicine and health care, company business models and strategies as well as financial investment opportunities. The overwhelming amount of available un-structured data has transformed the information from useful to troublesome. Indeed, it is becoming increasingly clear that our recording and processing capabilities are growing much slower than the amount of generated data and information. Search engines exacerbated this problem and although new paradigm of web-search are now being explored (Baeza-Yates & Raghavan, 2010), they normally provide users with huge amount of un-structured results, which need to be pruned and organized to become useful and valuable.

Complete Chapter List

Search this Book: