Latent Dirichlet Allocation Approach for Analyzing Text Documents

Latent Dirichlet Allocation Approach for Analyzing Text Documents

Parvathi Chundi (University of Nebraska at Omaha, USA) and Susannah Go (University of Nebraska, USA)
DOI: 10.4018/978-1-4666-5888-2.ch175

Chapter Preview



The problem that latent Dirichlet allocation (LDA) seeks to solve is as follows: Given a corpus, find short descriptions of the documents that facilitate efficient processing of the corpus while keeping intact the statistical relationships between the documents and words in the corpus that may be needed for other types of processing. LDA solves the problem by assuming that documents may be represented as a mixture of latent (unknown) topics, and then uses statistical inference to create groups of co-occurring words, which are used to form topics. LDA is a generative model. It specifies how to generate each document in a document collection from a specific set of topics. The words in a document can be generated in two steps:

  • Step 1: Randomly choose a distribution over topics.

  • Step 2: Each word in the document is generated as follows.

    • a.

      Select a random topic from the distribution of topics chosen in Step 1.

    • b.

      Select a word randomly from the word distribution corresponding to that topic.

The order of words is ignored by LDA, making it a bag-of-words model. LDA resolves some of the issues present in previous topic models; the Unigrams Model did not allow for multiple topics in one document, and the probabilistic latent semantic indexing (pLSI) model was prone to overfitting.

Key Terms in this Chapter

Word: A word is a basic processing unit obtained from the corpus.

Response Variable: A response variable is a variable associated with each document in the corpus and can be used for supervised learning. Examples of response variables include the category of a document and the number of stars or votes assigned by readers that capture its interestingness.

Topic: A topic is a probability distribution over words.

Bag of Words Model: The bags-of-words model represents a document as a collection of words and ignores the sequence in which words appear in a document.

Hyperparameter: A hyperparameter is a parameter of a prior distribution of a variable where the prior distribution is the probability distribution that expresses the uncertainty of the variable before the data is taken into account.

Topic Model: A topic model is a formal statistical relationship between a group of observed and latent random variables. Typically, observed variables correspond to words occurring in a document and latent variables correspond to the topic structure.

Corpus: A corpus is a collection consisting of two or more documents.

Generative Model: A generative model specifies a simple probabilistic procedure by which words in a document can be generated based on latent variables.

Dirichlet Distribution: A Dirichlet distribution is a probability distribution over probability mass functions of length k .

Complete Chapter List

Search this Book: