Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore.

Additionally, libraries can receive an extra 5% discount. Learn More

Additionally, libraries can receive an extra 5% discount. Learn More

Parvathi Chundi (University of Nebraska at Omaha, USA) and Susannah Go (University of Nebraska, USA)

Copyright: © 2015
|Pages: 6

DOI: 10.4018/978-1-4666-5888-2.ch175

Top## Background

The problem that latent Dirichlet allocation (LDA) seeks to solve is as follows: Given a *words* in a document can be generated in two steps:

**Step 1:**Randomly choose a distribution over topics.**Step 2:**Each word in the document is generated as follows.*a.*Select a random topic from the distribution of topics chosen in Step 1.

*b.*Select a word randomly from the word distribution corresponding to that topic.

The order of words is ignored by LDA, making it a bag-of-words model. LDA resolves some of the issues present in previous topic models; the Unigrams Model did not allow for multiple topics in one document, and the probabilistic latent semantic indexing (pLSI) model was prone to overfitting.

Word: A word is a basic processing unit obtained from the corpus.

Response Variable: A response variable is a variable associated with each document in the corpus and can be used for supervised learning. Examples of response variables include the category of a document and the number of stars or votes assigned by readers that capture its interestingness.

Topic: A topic is a probability distribution over words.

Bag of Words Model: The bags-of-words model represents a document as a collection of words and ignores the sequence in which words appear in a document.

Hyperparameter: A hyperparameter is a parameter of a prior distribution of a variable where the prior distribution is the probability distribution that expresses the uncertainty of the variable before the data is taken into account.

Topic Model: A topic model is a formal statistical relationship between a group of observed and latent random variables. Typically, observed variables correspond to words occurring in a document and latent variables correspond to the topic structure.

Corpus: A corpus is a collection consisting of two or more documents.

Generative Model: A generative model specifies a simple probabilistic procedure by which words in a document can be generated based on latent variables.

Dirichlet Distribution: A Dirichlet distribution is a probability distribution over probability mass functions of length k .

Search this Book:

Reset

Copyright © 1988-2018, IGI Global - All Rights Reserved