Identifying Emerging Topics and Content Change from Evolving Document Sets

Identifying Emerging Topics and Content Change from Evolving Document Sets

Parvathi Chundi (University of Nebraska-Omaha, Department of Computer Science, Omaha, NE, USA)
Copyright: © 2017 |Pages: 18
DOI: 10.4018/IJKBO.2017100101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Document sets where the content is evolving frequently occur often in organizations. It is common for oranizations to update the policy documents periodically and for a news story to evolve over a period of time. When a document set evolves, some of the old content may remain unchanged while some other new content may be added. Depending on the amount of changes, users may need to read and/or analyze the new content once again. Evolving content may make it hard for users to track the changes and understand the global view of the change. In this paper, we consider document sets consisting of documents published at two different points of time and develop a measure to capture the change in content between the documents published at two different time points. We divide a document set into two subsets – a subset of documents containing documents published at an earlier date and another subset containing documents published at a later date. We use Latent Dirichlet Allocation to extract a topic and word distributions for each of the two subsets of the document set. We then compute similarity of the set of topics computed for each subset to measure the amount of change in the content. We study the effectiveness of the method on two data sets – a set of privacy policy documents and a set of Reuters news articles extracted from the TDT-Pilot Corpus and present the experimental results.
Article Preview

Introduction

All organizations must deal with evolving content. Organizational policies may be revised at least once a year. Other operational documents may be updated several times a year. As an example, Google updated its privacy policies twice during 2016 and 4 times during 2015, and Netflix’s privacy policy was also updated twice in 2016 whereas Twitter seems to update its privacy policy once a year. The evolving content may pose a problem for the consumers of the content that are trying to keep abreast with the changes. In fact most internet users have little knowledge about how their online information can be exploited (Turow, et al., 2005). A study by (McDonald, and Cranor 2008) estimates that it would cost around 365 billion dollars per year in lost productivity if users were to read all privacy policies of all websites they visit and the cost only worsens as these documents evolve quickly. Content evolution is also common in news organization where a news articles about an event are published over days or months. The earlier content in a new story may be evolved to include new facts as they become available.

Many organizations provide tools to users to search and extract relevant content from their intranets. However, identifying specific changes in content can be tedious. Temporal analysis of documents (Han, 2000 and Allan, 2002) based on the date of publication is a common way to analyze and track the changing content in time-stamped document sets. Temporal analyses of document sets have proven useful in tracking of topics over time, finding hot topics and bursty topics, finding temporal correlations among document sets, etc. All these systems are aimed at finding content or time periods that are interesting to the consumers of the time-stamped document sets.

The temporal analysis methods may not be applicable in cases where contents of a document were changed to create the next revision or version where the users may be interested identifying how much content change occurred and what parts of the content might have undergone the change. In this paper, we propose a novel approach for identifying and measuring evolutionary change in time-stamped document sets. Our approach processes the document set D into two subsets – Dp, published before some time point (date) T and Dc, published after T, and measures the evolution of content between the two.

To identify content evolution, we first extract topics latent in the document sets, and compute how many topics occurring in Dp also occur in Dc. We use two similarity metrics – vector similarity and context similarity to identify topics that are similar in both the subsets. We then use topic similarity to identify next versions, previous versions of topics, dormant topics, and emerging topics, as well as evolutionary change.

The proposed approach is unique in that it first computes a higher-level (or summarized) representation of two document sets and then compares the higher-level representations to compute similarities/differences. Latent dirichlet allocation (Blei, et al., 2003, Steyvers and Griffiths, 2007, Blei, 2012) is an ideal method for generating a higher-level representation of a document set since it finds latent information from the document set. LDA represents each document in the given document set as a mixture of topics and each topic as a mixture of words. Given a document set, LDA extracts a probability distribution over words for each topic, and a probability distribution over topics for each document. Set of topics obtained from each document set are then compared to identify evolutionary change.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing