Article Preview
TopIntroduction
All organizations must deal with evolving content. Organizational policies may be revised at least once a year. Other operational documents may be updated several times a year. As an example, Google updated its privacy policies twice during 2016 and 4 times during 2015, and Netflix’s privacy policy was also updated twice in 2016 whereas Twitter seems to update its privacy policy once a year. The evolving content may pose a problem for the consumers of the content that are trying to keep abreast with the changes. In fact most internet users have little knowledge about how their online information can be exploited (Turow, et al., 2005). A study by (McDonald, and Cranor 2008) estimates that it would cost around 365 billion dollars per year in lost productivity if users were to read all privacy policies of all websites they visit and the cost only worsens as these documents evolve quickly. Content evolution is also common in news organization where a news articles about an event are published over days or months. The earlier content in a new story may be evolved to include new facts as they become available.
Many organizations provide tools to users to search and extract relevant content from their intranets. However, identifying specific changes in content can be tedious. Temporal analysis of documents (Han, 2000 and Allan, 2002) based on the date of publication is a common way to analyze and track the changing content in time-stamped document sets. Temporal analyses of document sets have proven useful in tracking of topics over time, finding hot topics and bursty topics, finding temporal correlations among document sets, etc. All these systems are aimed at finding content or time periods that are interesting to the consumers of the time-stamped document sets.
The temporal analysis methods may not be applicable in cases where contents of a document were changed to create the next revision or version where the users may be interested identifying how much content change occurred and what parts of the content might have undergone the change. In this paper, we propose a novel approach for identifying and measuring evolutionary change in time-stamped document sets. Our approach processes the document set D into two subsets – Dp, published before some time point (date) T and Dc, published after T, and measures the evolution of content between the two.
To identify content evolution, we first extract topics latent in the document sets, and compute how many topics occurring in Dp also occur in Dc. We use two similarity metrics – vector similarity and context similarity to identify topics that are similar in both the subsets. We then use topic similarity to identify next versions, previous versions of topics, dormant topics, and emerging topics, as well as evolutionary change.
The proposed approach is unique in that it first computes a higher-level (or summarized) representation of two document sets and then compares the higher-level representations to compute similarities/differences. Latent dirichlet allocation (Blei, et al., 2003, Steyvers and Griffiths, 2007, Blei, 2012) is an ideal method for generating a higher-level representation of a document set since it finds latent information from the document set. LDA represents each document in the given document set as a mixture of topics and each topic as a mixture of words. Given a document set, LDA extracts a probability distribution over words for each topic, and a probability distribution over topics for each document. Set of topics obtained from each document set are then compared to identify evolutionary change.