Topic and Cluster Evolution Over Noisy Document Streams
Sascha Schulz (Humboldt-University Berlin, Germany), Myra Spiliopoulou (Otto-von-Guericke-University Magdeburg, Germany) and Rene Schult (Otto-von-Guericke-University Magdeburg, Germany)
Copyright: © 2008
We study the issue of discovering and tracing thematic topics in a stream of documents. This issue, often studied under the label “topic evolution” is of interest in many applications where thematic trends should be identified and monitored, including environmental modelling for marketing and strategic management applications, information filtering over streams of news and enrichment of classification schemes with emerging new classes. We concentrate on the latter area and depict an example application from the automotive industry – the discovery of emerging topics in repair & maintenance reports. We first discuss relevant literature on (a) the discovery and monitoring of topics over document streams and (b) the monitoring of evolving clusters over arbitrary data streams. Then, we propose our own method for topic evolution over a stream of small noisy documents: We combine hierarchical clustering, performed at different time periods, with cluster comparison over adjacent time periods, taking into account that the feature space itself may change from one period to the next. We elaborate on the behaviour of this method and show how human experts can be assisted in identifying class candidates among the topics thus identified.