# MapReduce Style Algorithms for Extracting Hot Spots of Topics from Timestamped Corpus

Ashwathy Ashokan, Parvathi Chundi
DOI: 10.4018/978-1-4666-5888-2.ch407
OnDemand:
(Individual Chapters)
Available
\$37.50
No Current Special Offers

Top

## Background

### Hotspot Extraction Problem

A hotspot of a topic in a given data set of time stamped documents is a subinterval of the time period that contains significantly more documents that discuss the topic than the rest of the time period. Identifying a hot spot may provide a lot of useful information. To identify hot spots, a discrepancy score is assigned to each of the O(n2) intervals during the time period of the corpus. A discrepancy score of an interval is a numerical value that captures the discrepancy between the presence of the topic in the documents of the interval and its presence in the documents outside the interval. There are many ways to compute discrepancy scores. The notion of a temporal scan statistic (Scan Statistics Website) is typically used to compute the discrepancy score of an interval.

We define the hotspot extraction problem as following: given a time stamped corpus and a topic, identify a time interval with the maximum discrepancy score. Note that there may be more than one such interval. One of those intervals is arbitrarily chosen as a hot spot. Extracting a hot spot requires calculating the discrepancy score of every interval in the time period of the corpus. A naive implementation runs in time O(n3), where n is the number of the time points of the corpus. A topic can simply be a keyword, a list of keywords, or may contains keywords connected with the logical operators AND, OR and NOT (Chen, Chundi 2011). In this article, a topic is assumed to be a simple keyword. However, the algorithms can be extended to more general notions of a topic.

## Key Terms in this Chapter

Shuffle and Sort: The mechanism in the MapReduce framework that takes the output from the Mapper, groups the key, value pairs into buckets by keys and sorts the value in each bucket.

Discrepancy Score: A discrepancy score of a time interval is a numerical value that captures the discrepancy between the presence of the topic in the time interval and that outside the interval.

Mapper: The part of the MapReduce program that implements the map function.

Time Interval: A sequence of one or more consecutive time points.

Corpus: A collection consisting of two or more documents.

Topic: A keyword or a list of keywords, typically given by the user.

Hot spot: A time interval with a highest discrepancy score.

Reducer: The part of the MapReduce program that implements the reduce function.

Time Period: A list of consecutive time points.

Time point: Instance of time with a given base granularity, such as a second, minute, day, month, year, etc.

## Complete Chapter List

Search this Book:
Reset