Time series data is usually generated by measuring and monitoring applications, and accounts for a large fraction of the data available for analysis purposes. A time series is typically a sequence of values that represent the state of a variable over time. Each value of the variable might be a simple value, or might have a composite structure, such as a vector of values. Time series data can be collected about natural phenomena, such as the amount of rainfall in a geographical region, or about a human activity, such as the number of shares of GoogleTM stock sold each day. Time series data is typically used for predicting future behavior from historical performance. However, a time series often needs further processing to discover the structure and properties of the recorded variable, thereby facilitating the understanding of past behavior and prediction of future behavior. Segmentation of a given time series is often used to compactly represent the time series (Gionis & Mannila, 2005), to reduce noise, and to serve as a high-level representation of the data (Das, Lin, Mannila, Renganathan & Smyth, 1998; Keogh & Kasetty, 2003). Data mining of a segmentation of a time series, rather than the original time series itself, has been used to facilitate discovering structure in the data, and finding various kinds of information, such as abrupt changes in the model underlying the time series (Duncan & Bryant, 1996; Keogh & Kasetty, 2003), event detection (Guralnik & Srivastava, 1999), etc. The rest of this chapter is organized as follows. The section on Background gives an overview of the time series segmentation problem and solutions. This section is followed by a Main Focus section where details of the tasks involved in segmenting a given time series and a few sample applications are discussed. Then, the Future Trends section presents some of the current research trends in time series segmentation and the Conclusion section concludes the chapter. Several important terms and their definitions are also included at the end of the chapter.
A time series is simply a sequence of data points. In a segmentation of a given time series, one or more consecutive data points are combined into a single segment and represented by a single data point or a model for the data points in the segment. Given a time series T of n data points, segmentation of T results in a sequence of m segments, where each segment represents one or more consecutive data points in T. The number of segments m is typically much less than n. The input to a given instance of the segmentation problem is usually a time series and an upper bound on the number of segments.
Since the data points in a segment are represented by a single point or model, there is usually some error in the representation. Formally, the segmentation problem can be defined as follows. Given time series T and integer m, find a minimum error segmentation of T consisting of at most m segments. A specific version of this problem depends on the form of the data and how the segmentation error is defined. There are several approaches in the literature for addressing the segmentation problem. An optimum solution for the segmentation problem can be found by a dynamic programming based approach. (Duncan & Bryant, 1996; Gionis & Mannila, 2005; Himberg, Korpiaho, Mannila, Tikanmaki & Toivonen, 2001). A dynamic programming algorithm for solving this optimization problem runs in O(n2m) time, and so may not be practical for long time series with thousands of data points. There are more efficient heuristics that can be used to construct a segmentation of a given time series. However, these heuristics generally produce a suboptimal segmentation (Himberg, Korpiaho, Mannila, Tikanmaki & Toivonen, 2001; Keogh, Chu, Hart & Pazzani, 2004). There are also Bayesian based, fuzzy clustering, and genetic algorithm approaches to the segmentation problem (Oliver & Forbes, 1997; Abonyi, Feil, Nemeth, & Arva, 2005; Tseng, Chen, Chen & Hong, 2006). Methods have also been developed where a time series is segmented by converting it into a sequence of discrete symbols (Chung, Fu, Ng & Luk, 2004).