Article Preview
Top1. Introduction
With the advance development of technology, the data which need to be dealt with is becoming various and complicated. Furthermore, the scale of the data is huge, the form of the data is more diverse and the speed of processing data is lacking. So how to gain valuable information and meaningful knowledge quickly from the numerous and complex big data has become a key challenge.
Data Mining (Lin & Chen, 2011) extracts potentially useful information which people is interested in and which is unknown in advance. There is a variety of knowledge representation, such as concepts, rules, regularities and patterns. Time series (Box, Jenkins et al. 2013), which reflect the characteristic of the attribute value related with time, have large data scale, high dimension and frequent update.
As an important branch in the field of data mining research, time series data mining is the process of pattern discovery and knowledge extraction from the time sequence. The basic task includes similarity searching, periodical pattern mining, analysis and prediction, clustering and classification, visualization and abnormal data mining of time series. For a long time, the aim is finding the data pattern which appears frequently. People want to summarize some rules from the data pattern. Currently, almost all abnormal data are considered as noisy data and then ignored. In some cases, compared with the normal data, the frequency of abnormal data is not high, but it may hide some important information. And finding these abnormal data and the corresponding hidden information may provide valuable information and more enlightening knowledge.
Abnormal data are considered intuitively to be those which are associated with the data model or the data objects but do not conform to the general distribution. Nowadays there is not a definition generally accepted for abnormal data. It changes with the specific application. The research for abnormal time sequence mining is derived relatively late, but recently it is attracting more interest. For instance, this technology can monitor the crime hidden in the electronic commerce. And on the other hand, it can be used to detect possible intrusions by Hackers in the daily management of the Internet. In these applications, abnormal data ofen represent more useful meaning than other normal data. Although the probability of the abnormal data is very small, their importance cannot be ignored. Therefore, capturing these small probability events in the abnormal cases is important for some applications.
Hydrological data are discrete records for hydrological process, such as flow, rainfall, water level and so on (Ping, 2003). They are huge, noisy, unstable and have poor correlation. Nowadays, hydrological modernization is popular, which contains information collection, extraction, analysis and so on. During the hydrological process, it is indispensable to acquire valuable information and meaningful knowledge quickly from large amount of data. The rapid development of data mining provides a new approach for water resource management, hydrology and hydroinformatics research (Kozerski, 2010).
Hydrological time series data mining (Kozerski, 2010) can be used to extract the unknown process which contains important information from hydrological data, which is valuable for hydrological forecasting and hydrological data analysis. Abnormal hydrological time series are the data objects which are obviously inconsistent with the universal rule of the hydrological phenomenon. In the field of hydrological data mining, there are three branches: similarity search, sequence pattern mining and cycle analysis. The research of abnormal hydrological time series data mining is still at the starting stage.
There are many approaches for abnormal time series data mining. Most of them have clear problems. For example, the approach based on immunology (Chaovalit, Gangopadhyay et al. 2011) cannot apply to diverse data. Computational efficiency of Support Vector Machine (SVM) (Verdejo, García et al. 2011) is high, but its theory and modeling process are very complex which can only be adopted by experts. The accuracy of TSA-tree (Zhang, Meratnia et al. 2010) is low. To solve these problems, this paper puts forward a new approach which is based on Extended Symbolic Aggregate Approximation (ESAX) (Lkhagva, Suzuki et al. 2006) and Dynamic Time Warping (DTW) (Müller, 2007).