Concept Drift Detection in Data Stream Clustering and its Application on Weather Data

This﻿article﻿presents﻿a﻿stream﻿mining﻿framework﻿to﻿cluster﻿the﻿data﻿stream﻿and﻿monitor﻿its﻿evolution.﻿ Even﻿though﻿concept﻿drift﻿is﻿expected﻿to﻿be﻿present﻿in﻿data﻿streams,﻿explicit﻿drift﻿detection﻿is﻿rarely﻿ done﻿in﻿stream﻿clustering﻿algorithms.﻿The﻿proposed﻿framework﻿is﻿capable﻿of﻿explicit﻿concept﻿drift﻿ detection﻿and﻿cluster﻿evolution﻿analysis.﻿Concept﻿drift﻿is﻿caused﻿by﻿the﻿changes﻿in﻿data﻿distribution﻿ over﻿time.﻿Relationship﻿between﻿concept﻿drift﻿and﻿the﻿occurrence﻿of﻿physical﻿events﻿has﻿been﻿studied﻿ by﻿applying﻿the﻿framework﻿on﻿the﻿weather﻿data﻿stream.﻿Experiments﻿led﻿to﻿the﻿conclusion﻿that﻿the﻿ concept﻿drift﻿accompanied﻿by﻿a﻿change﻿in﻿the﻿number﻿of﻿clusters﻿indicates﻿a﻿significant﻿weather﻿event.﻿This﻿kind﻿of﻿online﻿monitoring﻿and﻿its﻿results﻿can﻿be﻿utilized﻿in﻿weather﻿forecasting﻿systems﻿ in﻿various﻿ways.﻿Weather﻿data﻿streams﻿produced﻿by﻿automatic﻿weather﻿stations﻿(AWS)﻿are﻿used﻿to﻿ conduct﻿this﻿study.

The utility of this framework is studied by applying it on weather data.Concept drifts, changesinthebestvalueofkandtheclusterevolutionsarebeingmonitoredtounderstandtheir relationship with the physical weather phenomena.Nowadays most of the weather monitoring equipmentproduceshigh-speeddatawithalargenumberofvariables.Weatherstreamproduced bytheAutomaticWeatherStation(AWS)attheAdvancedCentreforAtmosphericRadarResearch (ACARR)ofCochinUniversityofScienceandTechnology,Kerala,Indiaisusedforthisstudy.Dataiscollectedatone-minuteintervalsanditcontainstheweatherparametersliketemperature, relativehumidity,windspeed,winddirection,radiation,pressure,rainfall,etc.Monsoonsarethe mostimportantweatherphenomenonasfarastheIndianregionisconcerned.Hencetheframework is used to study the interrelationship between the changes in the clustering structure and the evolutionofthesouth-westmonsoon.
Insupervisedlearningproblems,conceptdriftcanbedefinedasachangeinthejointprobability distribution P(X, Y), where X denotes a random variable over vectors of attribute values and Y denotesarandomvariableoverclasslabels (Webb,2016)

Stream Clustering
While incorporating an explicit concept drift detection module to an existing stream clustering algorithm,themajorconcernishowtoblendthedriftdetectionmethodologywithonlinelearning.Before going to those details, a brief discussion on the stream clustering algorithm used in this frameworkisgiven.Overviewoftheabove-mentionedprocessisshowninFigure2.Thefollowingthreeprocesses willberunningcontinuouslythroughoutthestream: The flexibility provided by the CluStream algorithm to choose the time horizon of macroclustering,isutilisedtogetthemacro-clusteringoverthewarningperiod.
Page-Hinkley Test continuously monitors a parameter which can represent the change.In clustering, the average distance between the data records and their closest cluster centres is the parameterbeingmonitored.Silvaetal.(2017
The proposed framework does not require the historical data to be stored in main memory; instead,itjustkeepssnapshotsofdatainsecondarymemory,asitisdiscussedinCluStreamalgorithm (Aggarwaletal.2003).Thisreducesthememory-relatedproblemsconsiderablyandmakesthechange detectionprocessfaster.

Figure 1 .
Figure 1.Overview of the proposed framework

Figure 3 .
Figure 3. (a) Rainfall during the period 8 th April to 20 th June.The week of 14 th May denotes the start of a pre-monsoon shower.(b) Concept drift points and the corresponding number of clusters estimated for the period 8 th April to 20 th June.It can be noted that the number of clusters reduced from 4 to 2 from 14 th May to 19 th May, where the temperature also shows a sudden drop.

Figure 4 .
Figure 4. (a) Rainfall during the period -August to November; (b) Number of clusters computed for the period -August to November

Figure 6 .
Figure 6.Concept drift points when λ A = 1.75 times the average cluster radius and δ = 0.1*λ A

Figure 7 .
Figure 7. Concept drift points and the corresponding number of clusters estimated when λ A = 1.5 times the average cluster radius and δ = 0.01*λ A .Since the value of δ is reduced, it reports concept changes impatiently.

Figure 8 .
Figure 8. Cluster evolution analysis: Percentage of clusters that have been absorbed, survived and split at consecutive concept change points.This plot corresponds to the concept change points shown in Figure 7, i.e., when λ A = 1.5 times the average cluster radius and δ = 0.01*λ A .