A New Similarity Metric for Sequential Data

A New Similarity Metric for Sequential Data

Pradeep Kumar (Indian Institute of Management, India), Bapi S. Raju (Infosys Technologies Limited, India) and P. Radha Krishna (University of Hyderabad, India)
DOI: 10.4018/978-1-61350-474-1.ch014
OnDemand PDF Download:
List Price: $37.50


In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this chapter, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.
Chapter Preview

Sequence Similarity

A sequence is made of set of items that happen in time, or happen one after another, that is, in position but not necessarily in relation with time. We can say that a sequence is an ordered set of items. A sequence is denoted as follows:S = <a1, a2, …, an>where a1, a2, …, an are the item sets in sequence S. Sequence S contains n elements or ordered item sets. Sequence length is defined as the count of number of item sets contained in the sequence. It is denoted as |S| and here, |S| = n. Formally, similarity is a nonnegative real valued function S, defined on the Cartesian product X × X of a set X. It is called a metric on X if for every x,y∈ X, the following properties are satisfied by S.

Complete Chapter List

Search this Book: