Approaches for Pattern Discovery Using Sequential Data Mining

Approaches for Pattern Discovery Using Sequential Data Mining

Manish Gupta (University of Illinois at Urbana-Champaign, USA) and Jiawei Han (University of Illinois at Urbana-Champaign, USA)
DOI: 10.4018/978-1-61350-056-9.ch008
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this chapter we first introduce sequence data. We then discuss different approaches for mining of patterns from sequence data, studied in literature. Apriori based methods and the pattern growth methods are the earliest and the most influential methods for sequential pattern mining. There is also a vertical format based method which works on a dual representation of the sequence database. Work has also been done for mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensional databases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining. Some works also focus on mining incremental patterns and mining from stream data. We present at least one method of each of these types and discuss their advantages and disadvantages. We conclude with a summary of the work.
Chapter Preview
Top

Introduction

What is Sequence Data?

Sequence data is omnipresent. Customer shopping sequences, medical treatment data, and data related to natural disasters, science and engineering processes data, stocks and markets data, telephone calling patterns, weblog click streams, program execution sequences, DNA sequences and gene expression and structures data are some examples of sequence data.

Notations and Terminology

Let I = {i1, i2, i3 … in} be a set of items. An item-set X is a subset of items i.e. X ⊆ I. A sequence is an ordered list of item-sets (also called elements or events). Items within an element are unordered and we would list them alphabetically. An item can occur at most once in an element of a sequence, but can occur multiple times in different elements of a sequence. The number of instances of items in a sequence is called the length of the sequence. A sequence with length l is called an l-sequence. E.g., s=<a(ce)(bd)(bcde)f(dg)> is a sequence which consists of 7 distinct items and 6 elements. Length of the sequence is 12.

A group of sequences stored with their identifiers is called a sequence database. We say that a sequence s is a subsequence of t, if s is a “projection” of t, derived by deleting elements and/or items from t. E.g. <a(c)(bd)f> is a subsequence of s. Further, sequence s is a δ-distance subsequence of t if there exist integers j1 < j2 < … < jn such that s1 ⊆ tj1, s2 ⊆ tj2 … sn ⊆ tjn and jk-jk-1 ≤ δ for each k = 2, 3 ... n. That is, occurrences of adjacent elements of s within t are not separated by more than δ elements.

What is Sequential Pattern Mining?

Given a pattern p, support of the sequence pattern p is the number of sequences in the database containing the pattern p. A pattern with support greater than the support threshold min_sup is called a frequent pattern or a frequent sequential pattern. A sequential pattern of length l is called an l-pattern. Sequential pattern mining is the task of finding the complete set of frequent subsequences given a set of sequences. A huge number of possible sequential patterns are hidden in databases.

A sequential pattern mining algorithm should:

  • A.

    find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold,

  • B.

    be highly efficient, scalable, involving only a small number of database scans

  • C.

    be able to incorporate various kinds of user-specific constraints.

Complete Chapter List

Search this Book:
Reset