Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

Manish Gupta (University of Illinois at Urbana-Champaign, USA) and Jiawei Han (University of Illinois at Urbana-Champaign, USA)

Copyright: © 2012
|Pages: 18

DOI: 10.4018/978-1-61350-056-9.ch008

Chapter Preview

TopSequence data is omnipresent. Customer shopping sequences, medical treatment data, and data related to natural disasters, science and engineering processes data, stocks and markets data, telephone calling patterns, weblog click streams, program execution sequences, DNA sequences and gene expression and structures data are some examples of sequence data.

Let I = {i_{1}, i_{2}, i_{3} … i_{n}} be a set of items. An item-set X is a subset of items i.e. X ⊆ I. A sequence is an ordered list of item-sets (also called elements or events). Items within an element are unordered and we would list them alphabetically. An item can occur at most once in an element of a sequence, but can occur multiple times in different elements of a sequence. The number of instances of items in a sequence is called the length of the sequence. A sequence with length l is called an l-sequence. E.g., s=<a(ce)(bd)(bcde)f(dg)> is a sequence which consists of 7 distinct items and 6 elements. Length of the sequence is 12.

A group of sequences stored with their identifiers is called a sequence database. We say that a sequence s is a subsequence of t, if s is a “projection” of t, derived by deleting elements and/or items from t. E.g. <a(c)(bd)f> is a subsequence of s. Further, sequence s is a δ-distance subsequence of t if there exist integers j_{1} < j_{2} < … < j_{n} such that s_{1} ⊆ t_{j1}, s_{2} ⊆ t_{j2} … s_{n} ⊆ t_{jn} and j_{k}-j_{k-1} ≤ δ for each k = 2, 3 ... n. That is, occurrences of adjacent elements of s within t are not separated by more than δ elements.

Given a pattern p, support of the sequence pattern p is the number of sequences in the database containing the pattern p. A pattern with support greater than the support threshold min_sup is called a frequent pattern or a frequent sequential pattern. A sequential pattern of length l is called an l-pattern. Sequential pattern mining is the task of finding the complete set of frequent subsequences given a set of sequences. A huge number of possible sequential patterns are hidden in databases.

A sequential pattern mining algorithm should:

*A.*find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold,

*B.*be highly efficient, scalable, involving only a small number of database scans

*C.*be able to incorporate various kinds of user-specific constraints.

Search this Book:

Reset