Access Full-Text Recommend to Your Library

Buy Instant Access to This Chapter

Instant access upon order completion

Add to Cart

Share

Recommend to Librarian Recommend to Colleague Fair Use Policy

Free Content

Sample PDF

More Information

Rights & Permissions

Favorite Cite Chapter

MLA

Gupta, Manish, and Jiawei Han. "Approaches for Pattern Discovery Using Sequential Data Mining." Data Mining: Concepts, Methodologies, Tools, and Applications, edited by Information Resources Management Association, IGI Global Scientific Publishing, 2013, pp. 1835-1851. https://doi.org/10.4018/978-1-4666-2455-9.ch095

APA

Gupta, M. & Han, J. (2013). Approaches for Pattern Discovery Using Sequential Data Mining. In I. Management Association (Ed.), Data Mining: Concepts, Methodologies, Tools, and Applications (pp. 1835-1851). IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-4666-2455-9.ch095

Chicago

Gupta, Manish, and Jiawei Han. "Approaches for Pattern Discovery Using Sequential Data Mining." In Data Mining: Concepts, Methodologies, Tools, and Applications, edited by Information Resources Management Association, 1835-1851. Hershey, PA: IGI Global Scientific Publishing, 2013. https://doi.org/10.4018/978-1-4666-2455-9.ch095

Export Reference

For Librarians

Approaches for Pattern Discovery Using Sequential Data Mining

Manish Gupta (University of Illinois at Urbana-Champaign, USA) and Jiawei Han (University of Illinois at Urbana-Champaign, USA)

Source Title: Data Mining: Concepts, Methodologies, Tools, and Applications

DOI: 10.4018/978-1-4666-2455-9.ch095

Abstract

In this chapter we first introduce sequence data. We then discuss different approaches for mining of patterns from sequence data, studied in literature. Apriori based methods and the pattern growth methods are the earliest and the most influential methods for sequential pattern mining. There is also a vertical format based method which works on a dual representation of the sequence database. Work has also been done for mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensional databases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining. Some works also focus on mining incremental patterns and mining from stream data. We present at least one method of each of these types and discuss their advantages and disadvantages. We conclude with a summary of the work.

Chapter Preview

Top

Introduction

What is Sequence Data?

Sequence data is omnipresent. Customer shopping sequences, medical treatment data, and data related to natural disasters, science and engineering processes data, stocks and markets data, telephone calling patterns, weblog click streams, program execution sequences, DNA sequences and gene expression and structures data are some examples of sequence data.

Notations and Terminology

Let I = {i₁, i₂, i₃ … i_n} be a set of items. An item-set X is a subset of items i.e. X ⊆ I. A sequence is an ordered list of item-sets (also called elements or events). Items within an element are unordered and we would list them alphabetically. An item can occur at most once in an element of a sequence, but can occur multiple times in different elements of a sequence. The number of instances of items in a sequence is called the length of the sequence. A sequence with length l is called an l-sequence. E.g., s=<a(ce)(bd)(bcde)f(dg)> is a sequence which consists of 7 distinct items and 6 elements. Length of the sequence is 12.

A group of sequences stored with their identifiers is called a sequence database. We say that a sequence s is a subsequence of t, if s is a “projection” of t, derived by deleting elements and/or items from t. E.g. <a(c)(bd)f> is a subsequence of s. Further, sequence s is a δ-distance subsequence of t if there exist integers j₁ < j₂ < … < j_n such that s₁ ⊆ t_j1, s₂ ⊆ t_j2 … s_n ⊆ t_jn and j_k-j_k-1 ≤ δ for each k = 2, 3 ... n. That is, occurrences of adjacent elements of s within t are not separated by more than δ elements.

What is Sequential Pattern Mining?

Given a pattern p, support of the sequence pattern p is the number of sequences in the database containing the pattern p. A pattern with support greater than the support threshold min_sup is called a frequent pattern or a frequent sequential pattern. A sequential pattern of length l is called an l-pattern. Sequential pattern mining is the task of finding the complete set of frequent subsequences given a set of sequences. A huge number of possible sequential patterns are hidden in databases.

A sequential pattern mining algorithm should:

A.
find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold,
B.
be highly efficient, scalable, involving only a small number of database scans
C.
be able to incorporate various kinds of user-specific constraints.

Complete Chapter List

Search this Book:

Reset