Mining Unexpected Sequential Patterns and Implication Rules

Mining Unexpected Sequential Patterns and Implication Rules

Dong (Haoyuan) Li (LGI2P, École des Mines d’Alès, France), Anne Laurent (LIRMM, Université Montpellier II, France) and Pascal Poncelet (LIRMM, Université Montpellier II, France)
DOI: 10.4018/978-1-60566-754-6.ch010


As common criteria in data mining methods, the frequency-based interestingness measures provide a statistical view of the correlation in the data, such as sequential patterns. However, when the authors consider domain knowledge within the mining process, the unexpected information that contradicts existing knowledge on the data has never less importance than the regularly frequent information. For this purpose, the authors present the approach USER for mining unexpected sequential rules in sequence databases. They propose a belief-driven formalization of the unexpectedness contained in sequential data, with which we propose 3 forms of unexpected sequences. They further propose the notion of unexpected sequential patterns and implication rules for determining the structures and implications of the unexpectedness. The experimental results on various types of data sets show the usefulness and effectiveness of our approach.
Chapter Preview


Most real world applications process the data stored in sequence format, where the elements in data are sequentially ordered with temporal or spatial relation. For instances, in a customer retail database, a sequence can be all purchases of a customer ordered by the time of transaction; in a Web access log file, a sequence can be all of those resources accessed during a user session ordered by the time of request; in a telecommunication network monitoring database, a sequence can be all events during a period ordered by the time of occurrence; in a DNA segment, a sequence is a succession of nucleotide subunits with spatial order, etc. In order to discover the knowledge hidden in such sequential data, sequence data mining techniques (Dong & Pei, 2007; Han & Kamber, 2006) have been highly developed and widely applied in many application domains.

As one of the most important models of sequence data mining, the sequential pattern proposed by Agrawal and Srikant (1995) provides a statistical frequency based view of the correlations between the elements in sequential data. The problem of mining sequential patterns can be formally described as follows.

Given a set of binary-valued attributes R = {i1,i2,…,in}, an attribute is an item. An itemset, denoted as I = (i1i2im), is an unordered collection of items. A sequence is an ordered list of itemsets, denoted as s = 〈I1I2Ik〉. A sequence database, denoted as D, is a large set of sequences. Given two sequences s = 〈I1I2Im〉 and s' = 〈I'1I'2I'n〉, if there exist integers 1 ≤ i1i2 ≤ … ≤ imn such that I1I'i1, I2I'i2, …, ImI'im, then the sequence s is a subsequence of the sequence s' and the sequence s' is a super sequence of the sequence s, denoted as s ô s', and we say that the sequence s is included in the sequence s', or the sequence s' supports the sequence s. If a sequence s is not included in any other sequences, then the sequence s is a maximal sequence. The support (or the frequency) of a sequence s in a sequence database D, denoted as σ (s, D), is the fraction of the total number of sequences in the database D that support s. Given a minimal frequency threshold minimum support specified by user, denoted as σmin, a sequence s is frequent if σ (s, D) ≥ σmin. A sequential pattern is a frequent maximal sequence, so that the problem of mining sequential patterns is to find all frequent maximal sequences in a sequence database.

Example 1. Let D be a customer retail database, with the minimum support σmin = 0.5, we may find the sequential pattern s = 〈(Sci-Fi-Novel)(Action-Film Sci-Fi-Film)(Rock-Music)〉 where σ (s, D) = 0.6, which can be interpreted as “60% of customers purchase a Sci-Fi novel, then purchase action and Sci-Fi films later, and then purchase a rock music CD”. +

Example 2. Let D be a Web access log database, with the minimum support σmin = 0.5, we may find the sequential pattern s = 〈(login)(msglist)(msgread)(msgread)(logout)〉 where σ (s, D) = 0.8, which can then be interpreted as “80% of users visit the login page, then visit the message list page, then read messages, and at last logout”. +

Complete Chapter List

Search this Book: