Definitions related to the sequential pattern extraction will first be given. They will help understanding the various problems and methods presented hereafter.
Definitions
The item is the basic value for numerous data mining problems. It can be considered as the object bought by a customer, or the page requested by the user of a website, etc. An itemset is the set of items that are grouped by timestamp (e.g. all the pages requested by the user on June 04, 2004). A data sequence is a sequence of itemsets associated to a customer. In table 1, the data sequence of C2 is the following: “(Camcorder, MiniDV) (DVD Rec, DVD-R) (Video Soft)” which means that the customer bought a camcorder and miniDV the same day, followed by a DVD recorder and DVD-R the day after, and finally a video software a few days later.
Table 1.
Data sequences of four customers over four days
| Cust |
June 04, 2004 |
June 05, 2004 |
June 06, 2004 |
June 07, 2004 |
| C1 |
Camcorder, MiniDV |
Digital Camera |
MemCard |
USB Key |
| C2 |
Camcorder, MiniDV |
DVD Rec, DVD-R |
|
Video Soft |
| C3 |
DVD Rec, DVD-R |
MemCard |
Video Soft |
USB Key |
| C4 |
|
Camcorder, MiniDV |
Laptop |
DVD Rec, DVD-R |
A sequential pattern is included in a data sequence (for instance “(MiniDV) (Video Soft)” is included in the data sequence of C2, whereas “(DVD Rec) (Camcorder)” is not included according to the order of the timestamps). The minimum support is specified by the user and stands for the minimum number of occurrences of a sequential pattern to be considered as frequent. A maximal frequent sequential pattern is included in at least “minimum support” data sequences and is not included in any other frequent sequential pattern. Table 1 gives a simple example of 4 customers and their activity over 4 days in a shop. With a minimum support of “50%” a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4). In this case a maximal sequential pattern mining process will find three patterns:
-
•
S1: “(Camcorder, MiniDV) (DVD Rec, DVD-R)”
-
•
S2: “(DVD Rec, DVD-R) (Video Soft)”
-
•
S3: “(Memory Card) (USB Key)”
One can observe that S1 is included in the data sequences of C2 and C4, S2 is included in those of C2 and C3, and S3 in those of C1 and C2. Furthermore the sequences do not have the same length (S1 has length 4, S2 has length 3 and S3 has length 2).