With the huge number of information sources available on the Internet and the high dynamics of their data, peer-to-peer (P2P) systems propose a communication model in which each party has the same capabilities and can initiate a communication session. These networks allow a group of computer users with the same networking program to connect with each other and directly access resources from one another. P2P architectures also provide a good infrastructure for data and computer intensive operations such as data mining. In this article we consider a new data mining approach for improving resource searching in a dynamic and distributed database such as an unstructured P2P system, that is, in Masseglia, Poncelet, and Teisseire (2006) we call this problem P2P usage analysis. More precisely we aim at discovering frequent behaviors among users of such a system. We will focus on the sequential order between actions performed on each node (requests or downloads) and show how this order has to be taken into account for extracting useful knowledge. For instance, it may be discovered, in a P2P file sharing network that for 77% of nodes from which a request is sent for “Mandriva Linux,” the file “Mandriva Linux 2005 CD1 i585-Limited- Edition-Mini.iso” is chosen and downloaded; then a new request is performed with the possible name of the remaining iso images (i.e., “Mandriva Linux 2005 Limited Edition”), and in the large number of returned results the image corresponding to “Mandriva Linux 2005 CD2 i585-Limited-Edition-Mini.iso” is chosen and downloaded. Such knowledge is very useful for proposing the user with often downloaded or requested files according to a majority of behaviors. It could also be useful in order to avoid extra bandwidth consumption, which is the main cost of P2P queries (Ng, Chu, Rao, Sripanidkulchai, & Zhang, 2003).
Mining either association rules or sequential patterns in very large distributed databases as unstructured P2P systems is far away from trivial. For instance, in Wolff and Schuster (2003), authors propose to mine association rules, that is, sets of objects that tend to associate with one another, in a P2P system. The proposed algorithm combines, at each node, association rule mining algorithm with a majority voting protocol to discover all of the association rules that exist in the distributed database.
By nature P2P systems are dynamic, that is, nodes act independently of one another, and intermediate results may be overturned as new data arrives. Furthermore, whenever a node departs, the sequence of that node also disappears, and the global database has to be reconsidered. Traditional approaches for mining sequential patterns (Agrawal & Srikant, 1995; Pei et al., 2001) are irrelevant in such a dynamic context because they consider that the whole database is available. In Masseglia, Teisseire, and Poncelet (2003), we proposed to discover relationships and global patterns that exist between connecting users (Web usage mining). We proposed a new “client/server/engine” architecture for taking advantage of the computing power available on the machine a user navigates with. In this article our goal is different since it takes into account the dynamic nature of the considered system. We consider that the connected nodes can act with a special peer (called meter peer) in order to provide the end user with a good approximation of patterns embedded in this very large distributed database.
Key Terms in this Chapter
Peer-to-Peer (P2P) Systems: “Peer-to-peer” (or P2P) refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. The resources encompass computing power, data (storage and content), network bandwidth, and presence (computers, human, and other resources). Among the critical functions that have to be performed by P2P systems, we can cite distributed computing, data/content sharing, communication and collaboration, or platform services.
Data Sequence: The sequence of itemsets representing the behavior of a client over a specific period. The database involved in a sequential pattern mining process is a (usually large) set of data sequences. Data sequence could be purchase transaction data collected by retail stores or requested URLs. See also Web Usage Mining.
Sequential Pattern: A sequence included in a data sequence such that each item in the sequential pattern appears in this data sequence with respect to the order between the itemsets in both sequences. See also Data Sequence.
Client/Server: Client/server describes the relationship between two computer programs in which one program, called the client, makes a service request from another program, called the server, which fulfills the request.
Itemset: Set of items that occur together.
Peer-to-Peer (P2P) Usage Mining: With the huge number of information sources available on the Internet and the high dynamics of their data, (P2P) systems allow a group of computer users with the same networking program to connect with each other and directly access resources from one another. P2P Usage Analysis aims at discovering frequent behaviors among users of such a system.
Web Usage Mining: With the growing popularity of the World Wide Web, large volumes of data such as the user’s address or requested URLs are automatically gathered by Web servers and stored in access log files (access log, error log …). Analysis of such files in order to understand user behavior is called Web usage mining and can provide useful information to improve a network’s performance, to rebuild a Web site dynamically, or even to target customers in an e-business environment.
Items: Items are merely characteristics of individuals or observations of individual behaviors.
Maximal Frequent Sequential Pattern: A sequential pattern included in at least n data sequences (with n the minimum support specified by the user). A sequential pattern is maximal when it is not included in another frequent sequential pattern. A frequent sequential pattern may represent, for instance, a frequent behavior of a set of customers on a P2P system.