Article Preview
Top1. Introduction
Standard data mining algorithms have been relying on the implicit assumption that data are IID (independent and identically distributed). They consider datasets as collections of independent instances of single relations. While this restriction appears to be consistent with the classical statistical inference problem, it ignores inherent dependencies and correlations involved in numerous real-world phenomena that frequently emerge from complex interactions between entities. For instance in traditional data mining processes, it is common to search for purchase models of individuals by focusing on their own attributes (age, city, school, center of interest, etc). Nevertheless, it is obvious that social relationships between individuals, such as friendship or professional relationships, maintain their environment and determine their behavior and decisions.
The last decade social links have become the subject of a novel active research area, the “Science of Network” (Barabasi, 2002, Watts, 2004, Borner et al., 2007), a new way of studying relationships maintained between entities with new techniques regarding older traditional approaches followed in sociology and discrete mathematics. While previous works provided most popular results among which Milgram's experiment (Milgram, 1967) on small world phenomenon or Bott's work (Bott, 1957) on urban families, a great emphasis has been put on network structures and gave important findings. Network structures have been analysed according to recently defined measures such as the degree, density, diameter, clustering coefficient or shortest-path, etc.
The so-called and very novel social network mining area, also called link mining, (Getoor & Diehl, 2005) has attempted to apply the concepts of data mining on networks. Standard data mining techniques (classification, prediction, clustering, frequent pattern discovery), are used to explore the topological structure of the network without considering the nodes attributes. For instance, this is the case of methods for extracting frequent patterns from social networks, which are mostly approaches that only exploit the topological structure to extract sub-networks found frequently in a set of networks or a single much larger network.
However, both sources of information (structure and nodes features) seem to be relevant to take full advantage of the knowledge hidden into the network. While the links describe the nature of the relationships between individuals into the network, the attributes may provide relevant information on their role, position or influence. In the case of the spread of diseases for example, although the social link between two individuals is the main vector of the transmission process, properties such as the age or the origins, can influence quite significantly the probability that an individual contracts the disease, and therefore can influence the whole diffusion process.
Given the increasing number of available data sets, in which, in addition to social links, some information on nodes is also available, recent methods have integrated the complementary dimension. In this field, new algorithms or adaptations of existing algorithms have been proposed in visualization (Snasel et al., 2009), classification (Kuznetsov & Ignatov, 2009) or identification of communities (Gaume et al., 2010).
In this paper, we address the problem of the search for frequent patterns in social networks and we focus particularly on a new approach that combines both the network structure and the attributes of nodes for discovering regular patterns among the links that connect nodes with common characteristics. Such patterns are called “frequent links” and provide knowledge on the groups of node the most connected in the network.
For instance let us assume a social network built on mobile phone contacts, i.e in which two individuals (nodes) are connected if we observe at least one call from one of them to the other one. Each individual is characterized by its home country, the age and the professional status. A frequent link in this kind of network would provide for instance the following knowledge: “20% of calls are between European business men and Chinese people”.