Social networks provide a way to anticipate, build, and make use of links, by representing relationships and propagation of phenomena between pairs of entities that can be extended to large-scale dynamical systems. In its most general form, a social network can capture individuals, communities or other organizations, and propagation of everything from information (documents, memes, rumors) to infectious pathogens. This representation facilitates the study of patterns in the formation, persistence, evolution, and decay of relationships, which in itself forms a type of dynamical system, and also supports modeling of temporal dynamics for events that propagate across a network.
In this first section, we survey goals of predictive analytics using a social network, outline the specific tasks that motivate the use of graph-based models of social networks, and discuss the general state-of-the-field in data science as applied to prediction.
1.1 Overview: Goals of Prediction
In general, time series prediction aims to generate estimates for variables of interest that are associated with future states of some domain. These variables frequently represent a continuation of the input data, modeled under some assumptions about how the future data are distributed as a function of the history of past input, plus exogenous factors such as noise. The term forecasting refers to this specific type of predictive task. (Gershenfeld & Weigend, 1994) Acquiring the information to support this operation is known as modeling and frequently involves the application of machine learning and statistical inference. A further goal of the analytical process that informs this model is understanding the way in which a generative process changes over time; in some scenarios, this means estimating high level parameters or especially structural elements of the time series model.
Getoor (2003) introduces the term link mining to describe a specialized form of data mining: analyzing a network structure to discover novel, useful, and comprehensible relationships that are often latent, i.e., not explicitly described. Prototypical link mining tasks, as typified by the three domains that Getoor surveys, include modeling collections of web pages, bibliographies, and the spread of diseases. Each member of such a collection represents one entity. In the case of web page networks, links can be outlinks directed from a member page to another page, inlinks directed from another page to a member page, or co-citation links indicating that some page contains outlinks to both endpoints of a link. Bibliography or citation networks model paper-to-paper citations, co-author sets, author-to-institution links, and paper-to-publication relationships. Epidemiological domains are often represented using contact networks, which represent individual organisms (especially humans or other animals) using nodes and habitual or incidental contact using links. Spread models extend this graphical representation by adding information about incubation and other rates and time-dependent events.
Getoor and Diehl (2005) further survey the task of link mining, taxonomizing tasks into abstract categories such as object-based, link-based, and graph-based. Object-based tasks, used often in information retrieval and visualization, include ranking, classification, group detection (one instance of which is community detection), and identification (including disambiguation and deduplication). Link-based tasks, which we discuss in depth in this article, include the modeling task of link prediction – deducing or calculating the likelihood of a future link between two candidate entities, based on their individual attributes and mutual associations. Graph-based tasks include modeling tasks such as discovering subgraphs, as well as characterization or understanding tasks such as classifying an entire graph as a small-world network or being governed by a random generative model – e.g., some type of Erdős–Rényi graph (Erdős & Rényi, 1960).
Social media have proliferated and gained in user population, bandwidth consumed, and volume of content produced since the early 2000s. A brief history and broad survey of social network sites is given by boyd and Ellison (2007), documenting different mechanisms by which online social identity is maintained and computer-mediated communication practiced. This article also introduces contemporary work on characterization and visualization of network structure, modeling offline and online social networks using a combined model, and preservation of privacy on social network sites (SNSs). Many of the modeling tools referenced in this survey paper admit direct application or extension to predictive analytics tasks for SNSs. (Yu, Han, & Faloutsos, 2010)