Issues and Methods for Access, Storage, and Analysis of Data From Online Social Communities

Issues and Methods for Access, Storage, and Analysis of Data From Online Social Communities

Christopher John Quinn (Purdue University, USA), Matthew James Quinn (Williams College, USA), Alan Olinsky (Bryant University, USA) and John Thomas Quinn (Bryant University, USA)
DOI: 10.4018/978-1-5225-3142-5.ch015

Abstract

This chapter provides an overview for a number of important issues related to studying user interactions in an online social network. The approach of social network analysis is detailed along with important basic concepts for network models. The different ways of indicating influence within a network are provided by describing various measures such as degree centrality, betweenness centrality and closeness centrality. Network structure as represented by cliques and components with measures of connectedness defined by clustering and reciprocity are also included. With the large volume of data associated with social networks, the significance of data storage and sampling are discussed. Since verbal communication is significant within networks, textual analysis is reviewed with respect to classification techniques such as sentiment analysis and with respect to topic modeling specifically latent semantic analysis, probabilistic latent semantic analysis, latent Dirichlet allocation and alternatives. Another important area that is provided in detail is information diffusion.
Chapter Preview
Top

Introduction

Online social networks have become increasingly popular as a medium for information exchange and dialog for diverse communities. Twitter, a microblogging platform, has seen rapid growth and currently processes hundreds of millions of new messages each day with as many active users. Given the volume of activity and availability of data, there is a large amount of computational social science research investigating social networks and interactions in Twitter.

Some questions aim to characterize the structural properties of the social graph or “who follows whom.” The follower model employed by Twitter facilitates celebrity-fan communities. For instance, as of June 2014, USA President Barack Obama and musician Lady Gaga both have over forty million followers yet follow less than one million. Professional basketball player LeBron James has over twelve million followers and follows about 250 accounts. However, the number of followers is only one statistic of the network. There is interest in exploring how the connections are structured. For instance, do the users who follow LeBron James follow each other or at least follow some other player in common? Could they be considered a community or are they too separated? Are the followers of Barack Obama or Lady Gaga better interconnected? This is important not only for understanding the common interests of users but also for how easily information might be shared between them.

A significant question is “who is influential” in the social graph. With online social networks, influence is most easily measured by tangible phenomena such as reposting content deemed of interest (such as links to a video or article) or adopting a common hashtag. While the number of followers is an important statistic, research has shown that alone does not entail influence. The network topology needs to be taken into account as well as the followers themselves – some users are more likely to spread content than others.

There has also been significant research in exploring the communication of users. Microblogging platforms such as Twitter in particular promote novel linguistic features due to brevity constraints (each message must be less than 140 characters). Abbreviations, acronyms, hashtags (such as “#fail”) and excessive punctuations are all common. This presents challenges for automatic processing of the text for supervised classification tasks such as identifying the gender, age or political orientation of a user as well as unsupervised learning of the topics in a set of messages. Since the purpose of Twitter is to enable users to communicate, it is important for researchers to develop methods to be able to process and analyze that communication.

This chapter begins with a review of important techniques in social graph analysis, followed by a discussion of data access and sampling, textual analysis and finally, information diffusion. Software programs to assist processing social network data are referenced in an appendix.

Key Terms in this Chapter

Social Network Analysis: The study of social interactions from a network perspective, using tools from graph theory ( Newman, 2010 ).

Sentiment Analysis: A branch of textual analysis that investigates the emotional state of a text, such as happy versus sad, based on linguistic features ( Pang and Lee, 2008 ).

Centrality Measures: A quantification aimed at assessing each node's importance in the network based on the network topology ( Newman, 2010 ).

Homophily: A phenomenon in which users with similar backgrounds and interests associate with each other ( Newman, 2010 ).

Clique: A network in which each node is directly connected with every other node ( Newman, 2010 ).

Textual Analysis: Natural language processing and machine leaning techniques applied to investigate linguistic features of user interactions in a social network ( Roberts, 1997 ).

Diffusion: The phenomenon of something spreading out spatially; in the context of social networks, a picture, a website url, or expressions can spread from user to user, analogous to a contagious disease ( Newman, 2010 ).

Influence Maximization: The algorithmic challenge of finding a set of users in the network who would be the best set to spread certain content ( Kempe, Kleinberg & Tardos, 2003 ).

Latent Dirichelet Allocation (LDA): A machine learning technique used for document clustering to infer topics ( Blei, Ng & Jordan, 2003 ).

Complete Chapter List

Search this Book:
Reset