Applying Machine Learning to Online Data?: Beware! Computational Social Science Requires Care

Applying Machine Learning to Online Data?: Beware! Computational Social Science Requires Care

Ulya Bayram
DOI: 10.4018/978-1-7998-8553-5.ch005
(Individual Chapters)
No Current Special Offers


The immense impact of social media on contemporary cultural evolution is undeniable, consequently declaring them an essential data source for computational social science studies. Alongside the advancements in natural language processing and machine learning disciplines, computational social science researchers continuously adapt new techniques to the data collected from social media. Although these developments are imperative for studying the sociological transformations in many communities, there are some inconspicuous problems on the horizon. This chapter addresses issues that may arise from the use of social media data, like biased models. It also discusses various obstacles associated with machine learning methods while also providing possible solutions and recommendations to overcome these struggles from an interdisciplinary perspective. In the long term, this chapter will guide computational social science researchers in their future studies, from things to be aware of with data collection to assembling an accurate experimental design.
Chapter Preview


In the current digital age, online data availability is exponentially growing. Consequently, the interest in computational social science (CSS) research is rising, which spans many disciplines (social psychology, anthropology, economics, political science, sociology, and various levels of analysis) (Oboler, Welsh, & Cruz, 2012). Studies conducted on past and recent Congressional records of many countries including the United States (Bayram, Pestian, Santel, & Minai, 2019; Diermeier, Godbout, Yu, & Kaufmann, 2012; Gentzkow, Shapiro, & Taddy, 2016; Iyyer, Enns, Boyd-Graber, & Resnik, 2014; Jensen et al., 2012; Lauderdale & Herzog, 2016; Thomas, Pang, & Lee, 2006; Yu, Kaufmann, & Diermeier, 2008), political records from the British Parliament (Peterson & Spirling, 2018), and the Irish Dail debates (Lauderdale & Herzog, 2016), are among the plethora of studies that became possible thanks to the online availability of these collections. Similarly, old and new newspaper articles (Burley et al., 2020; Neresini, 2017), speeches of political figures including those from the past centuries (Jackson, Watts, List, Drabble, & Lindquist, 2021; Savoy, 2010), digital books (Brooke, Hammond, & Hirst, 2015), and many other textual online data have been the main focus of CSS research. While all these online data sources bring rich contributions to the broad range of CSS research areas, there is one domain that requires special attention: social media data.

Data collected from social media platforms facilitate many possibilities such as answering serious social science-related questions, finding insights into both individual-level and anthropological phenomena (Harford, 2014; Lazer et al., 2009; Olteanu, Castillo, Diaz, & Kiciman, 2019). There is also a growing consensus that data collected from these domains can provide more than simple observations (Olteanu et al., 2019); social media domains are amongst the most valuable data sources for CSS research areas. The wide-range public usage of social media domains, the absence of data use restrictions, the simplicity of data acquisition through application programming interfaces (APIs), and the valuable content of the data made them attractive for researchers. These social media and social network platforms (e.g. Twitter, Facebook, Reddit, Wikipedia, other forums) can easily capture the evolution of sociological norms and rapid changes culturally and globally. For example, recent movements such as “Me Too” and “Black Lives Matter” could not expand globally without these platforms. This fact makes these platforms one of the principal sources of information for CSS research on such events. Recent studies utilize social media data to evaluate the effects of crises like the mass killings in the United States (Burley et al., 2020).

Key Terms in this Chapter

Noise: In the context of machine learning, noise corresponds to the type of data or features that do not contain meaningful patterns related to the problem of interest and have a possibility of disrupting and harming the learning process.

Underfitting: Corresponds to the event when a machine learning model does not learn the patterns present within the training set properly for reasons such as incorrect parameter selection or a small number of epochs for the case of neural networks. A machine learning that suffers from underfitting would fail to return acceptable results from the within-dataset experiments and the generalization experiments.

Overfitting: Corresponds to the event when a machine learning model memorizes the training set data instead of learning the patterns present within it for accurate generalization. When a model overfits data, the model can return high within-dataset performance while it fails to generalize to other data.

Bias: Prejudice towards or against a person, a group, or a class. In a machine learning context, there are various types of biases. Each bias can affect a model differently.

Language Generation: It is the process of automatically generating texts from models containing artificial intelligence. The goal of this process is to auto-generate texts that appear to be created by humans.

Tuning: The set of operations within machine learning algorithms to improve the performance of classification and prediction. Some of the tuning procedures happen during the learning process, while it is also possible to tune a model after the initial run of the training process.

Interdisciplinary Research: It is a type of research that employs knowledge, data, techniques, and theories from multiple disciplines. The main goal is to analyze or solve a specific problem of interest using the strengths of these different disciplines.

Machine Learning: A subfield of artificial intelligence where models can learn patterns from the data in a supervised or unsupervised fashion and tune themselves during the learning process.

Deep Learning: A subfield of machine learning that works with artificial neural networks containing many hidden layers and complex structures.

Complete Chapter List

Search this Book: