Querying Google Books Ngram Viewer's Big Data Text Corpuses to Complement Research

Querying Google Books Ngram Viewer's Big Data Text Corpuses to Complement Research

Shalin Hai-Jew (Kansas State University, USA)
Copyright: © 2015 |Pages: 42
DOI: 10.4018/978-1-4666-6493-7.ch020
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

If qualitative and mixed methods researchers have a tradition of gleaning information from all possible sources, they may well find the Google Books Ngram Viewer and its repository of tens of millions of digitized books yet another promising data stream. This free cloud service enables easy access to big data in terms of querying the word frequency counts of a range of terms and numerical sequences (and languages) from 1500 – 2000, a 500-year span of book publishing, with new books being added continually. The data queries that may be made with this tool are virtually unanswerable otherwise. The word frequency counts provide a lagging indicator of both instances and trends, related to language usage, cultural phenomena, popularity, technological innovations, and a wide range of other insights. The text corpuses contain de-contextualized words used by the educated literati of the day sharing their knowledge in formalized texts. The enablements of the Google Books Ngram Viewer provide complementary information sourcing for designed research questions as well as free-form discovery. This tool allows downloading of the “shadowed” (masked or de-identified) extracted data for further analyses and visualizations. This chapter provides both a basic and advanced look at how to extract information from the Google Books Ngram Viewer for light research.
Chapter Preview
Top

Introduction

People have arrived to the era of Big Data, with massive amounts of digital information available for research and learning. “Big data” has been defined as an N of all, with all available data points for a data set able to be captured and analyzed; another definition has “big data” as datasets involving millions to billions of records and analysis through high performance computing. This recent development in modern life has come about for a variety of factors. One involves the proliferation of electronic information from the Web and Internet (such as through websites, social media platforms, and the hidden web), mobile devices, wearable cameras and Google Glasses. Much of what was analog is now digitized (transcoded into digital format) and datafied (turned into machine-readable and machine-analyzable information). This move to mass digitization is part of a global cultural paradigm of cyber-rization. The Open Data movement has swept through governments and even commercial entities, which see the “democratization of knowledge” as part of human and consumer rights. The movement for “the internet of things” or the “quantified self” will only further add to the outpouring and availability of digital information. Then, too, there have been advancements in cloud computing, which enables easier analysis of large datasets. Cheap data storage has enabled the storage of data into practical perpetuity.

Going to big data is not without controversy, particularly among researchers. In the Petabyte Age of big data, the “scientific method” and research-based hypothesizing may be irrelevant, suggests one leading thinker. With so many streams of electronic information from online social networks, sensors, digital exhaust (from commercial data streams), and communications channels, many assert that anything that people might want to know may be ascertainable by querying various databases. In a controversial article, Chris Anderson wrote: “Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot,” in “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (2008). The tight controls for random sampling and statistical validity in a “small data” world no longer apply in a world of big data (Mayer-Schönberger & Cukier, 2013). The authors describe a tradeoff between less sampling error in big data but a need to accept more measurement error. In a dataset with N = all, the sheer mass of data changes up the rules—even as big data may introduce spurious correlations where none may exist because of the increased “noise” in the data. Such large datasets may enable unexpected discovery, so researchers may not have to start with a hypothesis or a question or a hunch; it may be enough for them to peruse the data to see what they can find, through big data “prospecting” (Gibbs & Cohen, 2011). Mayer-Schönberger and Cukier suggest that the analysis of big data may be a necessary skill set for many in certain domains of research. In particular, big data enables researchers not only to achieve a high-level view of information but probe in depth to a very granular level (down to an individual record even if that record is one of tens or hundreds of millions of other records). In the vernacular, researchers are able to zoom-in and zoom-out of the data. Big data may not inform researchers directly about causation (because there is not an obvious way to mathematically show causality); to acquire that level of understanding may require much more additional analytical and inferential work and additional research.

Data researchers suggest that it is critical to know data intimately in order to optimize their use of the data. This includes basic information about how the data were acquired, where they were acquired from, how they were managed, and what may be extrapolated and understood from them.

Complete Chapter List

Search this Book:
Reset