Massive Digital Libraries (MDLs) and the Impact of Mass-Digitized Book Collections

Massive Digital Libraries (MDLs) and the Impact of Mass-Digitized Book Collections

Andrew Philip Weiss (California State University, Northridge, USA)
Copyright: © 2021 |Pages: 14
DOI: 10.4018/978-1-7998-3479-3.ch123

Abstract

This chapter describes the characteristics of massive digital libraries (MDLs) and outlines their impact upon current information science issues, especially digital collection metadata, copyright and fair use, the diversity of source collections, and user privacy. MDLs rival physical libraries' print holdings in size, breadth, and depth, often approaching a scale previously found only among library consortia or national libraries. The concept further intersects the digital library with the wider development of ‘big data.' Examples include Google Books, HathiTrust, Internet Archive, Digital Public Library of America (DPLA), California Digital Library, Texas Digital Library, Gallica, and Europeana.
Chapter Preview
Top

Introduction

To better analyze the growth of digital libraries, Weiss and James have proposed adopting the term Massive Digital Libraries (MDLs), a concept based on the increased size, scope and scalability of mass-digitized book collections. MDLs rival physical libraries’ print holdings in size, breadth, and depth, often approaching a scale previously found only among library consortia or national libraries. (Weiss and James, 2013a, 2013b, 2014, 2015; Weiss, 2016) The concept further intersects the digital library with the wider development of ‘big data,’ which is driven to an ever-larger scale by the ‘5Vs’ of Volume, Variety, Velocity, Variability, and Veracity. (Weiss, 2018)

The root of the concept begins in late 2004 when Google publicly announced its intention to digitize the world's books—including works still under copyright protection—and to place them all (roughly 130 million) online. Jean-Noel Jeanneney, head of Bibliothèque nationale de France at the time, interpreted Google’s planned project as a wake-up call for European countries. Failure to catch up to the American company, he argued, would result in significant problems for non-American organizations, especially if they were unable to check the company’s outsized influence. (Jeanneney, 2005)

Fifteen years on, it is hard to imagine that Google’s desire to create an online digital should have come as such a shock. Yet Google spurred significant hand-wringing and soul-searching among institutions traditionally charged with producing or preserving cultural artifacts. (Jeanneney; Venkatraman, 2009) In retrospect, the controversy seems quaint in comparison to the current crop of issues – especially the current “disruptions” of established economic models by Uber/Lyft, Facebook, Twitter, Spotify, and the like; the looming unknowns circling around artificial intelligence; and the encroachments on civil rights via electronic digital surveillance and other intrusions of privacy.

A number of similar mass-digitization projects developed and matured since Google’s announcement, including the HathiTrust, Internet Archive, Digital Public Library of America (DPLA), California Digital Library, Texas Digital Library, Gallica, and Europeana. These projects each transcend their roots as localized digital libraries and simultaneously adapted to and altered the digital landscape. These various MDLs have allowed for and contributed to the ascendancy of our current mass-digitization online culture.

This chapter describes the characteristics of Massive Digital Libraries (MDLs) and outlines their impact upon current information science issues, especially with regard to digital collection metadata, copyright, the diversity of source collections, and user privacy in an age of so-called “surveillance capitalism.” (Zuboff, 2015) Traditionally, libraries have been created to serve particular communities defined as well as bound by geography, intellectual disciplines, or specific end users. MDLs in their current trajectories, however, promise to transcend such limits in ways that are simultaneously constructive and destructive.

Key Terms in this Chapter

Big Data: The large-scale aggregation of data from software and wired devices, used for the purposes of analyzing systems or user behavior; it is characterized by its large volume, wide variety, velocity (or speed of transfer), its increased variability, and undetermined veracity.

Digital Humanities (DH): A branch of the humanities incorporating digital search, digitized texts, encoding, and other strategies to move previously print-bound materials and scholarship into computer science related analysis.

Mass-Digitization: The practice of quickly and thoroughly digitizing items on a large scale. In the case of MDLs these efforts involve scanning millions of books with multiple institutional partners across national boundaries.

Surveillance Capitalism: A term coined by Shoshanna Zuboff to describe practices of monetizing data derived by companies tracking online users of various services and devices.

Digital Corpus: The full text of millions of books digitized by the MDLs. These provide opportunities for scholars to examine the frequency of terms as they appear in the print corpus.

Massive Digital Libraries (MDLs): Term adopted to describe the mass-digitization of printed books and the mass-aggregation of their metadata into online, full-text-searchable digital collections; some component of open access and use of public domain works defines their collections; diffuse and diverse target end-user groups are also characteristic of MDLs.

Fair Use: Defined by some as the breathing room to allow for the freedom of expression, it is a strong limitation of the extent of copyright law. It is determined by examining four factors: 1) the purpose and character of the use; 2) the nature of the copyrighted work; 3) the amount and substantiality of the portion taken; and 4) the effect of the use upon the potential market.

Ngram Viewer: A tool developed by Google to aid in the visualization of the digital corpus of books. It plots frequencies of terms across a graph, helping users understand how common a word or concept was in the corpus at a specific moment in history.

Public Domain: Works that are no longer under copyright protection. In the United States these works tend to be those published before January 1, 1924 and unpublished works created prior to 1899. U.S. Government publications are in the public domain.

Complete Chapter List

Search this Book:
Reset