Digital Archives and Data Science: Building Programs and Partnerships for Health Sciences Research

Digital Archives and Data Science: Building Programs and Partnerships for Health Sciences Research

Kate Tasker, Rachel Taketa, Charles Macquarie, Ariel Deardorff
DOI: 10.4018/978-1-7998-9702-6.ch007
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This chapter describes work by the UCSF Industry Documents Library to develop resources, programs, and initiatives to support data science work with a diverse audience in the fields of health sciences, history of medicine, public health policy, and tobacco control. The Industry Documents Library (IDL) is a digital archive of over 15 million documents created by industries impacting public health, hosted by the University of California, San Francisco (UCSF) Library. The chapter describes the public health impact of industry documents research, highlights several examples of computational projects conducted by IDL scholars, outlines the IDL's developing plans for using data science techniques to assist with large-scale digital collection appraisal and metadata enhancement, and discusses how the IDL is expanding its collaborations with the UCSF Library's Data Science Initiative and Archives and Special Collections departments to further develop impactful data science programs across the university.
Chapter Preview
Top

Introduction

When the UCSF Industry Documents Library first launched in 2002, it was one of the earliest digital libraries to offer access to its collections on the web. From the beginning, librarians, archivists, technologists, and the public health research community at the University of California, San Francisco (UCSF) grappled with how best to manage, index, search, and analyze millions of digitized documents which had been released from historic litigation against the tobacco industry. As the interrelated fields of information science, digital archives, and data science evolved, the Industry Documents Library (IDL) developed its own software tools and approaches to meet researchers’ “big data” needs, at a time when few out-of-the-box solutions existed and none fit the unique requirements of a project which combined elements of digital archives, law libraries, and corporate accountability. For two decades the IDL focused on stewarding its data, building custom open-source tools for search and access, and supporting new public health research methods for studying and responding to the spread of tobacco-related diseases. During this period the IDL collection grew from a few thousand pages to a few million pages, and now contains more than 94 million pages in over 15 million documents. As the volume of available material continues to increase – the IDL anticipates adding more than 5 million new documents in the next year – the need to build partnerships and design collaborative strategies for managing data at scale is more imperative than ever. The expanding data science landscape now provides significant opportunities for the IDL to move beyond custom isolated solutions, and to embrace an approach which engages with an advancing library and data science knowledge base, innovative new ideas developed by other libraries and archives, and an expanding network of potential collaborators and partners.

This chapter begins with an overview of the UCSF Industry Documents Library, and highlights examples of how IDL scholars have applied data science methods to investigate specific questions about the tobacco industry’s impact on public health. It describes the application of emerging computational tools for digital archival appraisal and description, and the potential for these tools to dramatically improve the IDL’s capacity for curating its data. The chapter then describes the IDL’s current and potential collaboration opportunities with the UCSF Library’s Data Science Initiative (DSI), which serves as a campus hub for education and support in data science. This is followed by a description of another UCSF archival initiative which serves as a model for the IDL’s data science engagement efforts: the UCSF Archives and Special Collections project “No More Silence: Opening the Data of the HIV/AIDS Epidemic using Natural Language Processing Techniques.” This project offers rich instruction in how to prepare data for computational analysis, build a data science research community, develop training workshops, and connect with other campus partners. The chapter concludes with recommendations identified by the IDL, UCSF partners, and others in the library and archives profession, on how to encourage staff development and skill building, build cross-campus partnerships, provide opportunities for mutual learning through data science student internships, integrate data science tools into existing workflows, and plan for the long-term sustainability of data science activities.

Key Terms in this Chapter

Natural Language Processing (NLP): A form of artificial intelligence specifically focused on designing and using computer software, systems, and code that allow a computer to process and “understand” text and spoken words in the same way that humans do.

Discovery Documents: In the United States, parties in a lawsuit may request documents and other information from the opposing party which may be relevant to the case. This pre-trial stage of litigation is called discovery. Documents produced in discovery may be used as evidence if the case goes to a trial.

Solr Server: A full-text open-source search platform developed by the Apache Solr Foundation. Its functions include highlighting search terms in results, faceted searching, real-time indexing, database integration, and handling of Word and PDF documents.

Application Programming Interface (API): A type of software interface which enables the exchange of data between systems. APIs generally follow a set of standards documented in an API specification, which provides a user with instructions of how to build or use the API connection. APIs are frequently used with web applications to share large amounts of data for public use.

Optical Character Recognition (OCR): The electronic conversion of an image of text into machine-readable text. OCR text is generated from scanned documents using software. The text of a document can then be used for text mining or other computational analysis, or to provide full-text search capability.

Library Carpentry: A non-profit community organization which helps to build software and data skills in a library and information science context and to empower people to use software and data in their work.

Master Settlement Agreement (MSA): A 1998 legal agreement entered into by the four major tobacco companies and 46 U.S. States, which resolved the States’ lawsuits against the tobacco industry for recovery of tobacco-related health-care costs. The MSA required the companies to end certain marketing practices, to pay over $206 billion to the States, and to make their internal documents available to the public.

Topic Modeling: A text mining tool used to analyze words in a document (or collection of documents) to discover frequently used terms and to group them into clusters. These clusters can provide insight into the topics of a document or collection, allowing a user to better understand their content without needing to read through thousands or millions of pages.

Complete Chapter List

Search this Book:
Reset