Semi-Automating the Transformation of Chinese Historical Records Into Structured Biographical Data

Semi-Automating the Transformation of Chinese Historical Records Into Structured Biographical Data

Lik Hang Tsui (Harvard University, USA) and Hongsu Wang (Harvard University, USA)
DOI: 10.4018/978-1-5225-7195-7.ch011
OnDemand PDF Download:
No Current Special Offers


This chapter explores and analyzes the new methods that the China Biographical Database (CBDB) project team has developed and adopted to digitize reference works about Chinese history, which is part of the important process of turning them into structured biographical data. This workflow focuses on the Tang Dynasty (618-907) and has implications for the continued improvement in the technologies for digitization and research into historical biographies in the Chinese language. These explorations and outcomes also demonstrate attempts in the Chinese studies field to transform large amounts of texts in non-Latin script into structured biographical data in a semi-automated fashion, and are expected to benefit digital humanities research, especially initiatives focusing on the Asia-Pacific region.
Chapter Preview

Introduction And Literature Review

The China Biographical Database Project

The approach studied in this chapter for transforming historical records into biographical data has been attested by a large-scale digital humanities project, one of the most well known in the Chinese studies discipline. An effort to organize the numerous biographies in the Chinese historical record, CBDB is a relational database with biographical information about approximately 422,600 individuals (as of September 2018) that is meant to be useful for statistical, social network, spatial, and other kinds of analyses. The database is jointly developed by Harvard University, Academia Sinica, and Peking University since 2005 (Harvard University, 2018) and the team members include humanities and computer experts mostly based in Cambridge, Massachusetts and Beijing. The project employs computational techniques to locate and encode biographical records in Chinese texts and makes such data available to researchers. The data can be accessed online (at and as a standalone offline database in Microsoft Access format. An Application Programming Interface (API) for system interoperability is also available (Xu, 2018). Collaborators of the project have also transformed the database entries into a file for the iOS built-in dictionary (Harvard University, 2018). CBDB project team members systematically work through texts of a given format to mine vast amounts of data about large numbers of historical persons, versus doing research on each individual. Because CBDB records information about where people lived, where they studied, where they served in office, what offices they held, who their parents were, who they married, and who they knew etc., all these aspects of life can be correlated for very large groups of people in history.

CBDB is by nature a relational database. In other words, in contrast to a single table into which all data is loaded, it consists of many different tables that are linked together, allowing categorization and coding of many different aspects of the life histories of people in China’s past. As opposed to an unstructured text-only database, a relational database for historical people benefits the end-user in discovering and locating information about individual historical figures, and also in analyzing them systematically at an aggregate level. Through the wide range of data it collects, CBDB offers many ways to examine the lives of past individuals and groups. Its approach towards the generation and making use of data is transforming the historical research method of prosopography (Stone, 1972). As a collective biography, prosopography inquires into the common characteristics of a group of people. Through collecting data on phenomena that involve common aspects of people’s lives, it sheds light on questions about social groups in history and allows scholars to make much better sense of relationships between individuals and groups. Effective deployment of the prosopographical method supports discovery in ancient texts and the relationships that they record (Verboven et al. 2007), especially with computer-aided analysis. When populating a biographical database such as CBDB, it not only provides scholars with larger and more comprehensive data, but also data that is smarter (Zeng, 2017).

There have been similar efforts to build and analyze datasets and databases not only for studying imperial Chinese history, but also for Australia (Biographical Database of Australia, 2018), Anglo-Saxon England (Prosopography of Anglo-Saxon England, 2010), the Byzantine Empire (Martindale, 2014), Egyptian Middle Kingdom (Persons and Names of the Middle Kingdom, 2018), Islamic history (Romanov, 2017), Jaina history (Flügel, 2017), Japan (Japanese Biographical Database, 2018; Born, 2018), and Taiwanese history (Sie et al., 2017), to list some examples. These projects on history elsewhere, especially those dealing with non-Latin scripts, often face similar issues and potentials in processing and analyzing hard-copy historical sources.

Key Terms in this Chapter

Biographical Databases: Digital resources that contain systematic information or sources about historical figures. Most of these are organized in the form of relational databases. Beyond the China Biographical Database, there are also a good number of biographical database projects for studying the history of other regions and civilizations.

Prosopography: To investigate the common characteristics of a group of actors in history through a collective study of their lives. Doing prosopography requires creating and curating lists of biographical notes that support historical analyses. These usually record the offices, ranks, honors, property, and other features held by individuals, as well as the relationships between people. In the recent decades this often involves the creation of a biographical database so that the information can be queried through a computerized system.

China Biographical Database: The China Biographical Database (CBDB) is a relational database with biographical information about approximately 422,600 individuals as of September 2018, primarily from the 7th through 19th centuries. With both online and offline versions (see ), its data is meant to be useful for statistical, social network, and spatial analysis as well as serving as a kind of biographical reference. Jointly managed by Harvard, Academia Sinica, and Peking University, CBDB is one of the most well-known digital humanities projects in Chinese studies.

Datafication: The use of digital technologies to release the information associated with physical objects and turn them into computerized data so they can be utilized in a more useful and powerful way. Datafication is a phase beyond digitization. In this study we deal specifically with the datafication of historical records from the Tang dynasty in medieval China.

Proofreading: Due to technical constraints, the precision rates of the optical character recognition (OCR) of Chinese characters in historical records can be far from ideal. Human effort is therefore needed to identify errors in the OCR output and to correct them. However, these proofreading tasks can be laborious, expensive, time-consuming, and error-prone even for professionals. In the digitization workflow of CBDB proofreading is done on a computerized online system.

Name Disambiguation: Chinese historical figures often share names that make it difficult to distinguish them from each other. To integrate the extracted data into the existing CBDB system, we need to identify and link records of the same person, a process usually referred to as disambiguation. For CBDB’s biographical data disambiguation is done with the aid of digital tools.

Semi-Automated Digitization: In converting historical documents into digital data, the CBDB project team follows a specially designed workflow in order to enhance efficiency and reduce manual human input, yet maintain data quality at the same time. Compared to manual digitization, many more records can be digitized and processed using the same amount of resources. In our view, this is a more intelligent approach for populating the data in biographical databases.

Complete Chapter List

Search this Book: