Article Preview
TopIntroduction
With the advance in digital libraries and online publishing, the quantity of online scholarly documents and born-digital documents is increasing significantly, and the digital transition away from print continues (Ware & Mabe, 2015). Open-access initiatives and platforms such as publisher-owned websites and arXiv.org also make the personal digital library not only possible, but prevalent for researchers and scientists (Laakso & Björk, 2013). Portable Document Format (PDF) has become the de facto standard of producing, delivering, exchanging, and archiving scholarly documents because of its independence of visual information and source-file structure. Therefore, automatic extraction of metadata from such PDF documents is the fundamental work of digital preservation, bibliometrics, and scientific competitiveness analysis and evaluations (Suh & Lee, 2001, Zhao, 2010, Fiori et al., 2014).
Metadata is defined by some as data about data and is used by both humans and computers. Digital libraries must ensure that computer systems can both read and “understand” metadata (Lee, Kim, & Kim, 2001). This requires formal syntax and defined semantics—humans can overcome inconsistencies and vagueness, but computers cannot (Jeffery & Koskela, 2015). This chapter predominantly refers to scholarly documents from scientific literature rather than journalistic magazines, and refines the metadata to include title, author names, affiliations, and author-affiliation matching. We try to extract such metadata because they have versatile formatting styles and change frequently and significantly in different scholarly papers. The remaining metadata, such as publishing source, journal name, publication date, volume, and issue number, are outside the scope of this chapter, although they can be extracted similarly by approaches proposed here.
Unfortunately, the PDF specification only defines the basic logical structure to describe the texts, paragraphs, and other layout objects. The PDF specification is optimized for content presentation, but lacks structural information on the content, especially the structure in reading order. The absence of explicit tags or discernible labels for many elements in documents is the main obstacle to machines automatically identifying the metadata. Moreover, such absence of uniform formatting and layout standards makes it very hard, sometimes even impossible, to extract metadata from different scholarly documents appearing in different publishing sources. The accuracy and efficiency of metadata extraction are affected mainly by implementation variations of visual formatting in PDF documents from different computer programs; individual style differences from different authors; source compilation of PDF documents; and errors in the PDF document itself. The paper’s title is an example. The title has obvious visual formatting features, such as location on the first page, largest font size, or centered text. Although it is traditionally believed that the title with its simple formatting semantics is easy to extract, the extraction accuracy is still affected by the existence of the following factors:
- •
The title is not located at the beginning of the first page;
- •
The title has multiple lines;
- •
The title has a subtitle besides the main title;
- •
The title text has multiple font types and sizes;
- •
The title has special characters, a digital string, or mathematical or physics equations;
- •
There is a page header before the title;
- •
Different documents from different sources have different typesetting templates for the title.