Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

Congfeng Jiang (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Junming Liu (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Dongyang Ou (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China), Yumei Wang (School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China) and Lifeng Yu (Hithink RoyalFlush Information Network Co., Ltd., Hangzhou, China)
Copyright: © 2018 |Pages: 22
DOI: 10.4018/JDM.2018040101

Abstract

The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.
Article Preview
Top

Introduction

With the advance in digital libraries and online publishing, the quantity of online scholarly documents and born-digital documents is increasing significantly, and the digital transition away from print continues (Ware & Mabe, 2015). Open-access initiatives and platforms such as publisher-owned websites and arXiv.org also make the personal digital library not only possible, but prevalent for researchers and scientists (Laakso & Björk, 2013). Portable Document Format (PDF) has become the de facto standard of producing, delivering, exchanging, and archiving scholarly documents because of its independence of visual information and source-file structure. Therefore, automatic extraction of metadata from such PDF documents is the fundamental work of digital preservation, bibliometrics, and scientific competitiveness analysis and evaluations (Suh & Lee, 2001, Zhao, 2010, Fiori et al., 2014).

Metadata is defined by some as data about data and is used by both humans and computers. Digital libraries must ensure that computer systems can both read and “understand” metadata (Lee, Kim, & Kim, 2001). This requires formal syntax and defined semantics—humans can overcome inconsistencies and vagueness, but computers cannot (Jeffery & Koskela, 2015). This chapter predominantly refers to scholarly documents from scientific literature rather than journalistic magazines, and refines the metadata to include title, author names, affiliations, and author-affiliation matching. We try to extract such metadata because they have versatile formatting styles and change frequently and significantly in different scholarly papers. The remaining metadata, such as publishing source, journal name, publication date, volume, and issue number, are outside the scope of this chapter, although they can be extracted similarly by approaches proposed here.

Unfortunately, the PDF specification only defines the basic logical structure to describe the texts, paragraphs, and other layout objects. The PDF specification is optimized for content presentation, but lacks structural information on the content, especially the structure in reading order. The absence of explicit tags or discernible labels for many elements in documents is the main obstacle to machines automatically identifying the metadata. Moreover, such absence of uniform formatting and layout standards makes it very hard, sometimes even impossible, to extract metadata from different scholarly documents appearing in different publishing sources. The accuracy and efficiency of metadata extraction are affected mainly by implementation variations of visual formatting in PDF documents from different computer programs; individual style differences from different authors; source compilation of PDF documents; and errors in the PDF document itself. The paper’s title is an example. The title has obvious visual formatting features, such as location on the first page, largest font size, or centered text. Although it is traditionally believed that the title with its simple formatting semantics is easy to extract, the extraction accuracy is still affected by the existence of the following factors:

  • The title is not located at the beginning of the first page;

  • The title has multiple lines;

  • The title has a subtitle besides the main title;

  • The title text has multiple font types and sizes;

  • The title has special characters, a digital string, or mathematical or physics equations;

  • There is a page header before the title;

  • Different documents from different sources have different typesetting templates for the title.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 31: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 30: 4 Issues (2019)
Volume 29: 4 Issues (2018)
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing