Encoding Models for Scholarly Literature: Does the TEI Have a Word to Say?

Martin Holmes (University of Victoria, Canada) and Laurent Romary (INRIA-Gemo & Humboldt Universität Berlin, Germany)
DOI: 10.4018/978-1-61692-834-6.ch005
In this chapter, the authors examine the issue of digital formats for document encoding, archiving and publishing, through the specific example of “born-digital” scholarly journal articles. This small area of electronic publishing represents a microcosm of the state of the art, and provides a good basis for this discussion. The authors will begin by looking at the traditional workflow of journal editing and publication, and how these practices have made the transition into the online domain. They will examine the range of different file formats in which electronic articles are currently stored and published. They will argue strongly that, despite the prevalence of binary and proprietary formats such as PDF and MS Word, XML is a far superior encoding choice for journal articles. Next, the authors look at the range of XML document structures (DTDs, Schemas) which are in common use for encoding journal articles, and consider some of their strengths and weaknesses. The authors will suggest that, despite the existence of specialized schemas intended specifically for journal articles (such as NLM), and more broadly-used publication-oriented schemas such as DocBook, there are strong arguments in favour of developing a subset or customization of the Text Encoding Initiative (TEI) schema for the purpose of journal-article encoding; TEI is already in use in a number of journal publication projects, and the scale and precision of the TEI tagset makes it particularly appropriate for encoding scholarly articles. They will outline the document structure of a TEI-encoded journal article, and look in detail at suggested markup patterns for specific features of journal articles. Next, they will look briefly at how XML-based publication systems work, and what advantages they bring over electronic publication methods based on other digital formats.
This book chapter provides an overview on issues related to the definition of a standard framework for the editing of scientific content. It mainly takes its examples from the specific case of journal papers, while attempting to cover the core features of similar documents (conference papers, scientific books, ISO standards, etc.). The focus on scholarly papers results from a series of converging factors indicating that the provision of a reference model for the representation of such textual objects has become a central aspect of the capacity of scholarly publishing to go digital.

These various factors may be summarised as follows:

  • Most of the digital edition workflow is now carried out almost entirely in electronic form. Authors and reviewers are only exchanging digital texts with publishers;

  • In the scientific world itself, the increasing role of publication repositories, in conjunction with the open access movement, has raised questions, as well as expectations, with regards long-term accessibility of the corresponding data;

  • Specific repositories such as Pubmed Central1 have even taken strong positions with regard to the kind of formats they will offer for long-term accessibility;

  • XML technology has gained enough maturity to be now considered as the natural syntactic framework for the representation of semi-structured data in general, and particularly text based documents;

  • Even when taking the XML technology for granted, one can observe that so far no specific XML application has emerged as a de facto nor de jure standard, and even worse, no coordinated vision seems to guide the development of ongoing initiatives.

This chapter will approach the issue from the point of view of the actual use cases and needs of an editing workflow, identifying how the various types of workflows (author - publisher (reviewer) - reader), the issues and constraints related to scholarly publishing (what is specific to journal papers as opposed to any kind of semi-structured document), and style guides for scientific publications may impact on the definition of a reference model and/or format. In this context, we will try to demonstrate how much one has to consider the representation of scholarly papers in the wider context of text representation, in order to provide both a wide and sound basis for standardization but also to ensure a long-term convergence between specific and generic document types, through the reuse of shared components. This will lead us to suggest that the Text Encoding Initiative can be a good candidate to depart from proprietary endeavours and we will try to characterize a TEI subset for journal editing that covers most of the features identified in our paper.

