Four-Layer Grapheme Model for Computational Paleography

Four-Layer Grapheme Model for Computational Paleography

Raymond E.I. Pardede (Department of Electron Devices, Budapest University of Technology and Economics, Budapest, Hungary), Loránd L. Tóth (Department of Electron Devices, Budapest University of Technology and Economics, Budapest, Hungary), György A. Jeney (Department of Electron Devices, Budapest University of Technology and Economics, Budapest, Hungary), Ferenc Kovács (Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary) and Gábor Hosszú (Department of Electron Devices, Budapest University of Technology and Economics, Budapest, Hungary)
Copyright: © 2016 |Pages: 19
DOI: 10.4018/JITR.2016100105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

This article proposes a novel mathematical model of logical relationship among glyphs belonging to the same grapheme. Its research belongs to the computational paleography that is a field in the applied computer science. The proposed grapheme model is presented in four logical layers from bottom to up namely as Topology, Visual Identity, Phonetic, and Semantic Layer. In the Topology Layer, a unique glyph is defined by a set of topological properties. When trying to describe the logical relation of various glyphs, their topological properties must be examined in a higher layer framework so called Visual Identity Layer. In that layer, the glyphs of a single grapheme share some topological attributes in common. These common topological attributes form a main identity of a grapheme, which is called Common Identity template that is obtained by means of Supervised Learning method. The Phonetic Layer gives the sound values associated to the grapheme, and the Semantic Layer describes the usage of the grapheme in texts. Some potential implementations of the grapheme model are also presented.
Article Preview

Introduction

A significant research field of the human-computer interface development is the Natural Language Processing (NLP) that deals with processing the input given in oral or written forms (Kovács 2012). In order to handle the written contents, various grapheme processing methods are developed, e.g. the Optical Character Recognition (OCR) or the Automatic Speech Recognition (ASR). These approaches need deep analysis of the writing systems, especially the graphemes. Our research work is focused on general modeling of graphemes.

A writing system or in other word a script can be associated with different orthographies. As an example, the Latin script has several associated orthographies, such as French, German, Indonesian, English, Hungarian, and other orthographies. In writing system, grapheme is defined as the smallest semantically distinguishing fundamental unit, or in other word a minimally distinctive unit in a writing system. Graphemes may be in a form of alphabetic letters, ligatures, numerical digits, or punctuation marks. Also in writing system, glyph in general refers to a unique shape (an image) that represents a single grapheme and contains topological information about the shape of the grapheme. However in several cases, different glyphs may represent exactly the same abstract grapheme (Hosszú 2014). On the other hand, character refers to the encoded extension of a grapheme. It is noteworthy that the use of the terms grapheme and character is not consequent in the scientific literature.

Historical script relics have one or more inscriptions. An inscription is composed by symbols, which are the smallest individual units of an inscription from visual perspective. Typically, a symbol is materialization of a certain grapheme; in other words, the grapheme is the abstraction of a symbol and in vice versa, a symbol is the realization of a glyph of a certain grapheme. It is noteworthy that Kohrt (1986) and August (1986) use the term graph essentially in the same meaning as we use the term symbol.

The studies related to glyphs of particular script are special and challenging subject for pattern recognition. This subject may include but not limited to deciphering encrypted glyphs that are discovered through excavation, recognizing patterns in glyphs transformation, and so on. The effective software may assist the researchers to accelerate the research time and to provide more accurate result through the automated process. Producing such software surely needs a support of a solid mathematical model. Therefore, our main objective is to develop such descriptive mathematical model as a useful framework for building a tool, which can help in supporting the deciphering the historical or hard-to-read inscriptions. It is noteworthy that the appropriate software-based solution for analyzing historical inscriptions needs normalized data models, powerful parallel processing databases and parallel-computing approach (Willson 2011).

In this article, glyph relations and identification are modeled by using layer-based approach, which from bottom to up consists of the Topology Layer, the Visual Identity Layer, the Phonetic Layer, and the Semantic Layer. The relations of the topological, phonetic and semantic components of our four-layer grapheme model have special significance in case of ASR when there is a need to select the appropriate graphemes or words among homonyms. For this problem, typically Hidden Markov Models (HMMs) are employed (Segi et al. 2014). Kovács developed a morpheme analyzer for the NLP engine (2012). The phonetic layer of the grapheme model is important e.g. in the script deciphering (Hosszú 2014) and in ASR. In such systems, the pronunciation or phonetic dictionary is significant component that needs an appropriate phonetic model of the graphemes, which are elementary units of any written content (Ali et al. 2009). The semantic layer of the grapheme model is related to several semantic-based research works. One of them is the karaka relations in the Hindi language, which represent syntactico-semantic or semantico-syntactic relationship between various elements of the Hindi sentences. The computational identification of the sense of a word in a certain context called Word Sense Disambiguation (WSD) as part of the natural language processing was investigated and two supervised WSD algorithms were developed (Singh & Siddiqui 2015). The developed four-layer grapheme model including its principles and implementation examples are described further in the following sections.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2017)
Volume 9: 4 Issues (2016)
Volume 8: 4 Issues (2015)
Volume 7: 4 Issues (2014)
Volume 6: 4 Issues (2013)
Volume 5: 4 Issues (2012)
Volume 4: 4 Issues (2011)
Volume 3: 4 Issues (2010)
Volume 2: 4 Issues (2009)
Volume 1: 4 Issues (2008)
View Complete Journal Contents Listing