Article Preview
TopIntroduction
A significant research field of the human-computer interface development is the Natural Language Processing (NLP) that deals with processing the input given in oral or written forms (Kovács 2012). In order to handle the written contents, various grapheme processing methods are developed, e.g. the Optical Character Recognition (OCR) or the Automatic Speech Recognition (ASR). These approaches need deep analysis of the writing systems, especially the graphemes. Our research work is focused on general modeling of graphemes.
A writing system or in other word a script can be associated with different orthographies. As an example, the Latin script has several associated orthographies, such as French, German, Indonesian, English, Hungarian, and other orthographies. In writing system, grapheme is defined as the smallest semantically distinguishing fundamental unit, or in other word a minimally distinctive unit in a writing system. Graphemes may be in a form of alphabetic letters, ligatures, numerical digits, or punctuation marks. Also in writing system, glyph in general refers to a unique shape (an image) that represents a single grapheme and contains topological information about the shape of the grapheme. However in several cases, different glyphs may represent exactly the same abstract grapheme (Hosszú 2014). On the other hand, character refers to the encoded extension of a grapheme. It is noteworthy that the use of the terms grapheme and character is not consequent in the scientific literature.
Historical script relics have one or more inscriptions. An inscription is composed by symbols, which are the smallest individual units of an inscription from visual perspective. Typically, a symbol is materialization of a certain grapheme; in other words, the grapheme is the abstraction of a symbol and in vice versa, a symbol is the realization of a glyph of a certain grapheme. It is noteworthy that Kohrt (1986) and August (1986) use the term graph essentially in the same meaning as we use the term symbol.
The studies related to glyphs of particular script are special and challenging subject for pattern recognition. This subject may include but not limited to deciphering encrypted glyphs that are discovered through excavation, recognizing patterns in glyphs transformation, and so on. The effective software may assist the researchers to accelerate the research time and to provide more accurate result through the automated process. Producing such software surely needs a support of a solid mathematical model. Therefore, our main objective is to develop such descriptive mathematical model as a useful framework for building a tool, which can help in supporting the deciphering the historical or hard-to-read inscriptions. It is noteworthy that the appropriate software-based solution for analyzing historical inscriptions needs normalized data models, powerful parallel processing databases and parallel-computing approach (Willson 2011).
In this article, glyph relations and identification are modeled by using layer-based approach, which from bottom to up consists of the Topology Layer, the Visual Identity Layer, the Phonetic Layer, and the Semantic Layer. The relations of the topological, phonetic and semantic components of our four-layer grapheme model have special significance in case of ASR when there is a need to select the appropriate graphemes or words among homonyms. For this problem, typically Hidden Markov Models (HMMs) are employed (Segi et al. 2014). Kovács developed a morpheme analyzer for the NLP engine (2012). The phonetic layer of the grapheme model is important e.g. in the script deciphering (Hosszú 2014) and in ASR. In such systems, the pronunciation or phonetic dictionary is significant component that needs an appropriate phonetic model of the graphemes, which are elementary units of any written content (Ali et al. 2009). The semantic layer of the grapheme model is related to several semantic-based research works. One of them is the karaka relations in the Hindi language, which represent syntactico-semantic or semantico-syntactic relationship between various elements of the Hindi sentences. The computational identification of the sense of a word in a certain context called Word Sense Disambiguation (WSD) as part of the natural language processing was investigated and two supervised WSD algorithms were developed (Singh & Siddiqui 2015). The developed four-layer grapheme model including its principles and implementation examples are described further in the following sections.