Application of the Cluster Analysis in Computational Paleography

Application of the Cluster Analysis in Computational Paleography

Loránd Lehel Tóth (Budapest University of Technology and Economics, Hungary), Raymond Eliza Ivan Pardede (Budapest University of Technology and Economics, Hungary), György András Jeney (Budapest University of Technology and Economics, Hungary), Ferenc Kovács (Budapest University of Technology and Economics, Hungary) and Gábor Hosszú (Budapest University of Technology and Economics, Hungary)
DOI: 10.4018/978-1-4666-9479-8.ch020


This chapter presents a method to determine the actual version of a script used in constructing of a script relic from unknown origin. The glyphs belong to graphemes as models are realized in the relics as symbols. Some group of glyphs may transform their shape (shapeshifting) through time which produces various versions of scripts that use different glyphs to express the same grapheme. These glyph variants can be identified from extant relics, mainly from historical abecedaries that are used as references. Our algorithm can determine whether or not an abecedary is related to the symbols of a relic from unknown origin by means of the canonical decomposition of the glyphs and symbols. From there an aggregated value called fingerprint is created and it is unique for each relic. The fingerprints then are evaluated by clustering technique using various metrics. As the result of performing comparative evaluations the Minkowski metric provides the most interpretable clustering structure. The results of the evaluations, conclusions, and future work are also presented.
Chapter Preview

1. Introduction

The writing system is a symbolic representation of a language described in terms of linguistic units (Malatesha & Aaron 2006). The script is the graphic format in which a writing system is represented. Some examples of scripts are the following: Aramaic, Arab, Batak, Brahmi, Carian, Carpathian Basin Rovash (pronounced “rove-ash”, other spelling: Rovas), Cyrillic, Glagolitic, Greek, Hebrew, Kannada, Latin, Lycian, Lydian, Middle Persian, Parthian, Szekely-Hungarian Rovash, Tamil, etc. A symbol is a distinct unit of an inscription and inscriptions are composed of a series of symbols. The grapheme is a minimally distinctive unit in a writing system that is the abstraction of a symbol. The glyph is the shape of the grapheme with two-dimensional topological information. One grapheme usually has more glyphs (Hosszú 2014). A symbol is a realization of a glyph of a grapheme in an inscription. Graphemes can be letter, ideogram, ligature, numerical digit, accent, punctuation mark, etc. The grapheme like an object has several properties:

  • 1.

    The script(s) belonging into,

  • 2.

    Associated glyphs (shapes),

  • 3.

    Name (a unique identifier),

  • 4.

    Transcription values,

  • 5.

    Sound values (depending on age and language of use),

  • 6.

    Periods of use,

  • 7.

    Usage areas, etc.

A character is an extension of grapheme with a computer code assigned to it. The orthography is a representation system for using a particular script to write a certain language. A script can have various versions depending on the age and place; each of them has its own collection of glyphs of graphemes called an alphabet. The inscription of unknown origin is an extant inscription whose origin, date, and author are unknown. The inscription under test is the examined inscription and, usually has unknown origin.

The explored inscriptions that were written in ancient times provide challenges for researchers to identify them. The reason for this apart from missing the writing support materials (wood, stone, brick, paper, etc.) is that the glyphs of the graphemes used in these scripts might have changed numerous times over the ages. Moreover, it is also possible that different ancient script relics were written in different handwritings or written by people with altering writing skills. As a result, the topological properties and styles of glyphs used in script relics may vary during history.

This chapter introduces a new method for identifying the version of a certain script (an actual alphabet) used for making an examined inscription of unknown origin. We suppose that the script used for writing the examined inscription is already identified; however, the actual version of the script used for making that particular inscription is still undetermined. The basis of the method is comparing the symbols of the examined inscription to the glyphs of historical abecedaria and other deciphered script relics of a certain script. The method was verified by applying it to the Szekely-Hungarian Rovash (other spelling: Rovas, pronounced “rove-ash”) script that is used for representing the Hungarian language (Hosszú 2013b). This script was selected for verification because

Key Terms in this Chapter

Mathematical Optimization: Minimization or maximization of some objective function in order to select a best element with regard to some criteria from some set of available alternatives.

Rovash Paleography: The study of ancient rovash (pronounced “rove-ash”) writings and inscriptions.

Grapheme: A minimally distinctive unit in a writing system. Grapheme is the abstraction of a symbol. Graphemes can be letters, ligatures, numerical digits, or punctuation marks.

Rovash Script Family: Closely related writing systems used in the Eurasian Steppe by several nations and tribes up to the 11 th or 12 th century AD, and in the Carpathian Basin mainly by Hungarians up to the present time.

Computational Paleography: Applying computational algorithms in the study of ancient writings and inscriptions such as optimization or mathematical statistical methods.

Applied Computer Science: Identifies computer science concepts that can be used directly in solving real world problems.

Cluster Analysis: Identifying groups of objects that are similar to each other but different from objects in another group called cluster.

Alphabet: A set of graphemes of a script, which are used in a specific orthography. The term “alphabet” is used in a wide sense, not only for the true alphabetic scripts.

Glyph: The shape of the grapheme with topological information.

Symbol: The minimal individual visual unit of the inscription. Typically, the symbol is a realized grapheme.

Orthography: It is a representation system for using a particular script to write a particular language. E.g., the Latin script has several orthographies, including the French, German, English, Hungarian, etc. orthographies.

Script: A writing system that includes different versions called alphabets. E.g., the Latin script has several alphabets, including the French, German, English, Hungarian, etc. alphabets.

Complete Chapter List

Search this Book: