Image Analysis for Historical Japanese Book Archives

Image Analysis for Historical Japanese Book Archives

Chulapong Panichkriangkrai (Graduate School of Science and Engineering, Ritsumeikan University, Kyoto, Japan), Liang Li (Ritsumeikan Global Innovation Research Organization, Ritsumeikan University, Kyoto, Japan), Ross Walker (College of Information Science and Engineering, Ritsumeikan University, Kyoto, Japan) and Kozaburo Hachimura (College of Information Science and Engineering, Ritsumeikan University, Kyoto, Japan)
DOI: 10.4018/ijabim.2014040101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

This paper describes methods of image analysis for historical Japanese book archives with a dominant focus on character segmentation. The segmentation methodology includes stain and smear removal, binarization, character line extraction, and character extraction by region labeling with integration and separation techniques. The experimental results show that the proposed method can segment all text lines correctly and can extract more than 79% of the characters from 16 pages of Chinsetsu Yumiharizuki, containing 176 text lines and a total of 5181 quite complicated characters.
Article Preview

Introduction

Paper free, easy access and distribution, mobile availability, and interactive multimedia functions make digital books a rapidly growing market. With internet distribution, the digital book market is already a global business. On the other hand, nowadays, there is increasing research on the strategies of developing and introducing national cultural heritages worldwide (Leong, 2010). A large number of historical books were digitized and made available not only to researchers but to the public as digital media in many digital archives (Art Research Center, Ritsumeikan University, 2010; National Institute of Japanese Literature, 2013). However, only limited items have been transcribed into text based digital books, whereas most of the rest were collected as digital images without text information. It is a challenging work to develop learning/reading support system that analyses document images to discriminate figures, titles, and text area from digital archives. This article focused on these topics. Although our method and results are still in a preliminary stage, the approach will be significant when expanding the target of digital book business so as to include historical books in the future.

The analyses of digital archived historical book images in this paper mainly focuses on page segmentation and identification. For example, text region extraction means identifying the text part of a page image, while text-line extraction means identifying the text-line from the text region. Furthermore, character extraction refers to segmenting each character from the text-line. In this paper we propose techniques for both text-line and character segmentation.

The historical Japanese books that are focused on in this paper are books printed in the Edo period (1603-1867). Figure 1 shows examples of Japanese woodblock printed books from the Edo period. The Edo period was a period of calm that provided an ideal environment for developing commercial art. During that period, while Europeans used moveable type printing processes, the Japanese developed and used a woodblock printing process. This process uses wooden blocks, upon which are engraved reversed images of both the text and illustrations, as relief printing. For printing, two consecutive pages were carved on one side of the woodblock. During the Edo period, Japan published over 110,000 titles of books with more than 10 million copies in the markets (Hioki, 2009).

Figure 1.

Example of Japanese woodblock printed historical books

Currently, a large number of the books published in the Edo period have been scanned and made available to the public as digital images. However, experts have transcribed only a small number of book titles printed in the Edo period into modern book productions. Furthermore, only a small number of people can recognize and read the characters used in Edo period books. Old style characters and running scripts are different to modern ones. Characters of this type of historical book are difficult to segment because they have ligature-like characters that join two or more characters.

In this paper, we propose a character segmentation system for character shape comparison, character image retrieval, and to make a statistical analysis of the usage of characters in single or multiple books. The proposed concept can be applied for other Asian historical digital archives to offer cultural and social support.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing