Text Detection Model for Historical Documents Using CNN and MSER

Rankang Li (Southwest University, China), Shanxiong Chen (Southwest University, China), Fujia Zhao (Southwest University, China), and Xiaogang Qiu (Southwest University, China)

Source Title: Journal of Database Management (JDM) 34(1)

DOI: 10.4018/JDM.322086

Article PDF Download Open access articles are freely available for download

Abstract

This article introduces a text detection model for historical documents images. The handwritten characters in historical documents are always difficult to detect because they contain fuzzy or missing ink, or weathering features and stains; these features will seriously affect the detection accuracy. In order to reduce the influence mentioned above, an effective ATD model is proposed to detect the textbox of characters in historical documents image, and ATD model includes a CNN-based text-box generation network and an NMS-based MSER text-box generation model. As a post-processing method, a text merging algorithm is proposed to achieve higher detection accuracy. The test results on historical document datasets such as Yi, English, Latin, and Italian datasets show that the method in this paper has good accuracy, and it has taken a solid step for the detection of historical documents.

Article Preview

Top

Introduction

Handwritten text detection is an important research in computer vision and pattern recognition, and it refers to the task of determining the exact position of all texts or characters from input image and marking it with colored text-box. The difference of writing, outline and shape of handwritten texts made it very difficult to be detected accurately. Therefore, the detection of handwritten text ushered in a difficult challenge. Handwritten text detection has a wide range of applications such as documents recognition, historical documents translation, robot vision, etc. So, it is very important to continuously conduct in-depth research on detection methods in order to improve detection performance.

In the detection task of handwritten characters, images are always slanted, which has defects, ambiguities and excessive background noise, and historical document images have additional problems such as stains and breakage. As a special issue of handwritten text detection, text detection of historical document is performed on historical documents. With the fewer and fewer experts and scholars pay attention to the translation and understanding of historical documents, the importance of an automatic recognition system for historical documents is self-evident. The advantages of historical documents automatic recognition system are as follows: First, all historical documents exist in the form of digital images, avoiding the gradual disappearance such as fading paper or oracle. Second, it can quickly and automatically detect and recognize the input image, which is more efficient and accurate than manual. Third, an efficient detection and recognition system can facilitate the relevant learning of historical documents for researchers. Historical documents detection has an important application in the historical documents recognition system. Because an effective historical document recognition system needs to accurately detect the text-box before it can be recognized.

In recent years, especially with the popularity of deep learning technology, the field of text detection has attracted extensive attention of computer researchers. However, most researchers only focus on scene text detection, document text detection, handwritten text detection and other hot areas. As a special text detection task, the text recognition task of historical document images are difficult to detect because of its complex background, incomplete and fuzzy text, and its initial application value is small, so it has not been paid attention to by the academic community.

In the past 20 years, researchers have proposed many algorithms for text detection in handwritten characters. Especially in the past 10 years, the following literatures are dedicated to the detection of handwritten text (Shin H C, Roth H R, & Gao M, 2017), text detection tasks are defined as a two steps task: Candidate text area extraction and text/non-text area. These algorithms can generally be divided into two categories based on traditional algorithms and algorithms based on deep learning (Chen Shanxiong, Han Xu, & Mo Bofeng, 2017; Chen Shanxiong, Wang Xiaolong, & Wang Minggui, 2019). Because text detection of historical documents is special issue of handwritten text detection, so the methods used in handwritten text detection is suitable for historical documents text detection theoretically. However, additional influence of historical documents image makes it more difficult to detect accurately than handwritten text detection. Therefore, it is very important to improve the existing methods to achieve higher accuracy.

Complete Article List

Search this Journal:

Reset

Volume 36: 1 Issue (2025)

Volume 35: 1 Issue (2024)

Volume 34: 3 Issues (2023)

Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming

Volume 32: 4 Issues (2021)

Volume 31: 4 Issues (2020)

Volume 30: 4 Issues (2019)

Volume 29: 4 Issues (2018)

Volume 28: 4 Issues (2017)

Volume 27: 4 Issues (2016)

Volume 26: 4 Issues (2015)

Volume 25: 4 Issues (2014)

Volume 24: 4 Issues (2013)

Volume 23: 4 Issues (2012)

Volume 22: 4 Issues (2011)

Volume 21: 4 Issues (2010)

Volume 20: 4 Issues (2009)

Volume 19: 4 Issues (2008)

Volume 18: 4 Issues (2007)

Volume 17: 4 Issues (2006)

Volume 16: 4 Issues (2005)

Volume 15: 4 Issues (2004)

Volume 14: 4 Issues (2003)

Volume 13: 4 Issues (2002)

Volume 12: 4 Issues (2001)

Volume 11: 4 Issues (2000)

Volume 10: 4 Issues (1999)

Volume 9: 4 Issues (1998)

Volume 8: 4 Issues (1997)

Volume 7: 4 Issues (1996)

Volume 6: 4 Issues (1995)

Volume 5: 4 Issues (1994)

Volume 4: 4 Issues (1993)

Volume 3: 4 Issues (1992)

Volume 2: 4 Issues (1991)

Volume 1: 2 Issues (1990)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Text Detection Model for Historical Documents Using CNN and MSER

Abstract

Introduction

Complete Article List