A Unified Algorithm for Identification of Various Tabular Structures from Document Images

A Unified Algorithm for Identification of Various Tabular Structures from Document Images

Sekhar Mandal (Bengal Engineering and Science University, Shibpur, India), Amit K. Das (Bengal Engineering and Science University, Shibpur, India), Partha Bhowmick (Indian Institute of Technology Kharagpur, India) and Bhabatosh Chanda (Indian Statistical Institute, Kolkata, India)
Copyright: © 2011 |Pages: 28
DOI: 10.4018/jdls.2011040103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

This paper presents a unified algorithm for segmentation and identification of various tabular structures from document page images. Such tabular structures include conventional tables and displayed math-zones, as well as Table of Contents (TOC) and Index pages. After analyzing the page composition, the algorithm initially classifies the input set of document pages into tabular and non-tabular pages. A tabular page contains at least one of the tabular structures, whereas a non-tabular page does not contain any. The approach is unified in the sense that it is able to identify all tabular structures from a tabular page, which leads to a considerable simplification of document image segmentation in a novel manner. Such unification also results in speeding up the segmentation process, because the existing methodologies produce time-consuming solutions for treating different tabular structures as separate physical entities. Distinguishing features of different kinds of tabular structures have been used in stages in order to ensure the simplicity and efficiency of the algorithm and demonstrated by exhaustive experimental results.
Article Preview

Past Work

The approach of treating the text as either tabular or non-tabular is not yet proposed by others. Hence, citation with similar approaches is not possible. However, there are various works available in the literature for detection/ segmentation of tabular components, e.g., table, TOC, displayed math, etc. In this section we present a brief review of the past work under the following categories.

  • 1.

    Table: Table detection and segmentation have been done in several ways (Chandran et al., 1996; Mandal et al., 2006b; Tsuruoka et al., 2001; Watanabe et al., 1995). The algorithms may be classified broadly into two types: one based on the presence of rule lines in the table and the other based on the knowledge of table layout. Watanabe et al. (1995) have proposed a tree for representation of the structure of various kinds of tables. In Chandran et al. (1996), the horizontal and vertical lines of the table are used to recognize the structure of the tabulated data. In Itonori (1993), a similar technique is found, which also uses row-column pairing and the relationship of cells and ruled lines. Zuyev (1997) has defined a table grid, and has described simple and compound cells of any table based on the table grid. Node property matrix has been used by Tanaka and Tsuruoka (1998) in the processing of irregular rule lines and generation of HTML files. Method of analysis for unknown table structure has been proposed by Belaid, Panchevre, and Belaid (1998). Tersteegen and Wenzel (1998) have proposed a system for extraction of tabular structure (table only) with the help of predefined reference table. Tsuruoka et al. (2001) have presented a segmentation method for complex tables with or without rule lines. A technique has been described by Das and Chanda (1998a) to separate out tables and headings in document images. Ramel et al. (2003) have used a flexible representation scheme based on clear distinction between the physical table and its logical structure.

Complete Article List

Search this Journal:
Reset
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing