TB-WPRO: Title-Block Based Web Page Reorganization

TB-WPRO: Title-Block Based Web Page Reorganization

Qihua Chen (Institute of Computing Technology, Chinese Academy of Sciences, China), Xiangdong Wang (Institute of Computing Technology, Chinese Academy of Sciences, China) and Yueliang Qian (Institute of Computing Technology, Chinese Academy of Sciences, China)
DOI: 10.4018/japuc.2011010107
OnDemand PDF Download:
No Current Special Offers


For cell phone users and blind people using non-visual browsers, browsing Web by common browsers is quite inefficient due to the problem of information overload. This paper presents the TB-WPRO (Title-Block based Web Page Re-Organization) method, which hierarchically segments web pages into blocks using visual and layout information reflecting the web designers’ intent. TB-WPRO segments the web pages with a clear goal to extract self-described title blocks. To reorganize web pages, the segmentation result is transformed to a serial of small web pages that could be easily accessed. Compared to current methods, the proposed approach obtains a promising segmentation result where blocks are visually and semantically consistent with original web pages.
Article Preview

A variety of methods have been proposed to segment web pages into small blocks. The early way is considering the tag information in the DOM tree such as <P> (paragraph), <TABLE> (table), <UL> (list). <H1>-<H6> (heading), etc. Lin and Ho (2002) partition a page into several content blocks according to HTML tag <TABLE> and then use entropy to distinguish redundant block from the information block. Hattori et al. (2007) propose a hybrid method using content-distances and layout information to segment the webpage. However, the HTML tags do not contain any semantic information and could be misused. A FOM model is proposed by Chen et al. (2001). It treats each object in the webpage as either a basic or composite FOM which describe the object functionality. But the model does not describe the concrete segmentation method. The VIPS algorithm is proposed by Cai et al. (2003) which analyzes the webpage using the vision separator to segment the content into different areas. However, a parameter PDoC (Permitted Degree of Coherence) must be given by the user to get segmentation result. The paper does not tell how to determine proper PDoC to get blocks visually and semantically consistent with the original web page. Baluja (2006) uses a learning method to divide a webpage into 9 parts which the user could select to zoom. Chakrabarti et al. (2004) puts the nodes in DOM tree into a weighted graph, and formulates an appropriate optimization problem on it. The optimization problem is solved by a learning framework. But such machine learning method need a training process and is difficult to implement.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing