HTML Segmentation for Different Types of Web Pages

HTML Segmentation for Different Types of Web Pages

Evelin Carvalho Freire de Amorim (Departamento de Ciência da Computação (UFMG), Brazil)
DOI: 10.4018/978-1-4666-7262-8.ch005
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Search engines manage several types of challenges daily. One of those challenges is locating relevant content in a Web page. However, the concept of relevance in information retrieval depends on the problem to be solved. For instance, the menu of a website does not impact the results of an algorithm to detect duplicate Web pages. An HTML segmentation algorithm partitions a Web page visually in such a way that parts from a same partition are semantically related. This chapter presents two strategies to segment different types of Web pages.
Chapter Preview
Top

Introduction

Search engines manage redundant and non-structured content daily. However, redundant and non-structured data generate problems that affect the performance of search engines. For example, redundant data are not useful for a query; nevertheless redundant data can be exhibit in results if they are not removed from the dataset. Partitioning a web page into cohesive visual pieces and selecting the most relevant piece can improve algorithms for detection of redundant data. The task of partitioning a web page into cohesive visual pieces is called HTML segmentation.

Web browsing in mobile devices is also enhanced by HTML segmentation (Yin & Lee, 2004). The web browser of a mobile device can partition a web page and exhibit the most relevant part of the web page in the center of the screen. This feature improves the user’s experience in the mobile device.

Another task to be solved by HTML segmentation is the ranking quality of standard web pages searching schemes (Fernandes, Moura, da Silva, Ribeiro-Neto, & Braga, 2011). Ranking of web pages is an important task in Information Retrieval and search engines are concerned about the best ranking of web pages.

There are two main types of HTML segmentation techniques: general or topical. The latter technique segments only specific types of web pages, for instance blogs or news. Although topical techniques achieve robust results they are inflexible for particular Information Retrieval tasks. General techniques face the challenge of finding a model that conciliates features from different web pages like personal web pages and e-commerce web pages.

Considering that general techniques for HTML segmentation are uncommon and still constitute a challenge for the data mining area, because web pages displays relevant content in different ways. For instance, describing news web pages and an e-commerce web page in one model is not an intuitive task.

This chapter has the following goals:

  • 1.

    Describing general techniques for HTML segmentation;

  • 2.

    Comparing two general HTML segmentation techniques. The first strategy is called ETL HTML segmentation and the second strategy is called TPS segmentation.

The remaining of this chapter also reviews some topical techniques, the main results of HTML segmentation algorithms and issues to solve in HTML segmentation.

Top

Background

HTML segmentation covers concepts from information retrieval and data structures. The following subsection defines data structures concepts employed in HTML segmentation algorithms. The next subsection describes how HTML segmentation improves some tasks of the information retrieval area.

Complete Chapter List

Search this Book:
Reset