Building the Multidimensional Semantic Index of Webpages for Facet Extraction

Building the Multidimensional Semantic Index of Webpages for Facet Extraction

Xiao Wei (Shanghai Institute of Technology, Shanghai, China & Institute of Automation, Chinese Academy of Sciences, Beijing, China), Chenglei Qin (Shanghai Institute of Technology, Shanghai, China) and Zheng Xu (The Third Research Institute of Ministry of Public Security, Shanghai, China & Tsinghua University, Beijing, China)
DOI: 10.4018/IJCINI.2015040101
OnDemand PDF Download:
$37.50

Abstract

Faceted search is an efficient search method to use the big data and one of its key issues is to extract facets from unstructured webpages automatically. It is still a problem to extract facets from massive unstructured webpages exactly and automatically. To solve the problem, this paper first proposed a novel index structure of webpages, the Multidimensional Semantic Index (MDSI), which holds rich semantics and are helpful to extract facets. In MDSI, the differently dimensional semantic indexes are bridged by mining the semantic mapping between them. Then, an automatic facet extraction method is proposed by analysing semantic mapping relations in MDSI. At last, to validate the effect of the proposed method, two datasets are constructed and the experimental results show that the proposed method is feasible and comparatively precise.
Article Preview

1. Introduction

In the big data era, faceted search has already changed the mechanism of search. For example, on Amazon, commodities are indexed from different facets such as style, colour, size, and so on. With the support of faceted index, buyers can specify one or more facets to narrow their search, which improves the efficiency of search to a great extent. Compared with the traditional search, the efficiency of faceted search contributes to its semantic index. The traditional search engine uses the inverted index, which lacks of semantics and just maps terms to webpages to provide fast access to resources. The faceted search engines index resources from different facets at the same time and each facet has its own semantics.

In most of the current faceted search systems, facets of resources are specified by hand, which is easy to do in some applications such as e-commerce system. The facets of commodities are easy to determine, and the values of each facet is also easy to specify. For example, most of the commodities, such as TV, computer, food, car, and so on, have their description parameters which have been widely used by buyers and sellers. In fact, these parameters are actually selected as facets by e-commerce system to provide faceted search.

However, there are massive unstructured webpages on the Web, such as news, papers, articles, blogs, and so on. It is difficult to choose the facets or facet values from massive webpages by hand, which leads to the lack of faceted search systems on these kinds of resources. Take news service for example. There are two kinds of news services: one is news browse service, such as Yahoo, BBC, Routers, and so on. In this kind of websites, news is simply divided into several big classes, such as entertainment, sports, policy, military and so on. The other is news search services, such as Google News, Baidu News, and so on. In this kind of websites, news is indexed by terms from its content. Although in advanced search, the news can be searched by time, classifications, origins, they are weakly related to the news content and helpless to searching news efficiently. Therefore, it is eager but difficult to provide faceted search on massive unstructured webpages, in which, how to automatically choose facets and determine the values of each facet on massive unstructured webpages are the basic issues to be solved.

This paper proposed a novel index structure of webpages, named as Multidimensional Semantic Index (MDSI), which holds rich semantic and can be used to extract facets. In this paper, MDSI is first defined and constructed by mining semantic relations among the terms from webpages. Then, the differently dimensional semantic indexes of MDSI are bridged by mining the semantic mapping between them. At last, facets are extracted by analyzing semantic mapping relations in MDSI. In the last section, the experimental results on two datasets both show that the proposed method is feasible and comparatively precise. The contributions of this paper include:

  • 1.

    Propose a novel index structure of webpages, the Multidimensional Semantic Index, which holds rich semantic and can be used to extract facets from unstructured webpages;

  • 2.

    Propose an automatic facet extraction method based on MDSI, which can extract facets from massive webpages automatically.

Faceted search has become a popular technique in commercial search applications, especially for online retailers and libraries (Marchionini, 2003; Tunkelang, 2006). Faceted search is also used in image search, music search and so on. Faceted search also provides an interactive search paradigm for users. The basic issues of faceted search are facet extraction, faceted annotate, faceted index and faceted query.

There are already several open source projects of faceted search, such as the Bobo-Browse (http://xapian.org/docs/facets.html) and so on. Bobo-Browse is a Faceted Search implementation written purely in Java, an extension of Apache Lucene. Bobo-Browse provides the facets by the grouping statics on the search results of Lucene. There is no facet information in the documents, which relies on the results of Lucene heavily. XaPian is an open source project about search written by C++, which is similar to Lucene. Xapian provides the support of faceted search. Xapian provides functionality which allows you to dynamically generate complete lists of category values which feature in matching documents. XaPian needs the manufacture facets.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2017): 3 Released, 1 Forthcoming
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing