Improving the Compression Efficiency for News Web Service Using Semantic Relations Among Webpages

Improving the Compression Efficiency for News Web Service Using Semantic Relations Among Webpages

Xiao Wei (Shanghai University, Shanghai, China & Shanghai Institute of Technology, Shanghai, China & City University of Hong Kong, Hong Kong), Xiangfeng Luo (Shanghai University, Shanghai, China) and Qing Li (City University of Hong Kong, Hong Kong)
DOI: 10.4018/ijcini.2013040104
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Both compression and decompression play important roles in a web service system. High compression ratio helps to save the storage, while fast decompression contributes to decreasing the response time of service. Specifically focusing on the news web service, this paper proposes a compression mechanism to improve the efficiency of compression and decompression simultaneously by taking advantage of the semantic relations among webpages. Firstly, webpages are clustered into news topics according to the similarity semantic relation among webpages. Webpages belonging to the same topic have much duplicate content, which can improve the compression ratio when using delta-compression. Secondly, associated news topics are detected with the help of multiple-semantics link network of news topics. Associated topics are compressed into the same zip file which may decrease the times of decompression according to the habit of a user’s reading news on the Web. The authors apply the proposed compression mechanism to a practical news search engine and the experimental results show that it has high compression ratio and fast decompression speed as well.
Article Preview

Introduction

With the rapid increase of web data, it is an efficient way to save the storage for a web service that webpages should be compressed before stored. As a result, the data should be decompressed when serving for users. In the compression mechanism of web service, the compression ratio and the compression/decompression speed affect the web service in different aspects. Both the compression ratio and decompression speed play important roles in the web service system. High compression ratio helps save the storage effectively and fast decompression helps reduce the response time of the service. To some degree, the compression speed affects the response time of service slightly.

It is not easy to achieve both the high compression ratio and the fast decompression speed simultaneously in most situations (Fusco, 2012). In a sense, high compression ratio means that compression/decompressing needs more running time which will increase the response time and decrease the quality of services. Much work has been done to achieve both the high compression ratio and decompression speed (Fusco, 2012; Yang, 2010; Martin, 2007; Lindstrom, 2006). These methods work well for some specific compression tasks. However, they may not achieve the best performance of compression in the web service system.

Because the news web service is a very typical web application on the Internet, we select it as the application scenario to discuss the efficiency of the compression mechanism. There are two types of news web services: news website and news search engine. News search engines gather and store mass news webpages from news websites and the compression of news webpages is one of their key issues. Among the news webpages indexed by a news search engine, there are many duplicate ones for the reprint between news websites, and also many similar ones for the reference between news. The delta-compression is sometimes a very efficient method to compress a collection of similar documents (Ouyang, 2002). Both duplicate and similar webpages have much in common, which indicates that if the similar news webpages can be clustered into a collection, it can be compressed by delta-compression with a high compression ratio.

Another question is how to improve the speed of decompression. Beside the decompression algorithm itself, it can be solved with the help of the characters of service or the strategies of system. For example, the cache technique is an efficient method to decrease the response time by holding the data in memory/disk to reduce the actual times of data processing (including decompression) (Chan, 1999). In fact the cache technique has nothing to do with the decompression process.

The content organization strategy of the zip file affects the speed of decompression (Brisaboa, 2008; B¨uttcher, 2007). The data a user wants may be compressed in a single zip file or many zip files. More zip files means more times of compression operation and more running time. Therefore, it is an available way to reduce the times of running decompression by organizing the content of a zip file properly.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing