The Evolution of the (Hidden) Web and Its Hidden Data

The Evolution of the (Hidden) Web and Its Hidden Data

Manuel Álvarez Díaz, Víctor Manuel Prieto Álvarez, Fidel Cacheda Seijo
Copyright: © 2018 |Pages: 30
DOI: 10.4018/978-1-5225-3163-0.ch006
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.
Chapter Preview
Top

Introduction

Since its origins, the WWW has been the subject of numerous studies. However, one constant has been and continues to be the analysis of its size. Although it is nearly impossible to compute the exact size of the Web, because it is in constant change, everyone agrees that his size is in the order of billions of documents or pages (Gulli & Signorini, 2005). In this way, the WWW could be considered the largest repository of documents ever built.

Due to the large size of the Web, search engines are essential tools for users who want to access relevant information for a specific topic. Search engines are complex systems that allow, among other things: gathering, storing, managing and granting access to the information. Crawling systems are those which perform the task of gathering information. These programs are capable of traversing and analysing the Web in a certain order, by following the links between different pages.

The task of a crawling system presents numerous challenges due to the quantity, variability and quality of the information that it needs to collect. Among these challenges, specific aspects can be highlighted, such as the technologies used in web pages to access to data, both in the server-side (Raghavan & Garcia-Molina, 2001) or in the client-side (Bergman, 2001); or problems associated with web content such as Web Spam (Gyongyi & Garcia-Molina, 2005) or repeated contents (Kumar & Govindarajulu, 2009), etc. To get a detailed enumeration it is necessary to analyse the Web in more detail.

This article presents an analysis of the most important features of the Web and its components and also its evolution over a period of time. Particular emphasis is placed on the use of client/server side technologies. It is very important to remark that the Hidden Web is “hidden” just for the existence of some technologies used in web documents that difficult the task of crawler systems for accessing to it.

The analysis focuses on a snapshot of the Global Web for six different years: from 2009 to 2014. The results for each year are analysed independently and together to simplify the evaluation of the features at any given time and the changes between the different analysed years. The objectives of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time, and also to analyze how its changes affect tools such as crawlers and search engines. So, changing trends are presented and explained.

The structure of this paper is as follows. Background section introduces works related with the study and characterization of the Web. Methodology section shows the methodology followed to characterize the Web. Dataset section explains the dataset used. The analysis section discusses the results obtained for each year, and their evolution through the time. Finally, the future research directions section includes possible future works and the conclusions section summarises the results of the paper.

Complete Chapter List

Search this Book:
Reset