Measuring and Mapping the World Wide Web through Web Hyperlinks

Mario A. Maggioni (Università Cattolica, Italy), Mike Thelwall (University of Wolverhampton, UK) and Teodora Erika Uberti (Università Cattolica, Italy)
DOI: 10.4018/978-1-60566-014-1.ch121
The Internet is one of the newest and most powerful media that enables the transmission of digital information and communication across the world, although there is still a digital divide between and within countries for its availability, access, and use. To a certain extent, the level and rate of Web diffusion reflects its nature as a complex structure subject to positive network externalities and to an exponential number of potential interactions among individuals using the Internet. In addition, the Web is a network that evolves dynamically over time, and hence it is important to define its nature, its main characteristics, and its potential.
The Internet And The Web

In order to investigate the nature of the Web, it is essential to distinguish between the physical infrastructure (which we will call the “Internet”) and the World Wide Web (mostly known as www). The Internet is a series of connected networks, each of which is composed of a set of Internet hosts and computers connected via cables, satellites, and so forth. The Web is hosted by the Internet with e-mail as another popular service. The Web is a collection of Web pages and Web sites, many interconnected through hyperlinks, enabling information and communication to “flow” from one computer to another. Therefore, the Internet is the physical infrastructure reflecting the technical capability of a given geographical area (i.e., a country, a region, or a city) to enable effective and efficient exchanges of digital information; while the Web is a virtual space reflecting the ability to create and export digital information. Of course, the latter would not exist without the former (Berners-Lee & Fischetti, 1999).

While the Internet has a relatively stable infrastructure (because the investments to implement and maintain it are large and costly and few key organizations are involved: mostly corporate, governmental, or nongovernmental organizations), the Web changes very rapidly over time (because it is relatively cheap and easy to create and maintain a Web site and the number of people involved is huge). It is therefore difficult to give a precise and up-to-date description of the Web. For technical reasons, it also defies precise description (Thelwall, 2002).

The most common indicator of Internet diffusion is the number of Internet domain names, which should indicate the ability of a given geographical area to create digital contents and support the exchange flows of information. Unfortunately, this concept is ambiguous and its measurement does not entirely capture the actual diffusion of the Web. First, most generic top level domains (gTLDs), which accounted for almost 67% of the total domains in January 2004 and 56% in January 2007, do not reflect any specific geographical location. Second, some country code top level domains (ccTLDs), even if nominally geo-located, display a mismatch between the official location of the TLD and the actual source of digital information. For example, the .tv domain (an acronym for Tuvalu Islands) is very diffused among television companies internationally because of its acronym, and hence most .tv Web sites are not related to the owning country. Similarly, .nu (an acronym for Niue Islands) is quite common among commercial sites playing on the phonetic similarity between “nu” and “new”, but not necessarily because Niue inhabitants create digital contents for the Web. Third, even if considered jointly with other technological and economic indicators (e.g., the number of computers or telephone lines), the number of Internet domains may capture a large share of the Internet infrastructure, they do not reveal digital information flows.

Hence, it is crucial to use suitable indicators to map the infrastructure of flows of digital information across the Web. The number of Web pages and sites reflects the amount of information available on the Web, but not the structure of digital information flows, the ability to create digital contents and, to attract e-attention, or the crucial issue of the quality of information.

gTLD: Generic Top Level Domains are TLD reserved regardless of geography. At the present time, there are the following gTLDs: .aero, .biz, .com, .coop, .info, .int, .jobs, .mobi, .museum, .name, .net, .org, .pro, .tel, and .travel. There exist other peculiar gTLD, such as .edu, .mil, and .gov, that are reserved for United States educational, military, and governmental institutions or organizations; .asia is restricted to the Pan-Asia and Asia Pacific community, and .cat is restricted to the Catalan linguistic and cultural community.

Hyperlink: An active link placed in a Web page that allows the net surfer to jump directly from this Web page to another and retrieve information. This dynamic and nonhierarchical idea of linking information was first introduced by Tim Berners-Lee to manage scientific information within a complex and continuously changing environment like CERN. Internet hyperlinks are directional: outgoing links leaving a Web page and incoming links targeting a Web page.

ccTLD: Country Code Top Level Domain is the TLD associated to a country and corresponds to its ISO3166 code. Different from gTLDs, these domains are exclusive to countries.

Web Impact Factor: Similar to the impact factor calculated in bibliometrics, it is a measure of the influence of a site across the entire Web calculated according to the number of links from other sites.

Digital Divide: This term was first introduced by the Clinton Administration in 1999, analysing the diffusion of computers and the Internet among Americans. Some surveys emphasized the separation between information “haves” and “have nots” within ethnic groups and urban/rural populations. Later, this concept was extended worldwide, distinguishing between countries with much ICT and easy access to information and countries that have limited ICT facilities and difficult access conditions. Nowadays, the term digital divide refers to differences in the availability, access, and use of new technologies across and within countries.

Domain Name: Typically, the part of the address of a Web site after the http://and before the final slash (e.g., There exist three main typologies of top-level domain names (TLD) that characterize the ending part of each Web address: there are generic top-level domain (gTLD), country code top-level domain (ccTLD), and infrastructure top-level domain.

