During the past decade, the World Wide Web became the most popular network in the World. WWW grows with a very fast speed, thus the information that can be found through it is huge. In the early 90s, the first search engines for the WWW appeared. The user could give some keywords and the system returned a number of URLs (uniform resource locators) that contained the keywords. The order of the URLs in the return list was initially based on the number of the keyword occurrences in each URL. Some more sophisticated systems were taking into account the importance and the frequency of the keywords. As WWW was growing, a simple keyword search could match hundreds of thousands of pages. A human can only check the first twenty or even some more of the URLs that the search engine returns. Consequently, the ordering of the search results became very important. The most important URLs that are related with the search keywords should be ranked first. The link analysis ranking (LAR) is an objective way to sort search results. There are many advantages of the LAR over older methods. First of all the ranking is feasible without getting any feedback from the users. It is also not necessary to store the content of the URLs, but only the links. Another advantage is that it is difficult for the site developers to cheat by repeating keywords in the documents and moreover it may be pre-computed for all URLs. There are even more benefits using LAR to sort the search results that make it the best method used so far.
Key Terms in this Chapter
HTTP: hypertext transfer protocol. A network protocol used to fetch and/or transmit files over the network. This is the most commonly used protocol in the Web.
Web Crawler: Also known as “spider,” “robot,” “Webbot,” or “bot.” It is a program that fetches Web pages to a server in order to be indexed probably by another program. It usally “crawls” the Web by following links, but it is also an option to find the URLs that should be fetched in a database.
Citation Graph: A dirrected and usually unweighted graph G=(V,E) that corresponts to a set of publications. Each vertex of the graph corresponts to a publication. Each edge i?j of G means that publication i cites publication j.
Authority: A Web-page with “good” content. Following Kleinbergs definition, a Web page is an authority if there “good” HUBS pointing to it.
Adjacency Matrix: A matrix that corresponts to a directed graph G=(V,E). Each element of an adjacency matrix A[i,j] has the value of 1 if there is a link from node i to node j and zero otherwise.
Web graph: A dirrected and usually unweighted graph that corresponts to a part of the Web. Each vertex of the graph corresponts to a Web page. Each edge of the graph corresponts to a Web link.
Social Network: A social network is a set of people or groups of people with some pattern of contacts or interactions between them. Nowadays, the meaning of the term is very broad covering also other types of networks, like technological (Web, Internet, power grid), biological (gene, metabolic, food networks), whose nodes exhibit certain interactions between them.
URL: Uniform resource locator. A string representing the Web location of an object. It consists of tree parts. (a) the protocol--usually http or https, (b) the server, and (c) the full path of the file containnnign the object.
HTTPS: The hypertext tranfer protocol secure. The HTTP over a secure layer, usally the SSL (secure sockets layer).
Hub: A Web-page is a HUB if it has links to authorities.