It is well known that documents available on the Web are extremely heterogeneous in several aspects, such as the use of various idioms, different formats to represent the contents, besides other external factors like source reputation, refresh frequency, and so forth (Page & Brin, 1998). Altogether, these factors increase the complexity of Web information retrieval systems. Superficially, traditional search engines available on the Web nowadays consist of retrieving documents that contain keywords informed by users. Nevertheless, among the variety of search possibilities, it is evident that the user needs a process that involves more sophisticated analysis; for example, temporal or spatial contextualization might be considered. In these keyword-based search engines, for instance, a Web page containing the phrase “…due to the company arrival in London, a thousand java programming jobs will be open…” would not be found if the submitted search was “jobs programming England,” unless the word “England” appeared in another phrase of the page. The explanation to this fact is that the term “London” is treated merely like another word, instead of regarding its geographical position. In a spatial search engine, the expected behavior would be to return the page described in the previous example, since the system shall have information indicating that the term “London” refers to a city located in a country referred to by the term “England.” This result could only be feasible in a traditional search engine if the user repeatedly submitted searches for all possible England sub-regions (e.g., cities). In accordance with the example, it is reasonable that for several user searches, the most interesting results are those related to certain geographical regions. A variety of features extraction and automatic document classification techniques have been proposed, however, acquiring Web-page geographical features involves some peculiar complexities, such as ambiguity (e.g., many places with the same name, various names for a single place, things with place names, etc.). Moreover, a Web page can refer to a place that contains or is contained by the one informed in the user query, which implies knowing the different region topologies used by the system. Many features related to geographical context can be added to the process of elaborating relevance ranking for returned documents. For example, a document can be more relevant than another one if its content refers to a place closer to the user location. Nonetheless, in spatial search engines, there are more complex issues to be considered because of the spatial dimension concerning on ranking elaboration. Jones, Alani, and Tudhope (2001) propose a combination of Euclidian distance between place centroids with hierarchical distances in order to generate a hybrid spatial distance that may be used in the relevance ranking elaboration of returned documents. Further important issues are the indexing mechanisms and query processing. In general, these solutions try to combine well-known textual indexing techniques (e.g., inverted files) with spatial indexing mechanisms. On the subject of user interface, spatial search engines are more complex, because users need to choose regions of interest, as well as possible spatial relationships, in addition to keywords. To visualize the results, it is pleasant to use digital map resources besides textual information.
Numerous contributions have been made in the information retrieval (IR) area since 1960’s decade. Nevertheless, due to Web continuous growth, research in this field is still in infancy.
Baeza and Ribeiro (1999) say that IR brings some challenges, such as how to determine the real user needs, as well as supply their expectations through relevant document subsets. In IR systems, it is necessary to analyze both semantics and syntax of document contents, which may return imprecise results. According to Kowalsky (1997), the aim of an IR system is to minimize the overhead of finding the expected information. Classical IR models consider that a document is described by a set of indexed terms. Some of these models also take into account different terms importance at the same document. This importance is called weight (w), and can be represented by a numeric value. The most well-known classical models are Boolean, probabilistic and vector. The vector one, proposed by Gerard Salton, has a greater acceptance between researchers and is the most utilized in current IR applications.
Key Terms in this Chapter
Geographical Information System: Software, for implementing geoprocessing, that is able to store and manipulate spatial data.
Gazetteer: A dictionary that translates a set of spatial coordinates to a place name and vice-versa.
Indexing: A data structure technique used to speed up querying in large datasets.
Crawler: Also known as robot or spider, it is a module of a search engine that is responsible for visiting Web sites and extracting their content to be further indexed by the search engine.
Multimodal Interface: User-centered interface in which the computer may process more than one mode of communication.
Geocoding: A technique to provide spatial coordinates to places.
Search Engine: Software that enables one to find documents on the Web according to user query.