Metadata for Search Engines: What can be learned from e-Sciences?

Metadata for Search Engines: What can be learned from e-Sciences?

Magali Roux (Laboratoire d’Informatique de Paris VI, France)
DOI: 10.4018/978-1-4666-0330-1.ch003
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

E-sciences are data-intensive sciences that make a large use of the Web to share, collect, and process data. In this context, primary scientific data is becoming a new challenging issue as data must be extensively described (1) to account for empiric conditions and results that allow interpretation and/or analyses and (2) to be understandable by computers used for data storage and information retrieval. With this respect, metadata is a focal point whatever it is considered from the point of view of the user to visualize and exploit data as well as this of the search tools to find and retrieve information. Numerous disciplines are concerned with the issues of describing complex observations and addressing pertinent knowledge. In this paper, similarities and differences in data description and exploration strategies among disciplines in e-sciences are examined.
Chapter Preview
Top

Introduction

The starting point of information retrieval is the definition of types of data on which searches are conducted as these features will determine the architecture and strategies to bring into use.

Unstructured data consist of texts, pictures, videos, movies, etc., i.e., all documents that do not have any explicit organization and do not tell anything about their subdivisions. Such documents are difficult to exploit unless effective strategies allow encompassing these limitations. Currently, keyword-based approaches are used to perform search in unstructured data and to alleviate these challenges. Document search engines aims at browsing the web by looking for specific terms to address structure and semantic document content. Crawler application programs methodically traverse the web to recover relevant information. Index databases are then created for the purpose of providing fast search results when a query is given (Meng & He, 2009). Current search engines (Apple’s Spotlight and Google Desktop) rely upon such traditional indexing techniques requiring large resources.

In contrast, structured data are searched with more sound strategies. In databases, data is stored with its structure and, consequently, retains its own semantics. For a given query, the database management systems (DBMS) returns a very specific data whereas a search engine will only provide links to data ordered by estimated relevance. Nevertheless, such systems are not suitable for complex large data access; in particular, relational database systems do not support multidimensional or hierarchical objects and quite different data models are needed to facilitate data retrieval, modification, mathematical manipulation and visualization.

Actually, valuable information and especially scientific information, is dumped in multiple databases disseminated over multiple sites and is becoming more and more complex. Integrating large, heterogeneous data sources to gain into new knowledge is quite a challenging problem. This is especially the case in scientific disciplines that produce and/or store data on multifarious sites, from multi-campus projects. Heterogeneity is found to be syntactic and semantic, and several approaches are developed to provide the user with a unified data view (Hull, 1997). In the field of distributed databases, virtual integration is achieved by the loose coupling of source data models into a federated scheme that provide information sharing and exchanging (Heimbigner & McLeod, 1985). To improve extensibility, three-tier architecture is introduced to define the concept of mediator (Wiederhold, 1992). The upper layer represents the users and user interfaces, the middle layer contains the mediators that provide uniform interfaces and query accesses to wrappers and, at last, the lower layer contains data sources and wrappers translate both queries from the mediator level to source databases and results returned from source databases to mediators. Example of non-grid database federation systems include Oracle 10g federated Solution (Poggi & Ruzzi, 2004). Otherwise, the data Grid technology allows the integration of distributed resources on standardized interfaces between infrastructures.

Complete Chapter List

Search this Book:
Reset