Article Preview
TopIntroduction
The need to uniformly access heterogeneous information resources triggered a huge amount of research in the area of information integration performed in last decades. Several industrial strength integration software tools are currently available. However, the query evaluation performance delivered by these tools is still far from practically acceptable for any complex queries and is often unpredictable. Common techniques for performance improvement, such as indexing, query optimization, and caching, are well understood and developed for traditional databases but should be revised and re-evaluated in the integrated distributed heterogeneous environment.
The focus of this research is on advanced semantic caching techniques which are expected to be especially good in a distributed heterogeneous environment. In contrast with usual caching techniques, in addition to the usage patterns, semantic caching takes into account relationships and inter dependencies between data items to improve cache effectiveness and extend the ability of local query processing of cached data instead of distributed query processing.
The goal of this research is to evaluate the potential impact of different approaches to semantic caching and obtain quantitative estimations of semantic caching suitability for XML query processing.
The caching is known as important technique for any system attempting to achieve high performance. The idea of any caching is to reduce the number of repetitive accesses to the data items located in relatively slow large storage. Instead, copies of data items are temporarily stored in a fast (and relatively small) memory.
In almost all implementations cache is completely transparent for the application or system that uses the cached data. That is, the presence of cache affects performance, but not the functionality of the application, and, moreover, the application cannot explicitly control the behavior of the cache.
In this research, the data items copied from the primary data source into a cache are called cache units. Depending on the nature of cache units, physical, logical, and semantic caching are distinguished (Figure 1).
Figure 1. Physical, logical, and semantic cache
In a common physical caching the cache units are pages, blocks, or other low-level data structures defined for a physical storage. Physical caching is hardly applicable for distributed environment because low level units are not related to the units transferred over a network, and page sizes may vary from site to site.
When a logical cache is used, high level objects, such as web pages, are cached instead of physical blocks. A potential disadvantage of a logical caching is that the cache unit is considered atomic and can be used only for exactly the same request.
The caching is called semantic caching if the cache units are meaningful items or sets in the high-level application data model. Thus, parts of a cache unit may be used to process queries other than original one, or data from several cache units may be used together to evaluate a new query.
The semantics of data in a semantic cache is usually represented with queries (Chidlovskii & Borgho, 2000; Dar, Franklin, Jonsson, Srivastava, & Tan, 1996; Qun Ren & Dunham, 2003). To find data in semantic cache one has to match queries instead of page addresses used in a physical cache.
The expectation is that the above mentioned should improve reuse of the cached data. On the other hand, the extraction of cached data becomes more complex and may result in certain performance penalties. Hence, our goal is to find circumstances and conditions under which the semantic caching is beneficial.
A significant amount of work in the area of semantic caching for different data models has been done by research community during the last two decades. It turns out that the major restricting factor for effective use of semantic caching is the query containment problem which is proven to be decidable only for limited class of queries and has polynomial complexity for much smaller class of queries.
To make the semantic caching efficient, one has to restrict the class of cached queries and use an approximate algorithm checking if a query can be evaluated based on the cache contents only. If an approximate algorithm can produce false positive answers, the use of cache may result in data loss in the query responses.