A Performance Analysis of Semantic Caching for XML Query Processing

A Performance Analysis of Semantic Caching for XML Query Processing

Boris Novikov (Department of Computer Science, Saint-Petersburg University, Saint-Petersburg, Russia), Alice Pigul (Department of Computer Science, Saint-Petersburg University, Saint-Petersburg, Russia) and Anna Yarygina (Department of Computer Science, Saint-Petersburg University, Saint-Petersburg, Russia)
Copyright: © 2013 |Pages: 21
DOI: 10.4018/ijkbo.2013100103


Caching is important for any system attempting to achieve high performance. The semantic caching is an approach trying to benefit from the certain knowledge of data semantics. The authors expect that this information might enable reuse of semantically close data rather than exactly equal to cached data in the traditional system. However, the major obstacle for extensive application of semantic caching for any data model or query language is the computational complexity of the query containment problem, which is, in general, undecidable. In this article the authors introduce and compare three approximate conservative query matching algorithms for semantic caching of semi-structured queries. The authors then analyze their applicability for distributed query processing. Based on this analysis, the authors outline few scenarios where semantic caching can be beneficial for query processing in a distributed system of heterogeneous semi-structured information resources.
Article Preview


The need to uniformly access heterogeneous information resources triggered a huge amount of research in the area of information integration performed in last decades. Several industrial strength integration software tools are currently available. However, the query evaluation performance delivered by these tools is still far from practically acceptable for any complex queries and is often unpredictable. Common techniques for performance improvement, such as indexing, query optimization, and caching, are well understood and developed for traditional databases but should be revised and re-evaluated in the integrated distributed heterogeneous environment.

The focus of this research is on advanced semantic caching techniques which are expected to be especially good in a distributed heterogeneous environment. In contrast with usual caching techniques, in addition to the usage patterns, semantic caching takes into account relationships and inter dependencies between data items to improve cache effectiveness and extend the ability of local query processing of cached data instead of distributed query processing.

The goal of this research is to evaluate the potential impact of different approaches to semantic caching and obtain quantitative estimations of semantic caching suitability for XML query processing.

The caching is known as important technique for any system attempting to achieve high performance. The idea of any caching is to reduce the number of repetitive accesses to the data items located in relatively slow large storage. Instead, copies of data items are temporarily stored in a fast (and relatively small) memory.

In almost all implementations cache is completely transparent for the application or system that uses the cached data. That is, the presence of cache affects performance, but not the functionality of the application, and, moreover, the application cannot explicitly control the behavior of the cache.

In this research, the data items copied from the primary data source into a cache are called cache units. Depending on the nature of cache units, physical, logical, and semantic caching are distinguished (Figure 1).

Figure 1.

Physical, logical, and semantic cache

In a common physical caching the cache units are pages, blocks, or other low-level data structures defined for a physical storage. Physical caching is hardly applicable for distributed environment because low level units are not related to the units transferred over a network, and page sizes may vary from site to site.

When a logical cache is used, high level objects, such as web pages, are cached instead of physical blocks. A potential disadvantage of a logical caching is that the cache unit is considered atomic and can be used only for exactly the same request.

The caching is called semantic caching if the cache units are meaningful items or sets in the high-level application data model. Thus, parts of a cache unit may be used to process queries other than original one, or data from several cache units may be used together to evaluate a new query.

The semantics of data in a semantic cache is usually represented with queries (Chidlovskii & Borgho, 2000; Dar, Franklin, Jonsson, Srivastava, & Tan, 1996; Qun Ren & Dunham, 2003). To find data in semantic cache one has to match queries instead of page addresses used in a physical cache.

The expectation is that the above mentioned should improve reuse of the cached data. On the other hand, the extraction of cached data becomes more complex and may result in certain performance penalties. Hence, our goal is to find circumstances and conditions under which the semantic caching is beneficial.

A significant amount of work in the area of semantic caching for different data models has been done by research community during the last two decades. It turns out that the major restricting factor for effective use of semantic caching is the query containment problem which is proven to be decidable only for limited class of queries and has polynomial complexity for much smaller class of queries.

To make the semantic caching efficient, one has to restrict the class of cached queries and use an approximate algorithm checking if a query can be evaluated based on the cache contents only. If an approximate algorithm can produce false positive answers, the use of cache may result in data loss in the query responses.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing