Temporal Join with Hilbert Curve Mapping and Adaptive Buffer Management

Temporal Join with Hilbert Curve Mapping and Adaptive Buffer Management

Jaime Raigoza (California State University, Chico, CA, USA) and Junping Sun (Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA)
Copyright: © 2014 |Pages: 19
DOI: 10.4018/ijsi.2014040101

Abstract

Management of data with a time dimension increases the overhead of storage and query processing in large database applications especially with the join operation, which is a commonly used and expensive relational operator. The temporal join evaluation can be time consuming because temporal data are intrinsically multi-dimensional. Also, due to a limited buffer size, the long-lived data can be frequently swapped-in and swapped-out between disk and main memory thus resulting in a low cache hit ratio. The proposed index-based Hilbert-Temporal Join (Hilbert-TJ) join algorithm maps temporal data into Hilbert curve space that is inherently clustered, thus allowing for fast retrieval and storage. This paper also proposes the Adaptive Replacement Cache-Temporal Data (ARC-TD) buffer replacement policy which favors the cache retention of data pages in proportion to the average life span of the tuples in the buffer. By giving preference to tuples having long life spans, a higher cache hit ratio can be achieved. The caching priority is also balanced between recently and frequently accessed data. The comparison study consists of different join algorithms and buffer replacement policies. Additionally, the Hilbert-TJ algorithm offers support to both valid-time and transaction-time data.
Article Preview

Introduction

Management of temporal data is commonly performed in many scheduling, banking, medical, and scientific applications. Conventional or snapshot databases offer limited support in managing time-varying data since the data stored are considered valid at a given point in time, typically the present time. Data changes overwrite the current state thus; snapshot databases are insufficient in representing the time dimension (Tansel, Clifford, Gadia, Jajodia, Segev, & Snodgrass, 1993).

A temporal database keeps track of data with respect to time by implementing intrinsic time aspects. Specifically, in a temporal database the time interval data type, composed of a start time value and an end time value, is associated with data. The data can be time stamped with two concepts of time; a valid time interval represents when the data occurred in the modelled reality and a transaction time interval is the time period when data are stored in the database.

A common join operation executed on a temporal database is the time-equijoin (TE-join) that specifies that tuples join when values on the join attribute(s) from one table match the values from another table or the same table; also, their respective tuple time intervals intersect. Due to the additional time dimension, the time interval of a join operation in a temporal database can involve more complex query predicate(s) compared to a join in a snapshot database.

The GTE-join is defined as a generalized TE-join. A more formal definition is: suppose we are given temporal relations R, S, a key range kr and a time interval i, then a GTE-join finds all (r, s) where r ∈ R and s ∈ S such that: i) both, r.key ∈ kr and s.key ∈ kr; ii) r.key = s.key; iii) both, r.interval and s.interval intersect i, the time interval; and iv) r.interval intersects s.interval (Zhang, Tsotras, & Seeger, 2002). For example, given two temporal relations whose schemas are:R = (ID, Server_A_Login_Time, Server_A_Logout_Time)S = (ID, Server_B_Login_Time, Server_B_Logout_Time)such that each relation maintains a separate computer access log, a GTE-join operation can retrieve all users having an ID between 4000 and 5999 who accessed both computers last month. Due to the additional time dimension, the time interval of a join operation in a temporal database can involve more complex query predicate(s) compared to a join in a snapshot database. The need for efficient methods to store, retrieve, and process the data increases with the growth of past, present, and future state data (Enderle, Hampel, & Seidl, 2004).

Given the need to perform a GTE-join, the problem addressed by this proposed research is its efficient performance that depends on an indexed-based solution on clustered temporal data. Because temporal data are inherently multi-dimensional, building a temporal index that clusters temporal data with time intervals having a long duration is difficult. Temporal data is mapped on Hilbert curve space and managed using a B+ tree index.

The temporal join operation can also be improved by considering the buffer replacement policy. First, the Least Recently Used (LRU) buffer replacement policy, implemented with a double linked list, is commonly used because of its performance improvement and low overhead cost; the add and remove pointer operations incur an O(1) complexity (Silberschatz, Galvin, & Gagne, 2004). LRU supports the “recency” principle such that if a data location is retrieved by a process, the same or nearby data will probably need to be retrieved soon. However, a problem with the LRU policy occurs with a sequential scan: if the data size being read is larger than the buffer size, then the scan continually replaces everything in the buffer with data that are only needed once. This problem is known as the frequency problem and designs that overcome it are referred to as scan-resistant (Johnson & Shasha, 1994). An example of how the LRU policy could suffer drastically is stated by O’Neil, O’Neil, & Weikum (1993).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 6: 4 Issues (2018): 3 Released, 1 Forthcoming
Volume 5: 4 Issues (2017)
Volume 4: 4 Issues (2016)
Volume 3: 4 Issues (2015)
Volume 2: 4 Issues (2014)
Volume 1: 4 Issues (2013)
View Complete Journal Contents Listing