Article Preview
TopIntroduction
In today’s world, the real-time data availability for well-timed and well-informed decisions has become decisive for successful businesses, while data sizes are growing exponentially. Significance of real-time business data devalues, as it gets older. At the same time, the traditional working hours for global enterprises are not germane as they continue to serve customers around the globe and around the clock every day (Golfarelli & Rizzi, 2009; Vassiliadis, 2009; Thomsen & Pedersen, 2005). For uninterrupted global services, continuous real-time data availability for in time business decisions and actions is crucial and indispensable. Traditional offline data-refresh at data warehouses (DWHs) via ETL (Extract-Transform-Load) processes in batch windows (Kimball & Caserta, 2011) are not endurable in this scenario. Therefore, near-real-time data warehousing (NRT-DWH) is an evolving research area and plays a prominent role in supporting cutting-edge and contemporary business strategies and social requirements of the modern era. The modern warehousing techniques are transforming traditional warehouse from a static data repository into an active business entity. This helps to fulfill the contemporary business needs ranging from informing the different stakeholders about latest updates to effective, timely and accurate business decisions.
According to the demand of DWH industry, there is a need to develop an efficient algorithm that performs join operation for bursty and fast streaming data. In NRT-DWH, relational data generated by different data sources needs to reflect in the DWH with a minimal possible delay. Because data is coming from numerous data sources within the organization, it requires significant cleansing and transformation before loading it into the DWH using SQL. Thus, the powerful SQL features can be used to gain consistency and ACID compatibility for join query from the relational schema (Irshad et al., 2019). ETL processes are used for this purpose (Kimball & Caserta, 2011; Bornea et al., 2011). Transformation of extracted data (user sales data) from numerous sources is a crucial phase in ETL processes. In this phase, a stream of new extracted data is joined with a stored data before loading this into the DWH, as shown in Figure 1. Typically, a foreign key from the stream data is joined with the primary key in the master data (Naeem et al., 2012a; Mokbel et al., 2004; Dittrich et al., 2002). Since the join is between the stream data and the stored data therefore, it is called a semi-stream join.
Figure 1. Illustration of the join during the transformation phase of ETL
The problem of joining a streaming data with a stored data was first introduced in (Neoklis Polyzotis, Skiadopoulos, Vassiliadis, Simitsis, & Frantzell, 2008) and as a solution a seminal algorithm called MESHJOIN (Mesh Join) was presented. Later, various optimizations in MESHJOIN have been proposed (Bornea et al., 2011; Naeem et al., 2012a; Naeem et al., 2010; Naeem et al., 2013; Du & Zou, 2013; Naeem et al., 2012b). Since the concept of long tail is very common in sales data (Kleinberg, 2002), CACHEJOIN (Naeem et al., 2012a), one of these algorithms, was particularly designed for irregular streams by caching the frequent records of stored data. However, it executes its two SP and DP phases sequentially. Because of the sequential execution, the stream records are waiting unnecessarily before being processed. Thus, the algorithm cannot achieve optimal performance. Parallel execution of the SP and DP phases of CACHEJOIN can significantly speed up the joining process. Further details about limitations of CACHEJOIN are presented later in the paper.