Article Preview
TopIntroduction
The data warehouse (DW) continues to increase in importance as the core foundation of any Business Intelligence (BI) strategy. The DW and BI market reached $10.8 billion in 2011 and continues to be a top priority for CIOs (Demirkan & Delen, 2013). A data warehouse is a special type of centralized data storage facility in a distributed organizational information system which consolidates and integrates data from many different sources and presents it in an aggregate format to support decision making activities of middle or higher-level management personnel (Inmon & Hackathorn, 1994).
An Operational Data Store (ODS) is a crucial component of many DW architectures. It acts as an immediate staging area to store integrated data from different transaction systems prior to ETL (Extract, Transform and Load) processing on the centralized data warehouse (Sujitparapitaya et al., 2003). Data warehouses can be mission-critical enablers of organizational and inter-organizational strategic information systems such as Customer Relationship Management (CRM) (Cunningham et al, 2006). Other examples where a data warehouse can support the business strategy include Business Process Management and Supply Chain Management (Ariyachandra & Watson, 2010). The distributed nature of data warehousing architecture requires that any change in the source data at distributed locations in the network be propagated to the central DW via the ODS on a regular basis (Yang et al., 2011). The amount of traffic that is added to the network due to update propagation activities depends upon the propagation method used. Propagation can be accomplished either in real time or after a time lag which typically is the case with data warehousing (Doka et al., 2011; Inmon, 2000).
Though the contribution to the overall network traffic is likely to be less in the delayed batch mode, its usefulness is diminished by the fact that it can potentially result in a temporary and unknown amount of discrepancy between the warehouse data and the data at the source nodes. This discrepancy may not, however, be problematic provided its amount remains within pre-specified and known limits. While real-time processing is what the BI industry is moving towards due to increased requirements for organizational speed and agility, the infrastructure requirements for real-time information using data streams and in-memory processing can be prohibitively expensive for many organizations. Hence, it is beneficial to look for ways to optimize the traditional delayed mode of data delivery.
Most DW research tends to focus on optimizing server processing and storage once the data has already arrived in the DW (Cundius & Alt, 2013), but there seems to be a lack of research that accounts for network reliability and/or latency in the context of ODS and DW. Overall performance of a DW system can be impacted by overloaded nodes on the network that connect all the sources of DW data (Doka et al., 2011). Network reliability can be impacted by natural disasters such as the Great East Japan Earthquake of 2011. Network reliability can also be caused by intentional actions like a Denial-of-Service attack or unintentional events like faulty hardware, software, or configuration errors.
Network latency due to congestion on the Internet continues to be a problem. According to Cisco, the amount of data being transferred over the Internet (667 exabytes in 2013) is growing faster than the ability of the network infrastructure to carry that data (Demirkan & Delen, 2013). While newer networking technologies (like high-speed Metro Ethernet) can resolve many WAN congestion issues, high bandwidth circuits are not available everywhere. Furthermore, there are very large differences in network reliability levels in developing countries (Chandra et al., 2012). It cannot be assumed that every ODS or data warehouse has data sources with high-speed network capabilities.