Developing a Data Lakehouse for a South African Government-Sector Training Authority: Implementing Quality Control for Incremental Extract-Load-Transform Pipelines in the Ingestion Layer

Developing a Data Lakehouse for a South African Government-Sector Training Authority: Implementing Quality Control for Incremental Extract-Load-Transform Pipelines in the Ingestion Layer

Priyanka Govender, Nalindren Naicker, Sulaiman Saleem Patel, Seena Joseph, Devraj Moonsamy, Ayotuyi Tosin Akinola, Lavanya Madamshetty, Thamotharan Prinavin Govender
DOI: 10.4018/978-1-6684-9716-6.ch006
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The Durban University of Technology is undertaking a project to develop a data lakehouse system for a South African government-sector training authority. This system is considered critical to enhance the monitoring and evaluation capabilities of the training authority and ensure service delivery. Ensuring the quality of data ingested into the lakehouse is critical, as poor data quality deteriorates the efficiency of the lakehouse solution. This chapter studies quality control for ingestion-layer pipelines to propose a data quality framework. Metrics considered for data quality were completeness, accuracy, integrity, correctness, and timeliness. The framework was evaluated by practically applying it to a sample semi-structured dataset to gauge its effectiveness. Recommendations for future work include expanded integration, such as incorporating data from more varied sources and implementing incremental data ingestion triggers.
Chapter Preview
Top

Introduction

In the era of Big Data, where the abundance of information presents both opportunities and obstacles, proficiently managing and leveraging data has become crucial for organisational triumph. The extensive array of data types, encompassing structured, semi-structured, and unstructured formats, necessitates an advanced strategy for data manipulation (Azad et al., 2020). Central to this effort is the Extract, Load, Transform (ELT) framework, a flexible tool recognised for its effectiveness in navigating the intricacies of modern data environments (Singhal & Aggarwal, 2022). Nevertheless, as data pipelines progress in size and complexity, ensuring the integrity of data quality (DQ) has risen to a central position of importance.

Notably, the rise of big data led to the necessity of DLH as Harby and Zulkernine (2022) insinuated that new challenges were presented to Data Warehouses (DWs) due to the big data era. The surge in heterogeneous data volumes resulting from digital transformation poses a challenge to conventional DW solutions within organisations (Čuš & Golec, 2022; Giebler et al., 2021). In addition, Barika et al. (2019) bring to light that researchers are grappling with how to orchestrate, manage and execute big data workflows due to their distinct nature from traditional ones. Once this data undergoes transformation and is loaded into the DW, the initial filtered information is no longer preserved (Figueira, 2018). Conventional, Nambiar and Mundra (2022) highlights that methods like the ETL process are not efficient enough for these data management needs.

A Data Lake (DL) functions as a unified storage and exploration system designed to handle vast quantities of diverse data, has gained prominence as the recommended approach for processing and storing heterogenous data (Begoli et al., 2021). A separate study conducted by the DUT team highlight the importance of data governance as it establishes control over data utilisation and decision-making by incorporating procedures, rules, organisational structures, and responsibilities (Mthembu et al., 2024). Without regular maintenance of a DL, managing data becomes a costly and challenging task. Inadequate implementation of data governance in DL systems may lead to the development of a disorderly “Data Swamp,” compromising the value of information in knowledge management and analytical systems due to challenging-to-process and analyse data.

Thus, DLH became a revolutionary idea combining the advantages of DL with the advantages of DW, allowing organisations to manage, analyse and store massive amounts of data in a flexible and cost effective manner. Harby and Zulkernine (2022) further mentions that this aims to accelerate the process of effective knowledge extraction.

The aim of this study is to create a framework for data orchestration and ingestion of data in a Data Lakehouse (DLH) to ensure data quality (DQ). Aligned to the aim the following objectives are outlined for the study:

Key Terms in this Chapter

Data Lakehouse: A unified data management architecture that enables organizations to store and analyse structured and unstructured data on a single platform, combining the benefits of Data Lakes and Data Warehouses.

Semi-Structured Data: Data which has an irregular and implicit structure that lacks a defined data model.

Data Ingestion: The process of collecting and importing information from a variety of sources into an archive, e.g. the Data Lakehouse, so that it can be made available for analysis.

Data Quality: The degree to which data are accurate, complete, timely, and consistent, providing reliable and meaningful information for decision making.

Data Orchestration: This is to ensure efficient data flows and integration into a system, often facilitated through automated working processes, by coordinating and managing different processing tasks.

Complete Chapter List

Search this Book:
Reset