Creating a Data Lakehouse for a South African Government-Sector Learning Control Enforcing Quality Control for Incremental Extract-Load-Transform Pipe

Creating a Data Lakehouse for a South African Government-Sector Learning Control Enforcing Quality Control for Incremental Extract-Load-Transform Pipe

Dharmesh Dhabliya, Vivek Veeraiah, Sukhvinder Singh Dari, Jambi Ratna Raja Kumar, Ritika Dhabliya, Sabyasachi Pramanik, Ankur Gupta
Copyright: © 2024 |Pages: 22
DOI: 10.4018/979-8-3693-1582-8.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The Durban University of Technology is now engaged in a project to create a data lake house system for a Training Authority in the South African Government sector. This system is crucial for improving the monitoring and evaluation capacities of the training authority and ensuring efficient service delivery. Ensuring the high quality of data being fed into the lakehouse is crucial, since low data quality negatively impacts the effectiveness of the lakehouse system. This chapter examines quality control methods for ingestion-layer pipelines in order to present a framework for ensuring data quality. The metrics taken into account for assessing data quality were completeness, accuracy, integrity, correctness, and timeliness. The efficiency of the framework was assessed by effectively implementing it on a sample semi-structured dataset. Suggestions for future development including enhancing by integrating data from a wider range of sources and providing triggers for incremental data intake.
Chapter Preview
Top

Introduction

In South Africa, Sector Education and Training Authorities (SETAs) are government-established entities responsible for managing skills development and training in various sectors of the economy. These entities are referred to as Government-Sector Training Authorities (GTAs), and they play a crucial role in the country's efforts to improve skills and training across many sectors. The Durban University of Technology (DUT) has partnered with a South African Government Technical Agency (GTA) to enhance the data management strategy used by the GTA for an ongoing project. This is achieved by assisting DUT students in developing supplementary talents. In order to enhance the data management capabilities of the GTA, it was recognized that a comprehensive system was required to store data and generate reports automatically. Following the discussion, the DUT team suggested using Microsoft Azure services to establish a data warehousing solution. Additional context for the project is presented by (Mthembu et al. 2024). This chapter focuses on establishing a Data Lakehouse for a Training Authority in the South African Government sector. One crucial aspect is investigating techniques to ensure the integrity of data as it moves through the system, particularly during the Incremental Extract-Load-Transform Pipelines at the Ingestion Layer employing Data Orchestration.

In the age of Big Data, where the vast amount of information poses both advantages and challenges, effectively handling and using data has become essential for organizational success. The wide range of data types, including structured, semi-structured, and unstructured forms, requires a sophisticated approach for data manipulation (Azad et al., 2020). The Extract, Load, Transform (ELT) framework is a versatile tool that is well acknowledged for its efficacy in negotiating the complexities of contemporary data settings (Singhal & Aggarwal, 2022). However, as data pipelines get larger and more complicated, guaranteeing the integrity of data quality (DQ) has become more important.

The emergence of big data has necessitated the use of Distributed Data Warehouses (DLH). Harby and Zulkernine (2022) suggested that the big data age has brought up new issues for traditional Data Warehouses (DWs). The increase in diverse data quantities caused by digital transformation presents a difficulty for traditional data warehouse solutions in businesses (Čuš & Golec, 2022; Giebler et al., 2021). Furthermore, (Barika et al. 2019) highlight the challenges faced by researchers in organizing, controlling, and implementing big data workflows, which differ significantly from typical workflows. After undergoing transformation and being placed into the data warehouse (DW), the original filtered information is no longer retained (Figueira, 2018). According to Conventional, (Nambiar and Mundra 2022), the ETL procedure is deemed inadequate for fulfilling certain data management requirements.

A Data Lake (DL) is a comprehensive storage and exploration system specifically intended to manage large amounts of varied data. It has been widely recognized as the preferred method for processing and storing various data (Begoli et al., 2021).An further research undertaken by the DUT team emphasizes the significance of data governance in establishing authority over data utilization and decision-making via the implementation of processes, rules, organizational structures, and responsibilities (Mthembu et al., 2024).Failure to regularly maintain a data lake might result in expensive and difficult data management. Insufficient execution of data governance in DL systems may result in the creation of a chaotic “Data Swamp,” which undermines the worth of information in knowledge management and analytical systems owing to the presence of difficult-to-process and analyze data.

DLH emerged as an innovative concept that integrates the benefits of DL and DW, enabling companies to efficiently handle, analyze, and store vast quantities of data in a versatile and cost-efficient way. According to (Harby and Zulkernine 2022), this is intended to expedite the process of extracting information effectively.

The objective of this chapter is to provide a structured system for managing and incorporating data into a Data Lakehouse (DLH) to guarantee the accuracy and reliability of the data. The research outlines the following goals, which are aligned with the aim:

Key Terms in this Chapter

Data Ingestion: The process of collecting and importing information from a variety of sources into an archive, e.g. the Data Lakehouse, so that it can be made available for analysis.

Data Lakehouse: A unified data management architecture that enables organizations to store and analyze structured and unstructured data on a single platform, combining the benefits of Data Lakes and Data Warehouses.

Semi-Structured Data: Data which has an irregular and implicit structure that lacks a defined data model.

Data Orchestration: This is to ensure efficient data flows and integration into a system, often facilitated through automated working processes, by coordinating and managing different processing tasks.

Data Quality: The degree to which data are accurate, complete, timely, and consistent, providing reliable and meaningful information for decision making.

Complete Chapter List

Search this Book:
Reset