Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Creating a Data Lakehouse for a South African Government-Sector Learning Control Enforcing Quality Control for Incremental Extract-Load-Transform Pipe

Dharmesh Dhabliya, Vivek Veeraiah, Sukhvinder Singh Dari, Jambi Ratna Raja Kumar, Ritika Dhabliya, Sabyasachi Pramanik, Ankur Gupta

Source Title: Big Data Quantification for Complex Decision-Making

DOI: 10.4018/979-8-3693-1582-8.ch004

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The Durban University of Technology is now engaged in a project to create a data lake house system for a Training Authority in the South African Government sector. This system is crucial for improving the monitoring and evaluation capacities of the training authority and ensuring efficient service delivery. Ensuring the high quality of data being fed into the lakehouse is crucial, since low data quality negatively impacts the effectiveness of the lakehouse system. This chapter examines quality control methods for ingestion-layer pipelines in order to present a framework for ensuring data quality. The metrics taken into account for assessing data quality were completeness, accuracy, integrity, correctness, and timeliness. The efficiency of the framework was assessed by effectively implementing it on a sample semi-structured dataset. Suggestions for future development including enhancing by integrating data from a wider range of sources and providing triggers for incremental data intake.

Chapter Preview

Top

Introduction

In South Africa, Sector Education and Training Authorities (SETAs) are government-established entities responsible for managing skills development and training in various sectors of the economy. These entities are referred to as Government-Sector Training Authorities (GTAs), and they play a crucial role in the country's efforts to improve skills and training across many sectors. The Durban University of Technology (DUT) has partnered with a South African Government Technical Agency (GTA) to enhance the data management strategy used by the GTA for an ongoing project. This is achieved by assisting DUT students in developing supplementary talents. In order to enhance the data management capabilities of the GTA, it was recognized that a comprehensive system was required to store data and generate reports automatically. Following the discussion, the DUT team suggested using Microsoft Azure services to establish a data warehousing solution. Additional context for the project is presented by (Mthembu et al. 2024). This chapter focuses on establishing a Data Lakehouse for a Training Authority in the South African Government sector. One crucial aspect is investigating techniques to ensure the integrity of data as it moves through the system, particularly during the Incremental Extract-Load-Transform Pipelines at the Ingestion Layer employing Data Orchestration.

In the age of Big Data, where the vast amount of information poses both advantages and challenges, effectively handling and using data has become essential for organizational success. The wide range of data types, including structured, semi-structured, and unstructured forms, requires a sophisticated approach for data manipulation (Azad et al., 2020). The Extract, Load, Transform (ELT) framework is a versatile tool that is well acknowledged for its efficacy in negotiating the complexities of contemporary data settings (Singhal & Aggarwal, 2022). However, as data pipelines get larger and more complicated, guaranteeing the integrity of data quality (DQ) has become more important.

The emergence of big data has necessitated the use of Distributed Data Warehouses (DLH). Harby and Zulkernine (2022) suggested that the big data age has brought up new issues for traditional Data Warehouses (DWs). The increase in diverse data quantities caused by digital transformation presents a difficulty for traditional data warehouse solutions in businesses (Čuš & Golec, 2022; Giebler et al., 2021). Furthermore, (Barika et al. 2019) highlight the challenges faced by researchers in organizing, controlling, and implementing big data workflows, which differ significantly from typical workflows. After undergoing transformation and being placed into the data warehouse (DW), the original filtered information is no longer retained (Figueira, 2018). According to Conventional, (Nambiar and Mundra 2022), the ETL procedure is deemed inadequate for fulfilling certain data management requirements.

A Data Lake (DL) is a comprehensive storage and exploration system specifically intended to manage large amounts of varied data. It has been widely recognized as the preferred method for processing and storing various data (Begoli et al., 2021).An further research undertaken by the DUT team emphasizes the significance of data governance in establishing authority over data utilization and decision-making via the implementation of processes, rules, organizational structures, and responsibilities (Mthembu et al., 2024).Failure to regularly maintain a data lake might result in expensive and difficult data management. Insufficient execution of data governance in DL systems may result in the creation of a chaotic “Data Swamp,” which undermines the worth of information in knowledge management and analytical systems owing to the presence of difficult-to-process and analyze data.

DLH emerged as an innovative concept that integrates the benefits of DL and DW, enabling companies to efficiently handle, analyze, and store vast quantities of data in a versatile and cost-efficient way. According to (Harby and Zulkernine 2022), this is intended to expedite the process of extracting information effectively.

The objective of this chapter is to provide a structured system for managing and incorporating data into a Data Lakehouse (DLH) to guarantee the accuracy and reliability of the data. The research outlines the following goals, which are aligned with the aim:

Key Terms in this Chapter

Data Ingestion: The process of collecting and importing information from a variety of sources into an archive, e.g. the Data Lakehouse, so that it can be made available for analysis.

Data Lakehouse: A unified data management architecture that enables organizations to store and analyze structured and unstructured data on a single platform, combining the benefits of Data Lakes and Data Warehouses.

Semi-Structured Data: Data which has an irregular and implicit structure that lacks a defined data model.

Data Orchestration: This is to ensure efficient data flows and integration into a system, often facilitated through automated working processes, by coordinating and managing different processing tasks.

Data Quality: The degree to which data are accurate, complete, timely, and consistent, providing reliable and meaningful information for decision making.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Creating a Data Lakehouse for a South African Government-Sector Learning Control Enforcing Quality Control for Incremental Extract-Load-Transform Pipe

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List