Data Lakes

Data Lakes

Anjani Kumar, Parvathi Chundi
Copyright: © 2023 |Pages: 15
DOI: 10.4018/978-1-7998-9220-5.ch025
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Data lake (DL) technology is popular for its flexibility to handle different raw data formats at the ingestion time as well as at the time of retrieval from the data lake. It typically includes the following five layers data ingestion, staging, processed data, storage and visualization, and analytics. These five layers together provide access to seemingly infinite computation and storage resources for democratizing data access and for supporting a wide variety of analytics tasks in an enterprise. This work is going to explain the four steps approach for doing the analysis task. It will describe the three pillars for building a DL. Then, it will give a brief history of the evolution from Excel Sheet to DL. It will explain the five layers: data ingestion, staging, processed data, storage and visualization, and analytics. It will briefly explain three DL systems, Snowflake, Databricks, and Redshift, and then nine important metrics for these three DL systems will be compared.
Chapter Preview
Top

Introduction

With the advent of the digital world, data gets generated and collected for every action -- while browsing a website, purchasing items on e-commerce websites, watching videos online, etc. These data are generated in real-time and can be in diverse formats structured (relational tables or CSV files), unstructured (text files), & semi-structured (XML or log files). These ever-increasing databases create challenges in an organization where multiple departments may generate a part of the organizational data. For an organization to generate value from the data collected by different departments, these data sources must be accessible across the entire organization, merged, and analyzed in different ways for various purposes. A Data Lake (DL) is a centralized, scalable storage location where organizational data can be stored and made available widely across the entire organization for analysis purposes.

There are different requirements for different departments of a company. The Business Intelligence team may need data arranged in a specific format to compute data cubes to create reports and visualizations to answer many business questions. In contrast, the data science team may need data in a raw format to explore future trends or build predictive models.

A data analysis task is a process of extracting meaningful information from a massive volume of data. It can be done in various ways, such as creating reports and visualizations to answer business questions and developing a predictive model using machine learning to find patterns. There are mainly two types of data analysis: quantitative and qualitative. Even though these two tasks are conducted differently, both approaches attempt to tell a story from the data. Some commonalities between the two data analyses are data reduction, answering research questions, explaining variation, etc. (Hardy, 2004). A data analysis task is also defined as the accurate evaluation and full exploitation of the data obtained (Brandt & Brandt, 1998).

There are usually four steps for doing any data analysis task (Gorelik, 2019) and they are listed below.

  • 1.

    Find & Understand: An enterprise has vast amounts of data. This massive amount of data is saved in many databases, each containing many tables and each table containing many fields or attributes. A database is the collection of interrelated data that is stored in and managed by a database management system (DBMS) (Silberschatz, Korth, & Sudarshan, 2020). In general, DBMS uses the relational or tabular format to store data and relationships among data. Data are saved in a collection of tables. Each table has multiple columns, also known as attributes. The attribute names in the table are unique. Each row in the table stores the data as a record.

With thousands of tables at an enterprise and each table containing hundreds of fields, it is difficult, if not impossible, to locate the right data sets needed for an analysis task. As a simple example, consider the data analysis task to build sales prediction models for the northeast region of the United States. The analyst should be able to locate the tables where such data are stored among the hundreds of databases in the enterprise. It becomes complicated for an analyst to find and understand the meanings of numerous attributes of these tables. To find the tables with relevant attributes, an analyst may have to manually examine each table or enlist the help of others that might have used or created that table. Therefore, the analyst must first locate the correct fields needed for the data analysis and then understand the data/attributes in existing databases.

Key Terms in this Chapter

Cluster: A cluster is a set of compute nodes that work in sync to process a job. The job could be data engineering, data science, analytics, or any machine learning workload.

ETL: ETL stands for Extract, Transform, and Load. It is a data integration tool through which various data sources in different formats can be consolidated into a single location, such as Data Warehouse.

Cluster Pool: Cluster pool is a set of idle and ready-to-use instances. This reduces cluster start time as instances are waiting to be used as a part of cluster nodes.

Schema: A schema defines data organization such as field order, its type (string, integer, etc.). Also, it describes the relationships between tables in a given database by defining primary and foreign keys.

EDA: EDA stands for Exploratory Data Analysis. It is essential to understand data in order to apply various machine learning techniques. The technique is used to analyze data sets by summarizing their main characteristics.

ACID: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. In case of failures, these properties ensure the accuracy and integrity of the data in the database.

Compute Resources: Compute resources provide processing capabilities. More processing is required for a compute-intensive process such as processing complex analytical queries.

Cloud Agnostic: Cloud agnostic system is capable of being deployed in any cloud service provider. It helps businesses in choosing the appropriate cloud provider with minimal cost and the best features.

Streaming: Data generated in real-time continuously.

Storage Resources: Storage resources provide physical storage space capability. Suppose a DL grows in size, but its analytical queries requirement remains same. In this particular case, more physical storage space is required to save increased size of DL however compute resources remain the same.

Complete Chapter List

Search this Book:
Reset