Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Data Lakes

Anjani Kumar, Parvathi Chundi

Source Title: Encyclopedia of Data Science and Machine Learning

DOI: 10.4018/978-1-7998-9220-5.ch025

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Data lake (DL) technology is popular for its flexibility to handle different raw data formats at the ingestion time as well as at the time of retrieval from the data lake. It typically includes the following five layers data ingestion, staging, processed data, storage and visualization, and analytics. These five layers together provide access to seemingly infinite computation and storage resources for democratizing data access and for supporting a wide variety of analytics tasks in an enterprise. This work is going to explain the four steps approach for doing the analysis task. It will describe the three pillars for building a DL. Then, it will give a brief history of the evolution from Excel Sheet to DL. It will explain the five layers: data ingestion, staging, processed data, storage and visualization, and analytics. It will briefly explain three DL systems, Snowflake, Databricks, and Redshift, and then nine important metrics for these three DL systems will be compared.

Chapter Preview

Top

Introduction

With the advent of the digital world, data gets generated and collected for every action -- while browsing a website, purchasing items on e-commerce websites, watching videos online, etc. These data are generated in real-time and can be in diverse formats structured (relational tables or CSV files), unstructured (text files), & semi-structured (XML or log files). These ever-increasing databases create challenges in an organization where multiple departments may generate a part of the organizational data. For an organization to generate value from the data collected by different departments, these data sources must be accessible across the entire organization, merged, and analyzed in different ways for various purposes. A Data Lake (DL) is a centralized, scalable storage location where organizational data can be stored and made available widely across the entire organization for analysis purposes.

There are different requirements for different departments of a company. The Business Intelligence team may need data arranged in a specific format to compute data cubes to create reports and visualizations to answer many business questions. In contrast, the data science team may need data in a raw format to explore future trends or build predictive models.

A data analysis task is a process of extracting meaningful information from a massive volume of data. It can be done in various ways, such as creating reports and visualizations to answer business questions and developing a predictive model using machine learning to find patterns. There are mainly two types of data analysis: quantitative and qualitative. Even though these two tasks are conducted differently, both approaches attempt to tell a story from the data. Some commonalities between the two data analyses are data reduction, answering research questions, explaining variation, etc. (Hardy, 2004). A data analysis task is also defined as the accurate evaluation and full exploitation of the data obtained (Brandt & Brandt, 1998).

There are usually four steps for doing any data analysis task (Gorelik, 2019) and they are listed below.

1.
Find & Understand: An enterprise has vast amounts of data. This massive amount of data is saved in many databases, each containing many tables and each table containing many fields or attributes. A database is the collection of interrelated data that is stored in and managed by a database management system (DBMS) (Silberschatz, Korth, & Sudarshan, 2020). In general, DBMS uses the relational or tabular format to store data and relationships among data. Data are saved in a collection of tables. Each table has multiple columns, also known as attributes. The attribute names in the table are unique. Each row in the table stores the data as a record.

With thousands of tables at an enterprise and each table containing hundreds of fields, it is difficult, if not impossible, to locate the right data sets needed for an analysis task. As a simple example, consider the data analysis task to build sales prediction models for the northeast region of the United States. The analyst should be able to locate the tables where such data are stored among the hundreds of databases in the enterprise. It becomes complicated for an analyst to find and understand the meanings of numerous attributes of these tables. To find the tables with relevant attributes, an analyst may have to manually examine each table or enlist the help of others that might have used or created that table. Therefore, the analyst must first locate the correct fields needed for the data analysis and then understand the data/attributes in existing databases.

Key Terms in this Chapter

Cluster: A cluster is a set of compute nodes that work in sync to process a job. The job could be data engineering, data science, analytics, or any machine learning workload.

ETL: ETL stands for Extract, Transform, and Load. It is a data integration tool through which various data sources in different formats can be consolidated into a single location, such as Data Warehouse.

Cluster Pool: Cluster pool is a set of idle and ready-to-use instances. This reduces cluster start time as instances are waiting to be used as a part of cluster nodes.

Schema: A schema defines data organization such as field order, its type (string, integer, etc.). Also, it describes the relationships between tables in a given database by defining primary and foreign keys.

EDA: EDA stands for Exploratory Data Analysis. It is essential to understand data in order to apply various machine learning techniques. The technique is used to analyze data sets by summarizing their main characteristics.

ACID: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. In case of failures, these properties ensure the accuracy and integrity of the data in the database.

Compute Resources: Compute resources provide processing capabilities. More processing is required for a compute-intensive process such as processing complex analytical queries.

Cloud Agnostic: Cloud agnostic system is capable of being deployed in any cloud service provider. It helps businesses in choosing the appropriate cloud provider with minimal cost and the best features.

Streaming: Data generated in real-time continuously.

Storage Resources: Storage resources provide physical storage space capability. Suppose a DL grows in size, but its analytical queries requirement remains same. In this particular case, more physical storage space is required to save increased size of DL however compute resources remain the same.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Data Lakes

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List