Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Big Data Preprocessing, Techniques, Integration, Transformation, Normalisation, Cleaning, Discretization, and Binning

Pranali Dhawas, Abhishek Dhore, Dhananjay Bhagat, Ritu Dorlikar Pawar, Ashwini Kukade, Kamlesh Kalbande

Source Title: Big Data Analytics Techniques for Market Intelligence

DOI: 10.4018/979-8-3693-0413-6.ch006

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

“Unleashing the Power of Big Data: Innovative Approaches to Preprocessing for Enhanced Analytics” is a groundbreaking chapter that explores the pivotal role of preprocessing in big data analytics. It introduces diverse techniques to transform raw, unstructured data into a clean, analyzable format, addressing the challenges posed by data volume, velocity, and variety. The chapter emphasizes the significance of preprocessing for accurate outcomes, covers advanced data cleaning, integration, and transformation techniques, and discusses real-time data preprocessing, emerging technologies, and future directions. This chapter is a comprehensive resource for researchers and practitioners, enabling them to enhance data analytics and derive valuable insights from big data.

Chapter Preview

Top

1. Introduction To Big Data Preprocessing

Big data preprocessing plays a critical role in the data analysis process by converting raw and unprocessed data into a structured and clean format suitable for analysis. As the volume, velocity, and variety of data continue to grow exponentially, preprocessing becomes increasingly vital for extracting valuable insights and knowledge from large datasets.

The process of big data preprocessing involves employing various techniques and operations to enhance data quality, reduce noise and inconsistencies, handle missing values, and prepare the data for subsequent analysis tasks as shown in figure 1. It significantly contributes to improving the efficiency, accuracy, and effectiveness of data analysis (O. Çelik, 2019).

Figure 1.

Objectives of big data preprocessing

The main objectives of big data preprocessing include:

Data Cleaning: Raw data often contains errors, outliers, duplicates, or inconsistencies. Data cleaning aims to identify and rectify these issues to ensure high data quality. By eliminating noise and irregularities, the resulting clean data provides a reliable foundation for analysis.

Data Integration: Big data originates from diverse sources such as databases, sensors, social media, or IoT devices. Data integration involves combining data from different sources and formats into a unified representation. This step ensures data consistency and compatibility for analysis (Z. Cai-Ming, 2020).

Data Transformation: Data transformation techniques are applied to convert data into a suitable format for analysis. This may involve scaling numerical data, normalizing values, encoding categorical variables, or deriving new features through mathematical or statistical operations. Transformation facilitates data standardization and simplifies subsequent analysis tasks.

Dimensionality Reduction: Dealing with high-dimensional data can pose computational challenges and introduce noise or overfitting problems. Dimensionality reduction techniques help decrease the number of variables or features while preserving crucial information. This simplifies the analysis process and improves computational efficiency (H. S. Obaid, 2019).

Handling Missing Values: Missing data is a common issue in large datasets. Preprocessing techniques include imputing missing values using statistical methods or leveraging imputation algorithms to fill in the gaps. Proper handling of missing data ensures that the analysis is not compromised by incomplete information (T. A. Alghamdi, 2022).

Data Discretization: Discretization involves converting continuous data into categorical or discrete representations. This technique simplifies analysis by reducing the complexity associated with continuous variables. It allows for the application of methods specifically designed for categorical data (P. Gao, 2020).

Dealing with Imbalanced Data: Imbalanced data refers to situations where one class or category is significantly more prevalent than others. Preprocessing techniques address this imbalance by employing methods such as oversampling, under sampling, or generating synthetic samples to achieve a balanced representation of the data.

Big data preprocessing is indispensable for extracting valuable insights from complex datasets. By effectively cleaning, transforming, and organizing the data, preprocessing ensures that subsequent analysis tasks are more accurate, efficient, and reliable. The specific techniques utilized may vary based on the data's nature, analysis objectives, and the challenges posed by the dataset at hand.

Key Terms in this Chapter

Data Cleaning: Data Cleaning involves the identification and correction of errors, outliers, duplicates, or inconsistencies in raw data to improve its quality, aiming to eliminate noise and irregularities and establish a reliable foundation for subsequent analysis.

Data Transformation: Data Transformation includes applying techniques to convert data into a suitable format for analysis, such as scaling numerical data, normalizing values, encoding categorical variables, or deriving new features through mathematical or statistical operations. This standardization simplifies subsequent analysis tasks.

Dealing with Imbalanced Data: Dealing with Imbalanced Data focuses on situations where one class or category is significantly more prevalent than others. Preprocessing techniques, such as oversampling, undersampling, or generating synthetic samples, aim to achieve a balanced representation of the data.

Handling Missing Values: Handling missing values are the process of addressing missing data in large datasets. Techniques involve imputing missing values using statistical methods or leveraging imputation algorithms to fill gaps, ensuring that analysis is not compromised by incomplete information.

Data Integration: Data Integration is the process of merging data from various sources and formats (databases, sensors, social media, IoT devices) into a unified representation, ensuring data consistency and compatibility for analysis.

Dimensionality Reduction: Dimensionality Reduction entails employing techniques to reduce the number of variables or features in high-dimensional data while preserving essential information. This simplifies analysis, addresses computational challenges, and enhances efficiency by minimizing noise and overfitting problems.

Data Discretization: Data Discretization involves converting continuous data into categorical or discrete representations, simplifying analysis by reducing complexity associated with continuous variables and enabling the application of methods designed for categorical data.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Big Data Preprocessing, Techniques, Integration, Transformation, Normalisation, Cleaning, Discretization, and Binning

Abstract

1. Introduction To Big Data Preprocessing

Key Terms in this Chapter

Complete Chapter List