Challenges in Big Data Analysis

Challenges in Big Data Analysis

M. Govindarajan (Annamalai University, India)
DOI: 10.4018/978-1-7998-3479-3.ch041

Abstract

Big data brings new opportunities to modern society and challenges to data scientists. On one hand, big data holds great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of big data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. Prior to data analysis, data must be well constructed. However, considering the variety of datasets in big data, the efficient representation, access, and analysis of unstructured or semi-structured data are still challenging. Understanding the method by which data can be preprocessed is important to improve data quality and the analysis results. The purpose of this chapter is to highlight the big data challenges and also provide a brief description of each challenge.
Chapter Preview
Top

Background

David Lazer et al., (2009) discusses an emerging field that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Stadler et al., (2010) developed an efficient EM algorithm for numerical optimization with provable convergence properties. High dimensionality also gives rise to incidental endogeneity, a phenomenon that many unrelated covariates may incidentally be correlated with the residual noises. The endogeneity creates statistical biases and causes model selection inconsistency that lead to wrong scientific discoveries (Liao and Jiang, 2011; Fan and Liao, 2012). Jianqing Fan et al., (2013) gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. Nawsher Khan et al., (2014) comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination. These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data. Lenka Venkata Satyanarayana (2015) provides an in-depth analysis of different platforms available for performing big data analytics. This paper surveys different hardware platforms available for big data analytics and assesses the advantages and drawbacks of Big Data. D. P. Acharjya et al., (2016) explore the potential impact of big data challenges, open research issues, and various tools associated with it. Akhil et al., (2017) analyzed the potential effect of big data challenges, open research issues, and different tools related with it. Ripon Patgiri (2018) presents a study report on numerous research issues and challenges of Big Data which is employed in very large dataset. Reihaneh H. Hariri et al., (2019) reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.

Key Terms in this Chapter

Big Data: Big data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data.

Scalability: The scalability issue of big data has led towards cloud computing, which now aggregates multiple disparate workloads with varying performance goals into very large clusters.

Incidental Endogeneity: Incidental endogeneity is another subtle issue raised by high dimensionality.

Big Data Problems: Big data problems such as heterogeneity, noise accumulation, spurious correlations, and incidental endorgeneity, in addition to balancing the statistical accuracy and computational efficiency.

Heterogeneity: Big data are often created via aggregating many data sources corresponding to different sub-populations.

Noise Accumulation: Analyzing Big Data requires us to simultaneously estimate or test many parameters. These estimate errors accumulate when a decision or prediction rule depends on a large number of such parameters.

Spurious Correlation: High dimensionality also brings spurious correlation, referring to the fact that many uncorrelated random variables may have high sample correlations in high dimensions.

Complete Chapter List

Search this Book:
Reset