Article Preview
TopIntroduction
The concept of Big Data has been endemic within computer science since the earliest days of computing. “Big Data” originally meant the volume of data that could not be processed (efficiently) by traditional database methods and tools. Each time a new storage medium was invented, the amount of data accessible exploded because it could be easily accessed. The original definition focused on structured data, but many researchers and practitioners have now come to appreciate that a very significant and growing percent of the world’s data and accumulated information resides in massive, unstructured data and information sinks. Further, very large chunks of the Big Data are, largely in the form of unstructured text, imagery, and video. The explosion of data has not been accompanied by the expanded availability of new storage mediums that can accumulate it while ensuring ready access.
We define “Big Data” as the amount of data just beyond technology’s capability to store, manage and process efficiently. These limitations are only discovered by a robust analysis of the data itself, explicit statements of its processing needs, and assessments of the capabilities and deficiencies in the tools (hardware, software, and methods) used to analyze and administer the Big Data. As with any newly recognized problem, the conclusion of how to proceed may lead to a recommendation that new tools need to be forged to perform the new tasks.
As little as 5 years ago, researchers and analysts were only thinking of tens to hundreds of terabytes of storage for our personal computers. Today, we are thinking in tens to hundreds of petabytes. Thus, Big Data like our universe, appears to be a constantly moving and expanding target. It is that growing amount of data (in volume, type, and perhaps with new characteristics) that is just beyond our immediate capabilities and grasp, e.g., we have to work hard to store it, access it, manage it, and process it.
Our conclusion about growth rate is that the current growth in the volume of data collected continues to be staggering. Further, no researcher or manager writing in all the literatures surveyed projects a slowing of the rate of growth or that a cap may exist for the total volume of data collected in the future. A major challenge for IT researchers and practitioners is that this growth rate is fast exceeding our ability to: (1) design appropriate systems to handle the data effectively, and (2) analyze it to extract relevant meaning for decision making. In this paper, we identify critical issues associated with data storage, management, processing and security. To the best of our knowledge, the research literature has addressed some of these issues, but we believe new approaches, technologies and processes will (and must) continue to emerge to address these issues as data volume approaches the Exabyte range.