Article Preview
TopIntroduction
Due to the onset of new technology like cloud computing, social network, Internet of things and numerous online applications, the internet usage increases promptly. Millions of Data generated by these technologies are shared through network. This leads to big data (Sivarajah et al., 2017). In big data era, wireless sensor network also generates enormous volume of data (Smys, 2019). Storing and processing such data is a challenging task. De-duplication is an apt solution for data blast in big data. Volume, Variety, Velocity, Variability, Veracity, Value and Volume are the different dimensions of big data (Al-Mekhlal & Ali Khwaja, 2019).
Recent technologies help to increase the parallel processing and storage space in data analysis. Privacy and data security is also the major issue in processing big data (Abouelmehdi et al., 2017). Many researches have been done to establish the security and privacy in data analysis. In this information era, data analytics is a computerized process and it contradicts to data science and data analysis. Data analysis focuses on finding solutions for the queries. Data analytics focuses on particular field to attain the target. Data science focuses on finding new queries and uses predictive analytics and machine learning. Data analytics tools are Excel, SQL database, Power BI, SAS, R and Python. Data science tools are Hadoop, Spark, Cassandra, Mongo DB, Tableau, Python, R, Scala and QlikView. The outcomes of big data analytics software are efficient marketing, good customer service and high practical efficiency.
Various frameworks and algorithms have been developed to excerpt the meaningful information from the big data. Among those technologies, some methods take much time to gather the information. In order to select the reasonable technique for handling big data, several parameters are studied. The parameters include flexibility, reliability, security, accomplishment, cost and adeptness. Complex nature of big data needs state-of-the-art technologies to upgrade the efficiency. Many enterprises enhance their revenue and operations by refining the big data. Many organization takes intelligent decision with the help of big data.
Grouping of voluminous and multi-format data from various sources is termed as big data. It can process the data that outstrips the processing capacity of traditional database. Dreadful amount of data like structured, semi-structured and unstructured data can be handled by big data (Suma V et al., 2019). Structured data have certain format and it can be handled by structured query language. Semi-structured data includes JSON, XML that have self-describing structure. Unstructured data do not have any formal structure. It includes images, videos, audio, document, etc.
Data de-duplication, a powerful approach to free the storage space by eliminating superfluous data. It is a capability of storing unique copy to the storage. File chunking, hash code creation and detecting redundancy are the three components of hash-based chunking. It divides large files into multiple files known as chunks. (Rachmawati et al., 2018) create the Fingerprints for each chunk by SHA-1 or MD5. In order to find the redundant chunk, the de-duplication method compares the new fingerprint with the fingerprint stored in the database. Thus, the fingerprints pinpoint the Duplicate files and discard it from the storage.
Hadoop is a distributed processing framework that can process large files on clusters of computers. Hadoop has some benefits, such as cost effective, scalable, flexible and resilient to failure (Kumar et al., 2018). Hadoop Distributed File system (HDFS) and Map reduce are the two major components of Hadoop. HDFS equip high throughput for parallel processing and storage. Map reduce is a two-step processing mode, that suits for parallel processing in de-duplication (Singhal et al., 2018). Hadoop not only process tremendous quantity of data but also has the ability to handle mixed form of data. Diverse form of data includes Clickstream records, web server logs, mobile application logs, social media posts, customer emails and sensor data. To access the data across the nodes in the cluster, the Hadoop distributed File System (HDFS) is devised. Node failure in the hadoop does not affect the execution or processing of applications.
Hadoop overcomes classical database in two ways. Hadoop clear up the two main issues faced by conventional database.