Article Preview
TopIntroduction
The Human race is evolving at a very fast pace. Humans are producing, harvesting and feasting on unprecedented amount of data. New technologies and analytics platforms are helping in deriving unforeseen information from the huge amount of data. Evolution has accelerated and is primarily aided by the introduction of computational machines. These machines not only help with storage of data, but also help aid development. The dominance of these machines in the digital age has only led to generation of enormous amount of data which is only increasing at a rapid pace. As surveyed in 2012, about 2.5 Exabytes of data is generated every day, and this amount of data doubles every 40 months. More data is streamed across the internet every second than what was the storage of the entire internet only 20 years ago (Chen & Zhang, 2014). To keep up with such flow of data in the form of images, videos, datasets, files, text, animations, etc. it is necessary for humans to harness it for his complete benefit.
In such a scheme, big data analytics plays a crucial role only due its sheer capability to function and analyze large datasets. The approach of harnessing information and using it to predict an unknown event or to find patterns is known as data mining. Several techniques and methods have been developed by a lot of mathematicians and statisticians in order to extract information from these large datasets. The combination of these techniques and computational processing is the base for the formation of data mining. The main purpose of data mining techniques (Witten, Frank, Hall & Pal, 2016) is to extricate high level knowledge from raw data. Many algorithms and approaches were developed by mathematicians in the early 20th century. However, due to technological restraints at that time, these algorithms couldn’t put to practical use and it was only till recently when there was a surge in processing power when these algorithms again became an area of interest for many researchers. Of the many prevalent algorithms to exist, a few more commonly used is SVM, kNN and Gradient Descent (Kotsiantis, Zaharakis & Pintelas, 2007). These algorithms can be broadly classified into 4 types based on their functionality, namely Regression, Classification, Clustering and Rule extraction.
Classification is a technique through which a model or a classifier is created to predict the class of a particular query q. In the scope of this paper we shall discuss two of these classification techniques kNN and ARSkNN. The kNN is a commonly used classifier which is based on distance as a parameter to conclude the class of a testing instance. There are 76 similarity parameters which are used but all of them are based on distance (Choi, Cha & Tappert, 2010). Due to this approach, the kNN is often computationally and time ineffective and uses a lot of memory to operate. To overcome this problem, another classifier known as ARSkNN was developed which uses Massim as a similarity measure rather than distance (Kumar, Bhatnagar, & Srivastava, 2014). This approach not only reduces computational power required, but also reduces the overall time taken for classification. The performance of both these classifier is, however variable and is dependent on the characteristics of the dataset, i.e. if it is symmetric, binary etc.
The paper makes the following contributions; It:
- •
Establishes the empirical and experimental foundation of ARSkNN.
- •
Empirically evaluate ARSkNN and traditional kNN. The results shows ARSkNN superiority upon two parameters known as avg. accuracy percentage and avg. runtime.
In the following paper, the authors have compared and evaluated the performance of the kNN classifier using Euclidean distance as a similarity measure and the ARSkNN. Wine Quality dataset and Yeast dataset which were taken from the UCI repository (Blake & Merz, 1998) were used in this study. The paper is divided into 7 sections, namely: Introduction, Literature Review, Experimental Setup, Datasets Used, Empirical Evaluation, Discussion and Conclusion.