Density and Distance Based KNN Approach to Classification

Density and Distance Based KNN Approach to Classification

Yixin Su (Department of Computer Science and Software Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, China) and Sheng-Uei Guan (Department of Computer Science and Software Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, China)
Copyright: © 2016 |Pages: 16
DOI: 10.4018/IJAEC.2016040103
OnDemand PDF Download:
List Price: $37.50


KNN algorithm is a simple and efficient algorithm developed to solve classification problems. However, it encounters problems when classifying datasets with non-uniform density distributions. The existing KNN voting mechanism may lose essential information by considering majority only and get degraded performance when a dataset has uneven distribution. The other drawback comes from the way that KNN treat all the participating candidates equally when judging upon one test datum. To overcome the weaknesses of KNN, a Region of Influence Based KNN (RI-KNN) is proposed. RI-KNN computes for each training datum region of influence information based on their nearby data (i.e. locality information) so that each training datum can encode some locality information from its region. Information coming from both training and testing stages will contribute to the formation of weighting formula. By solving these two problems, RI-KNN is shown to outperform KNN upon several artificial datasets and real datasets without sacrificing time cost much in nearly all tested datasets.
Article Preview

1. Introduction

Classification is one of the most important techniques in data mining that identifies a new observation into a known set of categories (Alpaydin, 2010). Classification problem, aiming to decide a set of input vectors as one part among N classes (Stephen, 2009), received increasing attention by people due to the development of web technology and the exponential growth of data scale nowadays. The most frequently used classification methods include KNN algorithm, Bayes algorithm, Decision Tree algorithm, Neural Network algorithm, SVM classification algorithm and so on (Mao, Liu and Cao, 2013).

Classification is one of supervised learning instances (Alpaydin, 2010), whose training data is always the most important evidence when making prediction in the test stage. That is to say, different distribution of patterns in datasets will exhibit a different property and no algorithm is able to perform well on all kinds of patterns.

KNN algorithm, considered as one of the most well-received algorithms in classification, enjoys the reputation of being simple, efficient and nonparametric (Kansheng, Lemin and Haitao, 2011). The core idea of KNN algorithm is that when a test data is added into the space distributed with training data, it automatically finds the nearest k training data and regard itself as the category with the majority number of data among the k data. However, there are still some disadvantages. First, it regards each training data as the same once it is selected to be one of the k judges, which overlooks any differences among them. Meanwhile, KNN only considers data distribution information in a small area. In most cases, each training datum contains information about its surrounding environment, but KNN ignores this. Second, KNN algorithm performance deteriorates when dealing with the situation that two groups of data with different categories stay close to each other. When one group has significantly higher density than the other group, misclassification always happens when the test datum is located in between these two groups (Li, Lu and Yun, 2004).

Due to the wide usage of KNN algorithm (Zhu, Luo and Yi, 2009) and its apparent disadvantages, improving KNN algorithm to be more stable and reliable is meaningful. Therefore, this project aims to develop an improved KNN algorithm that trains the training data to mine deeper information from the dataset and then enhance the performance of KNN algorithm.

A density and distance based KNN algorithm, namely Region of Influence Based KNN (RI-KNN), is developed to reduce the misclassification rate when classifying the data near the boundary regions. In addition, when two groups of data with different densities are co-located and a test datum sits in between them, RI-KNN algorithm can use information trained from training data to reduce the error rate. Unlike traditional KNN algorithm that does not have a training stage (lazy learning), the new method contains a training process that deduces from each training datum region of influence information. This information shows how far the training datum can influence. That is to say, it is a method that can classify the data sets with any-shape clustering conditions comparing to some methods that select a geometric shape to enclose data within it belonging to a specific class (Song and Guan, 2014). After the training stage, in the space, every training datum contains a region that represents how far this training datum can influence in space.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2017): 3 Released, 1 Forthcoming
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing