A Study of Feature Selection and Dimensionality Reduction Methods for Classification-Based Phishing Detection System

A Study of Feature Selection and Dimensionality Reduction Methods for Classification-Based Phishing Detection System

Amit Singh, Abhishek Tiwari
Copyright: © 2021 |Pages: 35
DOI: 10.4018/IJIRR.2021010101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Phishing was introduced in 1996, and now phishing is the biggest cybercrime challenge. Phishing is an abstract way to deceive users over the internet. Purpose of phishers is to extract the sensitive information of the user. Researchers have been working on solutions of phishing problem, but the parallel evolution of cybercrime techniques have made it a tough nut to crack. Recently, machine learning-based solutions are widely adopted to tackle the menace of phishing. This survey paper studies various feature selection method and dimensionality reduction methods and sees how they perform with machine learning-based classifier. The selection of features is vital for developing a good performance machine learning model. This work is comparing three broad categories of feature selection methods, namely filter, wrapper, and embedded feature selection methods, to reduce the dimensionality of data. The effectiveness of these methods has been assessed on several machine learning classifiers using k-fold cross-validation score, accuracy, precision, recall, and time.
Article Preview
Top

1. Introduction

In Phishing, the phisher creates a fraud phishing website to mislead web users to steal their sensitive personal information. Deception is the way of Phishing by hiding as a trusted entity in electronic communication. The first time Phishing discovered in the 1980s. Anti-Phishing Working Group (APWG) reported 51,401 unique phishing websites in June 2018 (Chiew, Tan, Wong, Yong, & Tiong, 2019; Phishing Activity Trends Report 2nd Quarter 2018, 2018). Another report by RSA estimated that global organizations lost 9 billion$ due to phishing fraud in 2016 (Heidi Bleau, 2016). It is one of the biggest cybercrime faced by internet users. Generally, phishing attacks are accomplished using emails and website spoofing. Phishers start the attack by sending spoofed emails to victims and victims think this is authentic and secure, thereby they got trapped. Figure 1 represents the workflow structure of phishing.

Figure 1.

Phishing workflow

IJIRR.2021010101.f01

Apart from email, phisher leads users to various similar looking authenticated, secure and famous websites via advertisement links. There are many ways of phishing detection and prevention such as the use of any authorized anti-phishing software, naive browser extensions (Google and Mozilla Firefox use Blacklist warning system) and toolbars. Blacklist warning system queries a database of already known phishing URLs so it will not be able to identify new upcoming phishing websites (Chiew et al., 2019). Designing an intelligent phishing detection system, based on Machine learning classification model can easily identify whether this website or web-link is for phishing or not. These ML based classification systems are very effective. However, for creating these prediction system in machine learning, feature selection and dimensionality reduction are very important steps. Investigation of state of the art approaches reveals that there is a need for a systematic study of feature selection and dimensionality reduction approaches to design an intelligent and capable system to detect the phishing websites.

For any Machine learning classifier, we need useful and relevant features. For choosing, those relevant features from the dataset feature selection is paramount. Feature selection is even more useful when we are dealing with high dimensional data. This high dimensional dataset poses many problems, such as increased training time and sometimes it may lead towards overfitting of our machine-learning model. The feature selection process will select relevant attributes from data based on the method specified by the analyst (Ameen, Balogun, Usman, & Fashoto, 2016). These reduced features will help us in improving the accuracy of the classifier and decrease the computational cost of the classifier. There are three main category of feature selection techniques filter method, wrapper method, and embedded method. All these techniques have their unique significance, and we will discuss it section 3.

Dimension reduction is another feature preprocessing technique before the design of a classifier. Dimensional reduction transforms the dataset into a low dimensional dataset, ensuring it will not change the meaning of data. When the dimensionality of the datasets reduced, then it improves the performance of the classifier in comparison to applying on original data. Dimensionality reduction can be both linear and nonlinear; it depends on the dataset.

Feature selection and Dimensionality reduction both are used in designing the best Machine learning Classification model with a difference that features selection technique aims at selecting the features from original dataset whereas dimensionality reduction technique aims at transforming the dimensionality of original datasets.

Machine learning focuses on developing the computation algorithms to find out patterns, reasoning, and rules from data to design Machine Learning model, which can detect or make a prediction about forthcoming occurrences (Ali, 2017). Machine learning is supervised learning if outputs are given with training data for training the model else; it is unsupervised learning. Many supervised learning algorithms are successfully working on real-life applications. Some popular Machine learning Classification techniques are Support Vector machine (SVM), Naïve Bayes classifier, K Nearest Neighbor (KNN), Decision trees, Random forest, and Ensemble methods. These Classification models are being used to classify new upcoming data as either positive (or one) or negative (or zero).

In summary, we make the following contributions in this survey paper:

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing