Detection Approaches for Categorization of Spam and Legitimate E-Mail

Detection Approaches for Categorization of Spam and Legitimate E-Mail

Rachnana Dubey (LNCT, India), Jay Prakash Maurya (LNCT, India) and R. S. Thakur (Maulana Azad National Institute of Technology, India)
DOI: 10.4018/978-1-5225-3870-7.ch016

Abstract

The internet has become very popular, and the concept of electronic mail has made it easy and cheap to communicate with many people. But, many undesired mails are also received by users and the higher percentage of these e-mails is termed spam. The goal of spam classification is to distinguish between spam and legitimate e-mail messages. But, with the popularization of the internet, it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted e-mails automatically before they enter a user's mailbox. The main objective of this chapter is to examine and identify the best detection approach for spam categorization. Different types of algorithms and data mining models are proposed, implemented, and evaluated on data sets. For improvement of spam filtering technique, the authors analyze the methods of feature selection and give recommendations of their use. The chapter concludes that the data mining models using a combination of supervised learning algorithms provide better results than single data models.
Chapter Preview
Top

Introduction

E-mail is the most powerful medium of today communication. But E-mail spam is one the major problem for internet user. Every user is facing this problem on his day to day communication. Along with the growth of E-mail communication, spam’s are also continuously growing day by day. Spamming is of electronic communication systems to send unsought bulk messages or to push merchandise or services, that area unit nearly universally unwanted. Many problems arise due to spam mail; one of the major problem is many companies faces big financial loss (AnirudhRama, 2006). Another problem is that user needs to spend time on checking and deleting spam from their inbox. In addition, due to spam E-mails may contain malicious software (i.e. phishing software), illegal advertising, such as image schemes and attractive information, it has become a serious security issue on internet. The one of the best solution for solving spam issue is data mining with machine learning algorithm (Nema et al., 2016). Data mining as the approach for finding the spam type (spam or legitimate) text patterns from large amount of data through machine learning (Yadav et al., 2016), discover the similar pattern which are adopted by smart spammers as Shown in Figure 1.

Figure 1.

Flow Chart to find out spam

Five algorithms have been used for spam and legitimate categorization. The algorithms results are based on supervised learning algorithms (Naïve Bayes, Random Forest, Random tree, Bagging and Boosting). Moreover, Support Vector Machine can be used for spam categorization. Support vector machine is the supervised learning algorithm. SVM works on linear separable in different feature levels. In this proposed work, machine learning algorithm is evaluated using WEKA, Rapid Minor and SVM tool for finding accuracy, efficiency of classifiers and various types of errors. We have analyzed the most effective categorization methodology on bench mark dataset. This comprises 9324 records and 500 instances (70% for Training and 30% for Testing) to make the model. We described approaches and learning models for eliminate bulky commercial mails, malicious code, fraud E-mails. The main aim is to finding the unwanted keyword, which are mostly using for spam (Battista, 2011).

Key Terms in this Chapter

Optimization: Optimization is the process of adjusting a trading system in an attempt to make it more effective.

Relative Absolute Error: The absolute error is the magnitude of the difference between the exact value and the approximation.

Mean Absolute Error: The mean absolute error (MAE) is a quantity used to measure how close predictions are to the eventual outcomes.

Mean Squared Error: The difference between the estimator and what is estimated.

Categorization: Is a process where the objects are understood, recognized, and differentiated.

Legitimate: According to law.

Complete Chapter List

Search this Book:
Reset