Effect of N-Grams Technique in Preprocessing of Email Spam Filtering

Effect of N-Grams Technique in Preprocessing of Email Spam Filtering

Aakanksha Sharaff (National Institute of Technology Raipur, Raipur, India) and Naresh Kumar Nagwani (National Institute of Technology Raipur, Raipur, India)
Copyright: © 2017 |Pages: 12
DOI: 10.4018/ijaec.2017010102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this paper, the process of spam categorization based on character level; content-based approach has been demonstrated. Spam categorization has been performed by using N-gram technique. The general technique of using N-grams on words, creating a “Bag of Words” representation of documents, has been replaced by ‘Bag of Characters'.‘Bag of Character' is created by treating the whole email document as a single string and splitting it character-wise. In this approach, multiple N-grams i.e. bi-grams, tri-grams and quad-grams have been used simultaneously. It results in ‘bag of character' representation of email documents containing N-grams of sizes 2, 3 and 4. It enhances the results by enabling us to solve the problems occurring in Word N-grams. All the experiments have been performed on Ling Spam Corpus.
Article Preview

Introduction

Electronic mail (E-mail) is one of the fastest, cheapest, secure and reliable means of communication available. It enables users to connect almost instantly and the sensitive information is significantly secured. The major issue with this means of the method is irrelevant email also termed as ‘spam’. Spam mails are those mails that users do not want to have voluntarily. These spam mails are generally directed for the sake of advertisement or with the intent of breaching security. They cause the wastage of enormous online space, internet bandwidth as well as valuable time. On the prospect of security, they are particularly dangerous as they can distribute viruses or can extract valuable information if the user isn’t careful. As the number of email users increases, the risks caused by spam also increase. This compels us to search for effective and efficient ‘spam-filters.’ The spam-filters may be automated to detect and remove the spam messages or notify the users of their arrival and presence.

In earlier days, the spam filters were based on the recognition of various ‘spammers’. The emails from certain id that is recognized as a spammer is labeled as a spam message. The list of spammers kept on increasing as the email-ids associated with the spam messages kept changing or the email ids used were forged. This required a large storing space and a very efficient searching technique. It could also cause redundancy as the same ‘spammer’ was stored separately by different ‘spam-detectors.’

Later, the various recognized phrases such “Buy Now,” “Free Shopping,” “Talk with strangers,” etc., were used to categorize email as either legitimate or spam email. If such phrases were in abundance that email was categorized as spam. These types of filters were easily fooled by representing the phrases in only human readable form. e.g. “Buy Now” as “B-u-y_N-o-w.” Hence, filters needed to be set up differently and the basis of the algorithm behind the filters needed for variable.

Various machine learning techniques and text categorization methods led to the development of content-based spam-detectors. To develop such detectors, a collection of both spam (unwanted) and non-spam (legitimate) messages were used by supervised learning algorithms (support vector machines, decision trees, etc.) for the creation model that automatically classified the incoming message into one of the categories. Hence, it makes the task of developing filters easier at various levels e.g. specific user or large email moderators.

Spam detection is not a regular text categorization since it has some fascinating characteristics. Both legitimate and spam messages deal with a lot of varieties of areas and topics. In different words, they are heterogeneous. The length of the emails can vary up to large orders. It can be in different languages and can also contain various abbreviations, spelling mistakes and grammatical errors. Sometimes, the formatting of words can be changed to fool the spam-filters. The learning model should be effective enough to handle these.

Apart from the body of the email, various information can be extracted from subject, attachments, and addresses etc. that can help in enhancement of effectiveness of spam-filters. Also, spam categorization is a cost-sensitive process. Generally, in a completely-automated spam filter, the chances of categorizing a legitimate as spam is higher than categorizing a spam message as legitimate. This fact must be kept in consideration when evaluating the performance of spam-detectors.

All learning algorithm required the representation of documents. Most common method to do is to form an attribute vector. Most of machine learning approaches use “Bag of Words” as representation.

‘Bag of words’ is a list of words with their word counts. Each row is a document, each column is a word and each cell is a word count. e.g.:

  • Doc 1 – This is spam. Spam is unwanted.

  • Doc 2 – Spam should be removed.

  • All distinct terms are

  • {this, is, spam, unwanted, should, be, removed}

  • Representation of Documents: -

  • Doc 1 – {1, 2, 2, 1, 0, 0, 0}

  • Doc 2 – {0, 0, 1, 0, 1, 1, 1}

The problem with word based text representation is, it requires a ‘tokenizer’ (to split the message into separate tokens) and ‘lemmatizer’ (to bring down the amount of tokens). Lemmatizers are language dependent and available lemmatizers are not very effective. Also, various representations of same words, e.g. b.u.y. etc. can be used to confuse the spam-filter. This has led us to the use of N-Grams.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2017): 3 Released, 1 Forthcoming
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing