Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Effect of N-Grams Technique in Preprocessing of Email Spam Filtering

Aakanksha Sharaff, Naresh Kumar Nagwani

Source Title: International Journal of Applied Evolutionary Computation (IJAEC) 8(1)

DOI: 10.4018/ijaec.2017010102

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In this paper, the process of spam categorization based on character level; content-based approach has been demonstrated. Spam categorization has been performed by using N-gram technique. The general technique of using N-grams on words, creating a “Bag of Words” representation of documents, has been replaced by ‘Bag of Characters'.‘Bag of Character' is created by treating the whole email document as a single string and splitting it character-wise. In this approach, multiple N-grams i.e. bi-grams, tri-grams and quad-grams have been used simultaneously. It results in ‘bag of character' representation of email documents containing N-grams of sizes 2, 3 and 4. It enhances the results by enabling us to solve the problems occurring in Word N-grams. All the experiments have been performed on Ling Spam Corpus.

Article Preview

Top

Introduction

Electronic mail (E-mail) is one of the fastest, cheapest, secure and reliable means of communication available. It enables users to connect almost instantly and the sensitive information is significantly secured. The major issue with this means of the method is irrelevant email also termed as ‘spam’. Spam mails are those mails that users do not want to have voluntarily. These spam mails are generally directed for the sake of advertisement or with the intent of breaching security. They cause the wastage of enormous online space, internet bandwidth as well as valuable time. On the prospect of security, they are particularly dangerous as they can distribute viruses or can extract valuable information if the user isn’t careful. As the number of email users increases, the risks caused by spam also increase. This compels us to search for effective and efficient ‘spam-filters.’ The spam-filters may be automated to detect and remove the spam messages or notify the users of their arrival and presence.

In earlier days, the spam filters were based on the recognition of various ‘spammers’. The emails from certain id that is recognized as a spammer is labeled as a spam message. The list of spammers kept on increasing as the email-ids associated with the spam messages kept changing or the email ids used were forged. This required a large storing space and a very efficient searching technique. It could also cause redundancy as the same ‘spammer’ was stored separately by different ‘spam-detectors.’

Later, the various recognized phrases such “Buy Now,” “Free Shopping,” “Talk with strangers,” etc., were used to categorize email as either legitimate or spam email. If such phrases were in abundance that email was categorized as spam. These types of filters were easily fooled by representing the phrases in only human readable form. e.g. “Buy Now” as “B-u-y_N-o-w.” Hence, filters needed to be set up differently and the basis of the algorithm behind the filters needed for variable.

Various machine learning techniques and text categorization methods led to the development of content-based spam-detectors. To develop such detectors, a collection of both spam (unwanted) and non-spam (legitimate) messages were used by supervised learning algorithms (support vector machines, decision trees, etc.) for the creation model that automatically classified the incoming message into one of the categories. Hence, it makes the task of developing filters easier at various levels e.g. specific user or large email moderators.

Spam detection is not a regular text categorization since it has some fascinating characteristics. Both legitimate and spam messages deal with a lot of varieties of areas and topics. In different words, they are heterogeneous. The length of the emails can vary up to large orders. It can be in different languages and can also contain various abbreviations, spelling mistakes and grammatical errors. Sometimes, the formatting of words can be changed to fool the spam-filters. The learning model should be effective enough to handle these.

Apart from the body of the email, various information can be extracted from subject, attachments, and addresses etc. that can help in enhancement of effectiveness of spam-filters. Also, spam categorization is a cost-sensitive process. Generally, in a completely-automated spam filter, the chances of categorizing a legitimate as spam is higher than categorizing a spam message as legitimate. This fact must be kept in consideration when evaluating the performance of spam-detectors.

All learning algorithm required the representation of documents. Most common method to do is to form an attribute vector. Most of machine learning approaches use “Bag of Words” as representation.

‘Bag of words’ is a list of words with their word counts. Each row is a document, each column is a word and each cell is a word count. e.g.:

Doc 1 – This is spam. Spam is unwanted.
Doc 2 – Spam should be removed.
All distinct terms are
{this, is, spam, unwanted, should, be, removed}
Representation of Documents: -
Doc 1 – {1, 2, 2, 1, 0, 0, 0}
Doc 2 – {0, 0, 1, 0, 1, 1, 1}

The problem with word based text representation is, it requires a ‘tokenizer’ (to split the message into separate tokens) and ‘lemmatizer’ (to bring down the amount of tokens). Lemmatizers are language dependent and available lemmatizers are not very effective. Also, various representations of same words, e.g. b.u.y. etc. can be used to confuse the spam-filter. This has led us to the use of N-Grams.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 13: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 12: 4 Issues (2021)

Volume 11: 4 Issues (2020)

Volume 10: 4 Issues (2019)

Volume 9: 4 Issues (2018)

Volume 8: 4 Issues (2017)

Volume 7: 4 Issues (2016)

Volume 6: 4 Issues (2015)

Volume 5: 4 Issues (2014)

Volume 4: 4 Issues (2013)

Volume 3: 4 Issues (2012)

Volume 2: 4 Issues (2011)

Volume 1: 4 Issues (2010)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Effect of N-Grams Technique in Preprocessing of Email Spam Filtering

Abstract

Introduction

Complete Article List