The Impact of the Mode of Data Representation for the Result Quality of the Detection and Filtering of Spam

The Impact of the Mode of Data Representation for the Result Quality of the Detection and Filtering of Spam

Reda Mohamed Hamou, Abdelmalek Amine, Moulay Tahar
Copyright: © 2017 |Pages: 19
DOI: 10.4018/978-1-5225-2058-0.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Spam is now of phenomenal proportions since it represents a high percentage of total emails exchanged on the Internet. In the fight against spam, we are using this article to develop a hybrid algorithm based primarily on the probabilistic model in this case, Naïve Bayes, for weighting the terms of the matrix term -category and second place used an algorithm of unsupervised learning (K-means) to filter two classes, namely spam and ham (legitimate email). To determine the sensitive parameters that make up the classifications we are interested in studying the content of the messages by using a representation of messages using the n-gram words and characters independent of languages (because a message may be received in any language) to later decide what representation to use to get a good classification. We have chosen several metrics as evaluation to validate our results.
Chapter Preview
Top

State Of The Art

Among the anti-spam techniques that exist in the literature include are those based on machine learning and those not based on machine learning.

The Techniques Not Based on Machine Learning

Heuristics, or rules-based, this analysis uses regular expression rules to detect phrases or characteristics that are common in spam, and the amount and severity of identified features will propose the appropriate classification of the message. The history and the popularity of this technology has largely been driven by its simplicity, speed and accuracy. In addition, it is better than many advanced technologies of filtering and detection in the sense that it does not require a learning period. Techniques based on signatures generate a unique hash value (signature) for each message recognized spam. Filters signature compare the hash value of all incoming mail against those stored (the hash values ​​previously identified to classify spam e-mail). This kind of technology makes it statistically unlikely that a legitimate email will have the same hash of a spam message. This allows filter signatures to achieve a very low level of false positives. The blacklist is a technique that is simple common among almost all filtration products. Also known as block lists, blacklists filter e-mails from a specific sender. White lists, or lists of authorization, perform the opposite function, to correctly classify an email automatically from a specific sender. Currently, there is a spam filtering technology based on traffic analysis that provides a characterization of spam traffic patterns where a number of attributes per email are able to identify the characteristics that separate spam traffic from non-spam traffic.

Complete Chapter List

Search this Book:
Reset