A Novel Bio-Inspired Approach for Multilingual Spam Filtering

A Novel Bio-Inspired Approach for Multilingual Spam Filtering

Hadj Ahmed Bouarara (GeCode Laboratory, Department of Computer Science, Tahar Moulay University of Saida, Saida, Algeria), Reda Mohamed Hamou (GeCode Laboratory, Department of Computer Science, Tahar Moulay University of Saida, Saida, Algeria) and Abdelmalek Amine (GeCode Laboratory, Department of Computer Science, Tahar Moulay University of Saida, Saida, Algeria)
Copyright: © 2015 |Pages: 43
DOI: 10.4018/IJIIT.2015070104
OnDemand PDF Download:
$37.50

Abstract

In today's digital world the email service has revolutionized the sphere of electronic communication. It has become a veritable social phenomenon in our daily life. Unfortunately, this technology has become incontestably the original source of malicious activities especially the plague called undesirable emails (SPAM) that has grown tremendously in the last few years. The battle against spam emails is extremely fierce. This paper deals with an intelligent spam filtering system called artificial heart-lungs system (AHLS) mimicked from the biological phenomenon of general circulation and oxygenation of blood. It is composed of different steps: Selection to stop automatically emails with undesirable identifier. Multilingual pre-processing to treat the problem of multilingual spam emails and vectoring them. Heart filter and lungs filter to classify unwelcome email in the spam folder and welcome email in the ham folder to present them to the recipient. The method uses an automatic updating of learning basis and black list, and a ranking step to order the spam mails according to their spam relevancy. For the authors' experimentation, they have constructed a new dataset M.SPAM composed of emails pre-classified as spam or ham with different language (English, Spanish, French, and melange) and using the validation measures (recall, precision, f-measure, entropy, accuracy and error, false positive rate and false negative rate, ROC and learning curve). The authors have optimized the sensitive parameters (text representation technique, lungs filters, and the size of initial leaning basis). The results are positive compared to the result of other bio-inspired techniques (artificial social bees, artificial social cockroaches), supervised algorithm (decision tree C4.5) and automatic algorithm (K-means). Finally, a visual result mining tool was developed in order to see the results in graphical form (3d cub and cobweb) with more realism using the functionality of zooming and rotation. The authors' aims are to eliminate a large proportion of unwelcome email, treated the multilingual emails, ensuring an automatic updating of their system and poses a minimal risk of eliminating ham email.
Article Preview

1. Introduction And Background

In today’s world of globalization and borderless technology, the appearance of the Internet and the rapid development of telecommunication have made the world a global village. In the last decennium the e-mail service has become enormously used, and the principal vector of communication because it is cheap, reliable, fast and easily accessible. Moreover, it permits users with a mailbox (BAL) and address mail to exchange messages (picture, files, and text documents) from anywhere in the world via internet. Regrettably, this technology has led to the emergence and further escalation of several problems where among all the messages received by an individual in his mail box, we recognize two forms:

  • HAM (Regular): The email (welcome) sent by friends or by websites subscribed in and meant for a specific person;

  • SPAM (Irregular): The unsolicited emails (junk e-mail) sent in bulk to a large number of recipient indiscriminately and disingenuous, directly or indirectly by malicious people (scammers) that contains a payload (obvious or hidden) generally to hijack the recipient or for commercial interests.

The sending mechanism used by spammers are numerous such as: botnets, free email services, open proxies and stolen net blocks. The major tactics used by spammers to fool spam filters are: HTML tricks, Bayesian poisoning, multilingual email, content morphing, attachment image, forcing secondary MX, contouring IP reputation, or hiding the call to action (Hemalatha et al., 2015).

The nuisance brought by the spam is not limited only on the influx of undesired mails or the loss of legitimate mails; merely, we can identify different sorts of spam email such as Nigerian scam, FUD, Hoax, the spam telephony (Spim) and Phishing as illustrated in Figure 1. These forms of emails are annoying and the reasons for why users do not appreciate spam messages in their inbox are numerous: the waste overload in the mailbox that makes email less practical, loss of time in a business, time equals money, consumes a lot of network resource and bandwidth, loss of important emails, human resource consumption by damaging the computer if they contain virus, the risk of denial of service in the messenger server, and disruption of network operation. It is a rigorous phenomenon in the electronic life, which presents the main challenge and a security threat for Mail server administrators, and responsible of information organizations (Bouarara, 2015; Schieber & Hilbert, 2014). Generally an email is divided into two parts, namely, header (contain the identifier and the name of the sender) and body (the content of the email).

Figure 1.

Several examples of spam email

According to the most recent report of the Radicati Group released in 2013 (Radicati, 2012), who supplies quantitative and qualitative researches with details on e-mail, security, and social networks. It has been illustrated that 70-80% of email traffic is composed of spam. More detail on this report are grouped in Table 1 and Figure 2.

Table 1.
Radicati group statistic for email statistic
Radicati Group Statistic
Email account active in the world2.9 trillion
People who use mail regularly2.4 billion
Number of mail sent in this years67 trillion
The average of email sent every day182.9 billion
The percentage of spam mail81%
Spam cost to all U.S. Corporations9.4 billion $
Person who changed their email due to spam16%
Multilingual spam emails (emails written with different language)43% of all the spam emails are multilingual.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing