Email Worm Detection Using Data Mining

Email Worm Detection Using Data Mining

Mohammad M. Masud (University of Texas at Dallas, USA), Latifur Khan (University of Texas at Dallas, USA) and Bhavani Thuraisingham (University of Texas at Dallas, USA)
DOI: 10.4018/978-1-60566-210-7.ch002
OnDemand PDF Download:


This chapter applies data mining techniques to detect email worms. Email messages contain a number of different features such as the total number of words in message body/subject, presence/absence of binary attachments, type of attachments, and so on. The goal is to obtain an efficient classification model based on these features. The solution consists of several steps. First, the number of features is reduced using two different approaches: feature-selection and dimension-reduction. This step is necessary to reduce noise and redundancy from the data. The feature-selection technique is called Two-phase Selection (TPS), which is a novel combination of decision tree and greedy selection algorithm. The dimensionreduction is performed by Principal Component Analysis. Second, the reduced data is used to train a classifier. Different classification techniques have been used, such as Support Vector Machine (SVM), Naïve Bayes and their combination. Finally, the trained classifiers are tested on a dataset containing both known and unknown types of worms. These results have been compared with published results. It is found that the proposed TPS selection along with SVM classification achieves the best accuracy in detecting both known and unknown types of worms.
Chapter Preview


Email worm spreads through infected email messages. The worm may be carried by attachment, or the email may contain links to an infected website. When the user opens the attachment, or clicks the link, the host gets infected immediately. The worm exploits the vulnerable email software in the host machine to send infected emails to addresses stored in address book. Thus, new machines get infected. Worms bring damage to computer and people in various ways. They may clog the network traffic, cause damage to the system and make the system unstable or even unusable.

The traditional way of worm detection is signature based. A signature is a unique pattern in the worm body that can identify it as a particular type of worm. Thus, a worm can be detected from its signature. But the problem with this approach is that it involves significant amount of human intervention and may take long time (from days to weeks) to discover the signature. Thus, this approach is not useful against “zero-day” attacks of computer worm. Besides, signature matching is not effective against polymorphism.

Thus, there is a growing need for a fast and effective detection mechanism that requires no manual intervention. Our work is directed towards automatic and efficient detection of email worms. We apply a feature-based approach for this purpose. A number of features of email messages have been identified in (Martin, Sewani, Nelson, Chen & Joseph, 2005a), and discussed in “Feature reduction & Classification” section. The total number of features is large, some of which may be redundant or noisy. So we apply two different feature-reduction techniques: a dimension-reduction technique called Principal Component Analysis (PCA), and our novel feature-selection technique called Two-phase Selection (TPS) that applies decision tree and greedy elimination. These features are used to train a classifier to obtain a classification model. We use three different classifiers for this task: Support Vector Machine (SVM), Naïve Bayes (NB), and a combination of SVM and NB, mentioned henceforth as the Series classifier. The Series approach was first proposed by Martin, Sewani, Nelson, Chen & Joseph (2005b).

We use the data set of (Martin et al., 2005a) for evaluation purpose. The original data distribution was unbalanced, so we balance it by rearranging. We divide the data set into two disjoint subsets: the known worms set or K-Set and the novel worms set or N-Set. The K-Set contains some clean emails and emails infected by five different types of worms. The K-Set contains emails infected by a sixth type worm, but no clean email. We run a three-fold cross validation on the K-Set. At each iteration of the cross validation, we test the accuracy of the trained classifiers on the N-Set. Thus, we obtain two different measures of accuracy, namely, the accuracy of the three-fold cross validation on K-Set, and the average accuracy of novel worm detection on N-Set.

Our contributions to this research work are as follows: First, we apply two special feature-reduction techniques to remove redundancy and noise from data. One technique is PCA, and the other is our novel TPS algorithm. PCA is commonly used to extract patterns from high dimensional data, especially when the data is noisy. Besides, it is a simple and nonparametric method. TPS applies decision tree C4.5 (Quinlan, 1993) for initial selection, and thereafter it applies greedy elimination technique (see section “Two-phase Feature Selection (TPS)”). Second, we create a balanced dataset as explained above. Finally, we compare the individual performances among NB, SVM, and Series and show empirically that the Series approach proposed by Martin et al. (2005b) performs worse than either NB or SVM.

The rest of this paper is organized as follows: section “Related Work” describes related work in automatic email worm detection; section “Feature Reduction and Classification Techniques” describes the feature-selection, dimension-reduction and classification techniques; section “Data Set” describes the distribution of the data set; section “Experimental setup” describes the experimental setup such as hardware, software and system parameters; section “Results” discusses the results; and section “Conclusions” concludes with future guidelines for research.

Complete Chapter List

Search this Book: