Unsolicited commercial email also known as Spam is becoming a serious problem for Internet users and providers (Fawcett, 2003). Several researchers have applied machine learning techniques in order to improve the detection of spam messages. Naive Bayes models are the most popular (Androutsopoulos, 2000) but other authors have applied Support Vector Machines (SVM) (Drucker, 1999), boosting and decision trees (Carreras, 2001) with remarkable results. SVM has revealed particularly attractive in this application because it is robust against noise and is able to handle a large number of features (Vapnik, 1998). Errors in anti-spam email filtering are strongly asymmetric. Thus, false positive errors or valid messages that are blocked, are prohibitively expensive. Several authors have proposed new versions of the original SVM algorithm that help to reduce the false positive errors (Kolz, 2001, Valentini, 2004 & Kittler, 1998). In particular, it has been suggested that combining non-optimal classifiers can help to reduce particularly the variance of the predictor (Valentini, 2004 & Kittler, 1998) and consequently the misclassification errors. In order to achieve this goal, different versions of the classifier are usually built by sampling the patterns or the features (Breiman, 1996). However, in our application it is expected that the aggregation of strong classifiers will help to reduce more the false positive errors (Provost, 2001 & Hershop, 2005). In this paper, we address the problem of reducing the false positive errors by combining classifiers based on multiple dissimilarities. To this aim, a diversity of classifiers is built considering dissimilarities that reflect different features of the data. The dissimilarities are first embedded into an Euclidean space where a SVM is adjusted for each measure. Next, the classifiers are aggregated using a voting strategy (Kittler, 1998). The method proposed has been applied to the Spam UCI machine learning database (Hastie, 2001) with remarkable results.
The Problem Of Dissimilarities Revisited
An important step in the design of a classifier is the choice of the proper dissimilarity that reflects the proximities among the objects. However, the choice of a good dissimilarity for the problem at hand is not an easy task. Each measure reflects different features of the dataset and no dissimilarity outperforms the others in a wide range of problems. In this section, we comment shortly the main differences among several dissimilarities that can be applied to model the proximities among emails. For a deeper description and definitions see for instance (Cox, 2001).
Key Terms in this Chapter
UCE: Unsolicited Commercial Email, also known as Spam
K-NN: K-Nearest Neighbor algorithm for classication purposes.
Dissimilarity: It is a measure of proximity that does not obey the triangle inequality.
SVD: Singular Value Decomposition. Linear algebra operation that is used by many optimization algorithms.
Kernel: Non-linear transformation to a high dimensional feature space.
Bootstrap: Resampling technique based on several random samples drawn with replacement.
SVM: Support Vector Machines classifier.
MDS: Multidimensional Scaling Algorithm applied for the visualization of high dimensional data.