Visual Feature-Based Image Spam Filters

Visual Feature-Based Image Spam Filters

Copyright: © 2017 |Pages: 24
DOI: 10.4018/978-1-68318-013-5.ch006


This chapter provides the details of visual feature based image spam filters, a literature review on these spam filters and their limitations. These methods are generally computationally efficient and exhibits more accuracy in presence of various noises compared to OCR based detection schemes, as they do not include any text recognition stage (Lamia et al., 2012). Previously discussed near-duplicate spam detection methods are likely to perform well in abstracting base templates, when given enough examples of various spam templates in use (Mehta et al., 2008). However, the generalization ability of these methods will be limited. Visual feature based spam detection methods are generally built using different high level and/or low level image features (refer Chapter 3 of this book) related to color, shape, texture characteristics of spam images; hence they have more generalization capability (Lamia et al., 2012). Mostly; these techniques exploit the text intensive and noisy nature of spam images.
Chapter Preview

6.1. Previous Work

This section provides a detailed literature review of some good works proposed using image feature based techniques for spam detection.

Aradhye et al., (2005) exploits the text intensive nature of image spam by calculating 1) Extent of text feature - fraction of the total area of image that falls under text region, 2) Color Saturation Features - fraction of the total number of pixels in the image for which the difference max(R,G,B) – min(R,G,B) is greater than some threshold T (here, T=50 set) and 3) Color Heterogeneity Features. Figure 1 shows the distribution of Color Saturation Feature for both spam and ham images which shows good separable feature. However, no such good separation is observed in color heterogeneity feature. The authors claimed approximately 80% detection accuracy.

Figure 1.

Distribution of color saturation feature for spam and ham images


The method proposed by Nhung and Phuong (2007) uses computationally efficient edge based feature vector extraction to calculate vector of similarity (L1 distances) measures from an image to a small set of templates. Edge Directions (ED) and Edge Orientation Autocorrelogram (EOAC) are used as edge based translation and scale invariant features. ED is histogram of edge angles and reflects global shape information. Image spam’s text intensive nature is exploited in this scheme, as text elements have special shape characteristics that differentiate them from that of background or other elements. Figure 2 (b) shows the output of edge detection using Sobel operator for sample spam image (See Figure 2 (a)).

Figure 2.

Edge direction feature for sample spam image


The authors in this work have used SVM classifier in Weka Tool for experimentation on personal dataset only. Authors claim overall accuracy of 80% for the scheme. Using edge-based feature only may allow fast processing along with capturing regularities in shapes of text intensive spam images but may fail to achieve generalization capability.

In the same year, Byun et al., (2007) considered four spam image properties: color moment, color heterogeneity, conspicuousness, and self-similarity for image based spam detection (Byun et al., 2007). They applied multi-class characterization instead of single class characterization to improve detection robustness along with maximal figure-of-merit (MFoM) learning algorithm to design classifiers. Spam images are first categorized as text intensive synthetic/artificially modified images with diverse background region and non-synthetic/ images with no artificial modifications. Figure 3 (a)-(f) shows distribution of first and second order color moments in spam and ham images. The first order central moments shows wider separation compared to that of second order central moments here.

Figure 3.

Distribution of first and second order color moments in spam and ham images


The authors calculated color heterogeneity, by first scaling image by the maximum possible intensity in the RGB channels and converting scaled image to an indexed image by using minimum variance quantization. The RMS error between the original image and the indexed image is used as color heterogeneity feature which found no significant during our experiments; although natural images have more color heterogeneity and hence lower RMS errors than that of spam images. Calculation of conspicuousness feature - based on highly contrast property of spam images and self-similarity feature - based on uniform background property of spam images is highly computational. The authors claimed the detection rate of 81.5% and 5.6% of misclassification of legitimate images and good performance compared to the scheme discussed in the work (Aradhye et al., 2005).

Complete Chapter List

Search this Book: