Image Spam: Feature Extraction

Image Spam: Feature Extraction

Copyright: © 2017 |Pages: 32
DOI: 10.4018/978-1-68318-013-5.ch003

Abstract

Spam features represent the unique and special characteristics associated with spam, which are further used to differentiate them from other genuine messages. Each message m is processed by a feature extraction module to represent m in terms of n dimensional feature vector x = (x1, x2, …, xn) containing n features. This feature vector consists of many such features extracted from spam. In case of text based spam filters, a feature can be a word and a feature vector may be composed of various words extracted from spam. Each spam is associated with one feature vector. Based on the characteristics discussed in previous chapter, we will try to extract different features capturing those unique characteristics from image spam, in order to build the robust spam detection algorithms further. These features are broadly classified into high level metadata features, low level image features like color features, grayscale features, texture related features and embedded text related features.
Chapter Preview
Top

3.1. High-Level Metadata Features

High-level Meta data features such as image width, height, aspect ratio, file size, compression and image area can be extracted from the image header for spam image categorization (He, Wen and Zheng, 2009, Wang et al., 2010). Figure 1, Figure 2, Figure 3 and Figure 4 shows the distribution of features related to File Size (say, Feature Image_FileSize), Image Area (say, Feature Image_Area), Compression Ratio (say, Feature Compression_Ratio) and Aspect Ratio (say, Feature Aspect_Ratio) respectively for the hunter dataset containing total of 600 images (300 spam and 300 ham means genuine images). These metadata features are easy to analyse and needs less processing time so these are extracted from both ham and spam images, can be useful to distinguish Natural Images and Spam Images. If 978-1-68318-013-5.ch003.m01represents two dimensional image with length = m and width = n then Image_Area, Compression_Ratio and Aspect_Ratio are given as in Equation 3.1, Equation 3.2 and Equation 3.3 respectively.

978-1-68318-013-5.ch003.m02
(3.1)
978-1-68318-013-5.ch003.m03
(3.2)
978-1-68318-013-5.ch003.m04
(3.3)

3.1.1. File Size

Spam images are generally has more file size as compared to ham images. Figure 1 shows the histogram of the file size for the above said dataset. As it is clear from the figure that the file size of spam images vary from 0 to 13x104 Kbytes whereas it vary from 0 to 5x104Kbytes. So it can be a very good feature to segregate the spam images in first layer of any hierarchical model (Feng et al., 2011). So threshold of file size should be choose in such a way that it gives very less false positive and false negative in any algorithm. As it is clear from the figure, that if threshold lies in the overlapping area than the error will be introduced in the result.

Figure 1.

Histogram of file size of spam and natural images

978-1-68318-013-5.ch003.f01

3.1.2. Image Area

Many spam images are generally banners, due to which they have different dimensions of length and breadth. Figure 2 shows the histogram of the area for the above said dataset. It is clear from the Figure 2, that the file size of spam images vary from 0.3 to 5.5x105; whereas it vary from 0 to 0.7x10 for ham images. The distribution shows little overlapping between spam and genuine images; hence this feature can be exploited in the first layer of spam detection hierarchical model. The area size value should be choose in such a way that it gives very less false positive and false negative in any algorithm. As it is clear from the figure that if threshold lies in the overlapping area than the error will be introduced in the result whereas if it is in non-overlapping area the results will be more accurate.

Figure 2.

Histogram of image area of spam and natural images

978-1-68318-013-5.ch003.f02

Complete Chapter List

Search this Book:
Reset