A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis

A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis

Mohamed Hammami (Faculté des Sciences de Sfax, Tunisia), Youssef Chahir (Université de Caen, France) and Liming Chen (Ecole Centrale de Lyon, France)
DOI: 10.4018/978-1-59904-274-9.ch002
OnDemand PDF Download:


Along with the ever growing Web is the proliferation of objectionable content, such as sex, violence, racism, and so forth. We need efficient tools for classifying and filtering undesirable Web content. In this chapter, we investigate this problem through WebGuard, our automatic machine-learning-based pornographic Web site classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic Web sites, we focus here our attention on the use of skin color-related visual content-based analysis along with textual and structural content based analysis for improving pornographic Web site filtering. While the most commercial filtering products on the marketplace are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content-based analysis to the classical textual content-based analysis along with several major-data mining techniques for learning and classifying. Experimented on a test bed of 400 Web sites including 200 adult sites and 200 nonpornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color-related visual content-based analysis is driven in addition. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content-based analysis, and 95.62% classification accuracy rate when the visual content-based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today, textual and visual content.

Complete Chapter List

Search this Book:
Table of Contents
Jairo Gutierrez
Chapter 1
Varadharajan Sridhar, June Park
Survivability, also known as terminal reliability, refers to keeping at least one path between specified network nodes so that some or all of... Sample PDF
Design of High Capacity Survivable Networks
Chapter 2
Mohamed Hammami, Youssef Chahir, Liming Chen
Along with the ever growing Web is the proliferation of objectionable content, such as sex, violence, racism, and so forth. We need efficient tools... Sample PDF
A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis
Chapter 3
Bryan Houliston, Nurul Sarkar
Wi-Fi (also known as IEEE 802.11b) networks are gaining widespread popularity as wireless local area networks (WLANs) due to their simplicity in... Sample PDF
Wi-Fi Deployment in Large New Zealand Organizations: A Survey
Chapter 4
Kevin Curran, Noel Broderick
Over the years the number of Web users has increased dramatically unfortunately leading to the inherent problem of congestion. This can affect each... Sample PDF
Prevalent Factors Involved in Delays Associated with Page Downloads
Chapter 5
Ted Chia-Han Lo, Jairo Gutiérrez
The research reported in this chapter studied the relevance of the application of network quality of service (QoS) technologies for modern... Sample PDF
Network Quality of Service for Enterprise Resource Planning Systems: A Case Study Approach
Chapter 6
César García-Díaz, Fernando Beltrán
Congestion effects are the negative externalities or social costs that users generate on each other when using a shared network resource. Under a... Sample PDF
Cost-Based Congestion Pricing in Network Priority Models Using Axiomatic Cost Allocation Methods
Chapter 7
Ismail Khalil Ibrahim, Ashraf Ahmad, David Taniar
Mobile multimedia, referring to multimedia information exchange over wireless networks or wireless Internet, is made possible due to the popularity... Sample PDF
Mobile Multimedia: Communication Technologies, Business Drivers, Service and Applications
Chapter 8
Agustinus Borgy Waluyo, David Taniar, Bala Srinivasan
The emerging of wireless computing motivates radical changes of how information is obtained. Our paper discusses a practical realisation of an... Sample PDF
Mobile Information Systems in a Hospital Organization Setting
Chapter 9
Say Ying Lim, David Taniar, Bala Srinivasan
In this chapter, we present an extensive study of the available types of data caching in a mobile database environment. We explore the different... Sample PDF
Data Caching in a Mobile Database Environment
Chapter 10
John Goh, David Taniar
Mining walking pattern from mobile users represents an interesting research area in the field of data mining which is about extracting patterns and... Sample PDF
Mining Walking Pattern from Mobile Users
Chapter 11
Subhankar Dhar
This chapter presents the current state of the art of mobile ad hoc network and some important problems and challenges related to routing, power... Sample PDF
Applications and Future Trends in Mobile Ad Hoc Networks
Chapter 12
Kevin Curran, Elaine Smyth
Signal leakage means that wireless network communications can be picked up outside the physical boundaries of the building in which they are being... Sample PDF
Addressing WiFi Security Concerns
Chapter 13
Byung Kwan Lee, Seung Hae Yang, Tai-Chi Lee
Unlike SET (secure electronic transaction) protocol. This chapter proposes a SEEP (highly secure electronic payment) protocol, which uses ECC... Sample PDF
A SEEP Protocol Design Using 3BC, ECC(F2m) and HECC Algorithm
Chapter 14
Kevin Curran, John Honan
This chapter investigates the problem of e-mail spam, and identifies methods to minimize the volumes. The analysis focuses on the hashcash... Sample PDF
Fighting the Problem of Unsolicited E-Mail Using a Hashcash Proof-of-Work Approach
About the Authors