Semantics-Based Document Categorization Employing Semi-Supervised Learning

Semantics-Based Document Categorization Employing Semi-Supervised Learning

Jan Žižka (Mendel University in Brno, Czech Republic) and František Dařena (Mendel University in Brno, Czech Republic)
DOI: 10.4018/978-1-4666-8690-8.ch005
OnDemand PDF Download:
List Price: $37.50


The automated categorization of unstructured textual documents according to their semantic contents plays important role particularly linked with the ever growing volume of such data originating from the Internet. Having a sufficient number of labeled examples, a suitable supervised machine learning-based classifier can be trained. When no labeling is available, an unsupervised learning method can be applied, however, the missing label information often leads to worse classification results. This chapter demonstrates a method based on semi-supervised learning when a smallish set of manually labeled examples improves the categorization process in comparison with clustering, and the results are comparable with the supervised learning output. For the illustration, a real-world dataset coming from the Internet is used as the input of the supervised, unsupervised, and semi-supervised learning. The results are shown for different number of the starting labeled samples used as “seeds” to automatically label the remaining volume of unlabeled items.
Chapter Preview

Text Mining Using Machine Learning Approach

In accordance with this chapter pointing, and without any exact definition, the concept text mining is generally comprehended as a specialized branch of data mining. Data mining is the computational process of discovering knowledge in large data sets of any type involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Text mining area focuses on revealing knowledge in large text data, namely analyzing text in natural languages, which are (or, for some old languages, were) used by human beings. People use their spoken and written natural languages for communicating pieces of knowledge or information to each other. The current technology enables storing textual data in very large volumes using the computer technology; however, such storages – databases – make sense only when the data may be later somehow utilized: retrieving its part – information – relevant to carrying out a particular task and acquiring knowledge hidden in it.

Complete Chapter List

Search this Book: