Semantics-Based Document Categorization Employing Semi-Supervised Learning

Semantics-Based Document Categorization Employing Semi-Supervised Learning

ISBN13: 9781522517597|ISBN10: 1522517596|EISBN13: 9781522517603
DOI: 10.4018/978-1-5225-1759-7.ch077
Cite Chapter Cite Chapter

MLA

Žižka, Jan, and František Dařena. "Semantics-Based Document Categorization Employing Semi-Supervised Learning." Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, edited by Information Resources Management Association, IGI Global, 2017, pp. 1884-1912. https://doi.org/10.4018/978-1-5225-1759-7.ch077

APA

Žižka, J. & Dařena, F. (2017). Semantics-Based Document Categorization Employing Semi-Supervised Learning. In I. Management Association (Ed.), Artificial Intelligence: Concepts, Methodologies, Tools, and Applications (pp. 1884-1912). IGI Global. https://doi.org/10.4018/978-1-5225-1759-7.ch077

Chicago

Žižka, Jan, and František Dařena. "Semantics-Based Document Categorization Employing Semi-Supervised Learning." In Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, edited by Information Resources Management Association, 1884-1912. Hershey, PA: IGI Global, 2017. https://doi.org/10.4018/978-1-5225-1759-7.ch077

Export Reference

Mendeley
Favorite

Abstract

The automated categorization of unstructured textual documents according to their semantic contents plays important role particularly linked with the ever growing volume of such data originating from the Internet. Having a sufficient number of labeled examples, a suitable supervised machine learning-based classifier can be trained. When no labeling is available, an unsupervised learning method can be applied, however, the missing label information often leads to worse classification results. This chapter demonstrates a method based on semi-supervised learning when a smallish set of manually labeled examples improves the categorization process in comparison with clustering, and the results are comparable with the supervised learning output. For the illustration, a real-world dataset coming from the Internet is used as the input of the supervised, unsupervised, and semi-supervised learning. The results are shown for different number of the starting labeled samples used as “seeds” to automatically label the remaining volume of unlabeled items.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.