Multidimensional Text Warehousing for Automated Text Classification

Multidimensional Text Warehousing for Automated Text Classification

Jiyun Kim (University of Seoul, Seoul, Korea) and Han-joon Kim (University of Seoul, Seoul, Korea)
Copyright: © 2018 |Pages: 16
DOI: 10.4018/JITR.2018040110

Abstract

This article describes how, in the era of big data, a data warehouse is an integrated multidimensional database that provides the basis for the decision making required to establish crucial business strategies. Efficient, effective analysis requires a data organization system that integrates and manages data of various dimensions. However, conventional data warehousing techniques do not consider the various data manipulation operations required for data-mining activities. With the current explosion of text data, much research has examined text (or document) repositories to support text mining and document retrieval. Therefore, this article presents a method of developing a text warehouse that provides a machine-learning-based text classification service. The document is represented as a term-by-concept matrix using a 3rd-order tensor-based textual representation model, which emphasizes the meaning of words occurring in the document. As a result, the proposed text warehouse makes it possible to develop a semantic Naïve Bayes text classifier only by executing appropriate SQL statements.
Article Preview

Preliminaries For Semantic Text Warehouse

Text Representation Model

In the vector space model, documents are represented as vectors in which each element has a weighting. By contrast, our semantic tensor space model represents documents as 2nd-order tensors (i.e., matrices) ℜ|S|⊗ℜ|V|, where |S| is the number of concepts (or semantics), |V| is the number of terms indexed, and ℜ|S| and ℜ|V| are the vector spaces for the concepts and terms, respectively. We regard the ‘concept space’ as an independent space equated to the ‘term’ and ‘document’ spaces used in the vector space model (Hong et al., 2015; Kim & Chang, 2014).

According to the formal concept analysis principle, a concept is defined by a pair of ‘intent’ and ‘extent’. Here, the extent means the set of instances that are included in the concept and the intent means the set of all common attributes of instances included from the extent. In our work, the extent that represents a concept consists of a set of documents related with the concept, whereas the intent consists of a set of keywords extracted from the set of documents. Figure 1 illustrates a term-by-document matrix and a term-by-document-by-concept tensor representations for a given corpus. To represent a document corpus, rather than a term-by-document matrix, we can generate a 3rd-order tensor with distinct document, term, and concept spaces. As a result, we can represent terms or concepts as matrices; given a 3rd-order tensor of a document corpus, we can represent a component of each space using the other two vector spaces. That is, we can represent a document as a concept-by-term matrix, a term as a concept-by-document matrix, and a concept as a term-by-document matrix.

Figure 1.

3rd-order tensor of a document corpus

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 12: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 11: 4 Issues (2018): 2 Released, 2 Forthcoming
Volume 10: 4 Issues (2017)
Volume 9: 4 Issues (2016)
Volume 8: 4 Issues (2015)
Volume 7: 4 Issues (2014)
Volume 6: 4 Issues (2013)
Volume 5: 4 Issues (2012)
Volume 4: 4 Issues (2011)
Volume 3: 4 Issues (2010)
Volume 2: 4 Issues (2009)
Volume 1: 4 Issues (2008)
View Complete Journal Contents Listing