Text Representation Model
In the vector space model, documents are represented as vectors in which each element has a weighting. By contrast, our semantic tensor space model represents documents as 2nd-order tensors (i.e., matrices) ℜ|S|⊗ℜ|V|, where |S| is the number of concepts (or semantics), |V| is the number of terms indexed, and ℜ|S| and ℜ|V| are the vector spaces for the concepts and terms, respectively. We regard the ‘concept space’ as an independent space equated to the ‘term’ and ‘document’ spaces used in the vector space model (Hong et al., 2015; Kim & Chang, 2014).
According to the formal concept analysis principle, a concept is defined by a pair of ‘intent’ and ‘extent’. Here, the extent means the set of instances that are included in the concept and the intent means the set of all common attributes of instances included from the extent. In our work, the extent that represents a concept consists of a set of documents related with the concept, whereas the intent consists of a set of keywords extracted from the set of documents. Figure 1 illustrates a term-by-document matrix and a term-by-document-by-concept tensor representations for a given corpus. To represent a document corpus, rather than a term-by-document matrix, we can generate a 3rd-order tensor with distinct document, term, and concept spaces. As a result, we can represent terms or concepts as matrices; given a 3rd-order tensor of a document corpus, we can represent a component of each space using the other two vector spaces. That is, we can represent a document as a concept-by-term matrix, a term as a concept-by-document matrix, and a concept as a term-by-document matrix.
Figure 1. 3rd-order tensor of a document corpus