Below we describe the details of the applied methods. Firstly, we describe basics of data preprocessing and feature extraction. Next, we shortly explain all classifiers with their settings and modification applied in the experiments, including the proposed model based on CNN.
Data Preprocessing
The sentences from the original dataset used in this (Ptaszynski et al., 2010, 2015a, 2015b, 2016; Nitta et al., 2013) were preprocessed in the following ways:
- •
Tokenization: All words, punctuation marks, etc. are separated by spaces (later: TOK).
- •
Lemmatization: Like the above but the words are represented in their generic (dictionary) forms, or “lemmas” (later: LEM).
- •
Parts of Speech: Words are replaced with their representative parts of speech (later: POS).
- •
Tokens With POS: Both words and POS information is included in one element (later: TOK+POS).
- •
Lemmas With POS: Like the above but with lemmas instead of words (later: LEM+POS).
- •
Tokens With Named Entity Recognition: Words encoded together with with information on what named entities (private name of a person, organization, numericals, etc.) appear in the sentence. The NER information is annotated by CaboCha (later: TOK+NER).
- •
Lemmas With NER: Like the above but with lemmas (later: LEM+NER).
- •
Chunking: Larger sub-parts of sentences separated syntactically, such as noun phrase, verb phrase, predicates, etc., but without dependency relations (later: CHNK).
- •
Dependency Structure: Same as above, but with information regarding syntactical relations between chunks (later: DEP).
- •
Chunking With NER: Information on named entities is encoded in chunks (later: CHNK+NER).
- •
Dependency Structure With Named Entities: Both dependency relations and named entities are included in each element (later: DEP+NER).
Five examples of preprocessing were represented in Table 2 in Chapter 5. Theoretically, the more generalized a sentence is, the less unique and frequent patterns it will contain, but the produced patterns will be more frequent. For example, in the sentence from Table 2 in Chapter 5 we can see that a simple phrase kimochi_ii hi (“pleasant day”) is represented by a POS pattern as ADJ N. We can easily assume that there will be more ADJ N patterns than kimochi_ii hi, because many word combinations can be represented by this pattern. We compared the results of classification for each classifier using different preprocessing methods to find out whether it is better to represent sentences as more generalized or as more specific. The generalization is also closely related to the notion of Feature Density we propose to optimize the proposed method.