Co-Occurrence-Based Error Correction Approach to Word Segmentation

Co-Occurrence-Based Error Correction Approach to Word Segmentation

Ekawat Chaowicharat (Mahidol University, Thailand) and Kanlaya Naruedomkul (Mahidol University, Thailand)
DOI: 10.4018/978-1-61350-447-5.ch023


A number of word segmentation algorithms have been offered in the past; however, there is still room for improvement. Co-occurrence-Based Error Correction (CBEC), the proposed approach in this chapter, is a novel Thai word segmentation approach that was designed to provide accurate segmentation results based on context and purpose. CBEC quickly segments the input string using any available algorithm; maximal matching was used in the experiment. Next, CBEC checks its segmentation output against an error risk data bank to determine if there is any error risk. The error risk data bank is developed based on a training corpus. The current version of the error risk bank was based on the training corpus available at BEST 2009. Then, CBEC re-segments the input string using the co-occurrence score of the word sequence to ensure the accuracy of the segmentation result.
Chapter Preview

Relevant Issues

We aim at delivering a segmentation approach that is able to resolve the segmentation ambiguity problem and to provide reliable results for any NLP application. Therefore, a number of issues must be carefully explored including the definition of the term “word”, the types of error that may occur in the segmentation output, and the effects of the relationship between words that appear in the input string. Our approach begins the process by quickly segmenting the input string using any available segmentation algorithm. A maximal matching algorithm (Wang & Huang, 2005) was used in our experiment because it is fast and simple.

Complete Chapter List

Search this Book: