Article Preview
TopIntroduction
Online language learning is gaining popularity, with many sites offering forms of mutual correction. This involves the pairing of language learners from different native language backgrounds, who correct each other’s writing as a native speaker of the language. This circumstance differs from a traditional classroom setting as it lacks a teacher to guide the language learning process, and is essentially a student-to-student relation. Previous research (Flanagan, 2013) has proposed the use of automatically generated quizzes as a method for reflecting on past errors made by online language learners. However, a method of detecting errors is required in order for a learner to practice similar errors that they have made. Also, the profile of a foreign language learner can contain valuable information about possible problems they will face during the learning process, and could be used to help personalize feedback. A particularly important attribute of a foreign language learner is their native language background as it defines their known language knowledge. Native Language Identification (NLI) is a process of determining the native language of a foreign language learner by analyzing a piece of their writing. Fundamentally, this problem can be thought of as the process of identifying characteristic features that represent the application of a learner’s native language knowledge in the use of the language that they are learning. Previous research has shown that learners from different native language backgrounds have different characteristics in their use of foreign language (Swan, 2001). Recently, research into the automation of NLI has been gaining in popularity and there are several practical applications to which the process could be applied, such as: providing targeted feedback on detected and potential errors in learner writing based on known problems for native language groups, and forensic linguistic author profiling where the native language of the author can be an important feature for investigation (Tetreault, 2013).
In this paper, we approach the problem of identifying characteristic differences and the classification of learner native languages from the perspective of writing errors. The basis for this is that learner writing can contain words, in particular nouns, that have a strong relationship with the learner’s native language. While these words can be a good indicator of the learner’s native language, the use is highly dependent on the subject or theme of the writing and less to do with the language learning process, for example: the differences in the nouns used by a learner writing a personal diary versus those used in an essay on a subject that requires specialist nouns, such as computer science, and mathematics. Analysis on learner writing errors is less dependent on the subject of the writing as the target of analysis is based on writing error concepts rather than the actual words of the learners’ writing.
A set of 15 predicted writing error scores made from the normalized output of 15 different Support Vector Machine (SVM) classifiers trained in previous research (Flanagan, 2013) are used as the basis of this analysis. We refer to these predicted writing error scores as a 15-dimension error prediction vector. Preliminary investigation by clustering will be used to show the differences of co-occurring writing errors between native language groups. The error prediction vector will then be analyzed by SVM machine learning to classify a learner’s native language. As a naïve baseline for comparison we will classify the native language using all words to compare the effectiveness of the proposed method. In the final section of this paper, we will examine the influence of words that have strong cultural or nationalistic relations, such as nouns representing: people, places, food, religion, etc. A method of removing words that are characteristic to a native language will be proposed. This method will then be applied to filter out cultural or nationalistic words from the corpus to provide an alternative “non-biased” baseline for critical evaluation of the proposed error prediction vector method.