A language model is a description of language. Although grammar has been the prevalent tool in modelling language for a long time, interest has recently shifted towards statistical modelling. This chapter refers to speech recognition experiments, although statistical language models are applicable over a wide-range of applications: machine translation, information retrieval, etc. Statistical modelling attempts to estimate the frequency of word sequences.
A language model is a description of language. Although grammar has been the prevalent tool in modelling language for a long time, interest has recently shifted towards statistical modelling. This chapter refers to speech recognition experiments, although statistical language models are applicable over a wide-range of applications: machine translation, information retrieval, etc.
Statistical modelling attempts to estimate the frequency of word sequences. If a sequence of words is s = w1w2...wk, the probability can be expressed as:
It is reasonable to simplify this computation by approximating the word sequence generation as a (n-1)- order Markov process (Jelinek, 1998). Bigram (n=2) and trigram (n=3) models are common choices. Although we have limited the context, such models have a vast number of probabilities that need to be estimated. The text available for building the model is called the ´training corpus´ and, typically contains many millions of words. Unfortunately, even in a very large training corpus, many of the possible n-grams are never encountered. This problem is addressed by smoothing techniques (Chen & Goodman, 1996).
Which is the best modelling unit? Words are a common choice, but units smaller (or larger) than words can also be used. Word-based n-gram is best suited to modelling the English language (Jelinek, 1998). Inflective languages have several characteristics, which harm the prediction powers of standard models.
In general, all Indo-European languages are inflective but a serious problem arises regarding languages which are inflected to a greater extent (e.g. Russian, Czech, Slovenian). Agglutinative languages (e.g. Hungarian, Finnish, Estonian) have even more complex inflectional grammar where, besides inflections, compound words are a big problem. Inflective languages add inflectional morphemes to words. Inflectional morphemes indicate the grammatical information of a word (for example case, number, person, etc.). Inflectional morphemes are commonly added by affixing, which includes prefixing (adding a morpheme before the base), suffixing (adding it after the base), and much less common, infixing (adding it inside the base). A high degree of affixation contributes to the explosion of different word forms, making it difficult, even impossible, to robustly estimate language model probabilities. Rich morphology leads to high OOV (Out-Of-Vocabulary) rates and, therefore, data sparsity is the main problem.
This chapter focuses on modelling unit choice for inflective languages with the aim of reducing data sparsity. Linguistic and data-driven approaches were analyzed for this purpose.
Key Terms in this Chapter
Vocabulary: A set of words (or other units) being modelled. The same vocabulary is used by the language model and the target application.
Sub-Word Unit: Modelling unit smaller than a word. Sub-word units are usually morphemes, stems and endings, roots, etc.
Corpus: A large collection of texts, usually in electronic form. The corpus has greater value if it is tokenized (segmented into sentences, words etc.) and linguistically annotated (for example POS-tagged and lemmatized).
Perplexity: A measure of a language model’s quality. It can be interpreted as the geometric mean of the branch out factor of the language model. A language model with perplexity X has the same difficulty as an imaginary language in which every word can be followed by X different words with equal probability.
Inflective Language: A language characterized by the use of inflections. Inflection is the modification of a word in order to reflect grammatical information, such as gender, number, person etc.
Unknown Word: Vocabularies are typically fixed to be tens of thousands of words. All words not in the vocabulary are mapped to a single distinguished word, usually called the unknown word.
Language Model: A description of language. In statistical language modelling it is a set of probability estimates.
Out-Of-Vocabulary Rate: Number of unknown words in a new sample of language (it is called a test set), usually expressed in percentage.
n-Gram Model: A model, based on the statistical properties of n-grams. N-gram model predicts the i-th unit based on the knowledge of n-1 previous units. In n-gram modelling the assumption is made, that each unit depends only on n-1 previously observed units. This is the main deficiency of n-gram modelling, because it has been shown that the range of dependencies is significantly longer.