Statistical Modelling of Highly Inflective Languages

Mirjam Sepesy Maucec; Zdravko Kacic

doi:10.4018/978-1-59904-849-9.ch215

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Statistical Modelling of Highly Inflective Languages

Mirjam Sepesy Maucec, Zdravko Kacic

Source Title: Encyclopedia of Artificial Intelligence

DOI: 10.4018/978-1-59904-849-9.ch215

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

A language model is a description of language. Although grammar has been the prevalent tool in modelling language for a long time, interest has recently shifted towards statistical modelling. This chapter refers to speech recognition experiments, although statistical language models are applicable over a wide-range of applications: machine translation, information retrieval, etc. Statistical modelling attempts to estimate the frequency of word sequences.

Chapter Preview

Top

Introduction

Statistical modelling attempts to estimate the frequency of word sequences. If a sequence of words is s = w₁w₂...w_k, the probability can be expressed as:

It is reasonable to simplify this computation by approximating the word sequence generation as a (n-1)- order Markov process (Jelinek, 1998). Bigram (n=2) and trigram (n=3) models are common choices. Although we have limited the context, such models have a vast number of probabilities that need to be estimated. The text available for building the model is called the ´training corpus´ and, typically contains many millions of words. Unfortunately, even in a very large training corpus, many of the possible n-grams are never encountered. This problem is addressed by smoothing techniques (Chen & Goodman, 1996).

Which is the best modelling unit? Words are a common choice, but units smaller (or larger) than words can also be used. Word-based n-gram is best suited to modelling the English language (Jelinek, 1998). Inflective languages have several characteristics, which harm the prediction powers of standard models.

In general, all Indo-European languages are inflective but a serious problem arises regarding languages which are inflected to a greater extent (e.g. Russian, Czech, Slovenian). Agglutinative languages (e.g. Hungarian, Finnish, Estonian) have even more complex inflectional grammar where, besides inflections, compound words are a big problem. Inflective languages add inflectional morphemes to words. Inflectional morphemes indicate the grammatical information of a word (for example case, number, person, etc.). Inflectional morphemes are commonly added by affixing, which includes prefixing (adding a morpheme before the base), suffixing (adding it after the base), and much less common, infixing (adding it inside the base). A high degree of affixation contributes to the explosion of different word forms, making it difficult, even impossible, to robustly estimate language model probabilities. Rich morphology leads to high OOV (Out-Of-Vocabulary) rates and, therefore, data sparsity is the main problem.

This chapter focuses on modelling unit choice for inflective languages with the aim of reducing data sparsity. Linguistic and data-driven approaches were analyzed for this purpose.

Key Terms in this Chapter

Vocabulary: A set of words (or other units) being modelled. The same vocabulary is used by the language model and the target application.

Sub-Word Unit: Modelling unit smaller than a word. Sub-word units are usually morphemes, stems and endings, roots, etc.

Corpus: A large collection of texts, usually in electronic form. The corpus has greater value if it is tokenized (segmented into sentences, words etc.) and linguistically annotated (for example POS-tagged and lemmatized).

Perplexity: A measure of a language model’s quality. It can be interpreted as the geometric mean of the branch out factor of the language model. A language model with perplexity X has the same difficulty as an imaginary language in which every word can be followed by X different words with equal probability.

Inflective Language: A language characterized by the use of inflections. Inflection is the modification of a word in order to reflect grammatical information, such as gender, number, person etc.

Unknown Word: Vocabularies are typically fixed to be tens of thousands of words. All words not in the vocabulary are mapped to a single distinguished word, usually called the unknown word.

Language Model: A description of language. In statistical language modelling it is a set of probability estimates.

Out-Of-Vocabulary Rate: Number of unknown words in a new sample of language (it is called a test set), usually expressed in percentage.

n-Gram Model: A model, based on the statistical properties of n-grams. N-gram model predicts the i-th unit based on the knowledge of n-1 previous units. In n-gram modelling the assumption is made, that each unit depends only on n-1 previously observed units. This is the main deficiency of n-gram modelling, because it has been shown that the range of dependencies is significantly longer.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Statistical Modelling of Highly Inflective Languages

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List