Evolutionary Grammatical Inference

Evolutionary Grammatical Inference

Ernesto Rodrigues (Federal University of Technology, Brazil) and Heitor Silvério Lopes (Federal University of Technology, Brazil)
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-59904-849-9.ch091
OnDemand PDF Download:
$37.50

Abstract

Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the learner that should return a grammar capable of explaining to some extent the input data. The grammar inferred from data can then be used to classify unseen data or provide some suitable model for it. The classical formalization of Grammatical Inference (GI) is known as Language Identification in the Limit (Gold, 1967). Here, there are a finite set S+ of strings known to belong to the language L (the positive examples) and another finite set S- of strings not belonging to L (the negative examples). The language L is said to be identifiable in the limit if there exists a procedure to find a grammar G such that S+ ? L(G), S- ? L(G) and, in the limit, for sufficiently large S+ and S-, L = L(G). The disjoint sets S+ and S- are given to provide clues for the inference of the production rules P of the unknown grammar G used to generate the language L. Grammatical inference include such diverse fields as speech and natural language processing, gene analysis, pattern recognition, image processing, sequence prediction, information retrieval, cryptography, and many more. An excellent source for a state-of-the art overview of the subject is provided in (de la Higuera, 2005). Traditionally, most work in GI has been focused on the inference of regular grammars trying to induce finite-state automata, which can be efficiently learned. For context free languages some recent approaches have shown limited success (Starckie, Costie & Zaanen, 2004), because the search space of possible grammars is infinite. Basically, the parenthesis and palindrome languages are common test cases for the effectiveness of grammatical inference methods. Both languages are context-free. The parenthesis language is deterministic but the palindrome language is nondeterministic (de la Higuera, 2005). The use of evolutionary methods for context-free grammatical inference are not new, but only a few attempts have been successful. Wyard (1991) used Genetic Algorithm (GA) to infer grammars for the language of correctly balanced and nested parentheses with success, but fails on the language of sentences containing the same number of a’s and b’s (anbn language). In another attempt (Wyard, 1994), he obtained positive results on the inference of two classes of context-free grammars: the class of n-symbol palindromes with 2 = n = 4 and a class of small natural language grammars. Sen and Janakiraman (1992) applied a GA using a pushdown automata to the inference and successfully learned the anbn language and the parentheses balancing problem. But their approach does not scale well. Huijsen (1994) applied GA to infer context-free grammars for the parentheses balancing problem, the language of equal numbers of a’s and b’s and the even-length 2-symbol palindromes. Huijsen uses a “markerbased” encoding scheme with has the main advantage of allowing variable length chromosomes. The inference of regular grammars was successful but the inference of context-free grammars failed. Those results obtained in earlier attempts using GA to context-free grammatical inference were limited. The first attempt to use Genetic Programming (GP) for grammatical inference used a pushdown automata (Dunay, 1994) and successfully learned the parenthesis language, but failed for the anbn language. Korkmaz and Ucoluk (2001) also presented a GP approach using a prototype theory, which provides a way to recognize similarity between the grammars in the population. With this representation, it is possible to recognize the so-called building blocks but the results are preliminary. Javed and his colleagues (2004) proposed a Genetic Programming (GP) approach with grammar-specific heuristic operators with non-random construction of the initial grammar population. Their approach succeeded in inducing small context-free grammars. More recently, Rodrigues and Lopes (2006) proposed a hybrid GP approach that uses a confusion matrix to compute the fitness. They also proposed a local search mechanism that uses information obtained from the sentence parsing to generate a set of useful productions. The system was used for the parenthesis and palindromes languages with success.
Chapter Preview
Top

Introduction

Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the learner that should return a grammar capable of explaining to some extent the input data. The grammar inferred from data can then be used to classify unseen data or provide some suitable model for it.

The classical formalization of Grammatical Inference (GI) is known as Language Identification in the Limit (Gold, 1967). Here, there are a finite set S+ of strings known to belong to the language L (the positive examples) and another finite set S- of strings not belonging to L (the negative examples). The language L is said to be identifiable in the limit if there exists a procedure to find a grammar G such that S+ ⊆ L(G), S- ⊄ L(G) and, in the limit, for sufficiently large S+ and S-, L = L(G). The disjoint sets S+ and S- are given to provide clues for the inference of the production rules P of the unknown grammar G used to generate the language L.

Grammatical inference include such diverse fields as speech and natural language processing, gene analysis, pattern recognition, image processing, sequence prediction, information retrieval, cryptography, and many more. An excellent source for a state-of-the art overview of the subject is provided in (de la Higuera, 2005).

Traditionally, most work in GI has been focused on the inference of regular grammars trying to induce finite-state automata, which can be efficiently learned. For context free languages some recent approaches have shown limited success (Starckie, Costie & Zaanen, 2004), because the search space of possible grammars is infinite. Basically, the parenthesis and palindrome languages are common test cases for the effectiveness of grammatical inference methods. Both languages are context-free. The parenthesis language is deterministic but the palindrome language is nondeterministic (de la Higuera, 2005).

The use of evolutionary methods for context-free grammatical inference are not new, but only a few attempts have been successful.

Wyard (1991) used Genetic Algorithm (GA) to infer grammars for the language of correctly balanced and nested parentheses with success, but fails on the language of sentences containing the same number of a’s and b’s (anbn language). In another attempt (Wyard, 1994), he obtained positive results on the inference of two classes of context-free grammars: the class of n-symbol palindromes with 2 ≤ n ≤ 4 and a class of small natural language grammars.

Sen and Janakiraman (1992) applied a GA using a pushdown automata to the inference and successfully learned the anbn language and the parentheses balancing problem. But their approach does not scale well.

Huijsen (1994) applied GA to infer context-free grammars for the parentheses balancing problem, the language of equal numbers of a’s and b’s and the even-length 
2-symbol palindromes. Huijsen uses a “markerbased” encoding scheme with has the main advantage of allowing variable length chromosomes. The inference of regular grammars was successful but the inference of context-free grammars failed.

Those results obtained in earlier attempts using GA to context-free grammatical inference were limited. The first attempt to use Genetic Programming (GP) for grammatical inference used a pushdown automata (Dunay, 1994) and successfully learned the parenthesis language, but failed for the anbn language.

Korkmaz and Ucoluk (2001) also presented a GP approach using a prototype theory, which provides a way to recognize similarity between the grammars in the population. With this representation, it is possible to recognize the so-called building blocks but the results are preliminary.

Javed and his colleagues (2004) proposed a Genetic Programming (GP) approach with grammar-specific heuristic operators with non-random construction of the initial grammar population. Their approach succeeded in inducing small context-free grammars.

Key Terms in this Chapter

Genetic Algorithm: A type of evolutionary computation algorithm in which candidate solutions are represented typically by vectors of integers or bit strings, that is, by vectors of binary values 0 and 1

Local Search: A type of search method that starts at some point in search space and iteratively moves from position to neighbouring position using heuristics.

CYK: A Cocke-Younger-Kasami algorithm used to determine whether the sentence can be generated by the grammar.

Search Space: Set of all candidate solutions of a given problem instance.

Heuristic: Function used for making certain decisions within an algorithm

Finite Automata: A model of behavior composed of a finite number of states, transitions between those states, and actions. They are used to recognize regular languages

Pushdown Automata: A finite automaton that can make use of a stack containing data. They are used to recognize context-free language.

in the context of search algorithms: typically used for guiding the search process

Evolutionary Computation: Large and diverse class of population-based search algorithms that is inspired by the process of biological evolution through selection, mutation and recombination. They are iterative algorithms that start with an initial population of candidate solutions and then repeatedly apply a series of the genetic operators

Complete Chapter List

Search this Book:
Reset