Issues in the Syntactic Parsing of Queries for a Natural Language Interface to Databases

Issues in the Syntactic Parsing of Queries for a Natural Language Interface to Databases

Alexander Gelbukh, José A. Martínez F., Andres Verastegui, Alberto Ochoa
DOI: 10.4018/978-1-7998-4730-4.ch007
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In this chapter, an exhaustive parser is presented. The parser was developed to be used in a natural language interface to databases (NLIDB) project. This chapter includes a brief description of state-of-the-art NLIDBs, including a description of the methods used and the performance of some interfaces. Some of the general problems in natural language interfaces to databases are also explained. The exhaustive parser was developed, aiming at improving the overall performance of the interface; therefore, the interface is also briefly described. This chapter also presents the drawbacks discovered during the experimental tests of the parser, which show that it is unsuitable for improving the NLIDB performance.
Chapter Preview
Top

Introduction

Linguistics is the science that studies the origin, evolution, and structure of human language. For the human being, it is easy to acquire and master language, as well as is genetically and socially predisposed to acquire it naturally. The scientific study of human language is a complicated task mainly because of the variability in language, which complicates the construction of a general theory that describes the function of language.

In computer science, there exist two types of languages: Formal Language and Natural Language (NL).

Formal language is a set of strings of symbols formed according to a certain rule or rules that determine how the symbols in a given collection can be combined (Lexico, 1920). Formal language is created by the human being, who has formally created rules for its construction and use. Some examples of formal languages are programming languages and mathematical logic. In contrast, NL is a language that has been created in a natural way and it has not been designed by humans (Lexico, n.d.). NLs are those that are generated spontaneously by human society to communicate with each other. Some examples of NL are English, Russian, and Spanish. One of the main differences between both languages is variability. Formal languages have a limited variability because they are subject to very precise rules to avoid understanding problems. On the contrary, NLs have a lot of variabilities. NL variability is a communicative advantage because it allows the generation of the large variability of information, but from the computational point of view, it is a very challenging problem for a computer to understand this information.

In Computer Science, Artificial Intelligence aims at developing systems that mimic human intelligence. Natural Language Processing (NLP) is a sub-area of the Artificial Intelligence field, and specifically in Computational Linguistics. NLP studies the interaction between computers and humans though NL (written or spoken) aiming at the implementation of systems that ease human-computer communication, to make it as simple as the communication among people. A subarea in NLP is Natural Language Interfaces to Databases. A Natural Language Interfaces to Databases (NLIDBs) is a system that allows the user to access information stored in a database by typing requests expressed in some natural language (Androutsopoulos, Ritchie, & Thanisch, 1995).

Nowadays, a huge amount of information is stored in databases (DBs). In order to obtain information stored, generally in relational databases, database query languages can be used. A relational database is an auto descriptive collection of interrelated tables (Kroenke & Auer, 2012). An intuitive definition of a relational database is as follows: a relational database is a database type that complies with the relational model. It allows the establishment of interconnections or relations among data stored in tables, and through those connections, relate the data of the tables. To access the information stored in a relational database, the commonly used query language is SQL (Structured Query Language). SQL, originally property of IBM, is an international standard used by almost every relational database. It is used to query, define data structures, modify data, and specify security restrictions in a database (ISO, 1989). Unfortunately, querying a database using SQL requires technical knowledge and expertise in SQL; therefore, it is complicated for a casual and inexperienced user to formulate queries in SQL. The vast majority of users that formulate queries in SQL are information technology professionals. To make information stored in databases accessible to anyone, NLIDBs were developed, whose purpose is to allow inexperienced users to formulate queries to databases in natural language, without having to use SQL. An NLIDB translates NL queries into SQL queries to extract the information stored in a database and shows the information requested to the user.

Top

Background

An intuitive definition of an NLIDB is the following: an NLIDB is a system that translates a natural language query into a database query. The only function of the interface is to translate the query. To obtain the information stored in the database, a database management system is used, for example, Access, PostgreSQL, etc. The data flow of an NLIDB is depicted in Figure 1.

Figure 1.

The flow of an NLIDB

978-1-7998-4730-4.ch007.f01

Complete Chapter List

Search this Book:
Reset