Unlike natural languages, programming languages are strictly stylized entities created to facilitate human communication with computers. In order to make programming languages recognizable by computers, one of the key challenges is to describe and implement language syntax and semantics such that the program can be translated into machine-readable code. This process is normally considered as the front-end of a compiler, which is mainly related to the programming language, but not the target machine. This article will address the most important aspects in building a compiler front-end; that is, syntax and semantic analysis, including related theories, technologies and tools, as well as existing problems and future trends. As the main focus, formal syntax and semantic specifications will be discussed in detail. The article provides the reader with a high-level overview of the language implementation process, as well as some commonly used terms and development practices.
The task of describing the syntax and semantics of a programming language in a precise but comprehensible manner is critical to the language’s success (Sebesta, 2008). The syntax of a programming language is the representation of its programmable entities, for example, expressions, declarations and commands. The semantics is the actual meaning of the syntax entities. Since the 1960s (Sebesta, 2008), intensive research efforts have been made to formalize the language implementation process. Great success has been made in the syntax analysis domain. Context-free grammars are widely used to describe the syntax of programming languages, as well as notations for automatic parser generation (Slonneger & Kurtz, 1995). Formal specifications are also useful in describing semantics in a precise and logical manner, which is helpful for compiler implementation and program correctness proofs (Slonneger & Kurtz, 1995). However, there is no universally accepted formal method for semantic description (Sebesta, 2008), due to the fact that the semantics of programming languages are quite diverse, and it is difficult to invent a simple notation to satisfy all the computation needs of various kinds of programming languages. Overall, compiler development is still generally considered as one of the most appropriate software applications that can be implemented systematically using formal specifications.
Context-free grammar, BNF and EBNF. In the 1950s, Noam Chomsky invented four levels of grammars to formally describe different kinds of languages (Chomsky, 1959). These grammars, from Type-0 to Type-3, are rated by their expressive power in decreasing order, which is known as the Chomsky hierarchy. The two weaker grammar types (i.e., regular grammars, Type-3; and context-free grammars, Type-2) are well-suited to describe the lexemes (i.e., the atomic-level syntactic units) and the syntactic grammar of programming languages, respectively. Backus-Naur Form (BNF) was introduced shortly after the Chomsky hierarchy. BNF has the same expressive power as context-free grammar and it was first used in describing ALGOL 60 (Naur, 1960). BNF has an extended version called Extended BNF, or simply EBNF, where a set of operators are added to facilitate the expression of production rules.
Key Terms in this Chapter
Front-End: Among all the compiler phases, those mainly related to the programming language but not the target machine are normally considered as the front-end of a compiler.
Compiler-Compiler: Compiler-compilers are tools that can automatically generate compilers or interpreters from programming language descriptions.
Attribute Grammar: Attribute grammar is an extension of context-free grammars in which each symbol has an associated set of attributes that carry semantic information, and each production has a set of semantic rules associated with attribute computation.
Semantics: The semantics of a programming language is the actual meaning of the syntax entities. It describes what the programs written in this language can achieve.
Top-Down Parsing: Top-down parsing is a parsing strategy that iteratively expands the leftmost nonterminal (initially, the start symbol) according to its corresponding productions and matches the expanded sentences against the input program from left to right.
Bottom-Up Parsing: Bottom-up parsing is a parsing strategy that identifies terminal symbols from the input stream first, and combines them successively in a rightmost way to reduce to nonterminals, until the start symbol is reduced.
Aspect-Oriented Programming: Aspect-oriented Programming (AOP) provides special language constructs called aspects that modularize crosscutting concerns in conventional program structures (e.g., a concern that is spread across class hierarchies of object-oriented programs). A translator called a weaver is responsible for merging the additional code specified in an aspect language into the base language at compile time.
Context-Free Grammar: A context-free grammar is a quadruple (N, T, P, S), where N is a set of nonterminal symbols; T is a set of terminal symbols with T n N = Ø; the relation P ? N × (N ? T)* is a finite set of production rules and S is the start symbol with S ? N. The production of the form A ? a means A derives a, where A ? N and a? (N ? T)*.
Syntax: The syntax of a programming language is the representation of its programmable entities, for example, expressions, declarations and commands. It describes the appearance of programs written in this language.