Exactly 1800 Words on Languages and Parsing

Terence Parr

Excerpt from:

Language Translation Using PCCTS & C++
by Terence J. Parr
Published by Automata Publishing Company
Publication date: January 1997
ISBN: 0962748854

We give only a taste of language theory here and in a very loose fashion. However, it should give you enough information and define enough terms to get you through the rest of the book.

In the Spring of 1983, as first year computer science students at Purdue University, we were assigned the problem of recognizing arithmetic expressions, which could include nested parentheses. We were given a specification that described what the expressions looked like and were asked to produce a Pascal program that recognized such expressions. The specification looked something like

expr -> factor
factor-> term ( "+" term )*
term -> atom ( "*" atom )*
atom -> "(" expr ")"
atom -> INTEGER
atom -> IDENTIFIER

where INTEGER and IDENTIFIER were shorthands for strings of digits and strings of letters, respectively; "( "+" term )*" indicated that zero or more "+" term sequences could be seen. [Or did we get term and factor mixed up?]. The rules described the structure of small pieces of the expression language. For example,

term -> atom ( "*" atom )*

was read by saying, "a term is an atom followed by zero or more ""*" atom" sequences." We remember thinking what a marvelously precise means of describing an infinitely large set of input strings.

We decided that our program would have one function to recognize each rule in the grammar to keep things nice and neat. In this manner, references to rules would become, possibly recursive, procedure calls in our program. References to actual input strings such as "(" and INTEGER were all hard coded to eat white space and look for the particular string. This got to be repetitive and so we decided to factor out the common operations among all input string matching code. Further, it seemed easier to treat input strings as single "words" when trying to match the grammatical structure of the expressions. We eventually came up with a function called getword that returned an integer describing what input vocabulary word was found. It also made sense to assume that some variable would always hold the next word to be matched; that is, after matching a word, the variable would be set to the result of calling getword again.

With the benefit of our current knowledge, now we would say that we were provided with a term definitiongrammar consisting of a set of term definitionrules that specified the set of all possible sentences in the expression term definitionlanguage; that is, the grammar specified the Syntax, term definitionsyntax of the language. Rules with more than one alternative were considered to have multiple Production, term definitionproductions. Our program was a parser that made calls to a analyzerlexical analyzer or scanner (our getword function) that broke up the input character stream into term definitionvocabulary symbols, or tokens. The program we built was a classic example of a recursive-descent parser. A generic term for this type of parsing is top-down because when you look at the parse tree, the parse starts at the top (the start symbol) and works its way down the tree. Recursive-descent parsers are a set of mutually recursive procedures that normally use a single symbol of lookahead to make parsing decisions. For example, rule atom could be encoded in C as

int atom()
{
    // use lookahead to decide which alternative applies
    switch ( current_token ) {
       case LPAREN : // -> "(" expr ")"
          current_token = getword();
          expr();
          if ( current_token != RPAREN ) error-clause;
          current_token = getword();
          break;
       case INTEGER :   // -> INTEGER
          current_token = getword();
          break;
       case IDENTIFIER :   // -> IDENTIFIER
          current_token = getword();       
          break;
       default :
          error-clause (missing LPAREN, INTEGER, or IDENTIFIER)
   }
}

where the variable current_token is the lookahead symbol.

LL(1)Parsers that follow this simple formula can be classified as LL(1), which is a shorthand indicating that the input is matched from left-to-right (as opposed to backwards) and that parsing decisions are made on the left edge of alternative productions with 1 symbol of lookahead. This amounts to saying that LL(1) parsers must predict which alternative production will be successfully matched by examining only the set of tokens that can be matched first by each production. We loosely define this set of tokens that predicts alternatives to be the lookahead set. Normally, this is set of tokens that can be matched first by a production p and is called FIRST(p); e.g.,

FIRST("(" expr ")")

is the singleton set {"("}. Occasionally, the FOLLOW set is used to predict alternatives. FOLLOW(r) is the set of all tokens that can be matched following references to rule r. For example, given the grammar

rule         -> optional_ID SEMICOLON
optional_ID  -> IDENTIFIER
optional_ID  ->

FOLLOW(optional_ID) is { SEMICOLON } because SEMICOLON follows the reference to optional_ID. The lookahead set for the empty production is defined to be the FOLLOW of the invoking rule. Therefore, SEMICOLON predicts the empty production of optional_ID.

When the lookahead sets from alternative productions are not disjoint, we say that the parsing decision is nondeterministic or ambiguous. In other words, there is at least one token that predicts more than one alternative. Most of the time, this is a bad thing.

LL(1) parsers may be generalized to LL(k) for k>1. For example, the following grammar is ambiguous upon token A:

a -> A B C
a -> A D E

where A, B, C, D, and E are some vocabulary tokens. Because both productions have a common prefix of A, an LL(1) parser could not determine which production was going to successfully match. However, if the parser could see ahead to both the A and what followed A on the input stream, the parser could determine which production was going to match. An LL(2) parser is such a creature; hence, rule a is unambiguous in the LL(2) sense. A grammar for which a deterministic LL(k) parser can be built is LL(k). A language for which an LL(k) grammar exists is LL(k).

Because recursive-descent parsers are just piles of code, more sophisticated predictions can be made than simple lookahead buffer comparisons. For example, what if two productions are exactly alike syntactically, but are semantically different? What if the productions have different meanings (usually depending upon context or other information)? Consider the following grammar:

element -> ID "(" expression_list ")"
       // array reference

element -> ID "(" expression_list ")"
              // procedure call

It is perfectly reasonable to separate these two cases because while they look the same, array references and procedure calls are very different semantically. The definition of the ID must be consulted to determine which production to match. In a hand-built parser, you could do this:

element()
{
   if ( current_token == ID && isarray(current_text) ) {
      match an array reference;
   }
   else if ( current_token == ID && isprocedure(current_text) ) {
      match a procedure call;
   }
   else error;
}

where isarray(current_text) is some function you have defined that returns true if the ID is previously defined as an array; isprocedure would be defined similarly. We call such a parser a predicated LL(k) parser, pred-LL(k), because at least one parsing decision was predicated upon information not available to a pure LL(k) parser. The forms isarray(current_text) and isprocedure(current_text) are considered "semantic predicates. We could modify the grammar as follows:

element -> <<isarray(current_text)>>? ID "("expression_list ")" 
element -> <<isprocedure(current_text)>>? ID "("expression_list ")"

where <<...>>? is a semantic predicate in ANTLR 1.xx notation.

Pred-LL(k) parsing covers another type of predicated parsing decision. Consider the following grammar:

a -> (A)* B
a -> (A)* C

No matter how large we make the k of LL(k), a sequence of k+1 A's could always be presented to the parser, and the parser could not see past the A's to the B or C. This grammar then is non-LL(k) for any fixed value of k. Because there are real grammars that might have constructs requiring arbitrary lookahead, we introduced another type of predicate called Syntactic predicatesyntactic predicates. A syntactic predicate specifies a grammar fragment that uniquely predicts the associated production. The grammar could be modified as follows:

a -> ( (A)* B )? (A)* B
a -> (A)* C

which indicates that, to predict the first production, zero or more A's must be seen followed by a B. If this syntactic predicate fails, the second production will be attempted by default; hence, no predicate is required at its left edge. Clearly, this ability to scan arbitrarily ahead, renders the class of pred-LL(k) languages much larger than the class of LL(k) languages.

Bottom-up Parsers

And now, for something completely different: a bit about the class of languages and parsers called LR(k). LR(k) parsers are considered bottom-up parsers because they try to match the leaves of the parse tree as they work their way up the parse tree towards the start symbol at the root. A simple way to illustrate LR parsing is to consider a simple language as described the following grammar.

a -> A B C
a -> A B D

Loosely speaking, an LR-based parser consumes input symbols until it finds a viable complete production; for the purposes of this discussion, all productions are viable. Input token A would be tested against both productions. Since A matches neither completely, another input token would be consumed, and AB would be compared against the productions. Again, a complete right-hand-side would not be matched. The next input symbol would be consumed, say token D. At this point, ABD matches the second right-hand-side and the parser would report that it had found input for rule a. The process of consuming input is called Shiftingshifting, and the process of matching complete right-hand-sides is called Reducingreducing. (The right-hand-side is reduced to the left-hand-side.) In our example, no lookahead is required to determine that a valid sentence was found because the entire production can be seen before making a decision. Therefore, this grammar is LR(0).

LR(k) recognizers (and their variants such as LALR(k)) are stronger than LL(k) recognizers because the LR strategy uses more context information. For an LR parser, the context consists of all grammar productions consistent with the previously seen input. This context often includes several "pending" grammar productions. Intuitively, an LR(k) parser attempts to match multiple productions at the same time and postpones making a decision until sufficient input has been seen. In contrast, the context for an LL parser is restricted to the sequence of previously matched productions and the position within the current grammar production being matched. An LL(k) parser must make decisions about which production to match without having seen any portion of the pending productions - it has access to less context information. Hence, LL(k) parsers rely heavily on lookahead. We note that our LR(0) grammar is LL(3) as a case in point.

On the other hand, our pred-LL(k) parsers are stronger than LR(k) parsers for two reasons. First, semantic predicates may be used to parse context sensitive languages. Second, pred-LL(k) parsers have access to arbitrary lookahead. Further, embedding actions in an LR grammar can introduce ambiguities, thus reducing the strength of LR.