LL and LR Translator Need k
Terence Parr and Russell Quong
Appeared in SIGPLAN Notices
Volume 31 #2 Februrary 1996
Terence J. Parr
MageLang Institute
parrt@magelang.com
Russell W. Quong
quong@best.com
ABSTRACT
Language translation is a harder and more
important problem than language recognition. In
particular, programmers implement translators not
recognizers. Yet too often, translation is equated
with the simpler task of syntactic parsing. This
misconception coupled with computing limitations of
past computers has led to the almost exclusive use of
LR(1) and LL(1) in parser generators. We claim that
use of k>1 lookahead can and should be available
in practice, as it simplifies the translator
development. We give several examples justifying our
arguments.
0. Introduction
Language translation is a harder and more
important problem than language recognition. In
particular, programmers implement translators not
recognizers. Yet too often, translation is equated with
the simpler task of syntactic parsing. This misconception
coupled with speed and memory limitations of past
computers has led to the almost exclusive use of LR(1)
and LL(1) in parser generators.
A translator consists of a syntactic parser or recognizer
augmented with semantic actions that are triggered
by the recognizer. Semantic actions are typically used
for constructing abstract syntax tree nodes, adding
entries to the symbol table, or emitting intermediate
code. Thus, any program that does parsing, such a
compiler front-end, a source code interpreter or a
database format converter, is a translator not just a
recognizer. Due to the semantic actions, building a
translator is harder than building a recognizer, because
a translator must execute semantic actions immediately
upon seeing the appropriate left contexts.
We claim that use of k>1 lookahead can and should
be available in practice, as it simplifies translator
development. As this paper is motivated by practice not
theory, we justify our arguments via examples;
theoretical issues are of secondary concern. We focus on
parser generator tools which automatically implement an
LL(k) parser [RoS70] [PQ94] or an LR(k) parser [Knu65]
[Joh75] given a grammar specification, because all else
being equal, we believe programmers would rather use a
parser generator rather than implementing a translator by
hand. In addition, programmers should be able to write
natural grammars, that reflect the underlying structure
of the language and to freely intersperse actions in the
grammar.
The amount of lookahead, k, has been limited almost
exclusively to k=1 in widely-used parser generators. Two
reasons, one practical and one theoretical, explain this
situation. However, the first reason is outdated and the
second is simply erroneous, as we demonstrate in this
paper.
- First, use of k>1 lookahead for a fixed k
throughout a parser requires significantly more
space and time, in both the parser generator and
the resulting parser, than using k=1. However,
the use of k>1 is feasible now because
computers are more powerful and because
space-efficient heuristics [Par93] have been
developed that circumvent the intractable nature
of parsing with k>1.
- Second, many have felt that k>1 is unnecessary
when using an LR(1)-based tool because LR(1)
equals LR(k>1) in recognition power [Knu65].
However, this viewpoint is theoretical only and
is erroneous in practice.
- In translation LR(1) is strictly weaker
than LR(k>1). The presence of semantic
actions, which must be executed
immediately when encountered, makes it
impossible to convert an arbitrary LR(k)
grammar to an LR(k-1) grammar, as shown
in \refsec{theory}.
- Even if one were simply recognizing
input, converting an LR(k) grammar to an
LR(1) grammar is completely impractical,
as the resulting LR(1) would be gigantic.
More importantly, outwardly the LR(1)
grammar would retain almost none of the
structure of the original grammar.
Both LR or LL parsers use a finite state machine;
their recognition strength is a function of (i) the
amount of context} and (ii) the amount of lookahead. We
loosely view the context as the information encapsulated
in the current parser state. The lookahead is the current
set of unparsed tokens available to the parser. The
parser uses the lookahead to make parsing decisions. We
use k to denote the length of the strings in the
lookahead set.
LR(k) recognizers are stronger than LL(k) recognizers
because the LR strategy uses more context information.
For an LR parser, the context consists of all grammar
productions consistent with the previously seen input.
This context often includes several ``pending'' grammar
productions. Intuitively, an LR(k) parser attempts to
match multiple productions at once and postpones making a
decision until sufficient input has been seen. In
contrast, the context for an LL parser is restricted to
the sequence of previously matched productions and the
position within the current grammar production being
matched. An LL(k) parser must make decisions about which
production to match without having seen any portion of
the pending productions---it has access to less context
information. Hence, LL(k) parsers rely heavily on
lookahead.
As an example, consider using a parser rather than a
scanner to recognize integers such as 17 or real
numbers such as 17.89, where each character is a
token. An LR parser would have no problem. It can defer
deciding between integer or real until it sees the
presence or absence of a decimal point using a natural
grammar.
lr_gram : digits
| digits "." digits
;
In contrast, an LL parser cannot simultaneously parse
more than one pending production and cannot see past the
common integer prefix to what follows. Thus, the
programmer must left-factor the productions into a
production matching an integer value followed by an
optional fractional value.
ll_gram : digits decimal ;
decimal : "." digits
|
;
In Section 1, we discuss LL(k)'s reliance on lookahead
in more detail. While pure LR(k) recognizers have more
context information than LL(k) recognizers, we argue in
Section 2 that the introduction of semantic actions
reduces the amount of available context for an LR(k)
parser so that increasing the benefit of lookahead for
LR(k) parsers as well. In Section 3, we show that there
are languages of practical interest that are more easily
specified with LR/LL(k>1) grammars than by LR/LL(1)
grammars. Section 4briefly discusses the relative
strengths of LR(1), LL(k) and LR(k), and Section 5 points
out the problem associated with implementing parsers with
large lookahead.
1. LL(k) translators
The need for greater lookahead in LL parsers is simply
the need for more recognition strength. The inability of
LL to automatically left-factor common prefixes of
competing rules often forces the programmer to manually
left-factor these rules. At best this process leads to
unnatural grammars; at worst, it may be insufficient as
there are languages that are LL(k) but not LL(k-1)
[FiL91]. A larger lookahead k allows the parser to see
past more common prefixes, increases recognition strength
and minimizes the need for manual left-factoring. Once
the programmer has developed an LL(k) grammar, the
addition of semantic actions has no effect. Thus, if we
can recognize input with LL(k) we can translate it, too.
In practice, the use of LL(k>1) parsers both
simplifies grammar development and allows the use of more
natural grammars. In Section 3, we provide an example
showing LL(2) is much more convenient than LL(1).
2. LR(k) translators
LR(k) is stronger than LL(k) because it provisionally
matches multiple productions at the same time---each
parser state encodes a set of partially-recognized
productions. The parser is not required to choose between
them until the right edge of the winning production. In
this section, we show how inserting actions at positions
other than at production right edges can affect LR(k)'s
recognition strength by decreasing its context
information. Practical grammars both contain actions ``in
the middle'' of productions, (or more precisely at
non-right-edge positions) and require these actions to be
executed immediately when encountered. It is these
actions which introduce parsing conflicts. Thus for LR
parsing, we also urge the use of k>1 lookahead tokens
to overcome the lack of context information. We begin by
describing how adding actions affects an LR(k) grammar.
While an LR(k) parser can delay parsing decisions, it
cannot in general delay action execution. Further, the
parser can only execute actions once it has identified
which single production to match, because only those
actions within the appropriate production should be
executed. Actions embedded within a production force the
parser to make decisions much sooner than normal; i.e.,
after the symbols to the left of the action position have
been matched. A premature parsing decision is done with
less context information and weakens the parser. For
example, if an action were placed at the left edge of a
production, the LR(k) parser would be forced to decide if
that production would succeed without having matched any
of it. Unfortunately, this worst case scenario can reduce
the strength of LR(k) to LL(k).
The core of our argument for k>1 hinges on the fact
that (i) actions must be executed immediately and (ii)
actions can occur in the middle or at the left edge of a
rule. We now give two examples to empirically prove for
our assertion.
(i) Consider translating the
following ANSI C type definition and variable
declaration.
typedef unsigned int boolean;
boolean ignoreCase;
To properly translate this code, the symbol table must
be updated immediately upon seeing the first boolean
in the typedef declaration so that the
subsequent use of boolean will be viewed as a
type, instead of an unknown identifier. (A standard
technique is to have the lexical analyzer return
different tokens TOK_TYPE, TOK_VAR, Or TOK_FN
by examining the symbol table). The following grammar
could be used to recognize the input.
typedef_rule : TYPEDEF declaration {add typename}\ ';' ;
vardef_rule : declaration {add varname}\ ;
As a subtle point, in an LR(1)/LL(1) parser where the
parser always sees one token ahead of what has been
matched, the action ``add typename'' must occur before
the semicolon in the type definition rule in order to
update the symbol table before the lexer sees boolean
in the ensuing statement. If the action execution were
delayed until after the recognition of the ';',
the lexer would not return the correct token type.
(In this paper, we ignore the standard trick of having
the lexer use character lookahead to fake k>1
lookahead symbols. This trick only works in restricted
cases, such as when the lookahead symbol at k=2 is a
single character in length. Worse, this sort of hack
results in lexers that are hopelessly interwined with
parsers. Anyone who has tried to implement a C++ parser
with LALR(1) or LL(1) is familiar with the difficulties.)
(ii) Our next example shows that
actions do occur naturally in the middle of rules---some
actions cannot be moved to the right edges of a
production. Again, our example uses an action to update
the symbol table for lexer. In Jim Roskind's [Roskind]
precisely-tuned YACC C++ grammar, his comments indicate
that parsing scoped names of the form xxx::yyy
require the lexer to look up the name yyy in the
namespace for Xxx. (In C++, The Notation xxx::yyy
refers to name yyy in scope xxx where the scope
is assumed to be global if xxx is absent.) Thus the token type
for yyy is context dependent, as it might be a member
variable, member function or type name local depending on
the class xxx. The following is the rule from
Roskind's grammar that handles this situation for global
scope overrides.
global_scope
: { direct lexer to look for upcoming name in file scope } "::"
;
Here, the parser must execute an action to provide
context information to the lexer so that it can return
the correct token type for the identifier following the
scope override operator ``::''. Further, this
action must be executed before the parser has had a
chance to request the token following the ``::''.
Thus, we have demonstrated that actions must sometimes
occur in the middle of a production and that introducing
such actions reduces the amount of context information
available to an LR(k) parser. This reduction in context
is accompanied by a reduction in parsing strength, which
can be offset simply with more lookahead information.
3. Example grammars needing k>1
When designing a language, it is a good idea to keep
the syntax simple enough to be described easily with an
LR(1) or LL(1) grammar; however, it may be the case that
the most natural grammar for the language requires
k>1. Also, in practice, existing languages have quirks
that are best handled via parsers with greater lookahead.
(Languages designers are not always familiar with parsing
theory).
We now give two examples of standard languages with
non-LR(1)/LL(1) constructs. The first shows that
increasing k for an LL(k) parser can obviate need to
left-factor a grammar. The second example shows that
k>1 lookahead symbols are useful for LR(k) grammars
even without embedded actions. In the following grammars,
tokens are either CAPITALIZED or they are enclosed in
single quotes such as ':'.
Example 1: We illustrate the advantage of an
LL(2) grammar over an LL(1) grammar for recognizing a
fragment of the C language. Consider distinguishing
between C labels ``ID :'' and C assignment
statements "ID = ...", Where ID
represents an identifier token. In the following grammar
fragment, rule {\tt stat} requires two lookahead symbols
and is easily expressed with an LL(2) grammar. This
common language feature is hard to express with an LL(1)
grammar because the ID token is matched in many
different grammar locations. Left-factoring the common ID
prefix in stat}
and expr would results in a much
less natural grammar.
stat: ID ":" stat /* statement label */
| expr ";" /* assignment stat */
;
expr: ID "=" expr
| INT
;
Although manual left-factoring can sometimes solve the
problem, it is not ideal for two reasons. First, even if
left-factoring yields a reasonable grammar, the LL(k)
grammar for k>1 is simply more convenient, which tends
to reduce development time. Secondly, while
left-factoring might be theoretically possible, it can be
practically implausible, as in this example, where expr
occurs throughout the grammer.
Example 2: Consider specifying a
grammar for the "YACC language" which describes
the input to YACC itself. Surprisingly, the most natural
grammar is LR(2), not LR(1), because the semicolon at the
end of a YACC rule is optional, which implies that YACC
cannot easily recognize its own input language. Jim
Roskind [Roskind], who pointed out this fact to us, also
provided the following example YACC input. The problem
arises when parsing "... t t : ...".
s : A
| s A
| t
t : B
| t B
Two symbols of lookahead are required to determine
when a rule ends. The difficulty is that we have to scan
past a token to check for a following ":"
to determine whether or not to the current rule is
complete. A natural grammar (using extended BNF) for the
YACC language would look like the follow
yacc_input
: ( rule )*
;
rule: RULENAME ':' alt ( '|' alt )* [ ';' ]
;
/* This rule requires LL/LR(2) */
alt : ( TOKEN | RULENAME )*
;
where "(...)*" means zero-or-more and
"[...]'' means optional. The
"(...)*" subrule in alt
is essentially a loop that consumes
TOKENs or RULENAMEs until something that can
follow an alt is seen on the input. The loop
will consume tokens until it sees a '|' (correct),
';' (correct) or a ':' (incorrect). With
some work, this natural LR(2) grammar can be transformed
into an LR(1) grammar, but the resulting LR(1) grammar
would reflect little of the structure of the YACC
language. But the use of an unnatural grammar completely
negates the advantage of using a high-level grammar. (In
fact, if one just wants to recognize the input, a regular
expression probably suffices.)
4. Theoretical argument
In this section, we briefly describe how actions may
be introduced into an LR(k) grammar at non-right-edge
positions and then describe how the effect of action
introduction can reduce the strength of LR(k) to that of
LL(k) in the worst case.
An LR(k) parser's recognition strength lies with the
fact that a given parser state may correspond to multiple
grammar positions. Consequently, the parser cannot know
which embedded action to execute until it uniquely
identifies which surrounding production to reduce; i.e.,
until it performs a reduce. This implies that productions
with actions at non right-edge positions must be
``cracked'' to force the actions to a right edge. For
example, a production of the form
a : { action} A;
would be transformed into two productions:
a : b A ;
b : { action} ; /* production only has action */
The introduction of empty productions such as this can
introduce parsing conflicts because the parser must
decide if that production will succeed before any portion
of the cracked production has been matched.
Brosgol [Bro80] showed that a grammar is LL(K) iff
that grammar augmented with a reference to a unique empty
production on each left edge is LR(k). In other words, if
a grammar augmented with references to empty productions
on each left edge is still LR(k), then the original LR(k)
grammar is also LL(k). One may conclude that, while it is
unlikely that an action will be required on every left
edge, LR(k) can be reduced in strength to that of LL(k)
in the worst case.
Carrying the work of Brosgol further, we note that
LR(k) must also be stronger than LR(1) in the worst case
action placement scenario. This result can be seen by
observing that LR(1) is equally strong as LL(1) in the
worst case and, since LL(1) subset LL(k), LR(1) subset
LL(k); by transitivity, LR(1) subset LR(k). We conclude
that LR(k) cannot be arbitrarily rewritten to be LR(1)
when actions may be placed arbitrarily in that grammar.
Theory agrees with practice in that converting an
LR(k) grammar to be LR(1) (even if the converted grammar
were not huge) is not plausible and that LR(k) is
significantly weakened when actions are introduced.
Increasing the amount of lookahead beyond k=1 for both LL
and LR parsers is useful. A trivial grammar which is
LR(2) but not LR(1) due to actions is as follows.
start: { printf("X ahead"); } A X
| A Y
;
An LR(1) parser that sees the input ``A''
cannot determine whether or not to execute the action.
The grammar cannot be left-factored to reduce the
lookahead requirements:
start: { printf("X ahead"); } A (X|Y)
;
The parser in this case would execute the action for
both productions. Until the "X" is
seen, the action cannot be executed.
5. Implementation Issues
While we have shown LL(k) and LR(k) parser with k>1
to be useful, we have thus far not considered the cost of
implementing these parsers. In theory, storing full
lookahead information for one decision requires O(|T|^k)
space, where |T| is the number of token types. In
practice, it is expensive for a parser generator to
compute full lookahead for k>1 and the resulting
parsers are usually impractically large. Various methods
have been examined to obtain k>1 lookahead. For
example, [BeS86} And [CuC73] explored using infinite
regular languages for lookahead decisions and [ADG91]
computed lookahead k-sequences at parse-time rather than
statically. Unfortunately, practical and widely-used
parser generators have not been developed using these
techniques.
We have implemented an effective heuristic [Par93]
called linear-approximate lookahead (denoted LL1(k)
and LR1(k)) that retains most of the power of
full k >= 1 lookahead, but eliminates the exponential
cost of conventional lookahead. Storing
linear-approximate decisions requires O(|T| x k) space, a
significant savings over the O(|T|^k) required for full
lookahead. By using this heuristic whenever possible,
grammars can effectively use reasonably large k (k=4 for
large grammars on a 1995 PC workstation). We construct
practical parsing decisions for k>1 in the following
manner:
- k must be modulated according to the needs of
each parsing decision (this technique is well
known; [DeR71] suggested this in his paper on
SLR(k)).
- Most importantly, linear-approximate lookahead
must be used to squash the exponential growth for
the k>1 case. We rely upon conventional k>1
lookahead decisions only when the approximation
is inadequate.
- Even when a full lookahead decision is required,
we compute a hybrid approximate/conventional
decision in order to reduce the time and space
requirements. We take advantage of the fact that,
in most cases, approximate lookahead is nearly
sufficient---we generate decisions containing an
approximate decision plus a few k-tuple
comparisons to test for the lookahead sequences
that violate the constraints for approximate
lookahead.
To illustrate the spirit of linear-approximate
lookahead and its attendant time and space savings,
consider the following contrived LL(2) grammar.
a : (A B|C D|F E|H K|J K|A M)
| F I
;
where all symbols beginning with an uppercase letter
are token references. The following pseudo-code predicts
alternatives of rule ``a'':
if (t1,t2) \in {(AB),(CD),(FE),(HK),(JK),(AM)} then
predict first production};
elseif (t1,t2) = (FI) then
predict second production};
endif
where ti is the ith token of lookahead. Even if a hash
table were used to perform the 2-tuple set membership,
six 2-tuples would still have to be stored.
Alternatively, a series of 2-tuple tests would be both
slow and costly in space.
Now consider that, while the first token of lookahead
cannot uniquely identify which production to predict, the
sets of tokens at lookahead depth k=2 for productions one
and two are disjoint. Consequently, the parsing decision
can be reduced to the following pseudo-code:
if t2 in {B,D,E,K,K,M} then
predict first production};
elseif t2 = I then
predict second production};
endif
The set membership used in this case can
be done with a simple bit set test in constant time.
Five years ago, we discovered and
implemented linear-approximate lookahead in the
widely-used ANTLR [PQ94] parser generator, making
LL(k>1) parsing a practical reality (the first general
release for k>1 ANTLR was three years ago). The
public-domain parser generator ANTLR, described in
[PQ94], is available at ftp://ftp.parr-research.com/pub/pccts
. Finally, we know of a LR(k) parser generator [Grosch95]
that uses a variant of LR1(k) that is nearing
completion.
6. Conclusion
We have shown a practical need for k>1 lookahead in
both LL and LR parser generators. LL(1) is too weak,
while the presence of actions reduces the strength of
LR(1). Finally, with current computers and heuristic
approaches, use of k>1 is feasible.
Bibliography
- ADG91
- M. Ancona, G. Dodero, V. Gianuzzi, and M.
Morgavi. Efficient Construction of LR(k) States
and Tables. ACM TOPLAS, 13(1):150--178,
January 1991.
- BeS86
- Manuel E. Bermudez and Karl M. Schimpf. A
Practical Arbitrary Lookahead LR Parsing
Technique. Proceedings of the 1986 Symposium
on Compiler Construction; SIGPLAN Notices,
21(7), July 1986.
- Bro80
- B.M. Brosgol. Deterministic Translation
Grammars. Garland Publishing, New York, 1980.
Reprint of TR-3-74, Center for Research in
Computer Technology, Harvard University, 1974.
- CuC73
- Karel Culik and Rina Cohen. LR-Regular
Grammars---an Extension of LR(k) Grammars. Journal
of Computer and System Sciences, 7:66--96,
1973.
- DeR71
- Frank DeRemer. Simple LR(k) Grammars. Communications
of the ACM, 14(7):453--460, 1971.
- FiL91
- Charles N. Fischer and Richard J. LeBlanc. Crafting
a Compiler with C. Benjamin/Cummings
Publishing Company, Redwood City CA, 1991.
- Grosch95
- Josef Grosch grosch@cocoab.sub.com.
Private communications, April 1995.
- Joh75
- Stephen C. Johnson. Yacc: Yet another
compiler-compiler. Technical Report TR32, Bell
Laboratories; Murray Hill, NJ, 1975.
- Knu65
- Donald Knuth. On the Translation of Languages
from Left to Right. Information and Control,
8:607--639, 1965.
- PQ94
- Terence J. Parr and Russell W. Quong. ANTLR: A
Predicated-LL(k) Parser Generator. To appear
in Journal of Software Practice and Experience,
1995.
- Par93
- Terence John Parr. Obtaining Practical
Variants of LL(k) and LR(k) for k>1 by
Splitting the Atomic k-Tuple. PhD thesis,
Purdue University, West Lafayette, Indiana,
August 1993.
- RoS70
- D. J. Rosenkrantz and R. E. Stearns. Properties
of Deterministic Top-Down Grammars. Information
and Control, 17:225--256, 1970.
- Roskind
- James Roskind. Private communications, September
1994.