Manual Tree Manual
by Michael Barnett
Foundations of Software Engineering
Microsoft Research
http://www.research.microsoft.com/fse
22 August 2000
========================================================
This document describes:
1. How to write Antlr rules that construct trees of various shapes.
2. How to write manual tree construction semantic actions. While the
automatic tree construction of Antlr is usually what one wants to use,
there are situations where one needs/wants to perform manual tree
construction.
3. How to write tree parser rules that explore the abstract syntax trees
in a different order than the default left-to-right depth-first
traversal.
NOTATION
========================================
An identifier written in all caps is a token. If it is in lower case
from the beginning of the alphabet, then it is a non-terminal (i.e., a
parser rule), and if it is in lower case from the end of the alphabet,
then it is either a token or a non-terminal. Angled brackets around an
identifier means the tree built for source input xxx.
Non-terminals represent both a rule in the grammar and the sub-tree
built up by the rule during parsing. They are completely equivalent;
context makes clear which meaning is intended.
Example
XYZ A token, either created by the lexer, or a synthetic
token.
a Either a rule named "a" or the sub-tree built by rule a.
x Either a token or a rule, it doesn't matter.
AUTOMATIC TREES AND THEIR MANUAL COUNTERPARTS
========================================
Here are some rules with automatic tree construction, with each followed
by an equivalent rule with manual tree construction that builds exactly
the same tree. If you aren't sure why you would want the manual
versions, see the last section.
r1_auto: x y z
;
r1_manual:! e1:x e2:y e3:z { ## = #(#e1,#e2,#e3); }
;
r1_manual:! { ## = nullAST; }
e1:x { ## = #(nullAST,##,#e1); } // add x as a sibling
e2:y { ## = #(nullAST,##,#e2); } // add y as a sibling
e3:z { ## = #(nullAST,##,#e3); } // add z as a sibling
r1_manual:! { ## = nullAST; }
e1:x { ##->addChild(#e1); }
e2:y { ##->addChild(#e2); }
e3:z { ##->addChild(#e3); }
All r1 rules should build trees that look like: #(x y z).
r2_auto: x A^ y
;
r2_manual: e1:x a:A e2:y { ## = #(#a,#e1,#e2); }
;
r2_manual: e1:x A e2:y { ## = #([A],#e1,#e2); }
;
r2_manual:! { ## = nullAST; }
e1:x { ## = #(nullAST,##,#e1); }
a:A { ## = #(a,##); }
e2:y { ##->addChild(#e2); }
All r2 rules should build trees that look like: #(A x y).
LEFT-BRANCHING TREES
========================================
To get left-branching structures, use iteration. For instance, suppose
the source language has a left-associative operator represented by the
token OP, and that the rule x recognizes the operands to OP. (Usually, x
is a "lower" rule than left_ in that it recognizes expressions involving
an operator that binds tighter than OP. See the Java or C examples in
the distribution.)
left_auto: x (OP^ x)*
;
left_manual: e1:x ( /* nothing, but allows automatic tree construction
for empty case */
|! OP e2:x { ## = #([OP],#e1,#e2); #e1 = ##; } )+
;
left_manual:! e1:x { ## = #e1; }
(OP e2:x { ## = #([OP],#e1,#e2); #e1 = ##; })*
;
All left rules should build trees that look like:
#(OP #(OP ) )
RIGHT-BRANCHING TREES
========================================
To get right branching structures for right-associative expressions
using operator OP, use recursion on the rule itself:
right_auto: x (OP^ right_auto)
;
right_manual:! e1:x { ## = #e1; } (OP e2:right_manual { ## =
#([OP],#e1,#e2); })?
;
All left rules should build trees that look like:
#(OP #(OP ))
Suppose you want, for whatever reason, the keyword "elif" in your
language so nested if statements are written as:
if B1 then
S1
elif B2 then
S2
elif B3 then
S3
else
S4
You would want the trees to look like this:
#([IF],,,#([IF],,,#([IF],,,)))
(Note: With the plain if-then-else construct this problem doesn't come
up, because Antlr's default matching means that an if-statement in an
else-branch ends up constructing a tree like the one above.)
If you use the following simple rule for if-statements, you won't get
the proper trees.
if: IF^ expression THEN! block
(ELIF! expression THEN! block)*
(ELSE! block)?
;
You would get a tree like this for the above example:
#(IF )
But here is a rule for getting a right branching tree:
if!:
IF
e:expression
THEN
b:block
r:restOfIf[#e,#b]
{ ## = #r; }
;
restOfIf![ANTLR_USE_NAMESPACE(antlr)RefAST condition,
ANTLR_USE_NAMESPACE(antlr)RefAST then_clause]:
/* nothing */
{ ## = #([IF,"IF"],condition,then_clause); }
| ELSE b:block
{ ## = #([IF,"IF"],condition,then_clause,#b); }
| ELSEIF
e:expression
THEN
b2:block
r:restOfIf[#e,#b2]
{ ## = #([IF,"IF"],condition,then_clause,#r); }
;
WHY USE MANUAL TREE CONSTRUCTION?
========================================
Why use the manual rules? One example is that I wanted all of my binary
operators to form trees with a common root. That is, I wanted my ASTs to
look like this: #([BINARY],OP,left,right) so I could use a single
alternative in my tree parser for all binary operators. This constrasts
with the tree grammar for Java in the examples directory which has one
alternative per operator. So I used an action like the following:
{ ## = #([BINARY],OP,#e1,#e2); }
Another, perhaps more persuasive, reason to put different tokens at the
root of trees is to be able to step over them efficiently in a tree
parser. For instance, a parser rule like:
parser_rule: x (y)* z
;
leads to trees that look like: #(x y...y z), where "y...y" represents an
arbitrary number of y sub-trees. Now if you have a tree parser that
wants to explore the x and z sub-trees, but doesn't want to explore the
y sub-trees, it isn't possible. But if all of the y sub-trees are
gathered under a single root:
parser_rule: x gather_y_trees z
;
gather_y_trees: (y)* { ## = #([Y_TREES],##); }
;
then the trees look like this #(x #(Y_TREES y...y) z). Now the tree
parser can be written as:
tree_parser_rule: #(x . z)
;
explore_y_trees: #(Y_TREES (y)*)
;
The rule tree_parser_rule will now skip over the y sub-trees without
exploring them. This assumes you will explore the y sub-trees in another
part of the grammar or in a separate tree grammar. For the former,
suppose you want to explore the y sub-trees after you explore the z
sub-tree:
tree_parser_rule: #(x y:. z { explore_y_trees(#y); } )
;