Terence Parr
University of San Francisco
parrt@cs.usfca.edu
September 5, 2003
One of the most common translation task is to read some input, augment it, and spit it back out with the original format. For example, you may want to read in Java, add some instrumentation method calls, and spit the same program back out. Currently, there is no obvious means of doing this simply in ANTLR. I have conjured up an experimental means of supporting x to x translation without modifying ANTLR itself. This example provides modified token, token stream, tree node, and ast factory objects along with a simple grammar to demonstrate their use. The grammar describes a simple calculator language and builds the usual ASTs. There is also a tree walker grammar that asks a few of the operator nodes to print out the tokens, in original order, that were matched for that operation (as opposed to the tokens matched for the parser rule that matched that operation).
This experiment is not perfect. For example, I have not bothered to figure out how to handle whitespace or tokens w/o nodes (SEMI) that occur in between statements. The token stream tracks everything, but when I walk the trees to print, for example, all nodes associated with an assignment except the SEMI are printed; SEMI is not added to the tree even though it is parsed. Clearly, one only has to look at the tokens stored by the token stream after the end of one statement and before the start of the next statement. Exercise for the reader. ;) If someone wants to extend this, please do so and post at ANTLR.org or send to me. :)
The problem comes down to the types of trees ANTLR encourages you to
build: ASTs (ANTLR) versus parse trees. A parse tree encodes the
rules used to match some input and has tokens at the leaves with rule
nodes as subtree roots. In contrast, an AST consists only of token
nodes where subtree roots are taken to be operators. For example,
given input "3+4" the parse tree might look like:
expr
|
addExpr
/ \
atom + atom
| |
3 4
whereas the AST would look like:
+
/ \
3 4
Clearly, the parse tree is sensitive to changes in the grammar by definition. More importantly, the rule nodes are noise really. It is better from a tree walker perspective to have trees where the structure encodes the relationship between tokens without all the rule nodes getting in the way. An AST is sort of a canonical form that is insensitive to changes in the way you express your grammar. Regardless of how you express a grammar, the AST could always remain the same.
So ASTs are what you want from a translation point of view, but they have one big disadvantage: the order of the original token stream is not preserved. With a parse tree, you can simply walk the leaves in a depth-order traversal and you're done. An AST specifically moves tokens out of order. There is no well-defined order traversal for an AST that will print out the tokens in the original order.
The first solution that came to mind was tracking the index of the first and last token associated with every node and its children. So, for example, the "+" operator subtree root would track the min/max index for itself and all children. This min/max definition is inductive (recursive) with the overall root node having min..max of 0..n-1 where n is the number of tokens in the input stream.
This solution is nice because it doesn't matter how you build the
trees--root nodes always can spit out all tokens for itself and below
in the right order. For example, "a+b*c" would look like the
following, partially annotated with min..max token index:
+ 0..4
/ \
a * 2..4
/ \
b c 4..4
The identifier nodes have min=max=tokenIndex; so, node "a" has 0..0 and "c" has 4..4.
To print out just the multiplication, you would print from tokens 2 to 4 from the overall list of tokens you've tracked.
What about whitespace (this extends to comments, of course)? Consider
"a + b ". The token list is
index: 0 1 2 3 4 5
token: ID WS PLUS WS ID WS
and the tree is
+ 0..4
/ \
a b
So printing tokens associated with the "+" gets you "3 + 4", missing the final whitespace. I left my solution like this as I was not sure how I wanted to handle inter-expression whitespace etc...
If you have two expressions like "a+b; c+d" then you'd get an AST like
this:
0..3 5..7
+-------+
/ \ / \
a b c d
where the token list is:
index: 0 1 2 3 4 5 6 7
token: ID PLUS ID SEMI WS ID PLUS ID
Again, the SEMI is not added to the tree, but is parsed and is tracked in the list of tokens. In other words, the information is available, but I did not incorporate it into my subsequence printing routines.
character stream
|
v
[CalcLexer]
|
token stream of TokenWithIndex objects
|
v
[TokenStreamTracker]
|
token stream missing whitespace
|
v
[CalcParser]
|
AST (tracking min/max index values) ASTMinMax objects
|
v
[CalcTreeWalker]
$ cat test2
a = 3 + 4 * 5
+ 2
;
print a+ 1 ;
$ java Main < test2
Tree: ( =[0..16] a[0..0] ( +[4..16] ( +[4..12] 3[4..4] ( *[8..12]
4[8..8] 5[12..12] ) ) 2[16..16] ) ) ( print[20..25] ( +[22..25]
a[22..22] 1[25..25] ) )
Whole token stream:
[0:"a",<4>,line=1,col=1]
[1:" ",<11>,line=1,col=2]
[2:"=",<5>,line=1,col=3]
[3:" ",<11>,line=1,col=4]
[4:"3",<10>,line=1,col=5]
[5:" ",<11>,line=1,col=6]
[6:"+",<8>,line=1,col=8]
[7:" ",<11>,line=1,col=9]
[8:"4",<10>,line=1,col=10]
[9:" ",<11>,line=1,col=11]
[10:"*",<9>,line=1,col=12]
[11:" ",<11>,line=1,col=13]
[12:"5",<10>,line=1,col=14]
[13:"
",<11>,line=1,col=15]
[14:"+",<8>,line=2,col=5]
[15:" ",<11>,line=2,col=6]
[16:"2",<10>,line=2,col=7]
[17:"
",<11>,line=2,col=8]
[18:";",<6>,line=3,col=3]
[19:"
",<11>,line=3,col=4]
[20:"print",<7>,line=5,col=1]
[21:" ",<11>,line=5,col=6]
[22:"a",<4>,line=5,col=7]
[23:"+",<8>,line=5,col=8]
[24:" ",<11>,line=5,col=9]
[25:"1",<10>,line=5,col=10]
[26:" ",<11>,line=5,col=11]
[27:";",<6>,line=5,col=12]
[28:"
",<11>,line=5,col=13]
Original text:
-----
a = 3 + 4 * 5
+ 2
;
print a+ 1 ;
-----
Walking tree, showing tokens for various subtrees
addition: 3 + 4 * 5
addition: 3 + 4 * 5
+ 2
assign: a = 3 + 4 * 5
+ 2
addition: a+ 1
print: print a+ 1
| File | Description |
| ASTMinMax.java | A simple AST node that tracks the Token from which it was created and the min and max token index associated with this node and any children. For leaf nodes, min==max==token.getIndex(). |
| Main.java | Main program that creates the various pieces and hooks them up like the black box diagram above. |
| TokenStreamTracker.java | A TokenStream object that tracks all incoming tokens and can discard tokens like whitespace so the parser doesn't see them. |
| TokenWithIndex.java | The usual token but with an index into the TokenStreamTracker token list. |
| TrackerASTFactory.java | Modification to the standard tree construction facility so that min/max is set for subtree roots when children are added. |
| calc.g | The lexer, parser, and tree parser grammars. |
| test1 | A trivial test file |
| test2 | Test file: two statements |
Generated from calc.g:
CalcLexer.java
CalcParser.java
CalcParserTokenTypes.java
CalcTreeWalker.java
Here is a zip file of everything.