|
Home |
Download |
News |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs
|
|
|
Latest version is 2.7.7. Download now! » |
|
|
ANTLR 3.0 Tree Grammars, Tree Parsing
June 28, 2005Ok, last night I figured out how to handle rewrites during tree parsing. Simple! Just mirror what I do for token rewrite stream but for tree rewrite stream. Queue up a series of rewrites until the end and then alter the tree, giving it back to the user. Some people have mentioned doing this on a rule-by-rule basis for tree construction (during parsing) but this will work great on a grammar basis for tree rewriting too. :) I think it was Paul Lucas and Loring and Stanchfield who were talking about a related thing on a rule basis. Anyway, seems tree parsing really works. Got it doing the simpleC
example. Here it is in it's full glory
Note that in the tree parser, it's just the parser grammar to the right of the -> for the most part!
June 27, 2005Ok, so it looks like I have tree parsing working nicely. :) The code generation templates are essentially identical so it was fairly straightforward--I tricked the LL(*) analyzer into seeing A B as different from ^(A B) by adding imaginary navigation nodes DOWN and UP. Works great. Now I've got to make a decision about rewrites in a tree parser. I.e., given a tree and option output=AST, should the tree parser build a completely new tree or should it just rewrite the current one? I'm trying to think of a situation where you want to keep a tree intact during a multiple phase translation. I suppose debugging is one reason, but then if you want to keep that tree you can easily just do a dupTree on it before you pass it to the rewrite phase. Further, that dup will be faster than having the parser dup nodes. So, imagine you have
This parses trees like
The DECL and type nodes should stay intact; only the declarator result has changed. The ID node will be reused but merged into a new ^(ASSIGN ID INT["0"]) subtree that has two imaginary nodes. The result becomes the new declarator tree up in decl. Ack! The problem with rewriting a tree is that you are modifying the tree you are trying to parse. This is ok, if you do only local rewrites but what if you use an action to alter nodes in the tree that will be parsed later? You could easily create an infinite loop where a rule creates nodes that are parsed later via that same rule which creates more nodes in the future ad naseum. Ugh. One easy way to solve this is of course with memory. The TreeNodeStream object could build up an entire list of the nodes a priori, before the parse begins. In this way, the parse will proceed exactly as you expect even if you unlink / rewrite every node in the tree. But the cost is 4 bytes ptr per node and with 500,000 nodes as the example somebody gave, that's already 2G of RAM. :( This could be mapped to the disk like virtual memory, but... Of course, creating a new tree of 500k nodes is even more expensive. This "alter the future" is more common than you think. What if you're generating code for a translator to C from Java and you create memory in some statement of a block. As you do that you may want to generate a free() call at the end of the block. When the parser finishes the last "real" statement, it will start to parse your free() calls! The problem is that those are C statements not Java! The parser will puke. The choice appears:
Come to think of it, there could be some really strange bugs introduced if you rewrite a subtree that is later parsed using it's original structure (but it no longer has that stronger and some of your actions will fail!). Argh! June 23, 2005Just released ANTLR 3.0ea3, which has the tree construction and rewrite rules stuff...now on to tree grammars real quick. GrammarsProgrammers are used to having parsers recognize structure in single-dimensional streams of tokens. These parsers often build trees that are later walked in some manner to extract information or to generate a translation. Tree parsers do the same thing except that they recognize structure in a two-dimensional stream of tree nodes. The key to unifying parsing and tree parsing is to serialize a tree into a token stream by adding imaginary tokens DOWN and UP to encode tree structure. From a grammar point of view, though, a parser grammar and a tree grammar look identical except for the addition of ^(root children...) tree structure syntax. The syntax mirrors the construction syntax in token parser rewrite rules. Here is a simple example that matches a list of declaration trees such
as:
Here's the grammar:
So, all I need to do is go into the ANTLR antlr.g parser that builds
trees and have it turn
into
and the rest of ANTLR will be none-the-wiser :) The LL(*) analysis
algorithm will just simply work and provide k>1 lookahead for tree
parsers. ANTLR will see the following two alternatives as different:
To actually parse at runtime, I'll need a CommonTreeNodeStream that serializes a tree, but it should be easy. This part is brain dead simple. However, what do labels mean and what can actions do? Labels, types, ...SO, in the following grammar fragment, what type is $t? What type
is $ID?
If this were a parser grammar, $ID would be the token matched for
ID and $ID.tree is the tree created from the matched token. I
sort of like the idea that I can do multi-stage parsing. Take a
parser that parses a token stream and generates a token stream. Feed
that stream to another parser (a tree parser) that feeds off a flat
tree (token stream). So, take this grammar:
and then grab the resulting tree, passing it to another parser:
that matches the altered token stream. I wonder if this could help with parsing C++ and other nasty beasts. Hmmm...not that common. Actually, you could reuse the same grammar! I.e., the grammar only cares that it gets a token stream. It doesn't matter the source. You could have a parser eat its own output. Cute, but useful? Well, I like consistency...who knows what it will be useful for. The reason I mention this is it affects the type of $ID. If you want the same parser to parse tokens and tree nodes, the actions will only work if the type of $ID is the same. I guess you might do token stream to token stream (flat AST node stream) parsing, but the grammars would be different so you could make one a tree parser. Also, $ID cannot be assumed to be Token for tree parsing as your tree nodes may not track a token object as payload--you might, for example, inline all the data as fields rather than use a payload. I guess $ID.text could still work... $ID is type Token in a parser grammar because it is the fundamental input element. That sort of implies that a tree garmmar should have $ID be a tree node, whatever your ASTLabelType is. It would be useful to define $ID.tree perhaps to make it easy to transfer a parser grammar to a tree grammar. Tree Parser TemplateThe treeParser template should be virtually identical minus token type defs and TokenStream -> TreeNodeStream. TreeParser objectNeed a TreeParser. Hmmm....can't subclass Parser, but pulling out
a shared GenericParser to reuse match() etc... can be a problem.
It can use just input.LA(1), which will work when I use field:
but the generated parser for token stream will need input.LT(1), which is undefined for the super-interface IntStream. Crap, I don't want all "label = input.LT(1);" assignments to be wrapped in a case (slow and ugly). I know. Make any method that needs input to take it as an IntStream parameter as LA() is the only method needed. Code gen would change to match(input, tokenType) but I don't think the it will take much time to pass another parameter. I use input.LT(1) a lot, but I could change the code gen so Token is replaced with Object for tree parsing...yeah, I guess that will work. TreeNodeStreamHere is the interface that the tree parser would need to parse and yet
keep all code generation identical for things like template
tokenRef.
A CommonTreeNodeStream would keep a stack of pointers into the tree and a stack of int indexes that indicated where (which node/child) the stream leaves off at each level of the tree. AST to AST transformationsCopy or rewrite inline? Just posted email to the list in response to post by Akhilesh Mritunjai: "Yup, in tree parser, and I vote for default in-line. As for creating a new tree, thats what 'buildAST=true' is there for. Am I right ? Coupled together with a..." Ok, so output=AST (in 3.0 parlance) would construct a tree from
whatever the input was: tokens or trees. A tree parser with rewrite
rules, however, would do a rewrite. Let's see an example:
which creates trees like this:
(ain't that easy using the new notation?!!) then we'd have a tree grammar that created a new tree from the old:
(note how the tree grammar is the collection of right-hand-sides from the -> rewrites). heh, somebody ought to write this down as doc ;) Now, to do rewrites, one would leave off the output option:
Yes...I guess that makes sense. The method generated for rule
"program" would have to replace the children list of DECL with the
new subtrees created by rule decl. I guess like this:
I am not sure how to do that easily, though I of course see the need
(particularly given the tree sizes you refer to below...yikes!) The
TreeNodeStream that serializes the trees for two-dimensional parsing
could be asked to replace the node I suppose...sort of like asking an
iterator to do a delete. Hmm...that is an interesting approach. It
would only allow replacement of a node under the "cursor" at
input.LT(1) I guess. It would have to make sure not to try to parse
the new node. You might have a single node replaced with two nodes:
If the input were say B B then after the first call to b the "X Y" would replace the first B, yielding X Y B, but the TreeNodeStream would have to do a "replace and jump over new nodes" so the current pointer was on the 2nd B not the Y. Regardless, the approach would invisibly allow a completely new tree to be created or to have a tree modified inline. Wow.
|
|||||||||||||||||||||||||||||||||||||||||||||||||