|
Home |
Download |
News |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs
|
|
|
Latest version is 2.7.7. Download now! » |
|
|
Aspects, Actions, Rewriting, Attributes
June 25, 2007Tree and token stream rewriting. Starting with simple token
rewriting. I've thought about this and I think we can fold rewriting
patterns into grammar composition and token stream stuff. Here's a
simple rule that just replaces a token:
Presumably a variant of the TokenRewriteStream can insert tokens not just text. This would just replace the start..stop with the right hand side. If 'var' is a string, it would have to be tokenized. Hmm...only reason to go back to tokens is if you are going to do multiple parses. If going to text, just use output=template. Ok, if going token to token, should really be able to use same lexer
as we're going language L to L.
where ival(token) returns the integer value of the token. Syntax-directed token rewritesThat is not as obvious as an exemplar-based rewrite where i,j,k,x,y,z,v
are IDs and c,n,m are ints and s,t,u are strings (Idea I took from
Andy Tripp):
The right hand side would be tokenized statically (given a
pre-existing lexer) except the actions, which must be tokenized after
evaluation:
Andy has rules like:
But, for speed and sake of context, I like specifying the rule that
matches the input. I also like delimiters as I might want to match
white space at some point:
We need the <...> because we can't correctly/generally identify x and
n without them. Need to identify the elements on the left and
tokenize the right if we are going to token stream. First break up
into rule elements and literals:
then tokenize:
where we assume that the ".add()" notation is also in the language we are translating. If going to text we can leave as a template. In that case, the rewrite
string is converted to have x, n as template attribute refs <x> and
<n>:
Uh oh. We want to keep the same tokens if possible rather than retokenizing I'd say. So can't convert to string and then retokenize. Need to keep tokens in tact as somebody might annotate tokens with info during translation. We'll accept a string with holes but no ST extensions unless you jump into a {...} action and use %{...}. Repeated elementsTake TXL's example of dot product: (1 2 3).(3 2 1).
That rewrite would get translated (double quote string tokenized) to
where the {...} action would return a string that is tokenized at run-time. The $n and $m are List objects per usual ANTLR interpretation. Ok, now, let's turn to rule references. Rule referencesstat: "if ( true ) <s1=stat> else <s2=stat>" -> "<s1>" "if ( true ) <stat>" -> "<stat>" Also,
The 0 will be tokenized as INT. That has to be converted to a predicated element. See next section. More Andy rewrites:
Seems like I need to use <expr> not v1, v2:
PredicatesWhat about conditions patterns must satisfy? Should it be just a
predicate on the front or preds on the individual elements? Need to
constrain tokens:
becomes
Have to allow predicates also on left:
Input traversal and multiple passesHow many times are the rules applied and in what order? First, the rewrites are only applied in the grammatical context specified by the rule. Then, the rules are ordered by precedence just like autobacktracking mode. After each replacement, I assume we'd continue parsing after the rewritten input construct and look for the next match. Would we then just rewind the stream and do another pass until nothing changes? Perhaps, but it's pretty inefficient to pass over the entire input again just to look for a few patterns on expr. I think TXL will repeatedly apply the expr rewrite rules for a single expr until nothing changes and then continue. Maybe we could do this as an option? Or, for efficiency, we could map all locations in the input that start an expr and then do like an increment parse, jumping from expr to expr. The problem is that we'd lose context; i.e., which rule invoked that expr. Another way would be to make the parse tree during a first pass and then alter the parse tree during each pass. We could then jump straight to the expr subtree root nodes and look upwards for the context. Actually, repeatedly applying the expr rule set until nothing changes ala TXL is probably the right idea. No other rewrite rules can match the expr input so only a single overall pass is necessary. So there is local iteration, but each rule set represents a single pass. You could define multiple rule sets to define multiple passes; e.g., remove identify operations then define implicit scalars. Avoid parse tree construction to save time/space. When we have input f(f(x)) and we are translating f() to g(),
when does the arg list get translated?
First we apply the rewrite to the outside and get g(f(x)). Then we try the rewrites again for primary. We can't jump over the args. OH! It's invoking rule expr so it will get back down to primary and do it recursively. Marvelous. This way it kind of manually says which rewrites to go do. Pattern matching implementationI think we can just list the patterns first in the indicated rule and
turn on autobacktracking.
How do we make it spin around expr though until it doesn't change?
Perhaps a method that invokes the rule:
But, every reference to a rule with rewrites would have to call a meta-function like this instead of the rule directly. Hmm...does that mean the grammar must be regenerated every time I add a rewrite to a new rule? Is this related to grammar composition where I pull in components of a previous grammar? Hidden channelsHow do we handle comments in between tokens we rewrite? Ric Klaren
had the idea to match off-channel tokens on the left so we can
reference on the right.
where the dot in front means "hidden" like in UNIX filenames. May 9, 2006Gotta figure out how to deal with actions during backtracking. Started FAQ entry which explains the issue. What if we could auto unroll any object/data-structured you identified somehow as transactional? You would give me an undo() method for each object, like a hashtable. I could record the arguments to pass to undo so it knows how to undo. This would not be perfect as the arguments could point at stuff that could change (like a stringbuffer), but would be a poor man's transactions. We need to keep in mind that MOST of the time, this feature is unnecessary. It's only needed when you need both syn and sem preds and they interact. Let's see. Here's a simple example to chew on:
First time through on input
ANTLR backtracks through block and statement doing no actions. Then upon success, rewinds, and does it again with feeling. Upon input
ANTLR backtracks through block and statement doing no actions. Upon failure, it rewinds and does 2nd alt, calling block with feeling. This is the normal case. What if we needed a semantic predicate in statement that needed to know whether an ID was a var?
Here, you MUST have the define() function call executed or else the ...? sem pred will always fail. Alright. We need to unroll the define() action for sure but we could avoid if by throwing out the entire scope. If we executed the push/pop scope actions, the definitions would actually disappear already and automatically! Wow, interesting! Let's keep going with the undo() method though. I'd need to track the scope and ID.text and VariableSymbol objects so undo would know what to do. How can I know what to track? Impossible. Users will have to provide a specific undo action that I can execute:
That would unroll it no problem. Let's assume that plain ... actions always exec when an @undo is provided (heh, I like that). Where would the undo action get executed though? What if the undo action references a local variable? It won't even compile if I move it out to the commit/rollback of the synpred up in a(). The undo cannot be done in the rule as statement will be called multiple times. It has t be at the transaction end (synpred fail/success) where I roll back the input. Oh! Also, if it references input.LT(1) that will be at some unknown location during the undo action's rollback! There would have to be lots of restrictions on the undo actions. For example, I can see already that $ID.text will be unavailable in the rollback action in a(). Ok, taking the cue from Jeff Barnes, perhaps the undo action actually creates an object that parameterizes the action and executes it during rollback in reverse order:
Here it would evaluate scope and $ID.text in it's original context but execute the action during rollback. The object could be:
Later we can walk undoActions and undo everything. Hmm...would this work? Surely in this case. The key is passing all necessary data to the undo as evaluated in its original context. January 6, 2006I've been resisting the temptation to introduce a new symbol to handle
templates. For example, to avoid adding a new special symbol, I
introduced $templates::foo(args) as the template constructor to build
foo, which translates to:
With other template libs, you'd do $Java::method(...) etc... Now I'm finding real situations where ST integration can be improved. The ctor problem is reasonably solved, but setting attributes still sucks: st.setAttribute("arg1", e1); is repeated many times all over. A better notation would be $st.arg1 = e1; but that is highly ambiguous with the $x.y notation. Anyway, after looking at a number of real examples now, I find the urge to introduce a new symbol to help the human brain separate rule/scope attribute references from template syntactic sugar. I propose the following (developed in collaboration with Jean Bovet and influenced by Hartmut Kocher's suggestions):
What about other template scopes?
Make sense? Objections? hard to tell until you see a real example ;)
I see a lot of
which would be now
or even simpler:
I see lots of setAttribute stuff like:
Now it would be:
So x is just a variable like you expect; to use shortcut access to ST stuff, you need the %. Note how the $ stuff is now clearly separated from the % template stuff. December 23, 2005On Dec 23, 2005, at 3:12 AM, Hartmut Kocher wrote: 1) Call different templates (even when I only return one template, I might use some templates during calculation). We could use a common syntax here like $ST["name"] to extract a template form the group. This would be a reserved word. Yes, i have been working with this cool start-up company recently and I have been trying to use the template stuff. I have not needed multiple templates from a single rule, but let me add my own frustrations here and we can probably come up with a nice bit of extension. John Mitchell suggested this exact strategy...get something working (avoiding the complicated thing I was designing) and then try to build something with it! ;) So, I do this a few times in a Java grammar: StringTemplate fcallST = templateLib.getInstanceOf("fcall"); fcallST.setAttribute("ids", $type.st); // type is a rule reference above $anEnclosingTemplateVisibleHere.setAttribute("fcalls",fcallST); Wouldn't it be nice to do as you say (a rewrite like constructor but in an action):
and then we could have a syntax like pyStringTemplate to use array syntax instead of setAttribute:
The problem of course is clear: The dollar syntax $x.y is way to simple to handle the above and it's ambiguous anyway. We'd need a keyword as you say I think, but that makes it messy.
So, do we need a special symbol for inline stringtemplate stuff? Ick. @ is action, $ is attribute. Hm...keyword is probably better. Ack...can't just randomly look for a keyword...need a symbol. Hmm...well, what about, symbol plus braces:
I don't like overloading $ though but I hate to create another special symbol just for templates like:
Is there some generic thing I could do with %...? Ick. Would they have to nest?
Probably. To me that is not very readable to the casual observer where the explicit is very readable. Perhaps your $templates::fcall(ids=$type.st) would work with "templates" being the keyword / built-in scope of templates. Actually that would work. :)
It saves a bit of typing and is more clear. :) Better yet: How about defining a string template group in the grammar: group MyTemplates; // This only a defines a property in the parser to set a string template group. Later, we could allow to define the templates inline. I think you want this to be dynamic so, as you request below, you can reset the template group for each pass using setTemplateLib(). Then one could call templates of the group using a syntax like MyTemplates::Name. There would be a default group, which is used if the group name is omitted. We can just use the $templates::name(...) generic mechanism so we can dynamically change the actual group. Your way might be useful with the ST interface I plan on implementing. In this way, I could verify that your template names actually make sense. Heh, i like it! -> xxx calls default group "xxx" and returns this as result -> MyTemplates::xxx calls "xxx" in the the MyTemplates group and returns this as result. Interesting. You want to be able to define multiple template
libraries? Ok, so
I dislike overriding the $scope::x syntax so much, but it seems like it would work. How would we specify the new template groups?
then you could say $Java::fcall(...) and $Debug::dump(...). But, the lib names should be interfaces so you're not locked into a specific template group, right? This would generate methods like:
in the parser so you could set the actual lib. How often would this multiple template lib thing be used do you think? Within actions $MyTemplates::xxx would call the template. Now, we could nest them: -> ResultTemplate(MyTemplates::xxx($Name), MyTemplates::yyy($Id)), or whatever... Yup. Oh, right we'd need to allow
syntax too. 2) Build a collection of templates. These are useful to pass them to other templates... Currently, only the result of a rule can be added to a list. This could be expanded to support multiple lists, one per template. Just allow the same syntax to be used in front of each template call: myList+=MyTemplates::xxx adds the template to a list called myList. You can pass in multiple templates to fill with templates you create in that rule. You can pass multiple return values back and just add to a list manually. How often do you need to return multiple templates though? When I need this I just access the outer template where they will go and then setAttribute for each new template. I've yet to need multiple return templates. Maybe we should add the possibility to define lists in a scope: scope global { Use it like this: global::myList+=MyTemplates::xxx(...). We're getting closer and closer to the simple neutral action language I was thinking of adding later ;) Hm...this is getting a big much. I think:
would be just as good ;) Overall, I think, template manipulation within ANTLR should be expanded. Of I plan on it...just trying to let actual need guide us. :) course, I can code that now within actions but it would be far more user friendly if normal cases could be expressed in a standard and uniform way. Yes, let's come up with a nice extension. I have several 2.7.5 grammars in C# and Java, which only differs in the action code because of different method and class names. Using ANTLR 3 with templates I was hoping to move all differences to a set of templates. This only works if templates can be manipulated within ANTLR itself. Can't you just a different template group with setTemplateLib? BTW: I was thinking about using multiple passes using the same action code, but using different template groups for each pass... Are you using setTemplateLib? Let me add my own frustrations now. This is a drag:
Could we do this?
What if we always returned the template for a single element?
the template for identifier / builtInType would be autocopied to $type.st return value. I don't like this actually... it's not consistent with having multiple elements in a production. Perhaps
is "better". Weird looking though. Here is a case where I might have just plain foo instead of a.b.c. In either case I need to set the return type. I hate doing ( x -> template(...) ) syntax in the middle of a rule. I want that -> on the end! Here is what I do now:
Oh, actually we can do this properly even now. Duh, just queue up the IDs.
November 23, 2005I've changed my mind on this syntax too (yet more changes in my tests/examples). John Mitchell and others warned me that access to dynamically scoped stuff vs parameters etc... should be syntactically different. I agreed but was loath to use an extra valuable char like @, which I note is now very useful for actions. Anyway, imagine how things currently work:
Since function calls body, body can access the dynamic scope of function. You must explicitly say was is visible with the scope notation. Setting the dynamic value $name looks exactly like the parameter $funcName. In the body rule, you must specifically reference the scope with the attribute as $function.name, but this still doesn't scream "the value is in another rule, dude!" I am proposing to avoid use of a special '@' symbol or whatever in
favor of the C++ style scope override "::". THat said, I think the
distinction should be sort of "local" vs "global" so that you can say
$name inside the rule that defines the scope, but you must use
$scope::name in an invoked rule like body:
What about shared global scopes? I propose that we always have to use the fully-qualified scope in this case.
Speaking of which, we need to be able to specify some init code in the global scope so we can avoid code duplication and do avoid forgetting. I shied away from it originally because I didn't know what the syntax would look like and whether it would be useful. This bit me hard yesterday when building an ANTLR+ST example. How about just adding the init action into the scope?
Actually I'll be using probably @init ... (yet another tweak to the syntax...sorry! It's rapidly converging though I'm letting actual experience show me my faulty syntax). November 18, 2005Ok, talked to John Mitchell for like 2 hours by phone yesterday (well, actually today the 17th, but I wanted this entry to be separate so I bumped the day). He convinced me that the ANTLR+StringTemplate integration should be braindead simple, which means it's not as automated as I was going to make it. It's very clear now and experience will then guide me to make shortcuts later. :) Ok, the idea is simple:
Actions Currently, one cannot specify a set of instance vars/methods for the
lexer in a combined grammar:
The int i; action only goes in the parser. Further note that the
header works for Java, but C++ with it's requirement of
topologically sorted elements (ick), you need multiple headers. We
solved this in v2 by having named headers like:
Seems we need to generalize this for all languages and make a consistent means of setting actions in various places. ANTLR should not care where these go; the code generator either would use them or not. It would just convert the actions to template attributes, which the code gen was able to use or ignore. Perhaps the Target object should indicate the valid set of these; probably. Anyway, how about this:
The lexer_members name would actually be translated to members by
ANTLR for the code generator when it handled the lexer grammar (antlr
pulls apart the above combined grammar into two grammars and runs
itself recursively on the lexer grammar). So, in a pure lexer
grammar, you'd do:
What about rule init actions? Now we do:
Instead perhaps we do:
Where we avoid using a name with the @ as it's clear it can only be
the init action. Actually, what if we add the notion of a "finally"
action:
Yeah, let's make it @init. These actions are translated to attributes of the rule template in the code generator. Sounds ok, but I don't like that the @ char conflicts with the StringTemplate usage for specifying regions. This brings us to the StringTemplate code gen integration. Altering code generation To alter the output code is easy with v3: just change the templates,
but then you have to get ANTLR to see your new templates etc... I
think John Mitchell suggested putting the templates directly in the
grammar (probably at the bottom). This way the templates are
convenient to override and they are carried along with the templates.
How about the following notation:
The @element.prematch() means to override the region defined in
template element, which is defined as for Java:
Issue: the @ inside the codegen section doesn't mean action like it does outside per the above discussion...it means StringTemplate region. Issue: The curlies on the outside could actually be hard to match. I'd have to call the StringTemplateGroup group file lexer to correctly match the templates and know when to stop parsing at the final right curlie. Hmm... November 17, 2005Ok, after yesterday's efforts, I think I have narrowed down my problems to:
Ok, so it looks like I still am not even sure of the scope of my issues. Defining template attributes Heh, i know the list of referenced attributes. I should be able to do this:
and
where I need $ID instead of ID since the templates are like actions--the ANTLR-related stuff has to stand out. Why can't I refer to the grammar elements directly ???? rather than using
template argument attributes as in
Oh, for external templates, we'll need it anyway:
Also, it gets messy when you add ST syntax:
I don't like augmenting ST syntax. Hmm...Also what about:
Perhaps I treat templates like actions and translate $ids+ to antlr_list_ids and $id to antlr_id or some such. In templates though the + is probably unnecessary as ST assumes 0, 1, or more elements by default. Do I really want to encourage inline templates? Sure looks good for journal papers. November 16, 2005I completed my rewrite of the ANTLR+ST example and it highlights a few things:
Let's do one possible rewrite of cminus.g from Language Translation Using ANTLR and StringTemplate side by side:
November 15, 2005More thoughts...trying to redo the article I did on ANTLR+ST using new notation. Some problems. Trying to fix. We'll need some basic template construction for single elements, for
example:
and
November 14, 2005Ok, I've done lots of thinking about adding template rewrite rules to ANTLR v3. General points:
RewritesNormally you will want to separate the templates into a separate group
so you can plug in another group of templates for a retargetable code
generator:
where template decl might look like:
which generates stuff like "x = 3;" and so on. You can have multiple rewrites:
You could specify that last alt with an empty rewrite also:
Predicated rewrites. Just like tree rewrites, you can predicate the
templates:
Here's a real example:
Reference to templates in a StringTemplateGroup file. The code gen will add setter/getter methods for accessing a template lib pointer so that the translator is retargetable. Inline templatesSometimes you will just want to build something quickly and will want
templates inline:
where template(...) is like lamba in some languages; that is, you
are saying that a piece of code is coming--here, a template is
approaching. W/o it, the arg list (a=ID,b=INT) might look like
subrule or something:
I also contemplated:
which is sort of similar to the anonymous template stuff in StringTemplate. Big templates. When the template gets big or multi-line, you can
use the <<...>>:
As you can see though, large templates should really be pulled out and put into a StringTemplateGroup file. Template argumentsOne of the trickiest parts of this is the bridge between ANTLR and StringTemplate. The argument list is an assignment where the left-hand-side (lhs) is a StringTemplate attribute and the right-hand-side (rhs) is some ANTLR value. You may reference:
Once inside a template, you are free to access properties of the
incoming object whether it is a Token or some rule's return value
structure.
Dumping rule a's template, you'd see: "3 C abc" if the ID is abc. May 1, 2005I'm neck deep in the action translation at the moment; mostly refactoring and adding unit tests after speaking with John Mitchell yesterday at length. After he smacked me around a little, I am convinced that I need to make some changes. One is that we will use $x to refer to the stuff you're used to like parameters, return values, token labels, and rule labels; @x.y will refer to the new dynamic attribute stuff that you've been hearing about. John and I agreed that a separate symbol is justified as the dynamic attributes are so different than the regular attributes. Anyway, on to simple $x references. One of the things that drives me nuts in 2.x is that sometimes you can forget the #x or $x and have it still get through ANTLR to compile and run...but #x, for example, is different than x! Ack! This kind of bug has trapped me for hours sometimes. My proposal is that all labels and simple attributes like parameters are generated with a prefix or something so that you cannot accidentally reference them in an action. This helps me track which return values you access, for example, which helps me generate better code. If nobody references a rule return value, I won't generate code for it. :) So if you had:
you'd get compilation errors on s, x, y, id, and f. You need to do
Note that multiple return values require that I build a struct or simple class to hold the values so s and x do not even exist as simple variable references. January 30, 2005I finished off a first pass of the new attribute stuff in early
January, but am just getting around to documenting it. Here is my
semi-useful test case:
The rules classDef and slist share a single stack of symbols attribute scopes. Each entry into one of those methods pushes a new symbols object onto the same stack. Rules methods and decl may access the top of the attribute stack with @symbols.n or @symbols.names. Dynamic scoping. Yum! The init{...} action is just a regular rule init action, but it is used to initialize the new scope. I didn't think a constructor-like thing associated with the scope{...} was warranted. Scopes unify conceptsEverything is a scope now. The rule arguments and return values are all just scopes (note multiple return values now fall out naturally). The one difference is that rule arguments and return values are lexically scope. I don't want a future rule reference to see arguments of a previous call; those should be "private" to the rule. Note the (useless) action
that asks the reference to rule field for it's stop token. Each
rule reference can access start and stop tokens that indicate the
first and last token matched within that rule. Nice, eh? The return
value of field is:
Token ScopesReferences to tokens are now seen as var=TOKEN and @var is a scope
of attributes. Every label on a token has an implicit scope defined
as:
So, properties of Token objects are also just another kind of "scope". :) I am contemplating allowing @ID as a scope so @ID.text could be used without resorting to a label on the ID if it is unique in the production. static attributesAs I mentioned previously, sometimes you want an attribute in a scope
to only have one value. This means static in C++ and derivatives,
but in our case it's a wee bit more complicated--we need to have one
value shared only within the parser not across multiple instances of
the parser object. Hence, we need static attributes to actually be
implemented with an instance variable. Note that in a program with
only one parser instance, static attributes map to static parser
fields. Currently in Java, for
ANTLR generates:
Translation of @-referencesIn ANTLR 2.x, each language target implementor had to provide a
grammar that would parse actions and translate $getText and other
junk. In ANTLR 3.0, the code generator does all the parsing and
merely asks the templates to translate individual references via a few
templates. For example, here are some of the attribute translation
templates for Java:
Accessing non-top-of-stack scopesSometimes you will want to walk up the stack of attribute scopes to
see previous values. I think I will simply let the user ask for the
underlying stack data structure so they can use target language
constructs and methods to do whatever they want. I can imagine
something like:
or perhaps @symbols.stack is better. December 31, 2004Implicitly-named scope attributeslong stream of consciousness that ultimately convinced me of the right syntax/semantics, though I didn't summarize my findings in one spot. The "explicitly-named-scope-only" spec is bothering me. In other
words, sometimes you want to share attribute scopes like symbols,
but other times you don't want to share at all like an attribute to
hold "name of the current method I'm parsing".
Now, name is a horribly unspecific identifier just like a global variable called name would be. This is the original reason for having named scopes. On the other hand, we could ref it as @methodDef.name and then it's actually a good name and follows the @scope.attribute syntax perfectly. Ok, so I'll allow you to define attributes within the named scope of a rule; makes good enough sense to me. If you are in rule r, then you should be able to refer to r.attr
with either @r.attr or @attr? I definitely do not want to inherit
attribute names into invoked rules:
It's too easy to name two attributes the same thing and then mistakenly get the wrong value 10 levels down in the call chain (for big grammars anyway). For example, is @name the class name or method name. Better to be specific. If you need ambiguity about where the symbol is defined such as symbols scope entered in classDef and slist, then you need to create a named scope. On the other hand @x within rule a is probably ok. Perhaps I should do an attribute lookup in the immediately surrounding scope (return values, parameters, implicitly-defined attributes) if you have just @attributeName. Handy when the rule name is long. So, full syntax for rule definition vis-a-vis attributes:
Return values are like attributes visible to that rule and to the calling rule (but read only to the calling rule). Arguments are like attributes visible only to that rule. Attributes are visible to that rule and any rule invoked from that rule (read-write); they are exactly like local variables that are dynamically scoped and visible to any method that can see it's stack activation record. Specifying type names across targetsJava and C++ differ for variable definitions in that Java easily separates type and name of the variable whereas C++ has the name in the declarator part as in char *p[3] (array of 3 char pointers). So, how do we keep a language independent syntax for argument definitions and so on? The answer is that ANTLR must ignore the definitions of arguments, return values, and attributes; it treats them as actions. The only restraint is that the type specifier cannot have a comma or semicolon in it because those will separate the definitions (and any square brackets must balance). How can you have a comma in a type? C++ allows char (*p)(int,float) for example. So, you can have whatever funky stuff you want as long as ANTLR is
able to pull the individual declarations out:
If your language can do without static typing like Python, then do this:
So, if you defined a named attribute scope such as this:
you'd see the two variables i and names defined as-is in the generated C++ as part of the symbols scope definition. They will probably just end up fields in a class definition. Heh, what is the type of a scope? For example, if you want to record
a scope in an instance variable for use after parsing, you need to
know the type. Given:
what is the type? Perhaps:
The scope_symbols is target and code-generator specific, but so what. All the type names are target specific already.
Multiple return value syntaxHmm...how about
It looks like parallel assignment. Makes more sense than x,s=b[34]. This brings up a lexical issue for ANTLR. When it sees [...] out of context (as the lexer has no context; only the parser has context), which of the following is it?
The expression list is the worst: could be anything in quotes for example. The comma and semicolon do not separate the expressions. Ok, only solution is to totally ignore what is inside of [...] in all
cases except that when the [...] is interpreted as a list of
attributes. In this case, each target must indicate what the
separator is between attribute definitions. For Java/C/C++ and
most languages the comma would be replaced with a semicolon. Hmm...I
don't like that asymmetry though with named scopes, which are already
semicolon separated. Certainly it would be weird to have all semicolons:
What about attributes being semicolon-separated all the time:
Rather than attributes keyword, we'd use an unnamed scope (implicitly using the rule name as the scope name). Interesting. At least the semicolon thing would be consistently used only in curlies (like tokens section too, right?). You can have multiple scopes; in this case, rule classDef has an implicitly-named scope and it uses symbols scope. Initialization of attributesHow do you set the initial values of attributes? You might want the
attributes of scope symbols to be initialized differently depending
on where the scope is used (classDef vs slist). Do we put in a
constructor? Nope, because you might need to exec some code to
determine the constructor arguments. Easier to just let the init
action of a rule set the attributes:
Better to have something simple, obvious, and flexible than something that looks good in the manual or in a journal paper. ;) You can define your own void init(scope_symbols s) method if you want to factor out the initialiation code. This solution is also much more natural for non-OO languages. More Syntax MusingsNote that for the init-action, I have added init. I just noticed that this rule header is getting rather complicated. The {...} might look like an explicitly-defined scope without the init or a separator like ';' after scope. All is cool as long as the keywords are there. Ooops. What about options? At the moment, they are just ID=value
(comma-separated) before the ':' of a rule or subrule. Add options
keyword back in?
Still needed for subrules I suppose:
Boy that looks so weird now without semicolons between elements in the
rule and subrule. Ok, perhaps add separators:
where I have not added semicolons after [...] or after {...}.
For subrules:
What if we got rid of returns in favor of old PCCTS syntax:
Nah...returns is just too obvious to get rid of and > is nonobvious. What if there are only options:
Then the semicolon is not a separator but a terminator and looks
pretty superfluous. Ugh. If we made semicolon a separator, it might
be better, but it would force some order on the rule header:
Actually specifying semicolons as separators is rather hard in a context-free grammar. Ack. I might need the init last anyway so ANTLR has seen all definitions before it has to look up the @refs. Attribute cardinalityDo we need cardinality modifiers to indicate how many values an
attribute holds? I guess not when int and List make it pretty
clear:
numNames can only have one value. Setting it twice must reset it. Only in StringTemplate, where there are no attribute types, do you have issues with setting an attribute more than once. Now, what about static vs dynamic? The static keyword works, but that means I have to parse the damn
{...}, which could be really wild for some languages. Heck, python
would look like this probably:
I guess if I require static as the first word of any definition within curlies, I could pull it out and then give it to the StringTemplate that generates scopes in the code generator. Uh oh, there is no separator for Python. How do I know where the start of the declarations are? Ok, looks like any set of declarations will have to be semicolon terminated and then I'll send a list of attribute definitions to the code generator. I'll do the same for argument lists. I'll have the ANTLR lexer grab as a big chunk of text anything in {...} and [...] and pull it apart later when I know what is inside by context. Come to think of it, I definitely need a list of attribute definitions so I can pull static attributes out for use with C output. Attribute operatorsDo we want accessor methods for setting/getting attributes? You might want to override what setting an attribute means. For example, when you set a parent attribute that points to the previous scope you might want it to go to the parent and addChild or something. For now, I'll not add method syntax to the attributes. Oh, what about accessing scopes other than the top of the stack? What about the "previous"? @symbols[-1] for previous? @symbols[i] is the ith element from 1..n for depth i? @symbols[n] would be the top of stack. Depth is better than size or else people will think 0..n-1 instead of 1..n. I guess we'd need @symbols.depth or @symbols.size as a built in attribute. 0..n-1 is more natural to programmers, but 1..n depth avoids a constant i-1 everywhere. Tokens, labels, and attributesIt has always been a problem explaining the difference between ':' and
'=' in ANTLR. The ':' is for labels and '=' is for return values.
Let's just get rid of the ':' and use '='. For token references and
rule references both:
So @id is the named scope for the attributes of a token. Uh oh, how
do you refer to the return value of a rule and the tree that it will
create (or later perhaps the text it matched)? Ok, let's try again.
Get rid of the parallel assignment concept and access the return
values as attributes, which is more consistent.
Heh, this is nice. You don't have to define the result of calling a method. That always drove me nuts in 2.x. To get an int return value I had to go to the init-action and define a temporary local. Ick. Cool. We have syntax now for the tree created or the text matched by a rule: @x.tree and @x.text. Consistent with tokens then: @id.tree and @id.text. I like it. A token has attributes as well; by default it has line,
charPositionInLine, and text. Does a tree? Sure: token and
then the children. Should we allow people to define this in the
grammar like this:
Not sure. It's pretty easy to let them define a new Token subclass, but then I guess I couldn't check references like @x.tree.def for validity at compile-time. Just because they can define the attributes, doesn't mean they can define methods. I have not found myself defining anything other than accessor methods for Token so that's probably ok. AST on the other hand can have lots of added functionality. What about heterogeneous trees? Seems that you don't want to define a bunch of different types in the grammar file; you'd have trouble with inheritance relationships and so on. The target language should really be the one to define different ASTs, but that makes them inconsistent with Token definitions. For a first pass, let's leave out Token and AST definitions. The new Token definition has just about anything you'd want for now. Worry about creating application-specific tokens and trees later. StringTemplate integrationWhat happens when we integrate ST whose attributes are totally
untyped? Surely it's attributes will have to be visible so you can
set them:
I have added Smalltalk-like argument syntax to define the attributes
and then have cardinality operators. More often for large
(retargetable) translators, you'd refer to the template by name rather
than using an anonymous template:
elsewhere, in a StringTemplateGroup file, you'd have department
defined with parameters and cardinality:
Are the attributes of department visible in rules invoked by rule? They'd have to be, but would need to be fully qualified (just as they need to be fully qualified within rule). So I'd push a new StringTemplate object for each invocation of rule. Nice. Consistent. I like it. Actions and code generationIn ANTLR 2.x there are things like $getText and #foo that need to be parsed out of actions and translated. Each code generation target must build one of these translators. I'm going to invert it for 3.0 so that I'll cut out the @scope.attribute references and then ask a template to translate it for whatever target and then stick the resulting template back into the action. Attribute persistenceHow to keep a tree or list/set of attribute scopes around after parsing? In other words, what if you want the tree of symbols scopes to hang around after you process the data? The easiest thing to do for a symbol table "tree of scopes" is link up
the scopes with a "parent" pointer and a list of children.
After parsing, symbolTableRoot would point to the root of the symbol table. If you are building a parse tree or an AST, then pointers to the scopes could automatically be kept for easy access later. December 30, 2004More thoughts on attributes now that I've gotten error recovery working well. Talked to John Mitchell on the phone to refresh my memory. He stressed named scopes and cardinality specification and making clear the static vs dynamic distinction. After the conversation and walking around, I think I have a decent first pass at what attributes should look like. There are a number of issues:
IntroductionFirst the basic idea. Every time your parser enters a left curly such as in class definition or statement list in Java or C++ like programming language your symbol table must enter a new symbol scope. Rather than having a global symbol table that tracks scopes, you could define a local variable, "List names;", in rule classDef and slist to track the symbols in a class or statement list scope. Unfortunately, rules called from classDef and slist could not see that local because most programming languages (assume we're generating parsers in Java) have lexical scoping (that is, variables "die" at the right curly}). So you'd have to pass the list down as a parameter to every rule invoked below. Ugly. Wouldn't it be nice if the invoked rules could simply see the local variable? In our case, the fact that locals are allocated on the stack is the key not the fact that they are locally and lexically scoped (we want the opposite, right?). Anyway, we want dynamically scoped variables; we'll call them
attributes and have named scopes to reduce collision between
attributes that have the same name. In the case of the list of names,
you could define a scope such as:
and then have rules indicate that they are in that scope:
Consider what the parse tree might look like for input:
where <...> indicates some appropriate input. Including named attribute symbols, here is what it might look like: There are three scopes for this input and like magic the parser would create three scopes tracking them in scope-like fashion so that as you leave a rule, the top scope is popped off. Accessing an attribute in a grammar action would access the top scope by default though we should allow for random access to the stack of scopes. The diagram shows three references to @symbols. The value depends on where in the grammar the @symbols reference is and how deeply nested the parser is. For example, the same grammar action executes @symbols in rule atom but for our input the action actually references different names lists. The implementation for this scheme is pretty obvious:
Static vs Dynamic AttributesSometimes you do not want multiple copies of an attribute, but yet you
want it encapsulated/scoped with other attributes. For example,
imagine we want to track globally the number of symbols. You can add
the variable to scope symbols, but indicate that it's static vs
dynamic:
This amounts to an instance variable really, but for targets like C you would need to encapsulate numNames in the struct holding attributes of symbols. So, there is only one copy of numNames no matter how many scopes are created. This is directly analogous to the difference between static class variables and nonstatic instance variables in object oriented languages. In fieldDef then you could increment the count:
where "@symbols.numNames++" would translate to "((scope_symbols)symbols.top()).numNames++". Parameters, Return Values and AttributesPerhaps parameters and return values (yes, multiple return values)
should be attributes within the scope named after the rule. Heh, I
like it. So
Not bad actually. This would distinguish them nicely from local
variables and token/rule labels:
Actually should labels on token references be "scopes" for the attributes (which are properties, in this case) of the Token? So the reference to id.getText() should be @id.text perhaps. Think about later
October 27, 2004Dynamic attribute scopesRegarding attribute scopes (where you don't have to pass a value down
as a parameter down the call chain) such as:
I ran this scoped attribute / dynamic scoping attribute stuff intended for 3.0 past an attribute grammar expert at dinner. She agreed it was interesting; she suggested that Uwe Kastens had already done something similar. I just found some of his lecture notes. He calls it "Dependency pattern INCLUDING". "An attribute at the root of a subtree is used from within the subtree. Propagation through the context in between is omitted." So that is what we are talking about with dynamic scoping 'cept that we use named scopes rather than rulename.attr. Thankfully the antlr2004 workshop discovered this need! Rewrite rulesMany rewrite engines exist such as TXL, Stratego, DMS, etc... They are huge, however, and clearly not intended as little tools that get integrated into somebody else's application. Further, they are extremely complicated; the average programmer will surely be intimidated. Large declarative systems also pose problems. Anybody who has done a bit of prolog realizes that when a bunch of rules don't get the right result, it's pretty hard to figure out why not. Seems to me that we should be able to produce a rewrite system that is more deterministic. It could be constructive/imperative such as aspects that get attached to a grammar or declarative with a small set of rules. (I'm tossing in another reference to James Gosling's ACE syntax-based C preprocessor. It's pretty nice, if specific to C. It is a declarative style macro system). DeclarativeYesterday I had a list of student usernames and had to generate
perforce revision control client specs.
First, I used awk to generate Java code:
from data like this:
Then I built a Java program that walked the users HashMap and used
a StringTemplate to dump out all the specs:
Now, awk is powerful enough to do this itself I guess with some work,
but it seems that we should be able to do a simple declarative tool
that handled it better. I'll use the task as an example. Perhaps,
something like this:
where <INT> and <ID> could be some predefined tokens and
<inputRecord> is some rule in a grammar like:
That would make the replace rule kind of an aspect attached to the inputRecord rule that would get executed whenever it matches a line of input. You could inline the rule too:
where the tool would implicitly create an anonymous rule that matched INT followed by ID. If we don't match replace patterns up with grammar rules, then an "engine" would have to search the entire tree looking for subtree matches. That would be pretty slow in the worst case, but for one-off transformations like the above, who cares? Still I like the "aspect" nature of this where you decorate a grammar in a separate spec. AspectsSo what if we take a stock grammar for Java, for example, and then
allow a rewrite spec to feed off of it like a series of aspects? For
example, one could do this:
I don't like all the labels in there but how else to reference what is
matched for <ID> and <expr>? Experience with ANTLR 2.x suggests
that allowing the user to reference <ID> when unique causes
trouble...actually I think that was because we allowed ID not #ID
to mean the tree and ID-as-tree conflicted with
ID-as-token-type-constant. So perhaps when unique references:
The key here is that you get to "speak" in the source language directly rather than in the terms of parse trees or ASTs (you don't care anymore)! Notice that you must specify what kind of construct you are matching with the "<statement>:" syntax; that's the aspect part where you are decorating a stock grammar w/o having to modify it directly. This could solve a lot of our "how do we reuse grammars?" problems. Ok, so the with clause would actually be a StringTemplate so you
could do cool formatting stuff. For example, if you had a grammar
for a language that had the enum construct:
you could convert to Java like this:
where <it> is the standard StringTemplate iterated value as it walks
the tokens matched by exprList; <i> is also a standard attribute
that iterates 1..n. Ack! I just realized that the commas would still
be in the token list matched for exprList. Do we want to allow
properties of the rule to be the tokens with a particular type? It
would look like:
Hmm...not very general I guess. Another way is more explicit, but
does not reuse the input grammar. Perhaps the <enum>: reference on
the front just says apply this pattern when you see a node in the
parse tree that is enum:
where <(',' ID)*> is an input pattern conforming to an ANTLR grammar fragment. Wow...might make sense to a human, but not sure how we'd implement that. Predicated rewritesSometimes you don't want all syntactic patterns to rewrite, just some
depending on a predicate. For example, you might want to rewrite
the Day enumeration one way and all others another way.
where the conditions would be arbitrary code written by the programmer just like in an ANTLR grammar. This is a big difference from "closed" systems like TXL that ask you to use their predicate mechanism; there's can probably be proved to do correct translations but ours would be easier for programmers and more flexible in the sense that a programmer could use his/her own data structures. The other difference of this approach to TXL and other systems seems to be that I would make it very clear when something gets executed and would probably generate a new version of the underlying grammar that had these rewrite actions in them so you could learn exactly what was happening and you could still debug with your favorite debugger. I would like to resist a black box "engine" if we can. It seems a crucial idea that you do rewrites in a series of stages, with a single stage doing a single well-defined task. Otherwise, you have the problem of a single rewrite generating output that another rewrite in the same stage should rewrite. This is like in the C preprocessor where the result of one macro must invoke anothe macro to get expanded to text. This leads to madness for complicated macros. Separating into separate stages is a good idea. The only trouble is that once you do a rewrite, your next stage's "input" no longer conforms to the original grammar. TXL and friends build combined grammars so that at any point you can get the parse tree for a constrct in the input or output language. Not sure what to do here. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||