|
Home |
Download |
News |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs
|
|
|
Latest version is 2.7.7. Download now! » |
|
|
ANTLR 3 Lexers, Parsing Integration
November 6, 2004Had a long conversation with John Mitchell today and firmed up a very very clean answer to all my problems. John suggested a very nice filtering scheme too that obviates the need to do skip() or any of that cruft in the lexer. Simplified lexersAfter reviewing a bunch of lexers, it seems that much of the fancy functionality like '!' operator and such are not used. I am stripping down the lexers so that they have notions of:
So in lexer actions you can do this:
and that's about it. No parsing of $setType type stuff. Token buffers and lookaheadSo, the lexer always creates tokens for comments and whitespace, which is the commonly needed functionality. The parser simply ignores any tokens whose channel is not the comments/whitespace channel. It may cost a wee bit in the lookahead operation but not as much as you'd expect due to the left-factoring implicitly done by DFA decision making. The parser will incorporate TokenStreamRewriteEngine functionality so that you can always alter the input stream to do easy rewrites. That functionality will go a long way. AST creation will ultimately just point into the token buffer rather than copy text and so on. November 5, 2004Literals go into DFA, parser-lexer token type communicationAfter discussions with the usual suspects, particularly with Loring Craymer bugging me, I'm convinced that we should not have a literals table like ANTLR 2.x had. This was a liability due to weak LL-based lexers (LL lexers shine elsewhere). Part of what tipped me over the edge towards literals in DFAs, was my generation of bytecodes. DFAs in java via static objects is just too huge. By my estimates we can get about 2000 states into a single method, hence, single DFA. There are perhaps <100 keywords in just about every language, which should fit just fine. We could get maybe 250 keywords into a single DFA. At first, I thought that this simplifies the interaction of lexer and
parser. The parser grammar could reference anything in double
quotes--the string gets would get converted to a rule in the lexer.
For example,
would yield a P.tokens vocabulary interchange file:
Then, a lexer grammar could pull in this vocab:
The generated lexer would behave as if you had typed:
Every literal in the parser would get shoved into the lexer as a separate rule. BUT, I've just decided I don't like this given the discussion of integrated specs below. When you build a lexer, the grammar would dictate completely the rules. The incoming vocab would merely set the token types so a parser and lexer could sync up. I don't like the idea any more of having a parser be able to force behavior on a lexer (then it's consistent for people that build their own lexer or use, shudder, flex). Literals in your grammar will not be sent to the lexer. I guess that implies that you could not use "for" in the parser grammar, you'd have to say FOR. That is, unless you use a combined spec. The only means of communication between parser and lexer vocabulary-wise is token types. Clean and simple. I like it. I guess this will encourage the use of combined specs. Hmm...they could get big, but PCCTS seemed ok with combined spec. Integrated parser, lexer specificationWith the power of recursion and real LL-based lexing (instead of a DFA based approach), ANTLR 2.0 lost some convenience of specification. It was harder to just list a bunch of lexer rules and have ANTLR know how to separate them (e.g., INT vs FLOAT). You had to do lots of left-factoring. Ick. Anyway, this made it more difficult to reproduce the single specification PCCTS (ANTLR 1.0) introduced so ANTLR 2 dropped the concept. ANTLR 3.0 is a hybrid approach. It uses LL(*), which is LL plus a
full DFA decision-making algorithm that can see over arbitrary
left-prefixes common among more than one rule. We get the best of
both worlds. With this increased power, one should simply be able to
list a bunch of lexer rules and have ANTLR separate them while still
providing recursion, readable code, and other LL-based lexer
advantages. If we can just list rules and have it work, we should be
able to merge the lexer/parser spec again, at least for many
languages. Naturally, from the single spec, I would just
automatically create the lexer grammar for the user. For example, one
might code the following combined grammar:
ANTLR would automatically generate the lexer grammar:
It saves you the trouble of making the separate file and possibly mismatching the literals referenced in the parser and defined in the lexer. (Actually, ANTLR wouldn't even have to generate a file, it can build a text string and then invoke {Grammar}'s constructor to suck it in. Cool, I never did like the {parser.dlg} intermediate file from PCCTS days.) More importantly we can provide error messages now if you reference a token that has no lexical definition; Ric Klaren will jump for joy! How does this automatically-generated lexer file affect debugging? I think that it would be fine to let programmers debug the code generated specifically for the lexer (even though they did not create that lexer file specifically). As far as error messages, they would all point into the combined spec the user wrote. Does a combined spec remove the need for the tokens section? Nope. That item is used to introduce imaginary tokens and associate tree node data types with tokens etc... "Protected" lexer rulesWe would still need lexical rules that don't yield tokens--they are
just referenced from other lexer rules. ANTLR 2 uses the protected
keyword, which is hideous. I'd want to be able to do this still:
Heh, we don't need a keyword! If the token rule, DIGIT here, is not referenced from within the parser rules, it's not a token! Heh, that's awesome! Hmm...what to do when the lexer is in a separate file? Oh, if DIGIT is not in the token vocabulary import file, then it's not visible outside and is hence not a real token rule. Heh, I like this! Woohoo! Lexer rule return valuesI'm thinking that each method generated for a lexer rule should return a token. My only hesitation is for rules that are not real tokens. It would be expensive to compute/create a token object for each reference to, say, DIGIT. I do want the text matched for any rule, though. Perhaps, each rule returns an index into a text buffer indicating what text was matched for that rule. No, would need a range; would have to create a range object. Just as bad. I don't like the way ANTLR 2 does it--only create a token if somebody references the protected rule. Ah! I know, let's just leave it up to the programmer to specify what a rule returns if it's a non-token rule. For example, you could define DIGIT to return a char. Rules like INT that are real tokens would implicitly return a token? Setting token type, text, tokenIf one needs to alter the token type in a rule, we've been using
$setType etc..., which is a drag because then ANTLR needs to parse
lexical actions and look for $setType. Perhaps we should simply
mandate that each language target should do what is natural in that
language. For example, Java/C++ could do setType and C could do
set_type. The think I don't like this because a method may not be
what you want. You might want to set a local variable called ttype
as is done now, which is used to create the token for the rule.
Also, an invocation of nextToken could actually generate more than one token; sometimes you need to insert extra tokens (though that is rare). Perhaps we need an emit() method that just adds a token to an output queue. Then the user can compute whatever tokens they want and do an emit. But using the emit, there would be no "global" instance variables for token type, token etc... Methods for lexers wouldn't return anything then by default--they just compute and add tokens (or none at all). By default, each token rule would call emit at the end of the rule? Nope. We'd get duplicate emit instructions if the user put one inside a lexical action and I don't want to have to go find all the emits in the actions. I suppose you could check for empty token object emission programmatically by checking if the output queue had changed. That would work: if you don't emit a token, one will be emitted for you based about the text matched for the rule and token type of the enclosing rule. If you have one token rule call another token rule, the invoking rule would know still that a token had already been emitted and would skip the automatic emission I guess. Ok, back to setting just the token type not the token. Perhaps we
mandate that a local variable exists called ttype and you just set
it directly in any target language syntax.
The INT method would start by setting ttype=INT; and then you could "override" it by setting ttype manually. If you do not emit a token object manually also, then the INT rule would emit a new token object based upon ttype and the text matched for that rule. Well, that is, unless ttype is the "skip" token type. Being efficient with text / tokensKeywords and operators are very common tokens for just about any input
so it seems a waste to recreate a new token and string for them each
time. The only problem is that you would have screwed up line number
information, but for keywords that may not be too much of a problem.
Anyway, you can manually hack this:
Normally, a lexer has a text buffer (StringBuffer in Java) that builds up the text for a token. Then a Token object is created and a String is created from the text buffer and stuffed into the Token. Most token text could be more efficiently specified with two indices that indicate the start/stop index in the incoming char stream. This saves building String objects and doing double char copies (once from input buffer to lexer text buffer and then again into a String object). Much faster. The only time you need a text buffer is when you are not getting an exact copy of the input buffer such as n converted to newline char or when you want to strip the quote marks from the text of a string. Sometimes you want to totally rewrite the entire input text with something else. So, we need a hybrid approach where we can have Token objects that use indices and Token objects that have an actual string in them. A simple subclass will work. What is the signal in the lexer? Hmm...perhaps if I see any rule with a '!' (don't include text) suffix, I use the internal buffer. Ack. No, the '!' doesn't seem that useful after looking at a bunch of grammars. Better to make people create their own token if the text is different. Skipping tokensHow do you say "skip"? Currently we set the token type to a special value Token.SKIP. I've always hated this non-explicit skipping mechanism. Perhaps if you don't emit anything then nextToken would look for another. Actually, that is the signal I just said we'd use for auto-emission of a token. Damn, perhaps a special token type is the only reasonable answer. Actually, a skip() method will work. It can set a boolean that the nextToken() method uses to restart. Nope...see the Nov 6 entry above. FilteringHow do we do that "filter=true" and "filter=aRule" functionality like ANTLR 2? Ah. Easy. When the predictor DFA for all the lexer rule returns an error, just invoke the filter rule. If you have set filter=true rather than a rule, it defaults to a rule with just '.' wildcard in it. A char is consumed and the nextToken() rule tries again. Token objects are not created for filtered text. If you don't want a token created for each whitespace token, for example, you can put the whitespace rule into the filter. Single combined parser-lexer recognizer object?One could argue that a single combined parser/lexer object should be generated instead of separate objects, but they are really two separate recognizers. It is also useful to insert a token stream filter between the lexer and parser. I suppose one could send in a filter to the combined object; the object could just insert the filter between the parser and the TokenSource (lexer). Hmm... in general it just seems kind of messy to combine recognizres due to conflicting definitions of "match" for token and character in the same object. |
|||||||||||||||||||||||||||||||||||||||||||||||||