Home | Download | News | Wiki | About ANTLR | Feedback | Support | Bugs


Latest version is 2.7.7.
Download now! »

Download
» Home
» Download
» News
»Using ANTLR
» Documentation
» Wiki
» FAQ
» Articles
» Grammars
» File Sharing
» Code API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»TML
»PCCTS
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ANTLR 3.0 Error Reporting and Recovery

RSS Feed
TJP: My current thoughts on error handling

December 29, 2004

Re: error recovery and semantic actions, trees

Ok, so we handle errors within a rule for mismatched tokens now. This introduces a new problem: actions may refer to labels on missing tokens. For example,

expr : left:INT PLUS right:INT SEMI
       {System.out.println("sum="+Integer.parseInt(left.getText())+
                           Integer.parseInt(right.getText()));
       }
     ;

What happens when you have input "3+;"? You will get an error saying that INT is missing, found SEMI. It will "insert" an imaginary INT token to continue parsing. There's the problem: it actually doesn't create a token--right will be null.

I think I have a simple solution: have token labels be initialized to an ERROR token so at least you won't get null pointer exceptions.

expr() {
  int left=ERROR, right=ERROR;
  left = input.LT(1);
  match(INT);
  match(PLUS);
  right = input.LT(1);
  match(INT);
  match(SEMI);
  System.out.println("sum="+Integer.parseInt(left.getText())+
                     Integer.parseInt(right.getText()));
}

In this way, if an error occurs in any of the match() routines, an action may still refer to the token object, it will just get empty text and token type. Not optimal, but better than putting if(error) in front of every action.

Heh, what happens if somebody wants to use their own Token object? Then ERROR will require an invalid downcast. The programmer would have to define their own ERROR object. Hmm...ok, worry about this later.

Trees. So what happens to tree construction? I haven't thought about tree construction yet, but I need to make sure the error handling will work with tree construction. Normally, syntax errors prevent subsequent processing by a translator. Just like a compiler won't generate code if you have syntax errors.

Erroneous input will lead to invalid tree structures leading to tree parsing errors. That said, sometimes you want the bad tree in order to display in a GUI what went wrong with some input, for example. Should an ERROR node be inserted for missing tokens? Probably. What about deleted tokens? Probably should leave in the tree, but marked in some way to indicate a syntax error. Come to think of it, we can actually create a valid tree for most if not all mismatched token situations. It's not possible to create a useful tree, however, when there is no viable alternative (e.g., you enter statement but the first token is a right curly).

For input "3+;" I would like to see this tree:

  +
 / \
3   ERROR

assuming the semicolon is not to be included in the tree. For input "3+)4", I'd like to see

  +
 / \
3   4

because we can give the correct tree, but how to encode the error? What about "()"? I guess you'd want a single ERROR node as parens are not normally included in the tree. What about "[]"? Perhaps you want:

 INDEX
   |
 ERROR

A simple mechanism might be: "upon error, added ERROR token to the tree and start the error recovery mechanism avoiding tree additions during error recovery." In that way, you'd get the real tokens in the right spot I think. Imagine, we want brackets in the tree, then input "[3+]" would build

 INDEX
  /|\
 [ + ]
  / \
 3  ERROR

Well, I hope.

Whether you can parse the tree is another matter. Perhaps the tree parser would bail out of the current rule upon ERROR token or something.

December 28, 2004

For v3.0, I have dramatically improved the error recovery to something similar to what Josef Grosch had in Cocktail. Was pretty easy. Anyway, it now does single token insertion / deletion if necessary to keep going. So

if ( i>0 then i=1;

says "mismatched token 'then'; expecting ')'; inserting ')'" and keeps going in the same production. :) If you have

if ( i>0 ) blort then i=1;

it says "mismatched token 'blort'; expecting 'then'; deleting blort". It notices that "then" is next; so it can trivially delete the blort. :) Ah... All done in one simple routine. Best part is that it does not bail out of the ifStatement rule like it used to. PCCTS had something like this, but you had to manually tell it to do it.

Further, better than PCCTS and ANTLR 2, I am using actual local following-token-set information rather than global FOLLOW information to resync. It seems WAY better so far in my testing of the java grammar--it knows precisely what can follow a rule reference rather than what could follow any reference to that rule. :) This is per Wirth circa 1970.

Class Parser has a huge number of comments.

June 28, 2004

Had to drive 250 miles yesterday...lots of time to think about error reporting/recovery improvements.

Enhancing Error Reporting

One of the annoying things about simple error messages such as

line 20:12: expecting ID found FLOAT 

is that there are probably 200 references to token ID in your grammar. Where in the grammar was the parser when it found the mismatched token? Further, you have little contextual information about where the parser was in the input stream. Line 20, column 12 is useful, but you have to go into the file to see where that was. I would love the following (particularly for debugging grammars):

line 20:12: expecting ID found LPAREN
        at: methodDef : (access)* returnType @ ID (<args>)? "{" ...
     input: ... public static void @ ( int x ) { float ...

where the @ indicates the position. Now you know precisely where in the grammar and where in the input stream the error occurred. Knowing the location in the grammar is particularly important for tree grammars as the relationship between input and text input and tree node can often be muddy especially for artificial AST nodes.

For debugging, given valid input, the question you have is where in the grammar did it fail to match. It is at that point that you go fix your grammar usually.

So, how do we implement this. Well, first we need to pass the input stream to the error reporting routine. If the stream is capable of looking backwards and forwards then context of the input can be printed out as the user wants.

How do we get the grammar fragment information? One simple way is to generate a string representation of the entire grammar and then have a mapping from grammar position to char position within the string. The error reporting routine could walk backwards and forwards from the char position to compute the string representing just the surrounding alternative. We could use an instance variable called gposition and constantly keep it updated:

gposition=92;
match(ID);
gposition=93;
expr();
gposition=94;
{ // subrule
    switch ( input.LA(1) ) {
        ...
        default : // error
           reportNoViableException(input);
           throw new NoViableAltException(input, gposition);
    }
}
...

or update match() to include the position because that is the only place where you can get an error:

match(ID, 92);
expr();
{ // subrule
    switch ( input.LA(1) ) {
        ...
        default : // error
           reportNoViableException(input, 94);
           throw new NoViableAltException(input, 94);
    }
}

I don't like burdening all calls to match() with an extra parameter, but users of ANTLR can optimize by simply changing the code generator to not include this position parameter and then provide a different match() method. I just looked at the C output (non exception handling version) to see how many parameters there were to the match routine...lots! Check this out:

_zzmatch(3, &zzBadText, &zzMissText, &zzMissTok, &zzBadTok, &zzMissSet)

Rather than print the grammar out as a string for use by the error reporting, perhaps the entire tree structure (i.e., the processed grammar) should be made available to the running parser. The grammar position would map to an actual AST node. The cost would be a slightly larger string (flattened AST as string) compared to the original grammar, but would potentially provide more grammar information to a reporting facility. Hmm...actually, since the trees have down/right pointers (not up/left) it would be hard to walk backwards to print out a larger portion of the grammar. Ok, forget that.

Arguably this grammar string error message would be more useful to the programmer than an end user of the tool surrounding the parser. Then again, at a high level showing the possible surrounding structure could be useful to the user if the grammar was clean enough. When confronted with a syntax error, users will often consult the manual to see the valid syntax. The realities of grammar development though means that many of the rules, introduced for coding convenience, just won't make any sense to the end-user. I think that the existing paraphrase option is probably good enough for end-users, though I still want this grammar print out option for development.

The grammar string would be small, like 10k, as would the table, but they would really take up space at the bottom of the parser. The position to grammar char position mapping could be thousands of entries long. Hm...seems like this should be a debugging-only feature. How would this be turned on/off? I don't want option bloat.

Paraphrase

ANTLR has a paraphase option so that you can indicate (only at the rule level I think at the moment) what kind of thing you are trying to match with a rule in English rather than using the rule name. ANTLR can then generate something like:

line 20:12: expecting ID found LPAREN in boolean expression

where you have:

ifExpr
options {paraphrase="boolean expression";}
    : expr 
    ;

if an error occurs anywhere in rule ifExpr.

I would add paraphrase for subrules so that any no viable alt exception could print something reasonable. From the user point of view, this is better than printing the name of the rule or the actual grammar position per the above. For debugging a grammar, however, I think I'd like to have the grammar print out.

Paraphase has to be dynamic, with a stack of paraphrases. So that putting paraphase="expression" in rule expr will result in that paraphrase being used for any error within a rule invoked from expr.

What about paraphrase at the alt level? Most of the time I don't make separate rules for the various statements. Paraphrase at the rule or subrule level is blunt: all errors would say "error in statement" not "error in IF statement" or whatever. The syntax for options on an alt could get really hairy. Oh I know, use a subrule around it:

stat
paraphase="statement"
    : ( paraphrase="IF statement" : "if" expr "then" stat )
    | ( paraphrase="assignment" : ID '=' expr ';' )
    | "break" // uses "statement" as paraphrase
    ;

Parse trees, derivations

A standard feature of ANTLR will be to create parse trees; they are marginally useful for translation work, but are of a great help when figuring out why your "x=4;" input assignment statement gets interpreted as a weird kind of declaration. The parse tree or (easier to print linearly) the derivation would reveal everything without having to trace the grammar via print statements like option traceParser or via a debugger.

Upon recognition error, the parse tree or the final step in the derivation sequence, derived from the tree, could be printed. Actually, the current derivation shows sort of where you are in the grammar and the unprocessed input at the same time. For example,

line 20:12: expecting ID found LPAREN
derivation: <access> <access> <type> @ ( int x ) { float ...

Hmm...perhaps a derivation is more useful when your input was succcesfully parsed, but improperly interpreted. Yes, that's it. I would rather see the exact syntax (i.e., the grammar) where an error occurred.

Auto missing and extra token error correction

Good automatic error recovery is hard, but there are a few simple strategies that work very naturally. The first, described in my blog entry the other day, is to consume tokens until a token in the FOLLOW(that-rule) is found.

This is pretty catastrophic failure, however, for a rule. Any mistmatched token or no viable alt exception exits that rule. It is often the case that the input is simply missing a token such as:

i = 4*(x+3;

or have typed an extra token:

i = 4*(x+3));

Sometimes you want the whole rule to fail (you want to simply say "bad expression" regardless of where in the deeply-nested expr-call-chain the parser found the error). Many times though you want the rule to fix the single token deletion/insertion and keep going in the same rule. PCCTS's "@" operator told PCCTS to handle the error locally like that. I am now thinking that there could an option indicating whether errors should be handled locally within the alternative (is it the same option as defaultErrorHandling or a new option that could be turned on/off independently?). By default, NoViableAlt errors would, as usual, throw exceptions and be caught within the rule. Mismatched token exceptions, however, would be handled at the "match site", perhaps within match() itself.

For example, let's start with the default handling of errors in the following rule:

stat : IF expr @THEN stat
     | "break" ID SEMI
     | ...
     ;

If the lookahead was inconsistent with any of the alternatives for rule stat, then NoViableAltException would be caught within stat. An error would be reported and then the parser would consume tokens until a token in FOLLOW(stat) was found. If a match() invocation found a mismatch between the desired token and the current input symbol, it would look at what followed the token reference. If the current symbol is in the set of tokens that can follow the token reference, report a missing token and continue without consuming a token.

What about extra tokens? Hmm...i wonder if it's really useful to delete a single extra token? Is that very common? I think that missing comma or semicolon is very common, but an extra ID or INT or whatever is not that common. Anyway, if the current symbol is not something that could appear after the token reference, then a single token is consumed. If this following symbol, now the current symbol, is not consistent with what follows, an error is reported and an exception is thrown. Hmm...i may not implement this part. I think I will start by throwing an exception unless I think it's a single symbol omission. Then the normal exception catching mechanism will deal with it.

Another reason to have the grammar position: it can let us know what the FOLLOW(some-token) is for error recovery.

If default error handling is turned off, then any error, mismatched token or no viable alt, is handled by throwing the exception and not catching it automatically in the rule. You have to manually put in exception handlers.

Allow resync sets for rules? In other words, instead of relying on the FOLLOW, perhaps you could specify the "stop set" as I think they are called? (You could specify how the parser should resync even on the alternative level maybe)

expr
    resync = {SEMICOLON, RPAREN, RBRACK, RCURLY};
    :   multExpr (PLUS multExpr)*
    ;

multExpr
    :   atom (MULT atom)*
    ;

The stop set would be inherited by any rule in the rule call chain below it. Means a stack of these has to be maintained? Sure...no problem:

expr
   resync = {SEMICOLON, RPAREN, RBRACK, RCURLY};
   :  multExpr
   ;

multExpr : ... ;

would generate:

expr() {
    pushResyncSet(...);
    ...
}

multExpr() {
    ...
    // error here looks at resynch set pushed in expr
}

But wait, errors should be trapped where the resync is defined. The set should indicate what tokens to look for before returning from that method, which implies default error catching would be turned off dynamically for any rule invoked from that rule. In the above, a problem in multExpr() should blast out of it and return to expr() where it would consume until a token was found in the resync set (or EOF) and then expr() would return. So the exception is generated in the nested rule, but trapped in expr.

Exceptions in targets w/o native exceptions

How to implement these exceptions in C or other languages without native exceptions? C has longjmp() but that is not a great solution. I wonder if error return values could handle this? I suppose with a goto after each rule reference checking it's error.

Perhaps I'll define the default error handling as the required functionality for a target language. Being able to specify exception handlers would be target-dependent. I fired off a note to Tom Moog, maintainer of PCCTS, to see what his thoughts are for exception handling for C target.

June 24, 2004

ANTLR 2.x Error Handling

Current ANTLR 2.x has a very simple and straightforward error handling mechanism replicating the exception-based strategy pioneered by PCCTS. Upon MismatchedToken (or Char) or NoViableAlt, an exception is thrown by match() or the lookahead prediction, respectively. Catch{} blocks are added to each rule by default, whose action is to call reportError(exception) and then consume tokens until a token in the FOLLOW(that-rule) is found. After consuming, the rule with the error returns normally to the invoker, hoping that it has resynchronized properly. This simple mechanism works suprisingly well. For example, when an error occurs in a statement, it will consume tokens until it sees a semicolon. It consumes the semicolon and returns to the statementList rule, which looks for another statement. To have a single error terminate the entire parse, just turn off default error handling with an option. Exceptions must then be caught by the code that invokes the start rule of the recognizer.

ANTLR 2.x can also attach exceptions to rule references to allow errors occurring in some deeply nested rule invocation chain (such as expr) to be caught where you have more contextual information such as "bad IF expr" vs "bad WHILE expr". For example, given when an error occurs during the recognition of the conditional of an if-statement, a good way to recover would be to consume tokens until the then is found on the input stream:

stat : IF e:expr THEN stat
     ;
     exception[e]
       default : <<print error; consumeUntilToken(THEN);>>

The parser would continue with the parse after the expr reference (because we attached the exception handler to the rule reference) and look for the then right away.

PCCTS 1.x Error Handling

PCCTS by default did not do the automatic try/catch insertion. Instead, you had to either manually add exception catches or you could use a so-called global handler that was essentially a macro that got inserted after each rule. For example:

exception catch NoViableAlt : <<blah blah>>

a : A ;
    exception
      catch MismatchedToken : <<ack;>>
b : B ;

This grammar fragment is functionally equivalent to:

a : A ;
    exception
      catch MismatchedToken : <<ack;>>
      catch NoViableAlt : <<blah blah>>
b : B ;
    exception
      catch NoViableAlt : <<blah blah>>

PCCTS also had a NoSemViableAlt exception that was thrown when you had predicates in the decision that failed. It is an indication that, while one or more of the alternatives were syntactically ok, none was both semantically and syntactically valid.

PCCTS also had an interesting mechanism for handling single symbol insertion or multiple symbol deletion, which also worked remarkably well: the @ operator. You could suffix any token reference with the @ operator, which indicates that if that token is not seen on the input stream, errors are to be handled immediately rather than throwing a MismatchedToken exception. This was great for handling missing then tokens:

stat : IF expr @THEN stat
     | ...
     ;

Case 1: missing THEN. If THEN is not found, the parser reports "missing THEN", but keeps going if the next symbol is FIRST(stat).

Case 2: delete random token(s). If THEN is not found, the parser tries to consume tokens until it sees FIRST(stat) and then continues by invoking rule stat.

This all easily implemented by reporting the appropriate error and consuming until FIRST(stat) then continuing.

When working at NeXT on the C/Obj-C/C++ grammar, we found that this automatic mechanism was very very close to a handbuilt parser in terms of intelligent error recovery.

Proposed ANTLR 3.0 Error Handling

For 3.0, my general summary would be: "I want it all (and more...)!". Clearly, I will keep the exception-based strategy, but I would like to provide much more information / capabilities for reporting and recovery. When generating code for non-OO languages, however, I'm not sure how this will all translate, but I remember PCCTS doing parsing exceptions for C as well.

I would like to reduce the number of options for both simplicity and for code generation's sake, but I can't think of a good way to specify the default catch action. So, the code generator will automatically insert the try/catch structure around each rule with a default error action of "consume until FOLLOW(that rule) and return". The programmer can alter the code gen template to alter this default behavior if they want or simply put a manual catch after each rule.

You will be able to specify exception handlers for any rule. I also like being able to catch errors generated from a rule reference, but I don't like having labels be unique to the rule so I can attach an exception. Unique labels make actions that reference rule labels harder to write. I also don't like the wacky random @ symbol PCCTS used (I was out of characters and in a hurry). ;) So, labels will not be unique and I must find a way to attach an exception to a rule reference. Because this will not be that common, perhaps one puts the exception inline:

stat : IF (expr) catch[exception-type e] {action}
       THEN stat
     | ...
     ;

where an exception could be attached to a block (...) to make it more clear. Or perhaps we make labels on blocks unique and then can move the exception to bottom of rule:

stat : IF eblk:(expr)
       THEN stat
     | ...
     ;
     exception[eblk]: // catch only exceptions generated in expr
       catch[exception-type e] {action}
     exception :      // rule level exception
       catch[exception-type e] {action}

Nah...i don't like this unique label stuff, but then how do I trap those inline exceptions?

We will need FIRST, FOLLOW sets available for reporting and recovery.

The lexer will not do this recovery strategy. Upon error it will probably just bail out and attempt a new token.

Information about the grammar not just the token names/types should be included in the output recognizer. For example, I would like to have set of what could have been matched next at any point for reporting purposes. For debugging or even for better error messages in general, I would like to spit out what position in the grammar corresponds to the current parser state when an error is detected.

Currently error messages for tree grammars are horrible. We must be able to look back into source (line,col,text) to give a good error.

I will be building derivations / parse trees for debugging and testing. These must be consistent up until error occurrence and we must be able to access the partial derivations / trees.