ANTLR Meets Sed

Terence Parr

SED and AWK are great tools bestowed upon us from the great Uncle UNIX. They have one serious limitation, however: the tools are line-oriented and cannot handle simple translation problems for structured files like HTML. Consider performing an operation on the file names in <IMG> tags. The minute a tag spans more than one line, AWK and SED break down.

ANTLR 2.5.0 introduced an AWK-like lexical filtering mode that forces generated lexers to ignore any characters that do not match a lexical rule exactly. To turn ANTLR into SED, all you have to do is make a lexical filter rule that prints out the characters that don't match anything. Then, it's up to the lexical rules to generate what they want.

Consider the following contrived example that turns <br> and <p> tags into their uppercase equivalents and dumps anything other than those tags to standard output:

class T extends Lexer;
options {
  k=2;
  filter=IGNORE;
  charVocabulary = '\3'..'\177';
}

P : "<p>"  {System.out.print("<P>");};
BR: "<br>" {System.out.print("<BR>");};

protected IGNORE
  : ( "\r\n" | '\r' | '\n' )
    {newline(); System.out.println("");}
  | c:. {System.out.print(c);}
  ;

Rather than have a "filter=sed" option, it is simple enough to use this idiom: put a print statement in a filter rule.