Selectively Matching XML Tags

Terence Parr

Everything is XML-based these days. If you have a file full of data, chances are you are using markup tags to describe the contents. Check out the following xml:

<alice>
<category>
<pattern>He says</pattern>
<template>She says</template>
</category>
...
</alice>

That's easy enough with ANTLR to recognize. Your lexer would match all of those specific tags with explicit rules like:

O_CATEGORY : "<category>" ;

You could have a parser match up the <category> and </category> tags etc... or have the lexer do it since ANTLR's lexical grammars are stronger than regular expressions. The basic rule would look something like (ignoring the issue of errant text between tags):

protected
CATEGORY
    :   O_CATEGORY
        PATTERN (THAT)? TEMPLATE
        C_CATEGORY
    ;

Easy enough.

The Problem

But, what if you want to allow arbitrary HTML tags within the template so that you can format the response provided to the user? In other words, you want to specify an AIML category like this:

<category>
<pattern>
What's a grammar?
</pattern>
<template>
A <b>grammar</b> looks like this:

<pre><tt>
foo : bar | blort | ... ;
<tt></pre>
</template>
</category>

It's pretty clear you do not want to type in a complete HTML specification. So, the question posed for this Field Guide entry is: "How do I recognize some tags and ignore others as plain text?"

Before allowing HTML tags, the rule for a template might look like:

protected
TEMPLATE
    :   O_TEMPLATE
        TEXT
        C_TEMPLATE
    ;

// Grab text until next tag; presume it's "</template>"
protected
TEXT:   (~'<')+ // chew until you hit a tag start char
    ;

Allowing HTML tags such as <b> renders the TEXT rule useless because it will match text until it finds any tag not just </template>. Without seeing the entire <...> tag, you do not know whether to stop or not; that is, whether the tag is </template> or not. ANTLR has no look ahead operator that says "break out of the (...)+" loop upon a particular pattern (you can only say "match x" upon pattern y via the (x)=>y operator). [What ANTLR needs is the new PERL nongreedy (...)*? expression, if I read the documentation correctly. This same problem is exhibited when trying to stop consuming C comment text at string "*/".]

The Solution

Clearly you cannot have rule TEXT match tags as well as plain text. You will have to match all tags as tags even if you do not need them for document structure, leaving TEXT as a simple "anything but a tag" rule.

Do you delineate all of the HTML tags in the TAG rule? No. Have TAG match the lexical form of a tag and then enter those tags you care about as literals. The literals testing mechanism in the lexer will compare all tags against the set of literals, setting the token type if it finds a match. So, <b> will get a token type of TAG as far as the parser is concerned, but <template> (in the literals table) will get its unique token type.

In the parser or the lexer, define the token label / literal pairs via the ANTLR 2.6.0 tokens section:

tokens {
    O_ALICE ="<alice>",
    O_CATEGORY="<category>",
    O_PATTERN="<pattern>",
    O_THAT    ="<that>",
    O_TEMPLATE="<template>",
    C_ALICE ="</alice>",
    C_CATEGORY="</category>",
    C_PATTERN="</pattern>",
    C_THAT    ="</that>",
    C_TEMPLATE="</template>"
}

The complete lexer is trivial:

class AliceLexer extends Lexer;
options {
    charVocabulary = '\3'..'\377';
}

TAG
    :   '<' (~'>')* '>'
    ;

TEXT
    :   (
            /* Language for combining any flavor
             * newline is ambiguous.  Shutting off the warning.
             * '\r' '\n' can be matched in one alternative or
             *  by matching'\r' in one iteration and '\n' in
             *  another.
             */
            options {
                generateAmbigWarnings=false;
            }
        :   '\r' '\n'       {newline();}
        |   '\r'            {newline();}
        |   '\n'            {newline();}
        |   ~('<'|'\n'|'\r')
        )+
    ;

In the parser, you can define template as:

template
    :    O_TEMPLATE
           stuff
         C_TEMPLATE
    ;

stuff
    :    ( TAG | TEXT )*
    ;

Herein lies the trick. You cannot easily make uninteresting tags come to the parser as TEXT, however, you can lump all uninteresting tags together as a single token type: TAG. C_TEMPLATE is a special case of a tag just like a keyword is a special case of an identifier and, hence, comes to the parser with a unique token type rather than TAG. Also notice that the parser sees complete tags of arbitrary length as simple integer token types, thus, overcoming the arbitrary lookahead requirements a lexer-only solution has.

The Complete System

The parser I built for Alice is set up to send an event whenever a category (pattern/template) is found. I created an AliceReader that hides all of the lexer/parser creation code as well.