Conversion of a C++ Parser from PCCTS to ANTLR

Jianguo Zuo, Jianghan University,Wuhan,China
David Wigg, (wiggjd@bcs.org.uk)
Dr.S.E.Black, (blackse@sbu.ac.uk)
Centre for Systems and Software Engineering
South Bank University
London, UK
http://www.sbu.ac.uk/csse

August 2002

We have recently completed the conversion of cplusplus.g to CPP_parser.g to run under ANTLR 2.7.1. (to produce C++ code). A relatively inexperienced programmer (in C++ and ANTLR) took about a year for the conversion and it was rather more difficult than we had hoped. Tom Nurkkala's notes were very useful and to further help others I attach some more detailed notes on our experiences to add to your library of Articles on your website at antlr.org. If you notice any errors or omissions please let us know. If you would like to add any comments to our text, please do. When I have tidied up the new C++ parser, taken out our own code and retested it I should be able to send you a copy for your library of grammar files.

David

The following notes were produced by Jianguo Zuo who converted an original version (1995 1.2) of a C++ parser, cplusplus.g, file which ran under PCCTS 1.33 to run under ANTLR 2.7.1. This conversion was done as part of a project to produce output for another project. As all our embedded code was written in C++ the output from ANTLR was required in C++ so that our embedded code could remain in C++.

These notes are supplied as supplementary to a set of notes supplied by Tom Nurkkala and now recorded at http://www.antlr.org/papers/133to220.html which should also be consulted. We would like to record our appreciation of Tom's original notes which were a great help for us. We hope that the following notes, which repeat most (but not necessarily all) of Tom's notes but add some original points of our own, will assist future conversions.

If you have any queries about these notes please address them to David Wigg at wiggjd@bcs.org.uk who supervised this work and who maintains the current versions of the software.

Index of chapters:

  1. Convert the lexer part of StdCparser.g (ANTLR version) to generate C++ code
  2. Modify the lexer part of StdCparser.g to adapt to the parser part of the old grammar file, cplusplus.g
  3. Convert the parser part of the old grammar file, cplusplus.g, from PCCTS-1.33 version to ANTLR-2.7.1 version, CPP_parser.g.
  4. Use the converted lexer part of StdCparser.g and the converted parser part of cplusplus.g to construct a new grammar file, CPP_parser.g.
  5. Make necessary changes to the main.cpp file and support.cpp file.
  6. Create project and compile new parser
  7. Final result
  1. Convert the lexer part of StdCparser.g (ANTLR version) to generate C++ code

    The input for either PCCTS or ANTLR is a grammar file (*.g file) which consists of two parts, Lexer part and Parser part, in order to generate source code files: Lexer.cpp, Lexer.hpp, Parser.cpp and Parser.hpp respectively. PCCTS and ANTLR use completely different approaches to generate Lexer source code, so the lexer part of the old grammar file, cplusplus.g, must be replaced with a new one. Fortunately we have a grammar file for Standard C (StdCparser.g) from the ANTLR community. We can build a new lexer part for our new grammar file based on the lexer part of the StdCparser.g file. This StdCparser.g file is a java version, so we have to convert it to produce C++ version first and then make a few modifications to cope with parsing C++.

    1. Re-write the source code of "Ctoken.java" file to "Ctoken.hpp" and "Ctoken.cpp". Re-write the source code of "LineObject.java" file to "LineObject.hpp" and LineObject.cpp". The corresponding .hpp files will be included in the header of the new CPP_parser.g file (see below).
    2. Change "class StdCLexer extends Lexer;" to "class CPPLexer extends Lexer;" The "options" para remains the same.
    3. Change from StdCParser,
      {
      LineObject lineObject = new LineObject();
      String originalSource = "";
      PreprocessorInfoChannel preprocessorInfoChannel 
      = new PreprocessorInfoChannel();
      int tokenNumber = 0;
      boolean countingTokens = true;
      int deferredLineCount = 0;
      }
      

      to CPP_parser.g.
      {
      ANTLR_USE_NAMESPACE(antlr)LineObject lineObject;
      ANTLR_USE_NAMESPACE(std)string originalSource;
      int tokenNumber;
      bool countingTokens;
      int deferredLineCount;
      }
      

      and change all other references to "String" to "ANTLR_USE_NAMESPACE(std)string" and change all other references to "boolean" to "bool".
    4. Change all actions
      { ttype = Token.SKIP; }
      

      to
      { ttype = ANTLR_USE_NAMESPACE(antlr)Token::SKIP; }
      
  2. Modifying the lexer part of StdCparser.g to adapt to the parser part of the old grammar file, cplusplus.g

    Lexer and Parser are two closely connected parts in a grammar file. The Lexer recognizes the input character stream and convert it to tokens, which are then used by Parser. So the names of tokens in both parts must be consistent and the Lexer must supply all the tokens needed by the Parser.

    1. Change the names of tokens in the StdCparser.g file to the names of the same tokens in cplusplus.g. For example, change the name for "=" from "ASSIGN" to "ASSIGNEQUAL", change the name for "..." from "VARARGS" to "ELLIPSIS", etc.
    2. Add the tokens lacked in StdCparser.g from cplusplus.g file. These tokens were:
      POINTERTOMBR: "->*";
      DOTMBR: ".*";
      SCOPE: "::";
      
    3. Change from StdCParser.g.
      Number
              :( ( Digit )+ ( '.' | 'e' | 'E' ) )=> ( Digit )+
                  ( '.' ( Digit )* ( Exponent )?
                  | Exponent
                  )                       { _ttype = DoubleDoubleConst;   }
                  ( FloatSuffix           { _ttype = FloatDoubleConst;    }
                  | LongSuffix            { _ttype = LongDoubleConst;     }
                  )?
              |   ( "..." )=> "..."       { _ttype = VARARGS;     }
              |   '.'                     { _ttype = DOT; }
                  ( ( Digit )+ ( Exponent )?
                                          { _ttype = DoubleDoubleConst;   }
                    ( FloatSuffix         { _ttype = FloatDoubleConst;    }
                    | LongSuffix          { _ttype = LongDoubleConst;     }
                    )?
                  )?
              |   '0' ( '0'..'7' )*       { _ttype = IntOctalConst;       }
                  ( LongSuffix            { _ttype = LongOctalConst;      }
                  | UnsignedSuffix        { _ttype = UnsignedOctalConst;  }
                  )?
              |   '1'..'9' ( Digit )*     { _ttype = IntIntConst;         }
                  ( LongSuffix            { _ttype = LongIntConst;        }
                  | UnsignedSuffix        { _ttype = UnsignedIntConst;    }
                  )?
              |   '0' ( 'x' | 'X' ) ( 'a'..'f' | 'A'..'F' | Digit )+
                                          { _ttype = IntHexConst;         }
                  ( LongSuffix            { _ttype = LongHexConst;        }
                  | UnsignedSuffix        { _ttype = UnsignedHexConst;    }
                  )?
              ;
      

      to CPP_parser.g,
      	Number
      	:( ( Digit )+ ( '.' | 'e' | 'E' ) )=> ( Digit )+
      	( '.' ( Digit )* ( Exponent )?{_ttype =FLOATONE;} 
      	| Exponent{_ttype =FLOATTWO;} 
      	)
      	( FloatSuffix     
      	| LongSuffix           
      	)?
      	|( "..." )=> "..."{ _ttype = ELLIPSIS;}
      	|'.'{ _ttype = DOT;}
      	( ( Digit )+ ( Exponent )?{_ttype =FLOATONE;} 
      	  ( FloatSuffix      
      	    | LongSuffix
      	      )?
      	      )?
      	      |'0' ( '0'..'7' )*       
      	      ( LongSuffix            
      	      | UnsignedSuffix
      	      )?{_ttype = OCTALINT;}
      	      |'1'..'9' ( Digit )*     
      	      ( LongSuffix            
      	      | UnsignedSuffix
      	      )?{_ttype = DECIMALINT;}  
      	      |'0' ( 'x' | 'X' ) ( 'a'..'f' | 'A'..'F' | Digit )+
      	      ( LongSuffix            
      	      | UnsignedSuffix
      	      )?{_ttype = HEXADECIMALINT;}   
      	      ;
      
    4. Change from StdCParser.g,
      PREPROC_DIRECTIVE
      options 
      { paraphrase = "a line directive"; }
              :
              '#'
      	( ( "line" || (( ' ' | '\t' | '\014')+ '0'..'9')) => LineDirective      
      	| (~'\n')*           { setPreprocessingDirective(getText()); }
      	)
      	{  
      	_ttype = Token.SKIP;
      	}
              ;
      
      	protected LineDirective
      	{
      	boolean oldCountingTokens = countingTokens;
      	countingTokens = false;
      	}
      	:
      	 {
      	 lineObject = new LineObject();
      	 deferredLineCount = 0;
      	 }
      	 ("line")?  //in case the directive started "#line", but not for GNU 
      	 (Space)+
      	 n:Number { lineObject.setLine(Integer.parseInt(n.getText())); } 
      	 (Space)+
      	 (fn:StringLiteral 
      	 {  
      	 try{ 
      	 lineObject.setSource(fn.getText().substring
      	 (1,fn.getText().length()-1)); 
      	 } 
      	 catch (StringIndexOutOfBoundsException e) 
      	 { //not possible
      	 } 
      	 }
      	 | fi:ID{ lineObject.setSource(fi.getText()); }
      	 )?
      	 (Space)*
      	 ("1"{ lineObject.setEnteringFile(true); } )?
      	 (Space)*
      	 ("2"{ lineObject.setReturningToFile(true); } )?
      	 (Space)*
      	 ("3"{ lineObject.setSystemHeader(true); } )?
      	 (Space)*
      	 ("4"{ lineObject.setTreatAsC(true); } )?
      	 (~('\r' | '\n'))*
      	 ("\r\n" | "\r" | "\n")
      	 {
      	 preprocessorInfoChannel.addLineForTokenNumber
      	 (new LineObject(lineObject),new Integer(tokenNumber));
      	 countingTokens = oldCountingTokens;
      	 }
      	 ;
      
      	 to CPP_parser.g, 
      
      	 PREPROC_DIRECTIVE
      	 options 
      	 { paraphrase = "a line directive"; }
              :
              '#'
      	( "line" || (( ' ' | '\t' | '\014')+ '0'..'9')) => LineDirective
      
      	  {$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); newline();} 
              ;
      
      	protected LineDirective
      	{
      	char mydirective[200]; // to hold text from line
      	int x;
      	}
      	:
      	{
      	for (x=0;LA(x)!=10;x+=1)  // to copy text from line
      	{
      	mydirective[x] = char(LA(x));
      	}
      	mydirective[x] = 0;
      	}
      	("line")?  //in case the directive started "#line", but not for GNU
      	(Space)+
      	n:Number //{ lineObject.setLine(Integer.parseInt(n.getText())); } 
      	(Space)+
      	        (~('\r' | '\n'))*
      		{
      		process_hash_line (mydirective, _line); // see main()
      		}
      		        ("\r\n" | "\r" | "\n")
      			;
      
      			to deal with line number and file stuff from preprocessor.
      
  3. Convert the parser part of the old grammar file from PCCTS version to ANTLR-2.7.1 version.
    1. A header section in grammar file for ANTLR is used to contain the source code which must be placed before any other generated code. For Java, it is used to specify the imported classes.

      And for C++, header section would be used to specify header files required in the generated output, So we change it to:

      header 
      {
      #include "CToken.hpp"
      #include "LineObject.hpp"
      #include "antlr/CharScanner.hpp"
      #include "CPPDictionary.h"
      #include "var_types.h"
      
      extern int this_line;
      extern int debug;  // Set to 1 in main for debugging
      extern void process_hash_line(char *, int);
      
      #define MAX_VAR_LENGTH 50
      
    2. ANTLR uses a language indication in its options section within the grammar to indicate what language code it should generate. The default language is "Java". To generate C++ code, we added a C++ indication as follows:
      Options { language="Cpp"; }
      
    3. Change"class CPPParser {...}" to "class CPPParser extends Parser;" and add
      options
              {
              k = 2;
              exportVocab = STDC;
              buildAST = false;
      	codeGenMakeSwitchThreshold = 2;
              codeGenBitsetTestThreshold = 3;
              }
      
    4. Change"#header" to "header".
    5. Convert syntactic predicates.

      Change"(A)? B" to "(A)=>B", "(A)?" not followed by anything else, to "(A)=>A".

    6. Convert the optional clauses.

      Change"{....}" to "(....)?".

    7. Convert semantic actions and semantic predicates, Change"<<....>>" to "{....}" , "<<....>>?" to "{....}?".
    8. Convert context-guarded predicate,

      Change"(context-guard)?=><>" to "(context-guard)=>{semantic-predicate}?"

    9. Convert argument and return values for rules,

      Change"rulename[arg1,....,argn] > [v]" to "rulename[arg1,....,argn] returns [v]" in rule name,

      Change"rulename[arg1,....,argn] > [$v]" to "v = rulename[arg1,....,argn]" in rule body.

      Change"{$v=vAUTO;}" to "{v=vAUTO;}" etc. in rule body.

    10. Change"#pragma approx" to "options {warnWhenFollowAmbig=false;}:"
    11. Change"LT(param)->getText()" to "(LT(param)->getText()).data()", "LT(1)->getText()" to "(LT(1)->getText()).data()" etc..
    12. Change"beginEnumDefinition(char *);" to "beginEnumDefinition(const char *);" and change
      {beginEnumDefinition($id->getText());} to
      {beginEnumDefinition((id->getText()).data());}.
      
  4. Use the converted lexer part of StdCparser.g and the converted parser part of cplusplus.g to construct a new grammar file, CPP_parser.g.

    Now we have got the Lexer part for C++ and the Parser part for C++, and both are in ANTLR-2.7.1 version. What we do now is just combining the two parts into a new grammar file, CPP_parser.g. This grammar file is to be input to ANTLR-2.7.1 to generate a parser for C++ in C++. According to the requirement of ANTLR-2.7.1, the Parser part is in front of the Lexer part.

    To make the lexer and parser compatible we used the literal form of keywords like "auto", "bool", "break" etc. in the parser part instead of using the tokens (AUTO, BOOL, BREAK etc.). This meant we did not need to define the keywords as tokens in the lexer part of the grammar file.

    This new file was processed through ANTLR in a dos window using,

    "java antlr.Tool CPP_parser.g"
    
  5. Make necessary changes to the main.cpp file and support.cpp file.
    1. The linkage of the generated files through ANTLR is also different from PCCTS. The old main.cpp file uses the template:
      "ParserBlackBox p(argv[1]);"
      

      to act as the interface. In the new main.cpp file, we replace it with a function,

      "static void parseFile(const string& f)" 
      

      and add,
      "parser.init();" 
      

      before the "parser.translation_unit();" in the definition of the function.

      The file support.cpp is closely associated with all the other files in parser. So some corresponding changes must be made to it. For example, we have changed,

    2. change"#include " to "#include " etc.
    3. changeall "LT(n)->getText()" to "(LT(n)->getText()).data()"
    4. change"Eof" to "EOF"
    5. changeall "ANTLRParser::" references to "LLkParser::"
    6. change"isTypeName(char *s)" to "isTypeName(const char *s)"
    7. change"declaratorID(char *id)" to "declaratorID(const char *id)"
    8. change
      "classForwardDeclaration(TypeSpecifier ts,DeclSpecifier ds,char *tag)" to
      "classForwardDeclaration(TypeSpecifier ts,DeclSpecifier ds,const char *tag)"
      
    9. change"beginClassDefinition(TypeSpecifier ts,char *tag)" to "beginClassDefinition(TypeSpecifier ts,const char *tag)"
    10. change"enumElement(char *e)" to "enumElement(const char *e)"
    11. change"beginEnumDefinition(char *e)" to "beginEnumDefinition(const char *e)"
    12. change"templateTypeParameter(char *t)" to "templateTypeParameter(const char *t)"
  6. Create project and compile new parser

    Using MSVC++6.00 we loaded CPPLexer.cpp, CPPParser.cpp, Dictionary.cpp, LineObject.cpp, main.cpp and support.cpp with required .h and .hpp files.

    However, the new parser failed to parse our data correctly. We had to modify the new grammar file, CPP_parser.g, manually to make it produce a parser with the same functionality as the old version as follows.

    1. The context-guarded predicate such as,
      "(SCOPE | ID)=>{qualifiedItemIsOneOf(qiType|qiCtor) && 
      !qualifiedItemIsOneOf(qiType|qiCtor,-1)}?"
      

      does not generate right code via ANTLR-2.7.1. So we have changed it to the semantic predicate,
      "{(!((LA(1)==SCOPE||(LA(1)==ID)))||(qualifiedItemIsOneOf(qiType|qiCtor) && 
      !qualifiedItemIsOneOf(qiType|qiCtor,-1)))}? "
      
    2. In PCCTS, a semantic predicate can be automatically be hoisted into the prediction decision of other rules but ANTLR-2.7.1 does not do this. What we have done to solve such problem is to copy the corresponding prediction decision in the parser.cpp of old version into the corresponding rules in the new grammar file as semantic predicates to force it to generate the same prediction decision in the same rules in the new parser.cpp.
    3. The optional statement,
      class_head
      :    
      (  "struct"  
      |  "union" 
      |  "class" 
      )    
      (ID ( base_clause)? )?   
      ;
      

      did not produce a properly optional construction in the generated code. What we have done to solve the problem is to change this to,
      class_head   
      :
      (  "struct"  
      |  "union" 
      |  "class" 
      )    
      (ID ( base_clause)? )? LCURLY 
      ; 
      

      and change "class_head LCURLY" in all the other rules to just "class_head".
  7. Final result

    The resulting parser has been used successfully on the pre-processed version (.i files) of a number of source files. However, we cannot guarantee that either all changes required to convert from PCCTS to ANTLR have been mentioned nor that the changes mentioned are all correct or the best solutions. Please treat these notes as an aid to conversion only.

    If you find any errors or omissions, please let us know.