Skip to main content
University of Glasgow
A-Z : ACADEMIC DEPARTMENTS | SERVICES | STAFF |
Faculty of Information and Mathematical Systems > Department of Computing Science

Languages and Machines

Page 1/1: 1

2010-02-19

Generating an expression grammar for SableCC

Yes, you've read it correctly: this post is about generating a grammar for SableCC, the Java-based parser generator.

Expression grammars in SableCC

An expression grammar is that part of a grammar that describes the syntax for operator-based expressions, e.g.

2*3-4>>5+6*7

The grammar for this expression must take into account the precedence, associativity and fixity of the operators. For example, the SableCC productions for a grammar for arithmetic expressions consisting of factors and terms (i.e. sums of products) look like this (omitting the transformation annotations):

Productions

grammar         = exp_list;

exp_list        =
                    {list}    exp_list separator exp 
                  | {single}  exp           
                  ;

exp             =  
                    {plus}    exp plus factor 
                  | {factor}  factor          
                  ;

factor          =
                    {mult}    factor mult term 
                  | {term}    term             
                  ;

term              {-> exp} =
                    {number}  number            
                  | {exp}     l_par exp r_par   
                  ;

separator       = {semicolon} semi;

The AST is quite simple:

Abstract Syntax Tree

grammar           = exp+ ;

exp               =
                    {plus}    [l]:exp  [r]:exp |
                    {mult}    [l]:exp  [r]:exp |
                    {number}  number
                  ;

This example illustrates an unfortunate aspect of SableCC: the complete expression grammar has to be written out as production rules for every level of precedence, with the correct associativity and fixity. Even the example for a simple two-operator grammar is already quite complicated. Also note that every operator (or to be precise every precedence level) requires a separate AST node. A real-life programming language such as C++ or Perl has typically about 30 different operators with more than 10 levels of precedence. Writing an expression grammar for such a set of operators is daunting to say the least.

An expression grammar generator

However, the production rules for expression grammars are quite regular, so it is possible to generate the expression grammar using a script. That means we will be creating a parser generator grammar generator.

Every level of precedence requires a separate production rule and as observed above, because of the limited way in which in SableCC the concrete syntax can be transformed into the abstract syntax, every one of these rules results in a different node in the AST. It is therefore essential that we can not only generate the grammar but also at least a skeleton for the visitor methods for every operator node in the AST. Each of the visitor methods should contain a switch/case statement to perform the correct action for each operator. Unfortunately, Java's switch/case does not allow matching of strings. Therefore we need to map the operators to a unique number.

Perl to the rescue

For jobs like this, I very much prefer Perl over any other language; however, the design of the expression grammar generator is not Perl-specific (nor indeed SableCC-specific), so you can easily port it to your language of choice.

To generate the production rules (and the corresponding tokens, AST and visitor methods), we start from a table that succinctly describes our expression grammar:

my %optable=(
    1 => ['Prefix',['++', '--'],'AssocNone'],
    2 => [' ',['**'],'AssocRight'],
    3 => ['Prefix',['!','~'],'AssocRight'], 
    4 => ['Infix',['*','/','%'],'AssocLeft'], 
    5 => ['Infix',['+','-'],'AssocLeft'], 
    6 => ['Infix',['<<','>>'],'AssocLeft'],
    7 => ['Infix',['<=','>=','>','<'],'AssocNone'],
    8 => ['Infix',['==','!='],'AssocNone'], 
    9 => ['Infix',['&','bitand'],'AssocLeft'],
    10 => ['Infix',['|','^','bitor','xor'],'AssocLeft'],
    11 => ['Infix',['&&','and'],'AssocLeft'],
    12 => ['Infix',['||','or'],'AssocLeft'],
    13 => ['Infix',['='],'AssocRight'],
);

Based on this table, the script will generate the grammar and the corresponding Java code for the nodes from templates (expressions_templ.sablecc3, ExpressionsVisitor_templ.java). The templates contain commented placeholders, e.g. //OP_EXPR_PRODUCTIONS, which are replaced by the code generated by the script.

We create the operator tokens from corresponding helpers as follows:

# Helpers
$line=~/OP_EXPR_HELPERS/ && do { 
    for my $lev (1..$nlevs) {
        my @ops = @{$optable{$lev}->[1]};    
        print $GRAM "ops_l$lev = '".join ("' | '",@ops)."';\n";
    }
};

# Tokens
$line=~/OP_EXPR_TOKENS/ && do { 
    for my $lev (1..$nlevs) {
        my $opname=($lev==$signop)?'sign':"op_l$lev"; # special case: sign
        print $GRAM  "$opname = ops_l$lev;\n";
    }
};

The actual production rules depend on the fixity and associativity. For example, the Infix AssocRight rule template:

                print $GRAM  "{op_l$lev} terml$lm1 op_l$lev term_l$lev  
                {-> New expr.l$lev (op_l$lev,term_l$lm1.expr,term_l$lev.expr) }\n";

where $lm1=$lev-1.

AST generation is equally straightforward:

# AST    
$line=~/OP_EXPR_AST/ && do {
    for my $lev (1..$nlevs) {
        my $opname=($lev==$signop)?'sign':"op_l$lev"; # special case: sign
        my $fixity=$optable{$lev}->[0];
        if ($fixity eq 'Prefix') {
            print $GRAM "                 {l$lev}     [op]:op_l$lev [l]:expr  |\n";
        } elsif ($fixity eq 'Postfix') {
            print $GRAM "                 {l$lev}     [op]:op_l$lev [r]:expr  |\n";
        } else  {
            print $GRAM "                 {l$lev}    [op]:$opname [l]:expr  [r]:expr |\n";
        }
    }
};

We then generate an operator table for the Java class for our AST walker:

# Java operator table

$line=~/OP_EXPR_OPTABLE_ENTRIES/ && do {
    for my $lev (1..$nlevs) {
        my @ops = @{$optable{$lev}->[1]}; 
        my $i=0;
        for my $op (@ops) {
            print $JAVA "\t\toptable.put(\"$op\",$lev$i);\n";
            $i++;
        }
    }
};

Each operator gets a unique number based on its level and its position in the list. The Java methods for each operator precedence level contain a switch/case based on this number.

        print $JAVA "
        public void outAL${lev}Expr(AL${lev}Expr node) {
            String opstr=node.getOp().toString().trim();
            switch (optable.get(opstr)) {
        ";        
        my @ops = @{$optable{$lev}->[1]}; 
        my $i=0;
        for my $op (@ops) {
                print $JAVA "
                case $lev$i: // '$op'
                setNodeInt (node, getNodeInt (node.getL()) $op getNodeInt (node.getR()));
                break;
                ";                
            $i++;                
        }                
        print $JAVA "\n}\n}\n";

For example, a tree walker which performs integer computation has following method for the level-4 node:

    public void outAL4Expr(AL4Expr node) {
        String opstr=node.getOp().toString().trim();
        switch (optable.get(opstr)) {

                case 40: // '*'
                setNodeInt (node, getNodeInt (node.getL()) * getNodeInt (node.getR()));
                break;

                case 41: // '/'
                setNodeInt (node, getNodeInt (node.getL()) / getNodeInt (node.getR()));
                break;

                case 42: // '%'
                setNodeInt (node, getNodeInt (node.getL()) % getNodeInt (node.getR()));
                break;

        }
    }

Trying it out

To try it out, you can run the example, all files are in the archive BlogExample-ExpressionGrammar.zip.

The script is not very generic, all the file names and tables are hardcoded. So you simply run

$ perl expression_parser_generator.pl

This will generate the grammar expressions_gen.sablecc3, so you run sablecc on it:

$ sablecc expression_parser_generator.pl

The example provides a Main.java and a Calculate.java, which is based on the generated ExpressionsVisitor_gen.java. Compile and run:

$ javac Main.java
$ java Main test_expressions.gc

Every expression in the test file should evaluate to 42.

Conclusion

One of the deficiencies of SableCC is the lack of an expression grammar generator. I turns out it is not hard to write one in Perl, you're welcome to it. For full details, see the source code.

Content Request - posted by gotomeeting - 2/10/2011 17:38:09
One of the deficiencies of SableCC is the lack of an expression grammar generator. I turns out it is not hard to write one in Perl, you're welcome to it. For full details, see the source code.
powered by Blosxom