A Simple Compiler - Part 4: Semantic Analysis - the Symbol Table

Introduction

A Review:

Before we get started this time, let us review what we have covered so far.

In compiler development, design is king. If you design your language before you start coding, you'll have a much better chance at success. Along these lines, developing a well-thought out and non-ambiguous grammar is a must. You simply cannot skip this step.

Grammar:

A grammar is composed of rules. Each rule is called a production.

Productions are composed of terminals and non-terminals. Terminals are language elements, such as keywords and operators.

The syntax for productions that we are using is: a non-terminal on the left, followed by an equal sign, and then the rule(s) for that production, terminated with a period.

For instance:

 Teeny = 'program' ident ';' CmpndStmt '.' .

'Teeny' is a non-terminal, 'program', 'ident', ';' and '.' are terminals. 'CmpndStmt' is another terminal, and must be defined for the grammar to be complete.

Special symbols used include '|' (or) for alternation, '(' and ')' for grouping, '[', ']' means optional, and '{', '}' means zero or more.

We also reviewed a simple formula for converting grammar rules to actual parsing code.

For example, here is the Teeny 03 grammar for expressions:

 Expr      = Term { ('+' | '-') Term } .
 Term      = Factor { ('*' | '/') Factor } .
 Factor    = number
           | ('+' | '-') Factor
           | '(' Expr ')'
           .

As written, this causes negation to have the highest precedence, followed by multiplication and division, and finally, addition and subtraction. Can you determine why?

For example, given the following expression: 2 * -4

We start with 'Expr'. The first symbol on the right hand side is 'Term'. Term is a non-terminal, so we find the Term production. The Term production leads us to Factor, another non-terminal. We find Factor. Factor gives us a list of choices. '2' is a number, so we have a match. The Factor production has been satisfied, so we return back to Term. The rest of our expression is now: '* - 4'. Next, Terms says to look for zero or more '*' or '/' followed by another Factor. The '*' in our test expression is matched, so now we go back to Factor again. The restof of our expression is now: '-4'. This time in Factor we don't have a number, we have '-'. And Factor has a rule for that. This rule takes the '-', and then invokes Factor again. Recursing into Factor, all that is left of our sample expression is: '4'. This matches number, and so return back to Term. Our sample expression is now: '' (the empty string). There is not another '*' or '/', so we return to Expr. There is not a '+' or '-', so we have finished parsing our sample expression.

Assuming that sym is set to the first symbol in the input stream, code that implements these rules follows:

 def expr
   term()
   while sym in ['+', '-']
     getsym()
     term()
   end
 end

 def term
   factor()
   while sym in ['*', '/']
     getsym()
     factor()
   end
 end

 def factor
   case sym
     number : getsym()
     '+' : getsym(); factor()
     '-' : getsym(); factor()
     '(' : getsym(); expr(); expect(')')
     otherwise error('expecting a factor')
   end
 end

Note how closely the implementation code follows the grammar.

The Symbol Table

Constants:

Our current grammar is:

 Teeny     = 'program' ident ';' CmpndStmt '.' .
 CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
 Stmt      = Writeln .
 Writeln   = 'writeln' '(' [(string|Expr)
               {',' (string|Expr) }] ')' .
 Expr      = Term { ('+' | '-') Term } .
 Term      = Factor { ('*' | '/') Factor } .
 Factor    = number
           | ('+' | '-') Factor
           | '(' Expr ')'
           .

We would like to add variables to our language. But in order to keep things as simple as possible, first we'll add constants, which can be considered as a subset of variables. These are similar to variables, but they are always assigned an initial value, and that value can not be changed. Constants are more easily added than variables (since we do not have to worry about subsequent assignments), so that is what we'll add this time.

Grammar Changes:

A constant statement looks like:

 const a = 1, b = 7, x = 243;

One way of specifying this in EBNF notation:

 Decl      = 'const' Const {',' Const } ';' .
 Const     = ident '=' number .

This says that the symbol 'const', is followed by an identifier, followed by '=', followed by a number. This number can be followed by a ',', to denote more constants. Finally, the constant declaration is ended by a ';'.

Here is our Teeny .04 grammar, with Constant declarations added, and constants (in the form of an ident) added to Factor:

 Teeny     = 'program' ident ';' {Decl} CmpndStmt '.' .
 Decl      = 'const' Const {',' Const } ';' .
 Const     = ident '=' number .
 CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
 Stmt      = Writeln .
 Writeln   = 'writeln' '(' [(string|Expr)
               {',' (string|Expr) }] ')' .
 Expr      = Term { ('+' | '-') Term } .
 Term      = Factor { ('*' | '/') Factor } .
 Factor    = number
           | ident
           | ('+' | '-') Factor
           | '(' Expr ')'
           .

Notice that we've added the non-terminal Decl to the Teeny production, and in such a way as to allow zero or more Constant declarations. We have also added a production for Decl, and a production for Const. Finally, we have added ident to the Factor production, to allow for the use of Constants. Additionally, we had to add the symbols '=' and 'const' to our Scanner.

Code to parse a Decl might look like:

 def decl
   while sym == 'const'
     getsym()
     const()
     while sym == ','
       getsym()
       const()
     end
     expect(';')
   end
 end

 def const
   expect(ident)
   expect('=')
   expect(number)
 end

A Teeny .04 sample program using Constants:

  program four;
  const x = 3, y = 2;
  begin
    writeln('hello, world!');
    writeln('2 + 3 * 4 = ', 2 + 3 * 4);
    writeln('x is ', x, ' and y is ', y);
    writeln('x * y is ', x * y);
  end.

Enter the Symbol Table:

As you might guess, the changes to the scanner and the parser are rather simple. But, when we are parsing a Factor, or generating code, how do we remember that a certain identifier is a Constant? That is where the symbol table comes into play.

A symbol table is an abstract data type that is used to hold names of variables, constants, and functions. Along with the name, attributes such as data type, data address, data size, initial value, and scope are stored. Identifiers are added to the symbol table when they are initially declared. Later, when an identifier is referenced, the symbol table is consulted to make sure the identifier is known and properly used. As you might surmise, languages where identifiers can be used before they are declared pose problems for compiler writers, and essentially require the use of an AST.

In Teeny .04, whenever a constant is recognized, the constant identifier - and the constants value - is added to the symbol table. In our Toy compiler, we'll use an array of structures for the symbol table, and treat it as a simple stack. Production compilers typically use a hash table, while some prefer trees or tries.

Implementation:

Our symbol table will have the following fields:

Here is our simple symbol table class:

 class Sym_table {
 public:
   Sym_table() {tx = 0;}
   void enter_const(const char id[], int val);
   int  position(const char id[]);
   int  getval(int i) {return table[i].val;}
 private:
   enum {Table_max=100};
   struct Table {
     Ident id;             // ident name
     int val;              // value
   };

   int tx;                 // index
   Table table[Table_max]; // entry 0 used as sentinel
 };

When a constant declaration is recognized, the constant name and value are added to the symbol table. Here is the code for recognizing a constant and adding it to the symbol table:

 //Decl  = 'const' Const {',' Const } ';' .
 //Const = ident '=' number .
 void Parser::decl() {
   while (accept(constsym)) {
     do {
       expect(idsym);
       expect(eqlsym);
       expect(numbersym);
       if (position(getid()))
         error.warn("already defined");
       else
         enter_const(getid(), Scanner::getval());
     } while (accept(comma));
     expect(semi);
   }
 }

When a constant is found via the factor production, the name is looked up in the symbol table, and the value for the name is found. Code is then generated to push that value onto the stack. It is an error if the constant name is not found in the symbol table. Here is the new code for factor:

  if (sym == idsym) {
    int i = position(getid());
    if (i == 0)
      error.warn("undefined identifier");
    else
      gen.load_number(Sym_table::getval(i));
    getsym();
  }

Summary of changes to the compiler:

Summary:

We reviewed the different parts of a compiler, specifically Scanning, Parsing, Semantic analysis and Code generation. We reviewed grammars and their importance. We reviewed how to derive parsing code from a grammar. We introduced the Symbol Table. We added constants to our language, which gives a simple example of the use of a symbol table. We showed some of the major changes to Teeny.

References