A Simple Compiler - Part 3: More about grammars.

Some notes about the teeny compiler listing from last time:

Code is generated for a virtual machine. This virtual machine is based on the ones found in the texts by Wirth and Terry.

There is no error recovery - when an error is found, the compiler simply halts. We'll improve this in the future.

In the scanner, idsym is assigned to the (empty) last entry in the keywords table, hence idsym is the Symbol type if the symbol is not a keyword. Also, the keywords table is searched in a linear fashion. In a production compiler, a hash table should be used.

More about grammars:

We've looked at some very simple grammars. Do you remember the difference between terminals (basic language elements) and non-terminals (syntactic elements that are substituted for). Do you remember the EBNF meta symbols? () for grouping, | for alternation, {} for items occurring zero or more times, and [] for optional items.

Ready for something a little more complex? Remember our writeln statement? Lets allow for multiple strings, separated by commas, so that we can now write:

 writeln("string 1", "string 2", "string 3")

This isn't really useful now, but it will be once our language supports expressions. Here is the grammar:

 Writeln = 'writeln' '(' String { ',' String } ')' .

This says that writeln expects a left parenthesis, followed by a string, optionally followed by zero or more Strings, separated by commas. How would we parse this?

 if accept(writeln)
   expect(lparen)
   expect(litstrsym)
   while accept(comma)
     expect(litstrsym)
   end
   expect(rparen)
 end

See how the parsing code follows the grammar? As an aside, we could rewrite the above as:

 if accept(writeln)
   expect(lparen)
   repeat
     expect(litstrsym)
   until not accept(comma)
   expect(rparen)
 end

But let's slightly change the grammar, so that the parameters to Writeln are optional. That way, we can write just a newline by itself:

 Writeln = 'writeln' '(' [String { ',' String }] ')' .

Now the code becomes:

 if accept(writeln)
   expect(lparen)
   if not accept(rparen)
     repeat
       expect(litstrsym)
     until not accept(comma)
     expect(rparen)
   endif
 end

What about arithmetic expressions, for instance just the basics, those including plus, minus, multiply and divide.

 Expr    = Term {('+'|'-') Term} .
 Term    = Factor {('*'|'/') Factor} .
 Factor  = number
         | ('+'|'-') Factor
         | '(' Expr ')'
         .

Do you see how the above grammar forces the standard arithmetic precedence?

How would we parse the above? Just like we've been doing all along. Here goes:

To parse Expr:

  Expr = Term {('+'|'-') Term} .

 expr()
   term()
   while sym in [plus, minus]
     getsym()
     term()
   end
 end

  Term = Factor {('*'|'/') Factor} .

 term()
   factor()
   while sym in [multiply, divide]
     getsym()
     factor()
   end
 end

  Factor  = number
          | ('+'|'-') Factor
          | '(' Expr ')'
          .

 factor()
   if sym == number
     getsym()
   elseif sym in [plus, minus]
     getsym()
     factor()
   elseif sym == lparen
     getsym()
     expr()
     expect(rparen)
   else
     error()
   end
 end

I'm not trying to overemphasize this point, but do you see how simple it is to implement the parser, as long as you have a grammar? See how important a grammar is?

But all isn't quite as easy as it seems. In order to parse your grammar via recursive descent, you must make sure that your grammar is not ambiguous. Here is an example of an ambiguous grammar:

 Statement = 'if' Expr 'then' StmtSeq 'end'
           | 'while' Expr 'do' StmtSeq 'end'
           | ident ':=' Expr
           | ident '(' [ Expr {',' Expr}] ')'
           .

Can you see the ambiguity? Both assignment (ident ':=' Expr) and procedure call (ident '(' [ Expr {',' Expr}] ')') start with the same terminal. This won't work in a recursive descent parser. For that, we would need to use a more sophisticated algorithm.

So how can we solve this problem? Well, the only way is to either redesign your language or somehow change the grammar in an appropriate fashion. We'll solve the problem by doing the latter when we implement assignment and procedures.

Here is the new grammar for our teeny (version .03) language:

 Teeny     = 'program' ident ';' CmpndStmt '.' .
 CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
 Stmt      = Writeln .
 Writeln   = 'writeln' '(' [(string|Expr)
               {',' (string|Expr) }] ')' .
 Expr      = Term { ('+' | '-') Term } .
 Term      = Factor { ('*' | '/') Factor } .
 Factor    = number
           | ('+' | '-') Factor
           | '(' Expr ')'
           .

And here is a sample teeny v0.03 program:

 program three;
 begin
   writeln('hello, world!');
   writeln('2 + 3 * 4 = ', 2 + 3 * 4);
 end.

It took about 100 lines of code to implement the new features. The major changes are listed here:

Next time we'll implement integer constants.

References

File:teeny.h

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <ctype.h>

 enum {Ident_max = 16, Max_str = 200, Max_stack = 1000};
 typedef char Ident[Ident_max+1];

 typedef enum {
   wrc=1, wri, wrl, lit, neg, mul, dvd, add, sub, hlt} Opcode;

 typedef struct {
   Opcode opcode;
   int operand;
 } Instruction;

 void interpret();

 class Msg {
 public:
   Msg() {err_line = 0; err_col = 0;}
   void fatal(const char msg[]);
   void warn(const char msg[]);
   void runtime(const char msg[]);
   void set_line(int line) { err_line = line; }
   void set_col(int col) { err_col = col; }
 private:
   int err_line, err_col;
 };

 class Scanner {
 public:
   typedef enum {
     nul, lparen='(', rparen=')', times='*', plus='+', comma=',',
     minus='-', period='.', divide='/', semi=';', numbersym,
     idsym, litstrsym, program, begin, end, writeln, eofsym
   } Symbol;

   bool init_scanner(const char fn[]);
   Symbol getsym();
   int getval() { return val; }
   const char *getid() { return id; }
   const char *getlitstr() { return lit_str; }

 private:
   void read_ch();
   Symbol get_ident();
   Symbol get_number();
   Symbol get_string(int delimiter);

   int val;
   Ident id;
   char lit_str[Max_str];
   int ch;
   int cur_line, cur_col;
   FILE *f;
   static char *key_words[];
   static Symbol key_syms[];
 };

 class VirtualMachine {
 public:
   VirtualMachine() { code_index = 0; }
   void load_number(int n) { gen(lit, n); }
   void writeln() { gen(wrl, 0); }
   void write_int() { gen(wri, 0); }
   void halt() { gen(hlt, 0); }
   const Instruction *getcode() { return code; }

   void binary_op(Scanner::Symbol op);
   void unary_op(Scanner::Symbol op);
   void write_str(const char *);

 private:
   enum {MAX_CODE=500};
   void gen(Opcode opcode, int operand);
   Instruction code[MAX_CODE];
   int code_index;
 };

 class Parser : Scanner {
 public:
   void parse(char fn[]);
 private:
   Symbol sym;

   bool accept(Symbol s);
   void expect(Symbol s);
   bool is_addop();
   bool is_mulop();
   void factor();
   void term();
   void expr();
   bool stmt();
   void cmpndstmt();
 };

File:teeny.cpp

 #include "teeny.h"

 VirtualMachine vm;
 Msg error;

 void Msg::fatal(const char msg[]) {
   if (err_line == 0)
     printf(msg);
   else
     printf("line %d col %d %s\n", err_line, err_col, msg);
   exit(2);
 }

 void Msg::runtime(const char msg[]) {
   printf("Runtime error: %s\n", msg);
   exit(2);
 }

 void Msg::warn(const char msg[]) {
   printf("line %d col %d %s\n", err_line, err_col, msg);
   exit(2);
 }

 /---------------- Virtual Machine interpreter ----------------/
 void check_stack(int top) {
   if (top >= Max_stack)
     error.runtime("stack overflow");
 }

 void interpret() {
   int top = 0, stack[Max_stack];
   const Instruction *pc = vm.getcode();
   for (;;) {
     switch (pc->opcode) {
       case wrc: printf("%c", pc->operand); break;
       case wri: printf("%d", stack[top]); top--; break;
       case wrl: printf("\n"); break;
       case lit:
         top++;
         check_stack(top);
         stack[top] = pc->operand;
         break;
       case neg: stack[top] = -stack[top]; break;
       case mul: top--; stack[top] *= stack[top + 1]; break;
       case dvd: top--; stack[top] /= stack[top + 1]; break;
       case add: top--; stack[top] += stack[top + 1]; break;
       case sub: top--; stack[top] -= stack[top + 1]; break;
       case hlt: return;
       default: error.runtime("whoops! unknown opcode");
     }
     pc++;
   }
 }

 /---------------- Code generator -----------------------------/
 void VirtualMachine::gen(Opcode opcode, int operand=0) {
   if (code_index < MAX_CODE) {
     code[code_index].opcode = opcode;
     code[code_index].operand = operand;
     code_index++;
   } else error.fatal("program too long");
 }

 void VirtualMachine::binary_op(Scanner::Symbol op) {
   switch (op) {
     case Scanner::times:  gen(mul); break;
     case Scanner::divide: gen(dvd); break;
     case Scanner::plus:   gen(add); break;
     case Scanner::minus:  gen(sub); break;
     default: error.warn("unknown binary operator");
   }
 }

 void VirtualMachine::unary_op(Scanner::Symbol op) {
   switch (op) {
     case Scanner::minus: gen(neg); break;
     default: error.warn("unknown unary operator");
   }
 }

 void VirtualMachine::write_str(const char *lit_str) {
   for (int i = 0; lit_str[i] != '\0'; i++)
     gen(wrc, lit_str[i]);
 }

 /---------------- Lexical analyzer ---------------------------/
 char *Scanner::key_words[] = {"begin", "end", "program",
     "writeln", ""};
 Scanner::Symbol Scanner::key_syms[] = {begin, end, program,
     writeln, idsym};

 void Scanner::read_ch() {
   if (ch == '\n') {
     cur_line++;
     cur_col = 0;
   }
   ch = fgetc(f);
   if (ch != EOF)
     cur_col++;
 }

 Scanner::Symbol Scanner::get_string(int delimiter) {
   int i = 0;
   for (;;) {
     read_ch();
     if (ch == delimiter) {
       read_ch();
       break;
     }
     if (ch == EOF) {
       error.warn("eof in string");
       break;
     }
     if (ch < ' ') error.warn("illegal character in string");
     if (i < Max_str) {
       lit_str[i] = (char)ch;
       i++;
     }
   }
   lit_str[i] = '\0';
   return litstrsym;
 }

 Scanner::Symbol Scanner::get_ident() {
   int i = 0;
   do {
     if (i < Ident_max) {
       id[i] = (char)ch;
       i++;
     }
     read_ch();
   } while (ch != EOF && isalnum(ch));
   id[i] = '\0';
   for (i = 0;
       key_words[i][0] != '\0' && strcmp(id, key_words[i]) != 0;
       i++)
     ;
   return key_syms[i];
 }

 Scanner::Symbol Scanner::get_number() {
   val = 0;
   do {
     val = 10 * val + (ch - '0');
     read_ch();
   } while (ch != EOF && isdigit(ch));
   return numbersym;
 }

 Scanner::Symbol Scanner::getsym() {
   int save_ch;
   while (ch != EOF && ch <= ' ')
     read_ch();
   error.set_line(cur_line); error.set_col(cur_col);
   save_ch = ch;
   switch (ch) {
     case EOF: return eofsym;
     case '+':case '-':case '*':case '/':case '(':case ')':
     case '.':case ';':case ',':
       read_ch(); return (Symbol)save_ch;
     case '\'': return get_string(ch);
     default:
       if (isalpha(ch)) return get_ident();
       if (isdigit(ch)) return get_number();
       error.warn("illegal character");
       read_ch();
       return getsym();
   }
 }

 bool Scanner::init_scanner(const char fn[]) {
   if ((f = fopen(fn, "r")) == NULL) return false;
   error.set_line(1);
   error.set_col(0);
   cur_line = 1;
   cur_col = 0;
   ch = ' ';
   read_ch();
   return true;
 }

 /---------------- Parser -------------------------------------/
 bool Parser::accept(Symbol s) {
   if (s != sym) return false;
   sym = getsym();
   return true;
 }

 void Parser::expect(Symbol s) {
   if (!accept(s)) error.warn("syntax error");
 }

 bool Parser::is_addop() {
     return sym == plus || sym == minus;
 }

 bool Parser::is_mulop() {
     return sym == times || sym == divide;
 }

 // Factor = number
 //        | ('+' | '-') Factor
 //        | '(' Expr ')'
 //        .
 void Parser::factor() {
   if (sym == numbersym) {
     vm.load_number(getval());
     sym = getsym();
   } else if (is_addop()) {
     Symbol op = sym;
     sym = getsym();
     factor();
     if (op == minus)    //bug fix from Allan (Qdepartment, June 15, 2006)
       vm.unary_op(op);
   } else if (accept(lparen)) {
     expr();
     expect(rparen);
   } else {
     error.warn("factor expected");
   }
 }

 // Term = Factor { ('*' | '/') Factor } .
 void Parser::term() {
   factor();
   while (is_mulop()) {
     Symbol op = sym;
     sym = getsym();
     factor();
     vm.binary_op(op);
   }
 }

 // Expr = Term { ('+' | '-') Term } .
 void Parser::expr() {
   term();
   while (is_addop()) {
     Symbol op = sym;
     sym = getsym();
     term();
     vm.binary_op(op);
   }
 }

 // 'writeln' '(' [(string|Expr) {',' (string|Expr) }] ')' .
 bool Parser::stmt() {
   if (accept(writeln)) {
     expect(lparen);
     if (!accept(rparen)) {
       do {
         if (sym == litstrsym) {
           vm.write_str(getlitstr());
           sym = getsym();
         } else {
           expr();
           vm.write_int();
         }
       } while (accept(comma));
       expect(rparen);
     }
     vm.writeln();
     return true;
   }
   return false;
 }

 // CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
 void Parser::cmpndstmt() {
   expect(begin);
   do {
     stmt();
   } while (accept(semi));
   expect(end);
 }

 // Pascal = 'program' id ';' 'begin' Stmtseq 'end' '.' .
 void Parser::parse(char fn[]) {
   if (!init_scanner(fn)) error.fatal("no source");
   sym = getsym();
   expect(program);
   expect(idsym);
   expect(semi);
   cmpndstmt();
   expect(period);
   vm.halt();
 }

 /---------------- Main driver --------------------------------/
 int main(int argc, char *argv[]) {
   Parser parser;
   parser.parse(argv[1]);
   interpret();
   return 0;
 }