A Simple Compiler - Part 1: Lexical analysis.

The goal of this series of articles is to develop a simple compiler. Along the way, I'll show how easy it is to do so.

There are four major parts to a compiler: Lexical analysis, Parsing, Semantic analysis, and Code generation.

Briefly, Lexical analysis breaks the source code into its lexical units. Parsing combines those units into sentences, using the grammar (see below) to make sure the are allowable. Semantic analysis makes sure the sentences make sense, especially in areas that are not so easily specified via the grammar. Type checking is a good example. Code generation takes the output of the Parser (many times in the format of an Abstract Syntax Tree) and converts it to (virtual) machine code, assembly code, or perhaps even code in another programming language - C is a popular target.

Lexical analysis:

Also called scanning, this part of a compiler breaks the source code into meaningful symbols that the parser can work with. Typically, the scanner returns an enumerated type (or constant, depending on the language) representing the symbol just scanned. Scanning is the easiest and most well-defined aspect of compiling.

Additional responsibilities of the scanner include removing comments, identifying keywords, and converting numbers to internal form.

For an example, given the following assignment statement from a BASIC-like language:

 average = total / num_items

A scanner might return, in the order given:

 Symbol:         Source text:
 -------         -----------

 identifier      average
 assign          =
 total           total
 divide          /
 identifier      num_items

For a function call, from a C-family language:

 puts("Hello, World");

A scanner might return, in the order given:

 Symbol:         Source text:
 -------         -----------

 identifier      puts
 left_paren      (
 string_literal  "Hello, World"
 right_paren     )
 semicolon       ;

When designing a new language, it is a good idea to think about the symbols that make up the language, and define them in such a way that they can be specified in a non-ambiguous fashion.

A simple example:

Suppose we have a simple language that allows you to display the output of constant integer expressions, featuring the addition and multiplication operators. An example statement in the language:

 write(2001 + 42 * 457)

From this, we might say that the language is composed of Identifiers, integer constants, '(', ')', '+', '*'.

We could specify this more formally using regular expressions:

 identifiers:        [a-zA-Z][a-zA-Z0-9]*
 integer constants:  [0-9]+
 left parenthesis    (
 right parenthesis   )
 plus                +
 times               *

Note that we are assuming the language is free-form and not line based.

Tools exist that will take a specification not too far removed from this and automatically create a scanner. Lex and Flex are both popular scanner generators. Also, many parser generators include built-in scanner generators. Examples are Coco/R and Antlr.

It turns out that scanners, especially for non-ambiguously defined languages, are fairly easy to write.

Many compiler texts recommend constructing a scanner via a finite state machine. But for our purposes, a simple ad-hoc scanner is sufficient.

The main routine of a scanner, which returns an enumerated constant of the next symbol read is:

 getsym()
     -- skip white space
     while the char is a space or eol
         keep reading characters
     switch on the char
         when '+' read next char; return plus
         when '*' read next char; return times
         ...
         when 'a'..'z', 'A'..'Z':
             while the char is alphanumeric
                 accumulate characters in id
             return ident or keyword symbol, as appropriate
         when '0'..'9'
             while the char is a digit
                 accumulate the integer value in val
             return number
         otherwise
             read next char;
             return null
 end getsym

A C++ version follows below.

Next time: Parsing.

References


 // A simple scanner for a language with identifiers, integers,
 // parenthesis, and plus and minus symbols.
 #include <stdio.h>
 #include <string.h>
 #include <ctype.h>

 enum {Ident_max = 16};
 typedef char Ident[Ident_max+1];

 typedef enum {
     null, times, plus, rparen, lparen, number, writesym, eofsym
 } Symbol;

 class Scanner {
 public:
     bool init(const char fn[]);
     Symbol getsym();
     int getval() { return val; }
     const char *getid() { return id; }

 private:
     void read_ch();
     Symbol get_ident();
     Symbol get_number();
     void enter_kw(Symbol sym, Ident name);

     enum {KW=25};

     int val;
     Ident id;
     int ch;
     int nkw;
     char *keyTab[KW+1];
     Symbol keySym[KW+1];
     FILE *f;
 };

 void Scanner::read_ch() {
     ch = fgetc(f);
     if (ch != EOF)
         putchar((char)ch);
 }

 Symbol Scanner::get_ident() {
     int i;

     i = 0;
     do {
         if (i < Ident_max) {
             id[i] = (char)ch;
             i++;
         }
         read_ch();
     } while (ch != EOF && isalnum(ch));
     id[i] = '\0';
     for (i = 0; i < nkw && strcmp(id, keyTab[i]) != 0; i++)
         ;
     return keySym[i];
 }

 Symbol Scanner::get_number() {
     val = 0;
     do {
         val = 10 * val + (ch - '0');
         read_ch();
     } while (ch != EOF && isdigit(ch));
     return number;
 }

 Symbol Scanner::getsym() {
     while (ch != EOF && ch <= ' ')
         read_ch();
     switch (ch) {
         case EOF: return eofsym;
         case '+': read_ch(); return plus;
         case '*': read_ch(); return times;
         case '(': read_ch(); return lparen;
         case ')': read_ch(); return rparen;
         default:
             if (isalpha(ch))
                 return get_ident();
             if (isdigit(ch))
                 return get_number();
             read_ch();
             return null;
     }
 }

 void Scanner::enter_kw(Symbol sym, Ident name) {
     keyTab[nkw] = name;
     keySym[nkw] = sym;
     nkw++;
 }

 bool Scanner::init(const char fn[]) {
     if ((f = fopen(fn, "r")) == NULL)
         return false;
     ch = ' ';
     read_ch();
     nkw = 0;
     enter_kw(writesym,  "write");
     enter_kw(null,      "");
     return true;
 }

 int main(int argc, char *argv[]) {
     Scanner s;
     Symbol sym;

     if (argc < 2) {
         printf("Need a filename\n");
         return 2;
     }
     if (!s.init(argv[1])) {
         printf("Can't open %s\n", argv[1]);
         return 2;
     }
     do {
         sym = s.getsym();
     } while (sym != eofsym);

     return 0;
 }