lexer.lit - website - Website contents

lexer.lit (73462B)
      1 @code_type c .c
      2 @comment_type /* %s */
      3 @compiler lit -t lexer.lit && gcc -Wall -Wextra -Wstrict-aliasing=3 -Wwrite-strings -Wvla -Wcast-align=strict -Wstrict-prototypes -Wstringop-overflow=4 -Wshadow -fanalyzer tokenizer.c input.c hash_table.c token.c util.c -D TEST_TOKENIZER -g -O0 -o lextest && valgrind --leak-check=full ./lextest tokenizer.c && rm lextest
      4 
      5 @title Lexer
      6 @add_css ./style.css
      7 @s General Project Structure
      8 Since this is the first article, I'll outline the project structure for the C- compiler.
      9 
     10 The project has a series of pretty typical stages:
     11 
     12 1.  The lexer. This takes a file as input and emits a series of tokens (Its input is already preprocessed, I outsource that to "gcc -E").
     13 2.  The parser. This takes the tokens and builds an abstract syntax tree (AST).
     14 4.  The type checker/semantic analyzer. This is used to ensure that the types of variables and functions are correct.
     15 5.  The code generator. This takes the AST and generates an intermediate representation (IR).
     16 6.  The optimizer. This takes the IR and optimizes it. This'll be broken up into a few stages.
     17 7.  The lowerer. This takes the IR and lowers it to a simpler IR.
     18 8.  The register allocator. This takes the IR, which has instructions in an infinite number of registers, and assigns them to a finite number of registers.
     19 9.  The code emitter. This takes the IR and emits RISC-V assembly.
     20 
     21 As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena.
     22 
     23 @s Some Rules
     24 
     25 Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent.
     26 
     27 1. PROGRAM LIKE IT'S 1999
     28 
     29 > 640 KB ought to be enough for anybody. - Bill Gates
     30 
     31 Maybe not that little, But I'm going to try to keep the project as simple as possible, 640 KB probably won't be enough, but I'll still aim for less than 10 MB of memory usage.
     32 
     33 This places a lot of constraints on the project, but I think it's a good exercise in minimalism.
     34 
     35 Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future).
     36 
     37 2. PROGRAM IN C++--
     38 
     39 I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits:
     40 
     41 *   Quicker compilation. A change to a data structure will only require one file to be recompiled, rather than every file that includes the header.
     42 *   Less chance of bugs. If a function is the only way to interact with a data structure, then it's much harder to misuse that data structure.
     43 *   Run time type checking. I can include some sort of tag in the first field of every data structure to ensure that the correct functions are being called.
     44 
     45 3. DON'T GET FANCY
     46 
     47 My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else.
     48 
     49 That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability.
     50 
     51 4. DESIGN FOR DEBUGGING
     52 
     53 This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point.
     54 
     55 This might be painful, but it'll make debugging a lot simpler and let users look under the hood.
     56 
     57 5. SMART DATA, STUPID CODE
     58 
     59 A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible.
     60 
     61 For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table.
     62 
     63 @s Misc
     64 THIS IS A LITERATE PROGRAM! Go to [this link](https://reagancfischer.dev/projects/cminus/code/lexer.lit) to see the file that generated this HTML.
     65 
     66 @s The Lexer
     67 
     68 A lexical analyzer reads source code and produces tokens, which are the smallest units of meaning in a language. These tokens are then used by the parser to build an abstract syntax tree (AST), which represents the structure of the program.
     69 
     70 For example, in the C programming language, tokens include keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.).
     71 
     72 Given a string like `int main() { return 0; }`, the lexer would produce a series of tokens like `INT`, `IDENTIFIER(main)`, `LPAREN`, `RPAREN`, `LBRACE`, `RETURN`, `INTCONSTANT(0)`, `SEMICOLON`, `RBRACE`.
     73 
     74 @s Design
     75 
     76 I'll break the lexer into several modules:
     77 - `token.c` will contain the token data structure and functions to create and destroy tokens.
     78 - `input.c` will contain the input data structure and functions to read from the input file.
     79 - `tokenizer.c` will contain the main lexer logic.
     80 
     81 @s Token Interface
     82 
     83 We'll need several components to represent a token:
     84 - The type of token, which will be an enum with values like `TOK_IF` or `TOK_INTEGER_U32`.
     85 - The value of the token. Some tokens, like keywords, don't have a value. Others, like identifiers or constants, do.
     86 - The line and column of the token, used for error messages.
     87 
     88 To implement a class system in C, we'll hide the token implementation details behind an opaque pointer. We'll use a forward declaration of the token type in the header file and define the token type in the implementation file.
     89 
     90 @s
     91 --- Opaque Token Type
     92 typedef struct token token_t;
     93 ---
     94 
     95 @s
     96 We'll need functions to create and destroy tokens.
     97 --- Token Creation and Destruction Interface
     98 token_t *token_create(c_token_types kind, int lin, int col, int len);
     99 token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len);
    100 token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len);
    101 token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len);
    102 token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len);
    103 void token_destroy(token_t *token);
    104 ---
    105 
    106 
    107 @s
    108 We'll also need functions to access the token data.
    109 --- Token Interface
    110 c_token_types token_type(token_t *token);
    111 int64_t token_int(token_t *token);
    112 double token_float(token_t *token);
    113 const char *token_string(token_t *token);
    114 char token_char(token_t *token);
    115 int token_line(token_t *token);
    116 int token_column(token_t *token);
    117 void print_token(token_t *tok);
    118 ---
    119 
    120 @s
    121 We'll need types to represent the different kinds of tokens.
    122 --- Token Types
    123 typedef enum {
    124   // Control Keywords
    125   TOK_IF,
    126   TOK_ELSE,
    127   TOK_SWITCH,
    128   TOK_CASE,
    129   TOK_DEFAULT,
    130   TOK_WHILE,
    131   TOK_DO,
    132   TOK_FOR,
    133   TOK_CONTINUE,
    134   TOK_BREAK,
    135   TOK_RETURN,
    136   TOK_GOTO,
    137 
    138   // Type Keywords
    139   TOK_VOID,
    140   TOK_CHAR,
    141   TOK_SHORT,
    142   TOK_INT,
    143   TOK_LONG,
    144   TOK_FLOAT,
    145   TOK_DOUBLE,
    146   TOK_SIGNED,
    147   TOK_UNSIGNED,
    148   TOK_STRUCT,
    149   TOK_UNION,
    150   TOK_ENUM,
    151   TOK_TYPEDEF,
    152 
    153   // Storage Class/Specifier Keywords
    154   TOK_AUTO,
    155   TOK_REGISTER,
    156   TOK_STATIC,
    157   TOK_EXTERN,
    158   TOK_CONST,
    159   TOK_VOLATILE,
    160 
    161   // Misc Keywords
    162   TOK_SIZEOF,
    163 
    164   // Operators
    165   TOK_ADD,            // +
    166   TOK_SUB,            // -
    167   TOK_MUL,            // *
    168   TOK_DIV,            // /
    169   TOK_MOD,            // %
    170   TOK_BIT_AND,        // &
    171   TOK_BIT_OR,         // |
    172   TOK_BIT_XOR,        // ^
    173   TOK_BIT_NOT,        // ~
    174   TOK_LSHIFT,         // <<
    175   TOK_RSHIFT,         // >>
    176   TOK_NOT,            // !
    177   TOK_ASSIGN,         // =
    178   TOK_LT,             // <
    179   TOK_GT,             // >
    180   TOK_INC,            // ++
    181   TOK_DEC,            // --
    182   TOK_EQ,             // ==
    183   TOK_NE,             // !=
    184   TOK_LE,             // <=
    185   TOK_GE,             // >=
    186   TOK_AND,            // &&
    187   TOK_OR,             // ||
    188   TOK_MEMBER_POINTER, // ->
    189   TOK_MEMBER,         // .
    190   TOK_COND_DECISION,  // :
    191   TOK_COND,           // ?
    192   TOK_ASSIGN_ADD,     // +=
    193   TOK_ASSIGN_SUB,     // -=
    194   TOK_ASSIGN_MUL,     // *=
    195   TOK_ASSIGN_DIV,     // /=
    196   TOK_ASSIGN_MOD,     // %=
    197   TOK_ASSIGN_BITAND,  // &=
    198   TOK_ASSIGN_BITOR,   // |=
    199   TOK_ASSIGN_BITXOR,  // ^=
    200   TOK_ASSIGN_LSHIFT,  // <<=
    201   TOK_ASSIGN_RSHIFT,  // >>=
    202 
    203   // Separators
    204   TOK_LEFT_PAREN,    // (
    205   TOK_RIGHT_PAREN,   // )
    206   TOK_LEFT_BRACKET,  // [
    207   TOK_RIGHT_BRACKET, // ]
    208   TOK_LEFT_BRACE,    // {
    209   TOK_RIGHT_BRACE,   // }
    210   TOK_COMMA,         // ,
    211   TOK_SEMICOLON,     // ;
    212   TOK_DOT,           // .
    213   TOK_ELLIPSIS,      // ...
    214   TOK_HASH,          // #
    215 
    216   // Identifiers
    217   TOK_ID,
    218   TOK_TYPEDEF_NAME,
    219 
    220   // Constants
    221   TOK_INTEGER_U32,  // u
    222   TOK_INTEGER_U64,  // ul
    223   TOK_INTEGER_S32,  // (no suffix)
    224   TOK_INTEGER_S64,  // l
    225   TOK_FLOAT_32,     // f
    226   TOK_FLOAT_64,     // (no suffix)
    227   TOK_CHAR_CONST,         // 'c'
    228   TOK_STRING_ASCII, // "string" (width of 8 bits)
    229 
    230   // Special
    231   TOK_EOF,
    232   TOK_ERROR,
    233 } c_token_types;
    234 ---
    235 
    236 @s
    237 We bring this all together in `token.h`. Line and column are exposed as global variables because `skip_whitespace` will need to update them.
    238 --- token.h
    239 #ifndef TOKEN_H
    240 #define TOKEN_H
    241 #include <stdint.h> // For int64_t
    242 @{Token Types}
    243 @{Opaque Token Type}
    244 @{Token Creation and Destruction}
    245 @{Token Interface}
    246 extern int column;
    247 extern int line;
    248 #endif
    249 ---
    250 
    251 @s Token Implementation
    252 
    253 Now that we have the interface, we can implement the token data structure. We'll need:
    254 - The token type.
    255 - A way to store extra data.
    256 - Implementations of the functions defined in the interface.
    257 
    258 To verify the token isn't corrupt, we'll use a tag. `TOK_MAGIC_1` represents a token with optional data, and `TOK_MAGIC_2` represents a token without optional data.
    259 
    260 A zero-length array is used in the token data structure. This GCC extension allows us to allocate memory for the token data structure and the token data in one allocation.
    261 --- Token Data Structure
    262 #define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK"
    263 #define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN"
    264 
    265 struct token {
    266   long magic;
    267   int line;
    268   int column;
    269   short kind;
    270   long opt_data[0];
    271 };
    272 
    273 typedef struct token token_t;
    274 
    275 struct token_data {
    276   union {
    277     int64_t i;
    278     double f;
    279     const char *s;
    280     char c;
    281   } data;
    282 };
    283 
    284 typedef struct token_data token_data_t;
    285 int column = 1;
    286 int line = 1;
    287 ---
    288 
    289 
    290 @s
    291 We'll implement an interface for accessing the token data and a macro for accessing optional data.
    292 --- Token Data Access
    293 #define token_data(token) ((struct token_data *)((token)->opt_data))
    294 
    295 c_token_types token_type(token_t *token) {
    296   assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
    297   return token->kind;
    298 }
    299 
    300 int64_t token_int(token_t *token) {
    301   assert(token->kind == TOK_INTEGER_U32 || token->kind == TOK_INTEGER_U64 || token->kind == TOK_INTEGER_S32 || token->kind == TOK_INTEGER_S64);
    302   assert(token->magic == TOK_MAGIC_1);
    303   return token_data(token)->data.i;
    304 }
    305 
    306 double token_float(token_t *token) {
    307   assert(token->kind == TOK_FLOAT_32 || token->kind == TOK_FLOAT_64);
    308   assert(token->magic == TOK_MAGIC_1);
    309   return token_data(token)->data.f;
    310 }
    311 
    312 const char *token_string(token_t *token) {
    313   assert(token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TYPEDEF_NAME);
    314   assert(token->magic == TOK_MAGIC_1);
    315   return token_data(token)->data.s;
    316 }
    317 
    318 char token_char(token_t *token) {
    319   assert(token->kind == TOK_CHAR_CONST);
    320   assert(token->magic == TOK_MAGIC_1);
    321   return token_data(token)->data.c;
    322 }
    323 
    324 int token_line(token_t *token) {
    325   assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
    326   return token->line;
    327 }
    328 
    329 int token_column(token_t *token) {
    330   assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
    331   return token->column;
    332 }
    333 ---
    334 
    335 @s
    336 For debugging, we'll add a function to print the token type.
    337 --- Token Debugging
    338 @{Token Type Enum to String}
    339 @{Unescape String}
    340 @{Print Token}
    341 ---
    342 
    343 @s
    344 This function returns a string with the token type name.
    345 --- Token Type Enum to String
    346 const char *token_name_from_type(c_token_types type) {
    347   switch (type) {
    348   case TOK_IF:
    349     return "TOK_IF";
    350   case TOK_ELSE:
    351     return "TOK_ELSE";
    352   case TOK_SWITCH:
    353     return "TOK_SWITCH";
    354   case TOK_CASE:
    355     return "TOK_CASE";
    356   case TOK_DEFAULT:
    357     return "TOK_DEFAULT";
    358   case TOK_WHILE:
    359     return "TOK_WHILE";
    360   case TOK_DO:
    361     return "TOK_DO";
    362   case TOK_FOR:
    363     return "TOK_FOR";
    364   case TOK_CONTINUE:
    365     return "TOK_CONTINUE";
    366   case TOK_BREAK:
    367     return "TOK_BREAK";
    368   case TOK_RETURN:
    369     return "TOK_RETURN";
    370   case TOK_GOTO:
    371     return "TOK_GOTO";
    372   case TOK_VOID:
    373     return "TOK_VOID";
    374   case TOK_CHAR:
    375     return "TOK_CHAR";
    376   case TOK_SHORT:
    377     return "TOK_SHORT";
    378   case TOK_INT:
    379     return "TOK_INT";
    380   case TOK_LONG:
    381     return "TOK_LONG";
    382   case TOK_FLOAT:
    383     return "TOK_FLOAT";
    384   case TOK_DOUBLE:
    385     return "TOK_DOUBLE";
    386   case TOK_SIGNED:
    387     return "TOK_SIGNED";
    388   case TOK_UNSIGNED:
    389     return "TOK_UNSIGNED";
    390   case TOK_STRUCT:
    391     return "TOK_STRUCT";
    392   case TOK_UNION:
    393     return "TOK_UNION";
    394   case TOK_ENUM:
    395     return "TOK_ENUM";
    396   case TOK_TYPEDEF:
    397     return "TOK_TYPEDEF";
    398   case TOK_AUTO:
    399     return "TOK_AUTO";
    400   case TOK_REGISTER:
    401     return "TOK_REGISTER";
    402   case TOK_STATIC:
    403     return "TOK_STATIC";
    404   case TOK_EXTERN:
    405     return "TOK_EXTERN";
    406   case TOK_CONST:
    407     return "TOK_CONST";
    408   case TOK_VOLATILE:
    409     return "TOK_VOLATILE";
    410   case TOK_SIZEOF:
    411     return "TOK_SIZEOF";
    412   case TOK_ADD:
    413     return "TOK_ADD";
    414   case TOK_SUB:
    415     return "TOK_SUB";
    416   case TOK_MUL:
    417     return "TOK_MUL";
    418   case TOK_DIV:
    419     return "TOK_DIV";
    420   case TOK_MOD:
    421     return "TOK_MOD";
    422   case TOK_BIT_AND:
    423     return "TOK_BIT_AND";
    424   case TOK_BIT_OR:
    425     return "TOK_BIT_OR";
    426   case TOK_BIT_XOR:
    427     return "TOK_BIT_XOR";
    428   case TOK_BIT_NOT:
    429     return "TOK_BIT_NOT";
    430   case TOK_LSHIFT:
    431     return "TOK_LSHIFT";
    432   case TOK_RSHIFT:
    433     return "TOK_RSHIFT";
    434   case TOK_NOT:
    435     return "TOK_NOT";
    436   case TOK_ASSIGN:
    437     return "TOK_ASSIGN";
    438   case TOK_LT:
    439     return "TOK_LT";
    440   case TOK_GT:
    441     return "TOK_GT";
    442   case TOK_INC:
    443     return "TOK_INC";
    444   case TOK_DEC:
    445     return "TOK_DEC";
    446   case TOK_EQ:
    447     return "TOK_EQ";
    448   case TOK_NE:
    449     return "TOK_NE";
    450   case TOK_LE:
    451     return "TOK_LE";
    452   case TOK_GE:
    453     return "TOK_GE";
    454   case TOK_AND:
    455     return "TOK_AND";
    456   case TOK_OR:
    457     return "TOK_OR";
    458   case TOK_MEMBER_POINTER:
    459     return "TOK_MEMBER_POINTER";
    460   case TOK_MEMBER:
    461     return "TOK_MEMBER";
    462   case TOK_COND_DECISION:
    463     return "TOK_COND_DECISION";
    464   case TOK_COND:
    465     return "TOK_COND";
    466   case TOK_ASSIGN_ADD:
    467     return "TOK_ASSIGN_ADD";
    468   case TOK_ASSIGN_SUB:
    469     return "TOK_ASSIGN_SUB";
    470   case TOK_ASSIGN_MUL:
    471     return "TOK_ASSIGN_MUL";
    472   case TOK_ASSIGN_DIV:
    473     return "TOK_ASSIGN_DIV";
    474   case TOK_ASSIGN_MOD:
    475     return "TOK_ASSIGN_MOD";
    476   case TOK_ASSIGN_BITAND:
    477     return "TOK_ASSIGN_BITAND";
    478   case TOK_ASSIGN_BITOR:
    479     return "TOK_ASSIGN_BITOR";
    480   case TOK_ASSIGN_BITXOR:
    481     return "TOK_ASSIGN_BITXOR";
    482   case TOK_ASSIGN_LSHIFT:
    483     return "TOK_ASSIGN_LSHIFT";
    484   case TOK_ASSIGN_RSHIFT:
    485     return "TOK_ASSIGN_RSHIFT";
    486   case TOK_HASH:
    487     return "TOK_HASH";
    488   case TOK_ID:
    489     return "TOK_ID";
    490   case TOK_TYPEDEF_NAME:
    491     return "TOK_TYPEDEF_NAME";
    492   case TOK_INTEGER_U32:
    493     return "TOK_INTEGER_U32";
    494   case TOK_INTEGER_U64:
    495     return "TOK_INTEGER_U64";
    496   case TOK_INTEGER_S32:
    497     return "TOK_INTEGER_S32";
    498   case TOK_INTEGER_S64:
    499     return "TOK_INTEGER_S64";
    500   case TOK_FLOAT_32:
    501     return "TOK_FLOAT_32";
    502   case TOK_FLOAT_64:
    503     return "TOK_FLOAT_64";
    504   case TOK_CHAR_CONST:
    505     return "TOK_CHAR_CONST";
    506   case TOK_STRING_ASCII:
    507     return "TOK_STRING_ASCII";
    508   case TOK_EOF:
    509     return "TOK_EOF";
    510   case TOK_ERROR:
    511     return "TOK_ERROR";
    512   case TOK_LEFT_PAREN:
    513     return "TOK_LEFT_PAREN";
    514   case TOK_RIGHT_PAREN:
    515     return "TOK_RIGHT_PAREN";
    516   case TOK_LEFT_BRACKET:
    517     return "TOK_LEFT_BRACKET";
    518   case TOK_RIGHT_BRACKET:
    519     return "TOK_RIGHT_BRACKET";
    520   case TOK_LEFT_BRACE:
    521     return "TOK_LEFT_BRACE";
    522   case TOK_RIGHT_BRACE:
    523     return "TOK_RIGHT_BRACE";
    524   case TOK_COMMA:
    525     return "TOK_COMMA";
    526   case TOK_SEMICOLON:
    527     return "TOK_SEMICOLON";
    528   case TOK_DOT:
    529     return "TOK_DOT";
    530   case TOK_ELLIPSIS:
    531     return "TOK_ELLIPSIS";
    532   }
    533   return "UNKNOWN";
    534 }
    535 ---
    536 
    537 @s
    538 This function adds escape characters to a string for printing.
    539 --- Unescape String
    540 #define clamp(x, min, max) ((x) < (min) ? (min) : (x) > (max) ? (max) : (x))
    541 char *re_escape_string(const char *str) {
    542   int len = strlen(str);
    543   char *buf = malloc(len * 2 + 1);
    544   if (!buf) {
    545     fprintf(stderr, "Out of memory. Cannot escape string\n");
    546     exit(1);
    547   }
    548   int i = 0;
    549   for (int j = 0; j < len; j++) {
    550     switch (str[j]) {
    551     case '\a': buf[i++] = '\\'; buf[i++] = 'a'; break;
    552     case '\b': buf[i++] = '\\'; buf[i++] = 'b'; break;
    553     case '\f': buf[i++] = '\\'; buf[i++] = 'f'; break;
    554     case '\n': buf[i++] = '\\'; buf[i++] = 'n'; break;
    555     case '\r': buf[i++] = '\\'; buf[i++] = 'r'; break;
    556     case '\t': buf[i++] = '\\'; buf[i++] = 't'; break;
    557     case '\v': buf[i++] = '\\'; buf[i++] = 'v'; break;
    558     case '\\': buf[i++] = '\\'; buf[i++] = '\\'; break;
    559     case '\'': buf[i++] = '\\'; buf[i++] = '\''; break;
    560     case '"': buf[i++] = '\\'; buf[i++] = '"'; break;
    561     default: {
    562       if (isprint(str[j])) {
    563         buf[i++] = str[j];
    564       } else {
    565         buf[i++] = '\\';
    566         buf[i++] = 'x';
    567         buf[i++] = "0123456789abcdef"[clamp(str[j] >> 4, 0, 0xf)];
    568         buf[i++] = "0123456789abcdef"[clamp(str[j] & 0xf, 0, 0xf)];
    569       }
    570     }
    571     }
    572   }
    573   buf[i] = '\0';
    574   return buf;
    575 }
    576 ---
    577 
    578 
    579 @s
    580 This function prints the token type and value.
    581 --- Print Token
    582 void print_token(token_t *tok) {
    583   if (!tok) {
    584     printf("NULL\n");
    585     return;
    586   }
    587   const char *name = token_name_from_type(tok->kind);
    588   switch (tok->kind) {
    589   case TOK_ID:
    590   case TOK_STRING_ASCII: {
    591     char *escaped = re_escape_string(token_string(tok));
    592     printf("%s: \"%s\"@%d:%d\n", name, escaped, tok->line, tok->column);
    593     free(escaped);
    594     break;
    595   }
    596   case TOK_TYPEDEF_NAME: {
    597     char *escaped = re_escape_string(token_string(tok));
    598     printf("%s: %s@%d:%d\n", name, escaped, tok->line, tok->column);
    599     free(escaped);
    600     break;
    601   }
    602   case TOK_CHAR_CONST:
    603     printf("%s: '%c'@%d:%d\n", name, token_char(tok), tok->line, tok->column);
    604     break;
    605   case TOK_INTEGER_S32:
    606   case TOK_INTEGER_U32:
    607   case TOK_INTEGER_S64:
    608   case TOK_INTEGER_U64:
    609     printf("%s: %ld@%d:%d\n", name, token_int(tok), tok->line, tok->column);
    610     break;
    611   case TOK_FLOAT_32:
    612   case TOK_FLOAT_64:
    613     printf("%s: %f@%d:%d\n", name, token_float(tok), tok->line, tok->column);
    614     break;
    615   default:
    616     printf("%s@%d:%d\n", name, tok->line, tok->column);
    617     break;
    618   }
    619 }
    620 ---
    621 
    622 @s
    623 Now we can implement functions to create and destroy tokens. We'll start with the easy ones.
    624 --- Token Creation and Destruction
    625 token_t *token_data_create(c_token_types kind, int lin, int col, int len) {
    626   token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data));
    627   if (!token) {
    628     fputs("Out of memory\n", stderr);
    629     exit(1);
    630   }
    631   token->magic = TOK_MAGIC_1;
    632   token->line = lin;
    633   token->column = col;
    634   column += len;
    635   token->kind = kind;
    636   return token;
    637 }
    638 
    639 token_t *token_create(c_token_types kind, int lin, int col, int len) {
    640   token_t *token = malloc(sizeof(token_t));
    641   if (!token) {
    642     fputs("Out of memory\n", stderr);
    643     exit(1);
    644   }
    645   token->magic = TOK_MAGIC_2;
    646   token->line = lin;
    647   token->column = col;
    648   column += len;
    649   token->kind = kind;
    650   return token;
    651 }
    652 
    653 token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) {
    654   token_t *token = token_data_create(kind, lin, col, len);
    655   token_data(token)->data.i = i;
    656   return token;
    657 }
    658 
    659 token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) {
    660   token_t *token = token_data_create(kind, lin, col, len);
    661   token_data(token)->data.f = f;
    662   return token;
    663 }
    664 
    665 token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) {
    666   token_t *token = token_data_create(kind, lin, col, len);
    667   token_data(token)->data.c = c;
    668   return token;
    669 }
    670 
    671 void token_destroy(token_t *token) {
    672   if (token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2) {
    673     free(token);
    674   } else {
    675     fputs("Corrupt token\n", stderr);
    676     exit(1);
    677   }
    678 }
    679 ---
    680 
    681 @s
    682 `token_create_string` can be implemented either the easy way or the right way. Let's try the easy way.
    683 --- Token Create String
    684 token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len) {
    685   token_t *token = token_create(kind, lin, col, len);
    686   token_data(token)->data.s = strdup(s);
    687   return token;
    688 }
    689 ---
    690 
    691 @s
    692 There's an issue with this approach. `token_create_string` will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to `fprintf`, `fscanf`, `strchr`, `strtok`, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage.
    693 
    694 To fix this, we use a hash table to store the strings. We'll define a hash table in `hash_table.h` and `hash_table.c`.
    695 
    696 @s Hash Table
    697 A hash table is a data structure that maps keys to values. It's commonly used for implementing symbol tables to store variables and functions. To create a generic hash table, we'll need:
    698 
    699 1. A hash function to convert keys into array indices
    700 2. A comparison function to check if two keys are equal
    701 3. An opaque type to represent the hash table
    702 4. A destructor function to clean up keys and values when removing entries
    703 
    704 Let's start with the interface:
    705 
    706 @s
    707 --- Hash Table Opaque Types
    708 typedef struct hash_table hash_table_t;
    709 typedef int (*hash_table_cmp_fn)(const void *key1, const void *key2);
    710 typedef unsigned int (*hash_table_hash_fn)(const void *key);
    711 typedef void (*hash_table_dtor)(void *data, int is_key);
    712 ---
    713 
    714 @s
    715 --- Hash Table Creation and Destruction
    716 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor);
    717 void hash_table_destroy(hash_table_t *table);
    718 ---
    719 
    720 @s
    721 --- Hash Table Access
    722 void *hash_table_get(const hash_table_t *table, const void *key);
    723 void hash_table_put(hash_table_t *table, void *key, void *value);
    724 void hash_table_remove(hash_table_t *table, const void *key);
    725 ---
    726 
    727 @s
    728 --- hash_table.h
    729 #ifndef HASH_TABLE_H
    730 #define HASH_TABLE_H
    731 
    732 @{Hash Table Opaque Types}
    733 @{Hash Table Creation and Destruction}
    734 @{Hash Table Access}
    735 
    736 #endif /* HASH_TABLE_H */
    737 ---
    738 
    739 @s
    740 Now let's implement the hash table:
    741 
    742 --- hash_table.c
    743 #include <stdlib.h>
    744 #include <string.h>
    745 #include <stdio.h>
    746 #include "hash_table.h"
    747 
    748 @{Hash Table Data Structure}
    749 @{Hash Table Entry Data Structure}
    750 
    751 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) {
    752     @{Allocate and Initialize Hash Table}
    753     return table;
    754 }
    755 
    756 void hash_table_destroy(hash_table_t *table) {
    757     if (!table) return;
    758     @{Destroy Entries}
    759     free(table->entries);
    760     free(table);
    761 }
    762 
    763 void *hash_table_get(const hash_table_t *table, const void *key) {
    764     if (!table || !key) return NULL;
    765     @{Get Entry By Hash}
    766     @{Loop Through Entries and Return Value if Match}
    767     return NULL;
    768 }
    769 
    770 void hash_table_put(hash_table_t *table, void *key, void *value) {
    771     if (!table || !key) return;
    772     @{Get Entry By Hash}
    773     @{Loop Through Entries and Replace Value if Key Matches}
    774     @{Allocate New Entry if No Match}
    775 }
    776 
    777 void hash_table_remove(hash_table_t *table, const void *key) {
    778     if (!table || !key) return;
    779     @{Get Entry By Hash}
    780     @{Loop Through Entries and Remove Entry if Key Matches}
    781 }
    782 
    783 #ifdef TEST_HASH_TABLE
    784 #include <assert.h>
    785 
    786 static int string_cmp(const void *key1, const void *key2) {
    787     return strcmp((const char *)key1, (const char *)key2);
    788 }
    789 
    790 static unsigned int string_hash(const void *key) {
    791     unsigned int hash = 5381;
    792     const unsigned char *str = key;
    793     int c;
    794 
    795     while ((c = *str++))
    796     {
    797         hash = ((hash << 5) + hash) + c;
    798     }
    799 
    800     return hash;
    801 }
    802 
    803 int main(void) {
    804     hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL);
    805     assert(table != NULL);
    806 
    807     hash_table_put(table, "foo", "bar");
    808     hash_table_put(table, "foo", "baz");
    809     assert(strcmp((const char *)hash_table_get(table, "foo"), "baz") == 0);
    810 
    811     hash_table_remove(table, "foo");
    812     assert(hash_table_get(table, "foo") == NULL);
    813 
    814     hash_table_destroy(table);
    815     printf("All tests passed!\n");
    816     return 0;
    817 }
    818 #endif
    819 ---
    820 
    821 @s
    822 The hash table data structure contains an array of entry pointers, the size of the array, and function pointers for comparison, hashing, and destruction.
    823 
    824 --- Hash Table Data Structure
    825 struct hash_table {
    826     struct hash_table_entry **entries;
    827     int size;
    828     hash_table_cmp_fn cmp;
    829     hash_table_hash_fn hash;
    830     hash_table_dtor dtor;
    831 };
    832 ---
    833 
    834 @s
    835 Each entry in the hash table contains a key, a value, and a pointer to the next entry in the chain for collision resolution via chaining.
    836 
    837 --- Hash Table Entry Data Structure
    838 struct hash_table_entry {
    839     void *key;
    840     void *value;
    841     struct hash_table_entry *next;
    842 };
    843 ---
    844 
    845 
    846 @s
    847 To allocate a hash table, we allocate memory for the table structure and its entries, initialize the entries to NULL, and set the function pointers.
    848 
    849 --- Allocate and Initialize Hash Table
    850 hash_table_t *table = malloc(sizeof(struct hash_table));
    851 if (!table) {
    852     fprintf(stderr, "Error: Out of memory, could not allocate hash table\n");
    853     exit(EXIT_FAILURE);
    854 }
    855 table->entries = calloc(size, sizeof(struct hash_table_entry *));
    856 if (!table->entries) {
    857     fprintf(stderr, "Error: Out of memory, could not allocate hash table entries\n");
    858     free(table);
    859     exit(EXIT_FAILURE);
    860 }
    861 table->size = size;
    862 table->cmp = cmp;
    863 table->hash = hash;
    864 table->dtor = dtor;
    865 ---
    866 
    867 @s
    868 To destroy the entries in a hash table,  we iterate through all entries, free the keys and values using the destructor if provided, and free the entry itself.
    869 --- Destroy Entries
    870 for (int i = 0; i < table->size; i++) {
    871     struct hash_table_entry *entry = table->entries[i];
    872     while (entry) {
    873         struct hash_table_entry *next = entry->next;
    874         if (table->dtor) {
    875             table->dtor(entry->key, 1);
    876             table->dtor(entry->value, 0);
    877         }
    878         free(entry);
    879         entry = next;
    880     }
    881 }
    882 ---
    883 
    884 @s
    885 To retrieve an entry's hash bucket, we apply the hash function to the key and take the modulus of the result with the table size.
    886 
    887 --- Get Entry By Hash
    888 unsigned int hash = table->hash(key) % table->size;
    889 struct hash_table_entry *entry = table->entries[hash];
    890 ---
    891 
    892 @s
    893 When putting a new entry in the table, we first check if the key already exists. If it does, we update the value; otherwise, we create a new entry.
    894 
    895 --- Loop Through Entries and Replace Value if Key Matches
    896 while (entry) {
    897     if (table->cmp(entry->key, key) == 0) {
    898         if (table->dtor) table->dtor(entry->value, 0);
    899         entry->value = value;
    900         return;
    901     }
    902     entry = entry->next;
    903 }
    904 ---
    905 
    906 @s
    907 If no matching key is found, we create a new entry and insert it at the beginning of the hash bucket. 
    908 
    909 This can possibly improve performance due to a property called temporal locality. When we access an entry, we're likely to access it again soon. Since the entry is at the beginning of the list, it's likely to be accessed again soon.
    910 
    911 --- Allocate New Entry if No Match
    912 struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry));
    913 if (!new_entry) {
    914     fprintf(stderr, "Error: Out of memory, could not allocate hash table entry\n");
    915     exit(EXIT_FAILURE);
    916 }
    917 new_entry->key = key;
    918 new_entry->value = value;
    919 new_entry->next = table->entries[hash];
    920 table->entries[hash] = new_entry;
    921 ---
    922 
    923 @s
    924 To remove an entry, we find its bucket, update the linked list to bypass it, then free the entry and its contents.
    925 
    926 --- Loop Through Entries and Remove Entry if Key Matches
    927 struct hash_table_entry *prev = NULL;
    928 while (entry) {
    929     if (table->cmp(entry->key, key) == 0) {
    930         if (prev)
    931             prev->next = entry->next;
    932         else
    933             table->entries[hash] = entry->next;
    934         
    935         if (table->dtor) {
    936             table->dtor(entry->key, 1);
    937             table->dtor(entry->value, 0);
    938         }
    939         free(entry);
    940         return;
    941     }
    942     prev = entry;
    943     entry = entry->next;
    944 }
    945 ---
    946 
    947 @s
    948 To retrieve a value from a given bucket, we just walk the list and return the value if a matching key is found.
    949 
    950 --- Loop Through Entries and Return Value if Match
    951 while (entry) {
    952     if (table->cmp(entry->key, key) == 0) {
    953         return entry->value;
    954     }
    955     entry = entry->next;
    956 }
    957 ---
    958 
    959 @s
    960 We're now almost ready to implement `token_create_string` the right way. First, we'll need a good hash function. 
    961 
    962 Hash functions are a very interesting topic and there's a lot of good research on them. The hash function we use should be fast, have a low collision rate, and be able to handle strings of any length.
    963 
    964 We can't just sum the characters in a string, because that would mean that "stop" and "pots" would have the same hash. Multiplying has the same problem. If we take each to the power of its position in the string, we get a better distribution, but it's still awful.
    965 
    966 Using a simple python program, I brute-forced all possible 4-character strings and ran our power-hash function on them, the result showed that for 456976 possible strings, only 3760 were unique, which is terrible.
    967 
    968 Instead of trying to come up with a new hash function, we can use one that's been well-tested and is known to work well.
    969 
    970 The first time I wrote this, I used the hash function from the 'Red Dragon Book' (Compilers: Principles, Techniques, and Tools). 
    971 --- Hash Function
    972 unsigned long hash_string(void *key) {
    973   unsigned long hash = 0, g;
    974   char *p = key;
    975   while (*p) {
    976     hash = (hash \<\< 4) + *p++;
    977     if ((g = hash \& 0xf0000000) != 0) {
    978       hash ^= g \>\> 24;
    979       hash ^= g;
    980     }
    981   }
    982   return hash;
    983 }
    984 ---
    985 This is a bit slow on modern processors because it's not very cache-friendly. We can do better. Let's use its child, ELFHash, from libc.
    986 
    987 As you can see in the code below, this function avoids extra operations and should be much faster. 
    988 --- Hash Function :=
    989 unsigned int hash_string(const void *key) {
    990   unsigned long hash = 0, hi = 0;
    991   const char *p = key;
    992   hash = *p;
    993   if (hash != 0 && p[1] != 0) {
    994     hash = (hash << 4) + p[1];
    995     if (p[2] != 0) {
    996       hash = (hash << 4) + p[2];
    997       if (p[3] != 0) {
    998         hash = (hash << 4) + p[3];
    999         if (p[4] != 0) {
   1000           hash = (hash << 4) + p[4];
   1001           p += 5;
   1002           while (*p != 0) {
   1003             hash = (hash << 4) + *p++;
   1004             hi = hash & 0xf0000000l;
   1005             hash ^= hi >> 24;
   1006           }
   1007           hash &= 0x0fffffffl;
   1008         }
   1009       }
   1010     }
   1011   }
   1012   return hash;
   1013 }
   1014 ---
   1015 
   1016 @s
   1017 We also need a comparison function for strings.
   1018 --- String Comparison
   1019 int cmp_string(const void *key1, const void *key2) {
   1020   return strcmp((char *)key1, (char *)key2);
   1021 }
   1022 ---
   1023 
   1024 @s
   1025 Finally, we'll need a destructor for entries.
   1026 --- String Destructor
   1027 void dtor_string(void *value, int is_key) {
   1028   if (is_key) {
   1029     free(value); // Since the key and value are the same, we only need to free once.
   1030   }
   1031 }
   1032 ---
   1033 @s
   1034 These functions go in util.c
   1035 --- util.c
   1036 #include <string.h>
   1037 #include <stdlib.h>
   1038 #include "hash_table.h"
   1039 @{String Comparison}
   1040 @{Hash Function}
   1041 @{String Destructor}
   1042 ---
   1043 --- util.h
   1044 #ifndef UTIL_H
   1045 #define UTIL_H
   1046 #include "hash_table.h"
   1047 int cmp_string(const void *key1, const void *key2);
   1048 unsigned int hash_string(const void *key);
   1049 void dtor_string(void *value, int is_key);
   1050 #endif
   1051 ---
   1052 
   1053 @s
   1054 Now we can implement `token_create_string` the right way.
   1055 
   1056 You might notice that we're using the same key and value. This way of using a hash table is normally called a set. We're using it to store strings, but we could use it to store anything we want to deduplicate.
   1057 --- Token Create String --- :=
   1058 hash_table_t *string_table;
   1059 token_t *token_create_string(c_token_types kind, int lin, int col,
   1060                                     const char *s, int len) {
   1061   if (string_table == NULL) {
   1062     string_table = hash_table_create(2048, cmp_string, hash_string, dtor_string);
   1063   }
   1064   token_t *token = token_data_create(kind, lin, col, len);
   1065   char *key = hash_table_get(string_table, (void *)s);
   1066   if (key == NULL) {
   1067     key = strdup(s);
   1068     hash_table_put(string_table, key, key);
   1069   }
   1070   token_data(token)->data.s = key;
   1071   return token;
   1072 }
   1073 ---
   1074 
   1075 @s
   1076 We'll add an external declaration for `string_table` in `token.h` so other programs can take advantage of it.
   1077 --- token.h :=
   1078 #ifndef TOKEN_H
   1079 #define TOKEN_H
   1080 #include <stdint.h> // We use this for int64_t
   1081 #include "hash_table.h" // We need this for the string table
   1082 @{Token Types}
   1083 @{Opaque Token Type}
   1084 @{Token Creation and Destruction Interface}
   1085 @{Token Interface}
   1086 extern hash_table_t *string_table;
   1087 extern int column;
   1088 extern int line;
   1089 #endif
   1090 ---
   1091 
   1092 @s
   1093 Finally, we implement the token data structure in `token.c`.
   1094 --- token.c
   1095 #include <stdlib.h>
   1096 #include <string.h>
   1097 #include <stdio.h>
   1098 #include <assert.h>
   1099 #include <ctype.h>
   1100 #include "token.h"
   1101 #include "hash_table.h"
   1102 #include "util.h"
   1103 @{Token Data Structure}
   1104 @{Token Data Access}
   1105 @{Token Creation and Destruction}
   1106 @{Token Create String}
   1107 @{Token Debugging}
   1108 ---
   1109 
   1110 @s Input
   1111 
   1112 Input will provide a simple interface for reading characters from a file. The stream itself is deliberately hidden from the tokenizer, so that the tokenizer doesn't have to worry about buffering or anything like that.
   1113 
   1114 @s
   1115 --- Input Interface
   1116 void input_init(const char *filename);
   1117 int input_getc(void);
   1118 void input_ungetc(int c);
   1119 void input_destroy(void);
   1120 ---
   1121 
   1122 When the program wants to start reading a file, it calls `input_init` with the filename. It can then call `input_getc` to get the next character in the file. If there's no more input, `input_getc` will return `EOF`.
   1123 
   1124 There's also an `input_ungetc` function, which allows the program to put a character back into the stream. I'll only allow one character to be put back, but that should be enough for the tokenizer.
   1125 
   1126 Finally, when the program is done reading the file, it should call `input_destroy` to clean up.
   1127 
   1128 @s Input Design Decisions
   1129 
   1130 Per rule 1, we're trying to keep memory usage low. That means that instead of reading the entire file into memory, we'll need to read it in chunks. There are a couple of choices for how to do this:
   1131 
   1132 1. **Read a line at a time**: This is a more natural approach, but it has two drawbacks. First, it requires a large buffer to store the line (C normally specifies `BUFSIZ` as 8192 bytes). Second, if the line is longer than `BUFSIZ`, we'll have to read the line in chunks anyway.
   1133 
   1134 2. **Choose some arbitrary buffer size and read that many bytes at a time**: This is the approach I'm going to take. It's a little less natural, but it's more memory efficient.
   1135 
   1136 Input will read chunks of 128 bytes at a time, reusing the same static buffer. This limitation is not visible to the tokenizer, which will only see the `input_getc` interface.
   1137 
   1138 When the buffer is exhausted, `input_getc` will call `nextline`, which will read the next chunk of the file.
   1139 
   1140 @s Input Implementation
   1141 
   1142 The implementation of the input module is pretty straightforward. We have the following data structures and defines as globals:
   1143 
   1144 @s
   1145 --- Input Data
   1146 #define CHUNK_SIZE 128
   1147 static char buffer[CHUNK_SIZE];
   1148 static int buffer_pos = 0;
   1149 static int buffer_size = 0;
   1150 static char unget_buffer_stack[8];
   1151 static int unget_buffer_stack_pos = 0;
   1152 
   1153 static FILE *file = NULL;
   1154 ---
   1155 
   1156 When the program calls `input_init`, we open the file.
   1157 
   1158 @s
   1159 --- Input Initialization
   1160 void input_init(const char *filename) {
   1161   file = fopen(filename, "r");
   1162   if (file == NULL) {
   1163     fprintf(stderr, "Error: Cannot open file %s\n", filename);
   1164     exit(1);
   1165   }
   1166 }
   1167 ---
   1168 
   1169 When the program calls `input_getc`, we return the next character in the buffer. If the buffer is exhausted, we call `nextline`. We also track the line and column.
   1170 
   1171 @s
   1172 --- Input Get Character
   1173 int input_getc(void) {
   1174   if (unget_buffer_stack_pos > 0) {
   1175     return unget_buffer_stack[--unget_buffer_stack_pos];
   1176   }
   1177   if (buffer_pos == buffer_size) {
   1178     buffer_size = fread(buffer, 1, CHUNK_SIZE, file);
   1179     buffer_pos = 0;
   1180   }
   1181   if (buffer_size == 0) {
   1182     return EOF;
   1183   }
   1184   char c = buffer[buffer_pos++];
   1185   return c;
   1186 }
   1187 ---
   1188 
   1189 When the program calls `input_ungetc`, we save the character in the `unget_buffer`.
   1190 
   1191 @s
   1192 --- Input Unget Character
   1193 void input_ungetc(int c) {
   1194   unget_buffer_stack[unget_buffer_stack_pos++] = c;
   1195 }
   1196 ---
   1197 
   1198 Since we're not using dynamic memory allocation, cleanup is pretty simple.
   1199 
   1200 @s
   1201 --- Input Destroy
   1202 void input_destroy(void) {
   1203   fclose(file);
   1204 }
   1205 ---
   1206 
   1207 @s
   1208 We put the whole thing together in `input.c`.
   1209 --- input.c
   1210 #include <stdio.h>
   1211 #include <stdlib.h>
   1212 #include "input.h"
   1213 @{Input Data}
   1214 @{Input Initialization}
   1215 @{Input Get Character}
   1216 @{Input Unget Character}
   1217 @{Input Destroy}
   1218 ---
   1219 
   1220 @s
   1221 We'll need an external declaration for `file` in `input.h` so other programs can take advantage of it.
   1222 --- input.h
   1223 #ifndef INPUT_H
   1224 #define INPUT_H
   1225 @{Input Interface}
   1226 #endif
   1227 ---
   1228 
   1229 @s
   1230 We'll implement the lexer interface in `tokenizer.h`
   1231 --- tokenizer.h
   1232 #ifndef TOKENIZER_H
   1233 #define TOKENIZER_H
   1234 #include "token.h"
   1235 #include "input.h"
   1236 @{Tokenization Interface}
   1237 #endif
   1238 ---
   1239 
   1240 @s
   1241 The tokenization interface will have a couple of functions. `next_token` will return the next token in the input stream, `init_tokenizer` will initialize the tokenizer, and `destroy_tokenizer` will clean up.
   1242 
   1243 We'll also have some helper functions for lookahead and matching. 
   1244 
   1245 'peek_token' will return the next token without consuming it (it technically does advance the input stream, but it saves the token so it can be reused). 
   1246 
   1247 'consume' will consume the next token if it matches a given kind. If it doesn't match, it will print an error message and exit.
   1248 --- Tokenization Interface
   1249 void init_tokenizer(const char *filename);
   1250 void destroy_tokenizer(void);
   1251 token_t *next_token(void);
   1252 void reject_token(token_t *token);
   1253 token_t *peek_token(void);
   1254 void consume(c_token_types kind);
   1255 void consume_alt(c_token_types *kinds, int n);
   1256 ---
   1257 
   1258 @s
   1259 Now we can finally implement the tokenizer. A stack for storing tokens for lookahead is defined, as is a hash table, modified by the parser, for typedefs.
   1260 --- tokenizer.c
   1261 #include <ctype.h>
   1262 #include <errno.h>
   1263 #include <float.h>
   1264 #include <math.h>
   1265 #include <stdarg.h>
   1266 #include <stdint.h>
   1267 #include <stdio.h>
   1268 #include <stdlib.h>
   1269 #include <string.h>
   1270 #include "tokenizer.h"
   1271 #include "token.h"
   1272 #include "input.h"
   1273 #include "util.h"
   1274 token_t *left_stack[8];
   1275 int left_stack_pos = 0;
   1276 hash_table_t *typedef_table = NULL;
   1277 @{Utility Functions}
   1278 @{Tokenization Function}
   1279 ---
   1280 
   1281 @s
   1282 Utility functions are everything that doesn't directly tokenize the input.
   1283 --- Utility Functions
   1284 void init_tokenizer(const char *filename) {
   1285   input_init(filename);
   1286   typedef_table = hash_table_create(16, cmp_string, hash_string, dtor_string);
   1287 }
   1288 
   1289 void destroy_tokenizer(void) {
   1290   input_destroy();
   1291 }
   1292 
   1293 void reject_token(token_t *token) {
   1294   left_stack[left_stack_pos++] = token;
   1295 }
   1296 
   1297 token_t *peek_token(void) {
   1298   if (left_stack_pos > 0) {
   1299     return left_stack[left_stack_pos - 1];
   1300   }
   1301     token_t *token = next_token();
   1302     reject_token(token);
   1303     return token;
   1304 }
   1305 
   1306 @{Stringify Type}
   1307 
   1308 void consume(c_token_types kind) {
   1309   token_t *token = next_token();
   1310   if (token_type(token) != kind) {
   1311     fprintf(stderr, "Error: Expected token of type \"%s\", got \"%s\"\n", stringify_type(kind), stringify_type(token_type(token)));
   1312     exit(1);
   1313   }
   1314   token_destroy(token);
   1315 }
   1316 
   1317 void consume_alt(c_token_types *kinds, int n) {
   1318   token_t *token = next_token();
   1319   for (int i = 0; i < n; i++) {
   1320     if (token_type(token) == kinds[i]) {
   1321       token_destroy(token);
   1322       return;
   1323     }
   1324   }
   1325   fprintf(stderr, "Error: Expected one of the following tokens: ");
   1326   for (int i = 0; i < n; i++) {
   1327     fprintf(stderr, "\"%s\" ", stringify_type(kinds[i]));
   1328   }
   1329   fprintf(stderr, "got \"%s\"\n", stringify_type(token_type(token)));
   1330   exit(1);
   1331 }
   1332 ---
   1333 
   1334 @s
   1335 We'll need a helper function to convert token types to strings. It's pretty simple, just tedious.
   1336 --- Stringify Type
   1337 const char *stringify_type(c_token_types type) {
   1338   switch (type) {
   1339   case TOK_IF:
   1340     return "if";
   1341   case TOK_ELSE:
   1342     return "else";
   1343   case TOK_SWITCH:
   1344     return "switch";
   1345   case TOK_CASE:
   1346     return "case";
   1347   case TOK_DEFAULT:
   1348     return "default";
   1349   case TOK_WHILE:
   1350     return "while";
   1351   case TOK_DO:
   1352     return "do";
   1353   case TOK_FOR:
   1354     return "for";
   1355   case TOK_CONTINUE:
   1356     return "continue";
   1357   case TOK_BREAK:
   1358     return "break";
   1359   case TOK_RETURN:
   1360     return "return";
   1361   case TOK_GOTO:
   1362     return "goto";
   1363   case TOK_VOID:
   1364     return "void";
   1365   case TOK_CHAR:
   1366     return "char";
   1367   case TOK_SHORT:
   1368     return "short";
   1369   case TOK_INT:
   1370     return "int";
   1371   case TOK_LONG:
   1372     return "long";
   1373   case TOK_FLOAT:
   1374     return "float";
   1375   case TOK_DOUBLE:
   1376     return "double";
   1377   case TOK_SIGNED:
   1378     return "signed";
   1379   case TOK_UNSIGNED:
   1380     return "unsigned";
   1381   case TOK_STRUCT:
   1382     return "struct";
   1383   case TOK_UNION:
   1384     return "union";
   1385   case TOK_ENUM:
   1386     return "enum";
   1387   case TOK_TYPEDEF:
   1388     return "typedef";
   1389   case TOK_AUTO:
   1390     return "auto";
   1391   case TOK_REGISTER:
   1392     return "register";
   1393   case TOK_STATIC:
   1394     return "static";
   1395   case TOK_EXTERN:
   1396     return "extern";
   1397   case TOK_CONST:
   1398     return "const";
   1399   case TOK_VOLATILE:
   1400     return "volatile";
   1401   case TOK_SIZEOF:
   1402     return "sizeof";
   1403   case TOK_ADD:
   1404     return "+";
   1405   case TOK_SUB:
   1406     return "-";
   1407   case TOK_MUL:
   1408     return "*";
   1409   case TOK_DIV:
   1410     return "/";
   1411   case TOK_MOD:
   1412     return "%";
   1413   case TOK_BIT_AND:
   1414     return "&";
   1415   case TOK_BIT_OR:
   1416     return "|";
   1417   case TOK_BIT_XOR:
   1418     return "^";
   1419   case TOK_BIT_NOT:
   1420     return "~";
   1421   case TOK_LSHIFT:
   1422     return "<<";
   1423   case TOK_RSHIFT:
   1424     return ">>";
   1425   case TOK_NOT:
   1426     return "!";
   1427   case TOK_ASSIGN:
   1428     return "=";
   1429   case TOK_LT:
   1430     return "<";
   1431   case TOK_GT:
   1432     return ">";
   1433   case TOK_INC:
   1434     return "++";
   1435   case TOK_DEC:
   1436     return "--";
   1437   case TOK_EQ:
   1438     return "==";
   1439   case TOK_NE:
   1440     return "!=";
   1441   case TOK_LE:
   1442     return "<=";
   1443   case TOK_GE:
   1444     return ">=";
   1445   case TOK_AND:
   1446     return "&&";
   1447   case TOK_OR:
   1448     return "||";
   1449   case TOK_MEMBER_POINTER:
   1450     return "->";
   1451   case TOK_MEMBER:
   1452     return ".";
   1453   case TOK_COND_DECISION:
   1454     return ":";
   1455   case TOK_COND:
   1456     return "?";
   1457   case TOK_ASSIGN_ADD:
   1458     return "+=";
   1459   case TOK_ASSIGN_SUB:
   1460     return "-=";
   1461   case TOK_ASSIGN_MUL:
   1462     return "*=";
   1463   case TOK_ASSIGN_DIV:
   1464     return "/=";
   1465   case TOK_ASSIGN_MOD:
   1466     return "%=";
   1467   case TOK_ASSIGN_BITAND:
   1468     return "&=";
   1469   case TOK_ASSIGN_BITOR:
   1470     return "|=";
   1471   case TOK_ASSIGN_BITXOR:
   1472     return "^=";
   1473   case TOK_ASSIGN_LSHIFT:
   1474     return "<<=";
   1475   case TOK_ASSIGN_RSHIFT:
   1476     return ">>=";
   1477   case TOK_HASH:
   1478     return "#";
   1479   case TOK_ID:
   1480     return "identifier";
   1481   case TOK_TYPEDEF_NAME:
   1482     return "typedef name";
   1483   case TOK_INTEGER_U32:
   1484   case TOK_INTEGER_U64:
   1485   case TOK_INTEGER_S32:
   1486   case TOK_INTEGER_S64:
   1487     return "integer constant";
   1488   case TOK_FLOAT_32:
   1489   case TOK_FLOAT_64:
   1490     return "floating constant";
   1491   case TOK_CHAR_CONST:
   1492     return "character constant";
   1493   case TOK_STRING_ASCII:
   1494     return "string constant";
   1495   case TOK_EOF:
   1496     return "EOF";
   1497   case TOK_ERROR:
   1498     return "error";
   1499   case TOK_LEFT_PAREN:
   1500     return "(";
   1501   case TOK_RIGHT_PAREN:
   1502     return ")";
   1503   case TOK_LEFT_BRACKET:
   1504     return "[";
   1505   case TOK_RIGHT_BRACKET:
   1506     return "]";
   1507   case TOK_LEFT_BRACE:
   1508     return "{";
   1509   case TOK_RIGHT_BRACE:
   1510     return "}";
   1511   case TOK_COMMA:
   1512     return ",";
   1513   case TOK_SEMICOLON:
   1514     return ";";
   1515   case TOK_DOT:
   1516     return ".";
   1517   case TOK_ELLIPSIS:
   1518     return "...";
   1519   }
   1520   return "UNKNOWN";
   1521 }
   1522 ---
   1523 
   1524 @s
   1525 Now we can implement the tokenization function. The pattern is pretty simple: we call each of the tokenization functions in turn until we find a match. If we don't find a match, we print an error message and exit.
   1526 You might wonder why skip_whitespace can return a token. This makes handling the divide operator easier as comments also start with a slash.
   1527 --- Tokenization Function
   1528 char file_name[1024];
   1529 @{Warning/Error Functions}
   1530 @{Skip Whitespace}
   1531 @{Tokenize Identifier}
   1532 @{Tokenize Number}
   1533 @{Tokenize String}
   1534 @{Tokenize Character}
   1535 @{Tokenize Operator}
   1536 token_t *next_token(void) {
   1537   if (left_stack_pos > 0) {
   1538     return left_stack[--left_stack_pos];
   1539   }
   1540   token_t *tok = skip_whitespace();
   1541   if (tok != NULL) {
   1542     return tok;
   1543   }
   1544   tok = read_identifier();
   1545   if (tok != NULL) {
   1546     return tok;
   1547   }
   1548   tok = read_number();
   1549   if (tok != NULL) {
   1550     return tok;
   1551   }
   1552   tok = read_char_constant();
   1553   if (tok != NULL) {
   1554     return tok;
   1555   }
   1556   tok = read_string_literal();
   1557   if (tok != NULL) {
   1558     return tok;
   1559   }
   1560   tok = read_operator();
   1561   if (tok != NULL) {
   1562     return tok;
   1563   }
   1564   int c = input_getc();
   1565   if (c == EOF) {
   1566     return NULL;
   1567   }
   1568   tok_warn(
   1569       "Warning: Ignoring unexpected character '%c' at line %d, column %d\n", c,
   1570       line, column);
   1571   return next_token();
   1572 }
   1573 
   1574 #ifdef TEST_TOKENIZER
   1575 @{Run Test}
   1576 #endif
   1577 ---
   1578 
   1579 @s
   1580 We'll need a couple of helper functions to skip whitespace and print warnings/errors.
   1581 --- Warning/Error Functions
   1582 void tok_error(const char *fmt, ...) {
   1583   va_list args;
   1584   va_start(args, fmt);
   1585   fprintf(stderr, "Error in file %s at line %d, column %d: ", file_name, line,
   1586           column);
   1587   vfprintf(stderr, fmt, args);
   1588   va_end(args);
   1589 }
   1590 
   1591 void tok_warn(const char *fmt, ...) {
   1592   va_list args;
   1593   va_start(args, fmt);
   1594   fprintf(stderr, "Warning in file %s at line %d, column %d: ", file_name, line,
   1595           column);
   1596   vfprintf(stderr, fmt, args);
   1597   va_end(args);
   1598 }
   1599 ---
   1600 
   1601 @s
   1602 The `skip_whitespace` function is pretty simple. It just skips over any comments, whitespace, and line directives.
   1603 --- Skip Whitespace
   1604 static token_t *skip_whitespace(void) {
   1605   int c;
   1606   while ((c = input_getc()) != EOF) {
   1607     if (isspace(c)) { // Whitespace
   1608       if (c == '\n') {
   1609         line++;
   1610         column = 1;
   1611       } else {
   1612         column++;
   1613       }
   1614     } else if (c == '#') // GCC preprocessor line control directive.
   1615     {
   1616       char buf[512];
   1617       int i = 0;
   1618       while ((c = input_getc()) != EOF && c != '\n') {
   1619         buf[i++] = c;
   1620         column++;
   1621       }
   1622       buf[i] = '\0';
   1623       if (sscanf(buf, "%d \"%[^\"]\"", &line, file_name) == 2) {
   1624         column = 1;
   1625       } else {
   1626         tok_error("Invalid #line directive\n");
   1627       }
   1628       if (c == EOF) {
   1629         return NULL;
   1630       }
   1631     } else if (c == '/') { // Comment
   1632       c = input_getc();
   1633       if (c == '/') {
   1634         while ((c = input_getc()) != EOF && c != '\n') {
   1635           column++;
   1636         }
   1637         if (c == EOF) {
   1638           return NULL;
   1639         }
   1640         line++;
   1641         column = 1;
   1642       } else if (c == '*') { // Multiline comment
   1643         while ((c = input_getc()) != EOF) {
   1644           if (c == '*') {
   1645             c = input_getc();
   1646             if (c == '/') {
   1647               break;
   1648             }
   1649           } else if (c == '\n') {
   1650             line++;
   1651             column = 1;
   1652           } else {
   1653             column++;
   1654           }
   1655         }
   1656         if (c == EOF) {
   1657           return NULL;
   1658         }
   1659       } else { // Handled here to simplify the code.
   1660         if (c == '=')
   1661           return token_create(TOK_ASSIGN_DIV, line, column, 2);
   1662         input_ungetc(c);
   1663         return token_create(TOK_DIV, line, column, 1);
   1664       }
   1665     } else {
   1666       input_ungetc(c);
   1667       return NULL;
   1668     }
   1669   }
   1670   return NULL;
   1671 }
   1672 ---
   1673 
   1674 @s
   1675 The `read_identifier` function reads an identifier from the input stream. C identifiers can contain letters, digits, and underscores, but they can't start with a digit.
   1676 --- Tokenize Identifier
   1677 @{Get Keyword}
   1678 static token_t *read_identifier(void) {
   1679   int c;
   1680   char buf[1024];
   1681   int i = 0;
   1682   c = input_getc();
   1683   if (!isalpha(c) && c != '_') {
   1684     input_ungetc(c);
   1685     return NULL;
   1686   }
   1687   buf[i++] = c;
   1688   while ((c = input_getc()) != EOF) {
   1689     if (!isalnum(c) && c != '_') {
   1690       input_ungetc(c);
   1691       break;
   1692     }
   1693     buf[i++] = c;
   1694     if (i >= 1008) {
   1695       tok_error("Identifier too long\n");
   1696       exit(1);
   1697     }
   1698   }
   1699   buf[i] = '\0';
   1700   column += i;
   1701   // Check if it's a keyword
   1702   c_token_types kind = get_keyword(buf, i);
   1703   if (kind != TOK_ID) {
   1704     return token_create(kind, line, column, i);
   1705   }
   1706   // Check if it's a typedef
   1707   if (hash_table_get(typedef_table, buf) != NULL) {
   1708     return token_create(TOK_TYPEDEF_NAME, line, column, i);
   1709   }
   1710   return token_create_string(kind, line, column, buf, i);
   1711 }
   1712 ---
   1713 
   1714 @s
   1715 The `get_keyword` function is a simple decision tree for identifying keywords. The code is pretty tedious, but it works.
   1716 --- Get Keyword
   1717 c_token_types get_keyword(const char *buf, int len) {
   1718   switch (buf[0]) {
   1719   case 'a':
   1720     if (len == 4 && buf[1] == 'u' && buf[2] == 't' && buf[3] == 'o')
   1721       return TOK_AUTO;
   1722     break;
   1723 
   1724   case 'b':
   1725     if (len == 5 && buf[1] == 'r' && buf[2] == 'e' && buf[3] == 'a' &&
   1726         buf[4] == 'k')
   1727       return TOK_BREAK;
   1728     break;
   1729 
   1730   case 'c':
   1731     switch (buf[1]) {
   1732     case 'a':
   1733       if (len == 4 && buf[2] == 's' && buf[3] == 'e')
   1734         return TOK_CASE;
   1735       break;
   1736     case 'h':
   1737       if (len == 4 && buf[2] == 'a' && buf[3] == 'r')
   1738         return TOK_CHAR;
   1739       break;
   1740     case 'o':
   1741       if (len == 5 && buf[2] == 'n' && buf[3] == 's' && buf[4] == 't')
   1742         return TOK_CONST;
   1743       if (len == 8 && buf[2] == 'n' && buf[3] == 't' && buf[4] == 'i' &&
   1744           buf[5] == 'n' && buf[6] == 'u' && buf[7] == 'e')
   1745         return TOK_CONTINUE;
   1746       break;
   1747     }
   1748     break;
   1749 
   1750   case 'd':
   1751     switch (buf[1]) {
   1752     case 'e':
   1753       if (len == 7 && buf[2] == 'f' && buf[3] == 'a' && buf[4] == 'u' &&
   1754           buf[5] == 'l' && buf[6] == 't')
   1755         return TOK_DEFAULT;
   1756       break;
   1757     case 'o':
   1758       if (len == 2 && buf[2] == '\0')
   1759         return TOK_DO;
   1760       if (len == 6 && buf[2] == 'u' && buf[3] == 'b' && buf[4] == 'l' &&
   1761           buf[5] == 'e')
   1762         return TOK_DOUBLE;
   1763       break;
   1764     }
   1765     break;
   1766 
   1767   case 'e':
   1768     switch (buf[1]) {
   1769     case 'l':
   1770       if (len == 4 && buf[2] == 's' && buf[3] == 'e')
   1771         return TOK_ELSE;
   1772       break;
   1773     case 'n':
   1774       if (len == 4 && buf[2] == 'u' && buf[3] == 'm')
   1775         return TOK_ENUM;
   1776       break;
   1777     case 'x':
   1778       if (len == 6 && buf[2] == 't' && buf[3] == 'e' && buf[4] == 'r' &&
   1779           buf[5] == 'n')
   1780         return TOK_EXTERN;
   1781       break;
   1782     }
   1783     break;
   1784 
   1785   case 'f':
   1786     switch (buf[1]) {
   1787     case 'l':
   1788       if (len == 5 && buf[2] == 'o' && buf[3] == 'a' && buf[4] == 't')
   1789         return TOK_FLOAT;
   1790       break;
   1791     case 'o':
   1792       if (len == 3 && buf[2] == 'r')
   1793         return TOK_FOR;
   1794       break;
   1795     }
   1796     break;
   1797 
   1798   case 'g':
   1799     if (len == 4 && buf[1] == 'o' && buf[2] == 't' && buf[3] == 'o')
   1800       return TOK_GOTO;
   1801     break;
   1802 
   1803   case 'i':
   1804     switch (buf[1]) {
   1805     case 'f':
   1806       if (len == 2 && buf[2] == '\0')
   1807         return TOK_IF;
   1808       break;
   1809     case 'n':
   1810       if (len == 3 && buf[2] == 't')
   1811         return TOK_INT;
   1812       break;
   1813     }
   1814     break;
   1815 
   1816   case 'l':
   1817     if (len == 4 && buf[1] == 'o' && buf[2] == 'n' && buf[3] == 'g')
   1818       return TOK_LONG;
   1819     break;
   1820 
   1821   case 'r':
   1822     switch (buf[1]) {
   1823     case 'e':
   1824       if (len == 8 && buf[2] == 'g' && buf[3] == 'i' && buf[4] == 's' &&
   1825           buf[5] == 't' && buf[6] == 'e' && buf[7] == 'r')
   1826         return TOK_REGISTER;
   1827       if (len == 6 && buf[2] == 't' && buf[3] == 'u' && buf[4] == 'r' &&
   1828           buf[5] == 'n')
   1829         return TOK_RETURN;
   1830       break;
   1831     }
   1832     break;
   1833 
   1834   case 's':
   1835     switch (buf[1]) {
   1836     case 'h':
   1837       if (len == 5 && buf[2] == 'o' && buf[3] == 'r' && buf[4] == 't')
   1838         return TOK_SHORT;
   1839       break;
   1840     case 't':
   1841       if (len == 6 && buf[2] == 'a' && buf[3] == 't' && buf[4] == 'i' &&
   1842           buf[5] == 'c')
   1843         return TOK_STATIC;
   1844       break;
   1845     case 'i':
   1846       if (len == 6 && buf[2] == 'g' && buf[3] == 'n' && buf[4] == 'e' &&
   1847           buf[5] == 'd')
   1848         return TOK_SIGNED;
   1849       if (len == 6 && buf[2] == 'z' && buf[3] == 'e' && buf[4] == 'o' &&
   1850           buf[5] == 'f')
   1851         return TOK_SIZEOF;
   1852       break;
   1853     case 'r':
   1854       if (len == 6 && buf[2] == 'u' && buf[3] == 'c' && buf[4] == 't')
   1855         return TOK_STRUCT;
   1856       break;
   1857     case 'w':
   1858       if (len == 6 && buf[2] == 'i' && buf[3] == 't' && buf[4] == 'c' &&
   1859           buf[5] == 'h')
   1860         return TOK_SWITCH;
   1861       break;
   1862     }
   1863     break;
   1864 
   1865   case 't':
   1866     if (len == 7 && buf[1] == 'y' && buf[2] == 'p' && buf[3] == 'e' &&
   1867         buf[4] == 'd' && buf[5] == 'e' && buf[6] == 'f')
   1868       return TOK_TYPEDEF;
   1869     break;
   1870 
   1871   case 'u':
   1872     switch (buf[1]) {
   1873     case 'n':
   1874       if (len == 5 && buf[2] == 'i' && buf[3] == 'o' && buf[4] == 'n')
   1875         return TOK_UNION;
   1876       if (len == 8 && buf[2] == 's' && buf[3] == 'i' && buf[4] == 'g' &&
   1877           buf[5] == 'n' && buf[6] == 'e' && buf[7] == 'd')
   1878         return TOK_UNSIGNED;
   1879       break;
   1880     }
   1881     break;
   1882 
   1883   case 'v':
   1884     switch (buf[1]) {
   1885     case 'o':
   1886       if (len == 4 && buf[2] == 'i' && buf[3] == 'd')
   1887         return TOK_VOID;
   1888       if (len == 8 && buf[2] == 'l' && buf[3] == 'a' && buf[4] == 't' &&
   1889           buf[5] == 'i' && buf[6] == 'l' && buf[7] == 'e')
   1890         return TOK_VOLATILE;
   1891       break;
   1892     }
   1893     break;
   1894 
   1895   case 'w':
   1896     if (len == 5 && buf[1] == 'h' && buf[2] == 'i' && buf[3] == 'l' &&
   1897         buf[4] == 'e')
   1898       return TOK_WHILE;
   1899     break;
   1900 
   1901   default:
   1902     return TOK_ID;
   1903   }
   1904   return TOK_ID;
   1905 }
   1906 ---
   1907 
   1908 @s
   1909 The `read_operator` function works similarly to the `read_identifier` function. It uses a decision tree to identify operators. 
   1910 --- Tokenize Operator
   1911 
   1912 token_t *read_operator(void) {
   1913   int c;
   1914   c = input_getc();
   1915   switch (c) {
   1916   case '!': {
   1917     c = input_getc();
   1918     if (c == '=')
   1919       return token_create(TOK_NE, line, column, 2);
   1920     input_ungetc(c);
   1921     return token_create(TOK_NOT, line, column, 1);
   1922   }
   1923   case '%': {
   1924     c = input_getc();
   1925     if (c == '=')
   1926       return token_create(TOK_ASSIGN_MOD, line, column, 2);
   1927     input_ungetc(c);
   1928     return token_create(TOK_MOD, line, column, 1);
   1929   }
   1930   case '&': {
   1931     c = input_getc();
   1932     if (c == '&')
   1933       return token_create(TOK_AND, line, column, 2);
   1934     if (c == '=')
   1935       return token_create(TOK_ASSIGN_BITAND, line, column, 2);
   1936     input_ungetc(c);
   1937     return token_create(TOK_BIT_AND, line, column, 1);
   1938   }
   1939   case '(':
   1940     return token_create(TOK_LEFT_PAREN, line, column, 1);
   1941   case ')':
   1942     return token_create(TOK_RIGHT_PAREN, line, column, 1);
   1943   case '*': {
   1944     c = input_getc();
   1945     if (c == '=')
   1946       return token_create(TOK_ASSIGN_MUL, line, column, 2);
   1947     input_ungetc(c);
   1948     return token_create(TOK_MUL, line, column, 1);
   1949   }
   1950   case '+': {
   1951     c = input_getc();
   1952     if (c == '+')
   1953       return token_create(TOK_INC, line, column, 2);
   1954     if (c == '=')
   1955       return token_create(TOK_ASSIGN_ADD, line, column, 2);
   1956     input_ungetc(c);
   1957     return token_create(TOK_ADD, line, column, 2);
   1958   }
   1959   case ',':
   1960     return token_create(TOK_COMMA, line, column, 1);
   1961   case '-': {
   1962     c = input_getc();
   1963     if (c == '-')
   1964       return token_create(TOK_DEC, line, column, 2);
   1965     if (c == '=')
   1966       return token_create(TOK_ASSIGN_SUB, line, column, 2);
   1967     if (c == '>')
   1968       return token_create(TOK_MEMBER_POINTER, line, column, 2);
   1969     input_ungetc(c);
   1970     return token_create(TOK_SUB, line, column, 1);
   1971   }
   1972   case '.': {
   1973     c = input_getc();
   1974     if (c == '.') {
   1975       c = input_getc();
   1976       if (c == '.') {
   1977         return token_create(TOK_ELLIPSIS, line, column, 3);
   1978       } else {
   1979         // Bail out, can't store more than one unget
   1980         tok_error("Unexpected character '.' at line %d, column %d\n", line,
   1981                   column);
   1982         exit(1);
   1983       }
   1984     }
   1985     return token_create('.', line, column, 1);
   1986   }
   1987   case '/': {
   1988     c = input_getc();
   1989     if (c == '=')
   1990       return token_create(TOK_ASSIGN_DIV, line, column, 2);
   1991     input_ungetc(c);
   1992     return token_create(TOK_DIV, line, column, 1);
   1993   }
   1994   case ':':
   1995     return token_create(TOK_COND_DECISION, line, column, 1);
   1996   case ';':
   1997     return token_create(TOK_SEMICOLON, line, column, 1);
   1998   case '<': {
   1999     c = input_getc();
   2000     if (c == '<') {
   2001       c = input_getc();
   2002       if (c == '=')
   2003         return token_create(TOK_ASSIGN_LSHIFT, line, column, 3);
   2004       input_ungetc(c);
   2005       return token_create(TOK_LSHIFT, line, column, 2);
   2006     }
   2007     if (c == '=')
   2008       return token_create(TOK_LE, line, column, 2);
   2009     input_ungetc(c);
   2010     return token_create(TOK_LT, line, column, 1);
   2011   }
   2012   case '=': {
   2013     c = input_getc();
   2014     if (c == '=')
   2015       return token_create(TOK_ASSIGN, line, column, 2);
   2016     input_ungetc(c);
   2017     return token_create(TOK_ASSIGN, line, column, 1);
   2018   }
   2019   case '>': {
   2020     c = input_getc();
   2021     if (c == '>') {
   2022       c = input_getc();
   2023       if (c == '=')
   2024         return token_create(TOK_ASSIGN_RSHIFT, line, column, 3);
   2025       input_ungetc(c);
   2026       return token_create(TOK_RSHIFT, line, column, 2);
   2027     }
   2028     if (c == '=')
   2029       return token_create(TOK_GE, line, column, 2);
   2030     input_ungetc(c);
   2031     return token_create(TOK_GT, line, column, 1);
   2032   }
   2033   case '?':
   2034     return token_create(TOK_COND, line, column, 1);
   2035   case '[':
   2036     return token_create(TOK_LEFT_BRACKET, line, column, 1);
   2037   case ']':
   2038     return token_create(TOK_RIGHT_BRACKET, line, column, 1);
   2039   case '^': {
   2040     c = input_getc();
   2041     if (c == '=')
   2042       return token_create(TOK_ASSIGN_BITXOR, line, column, 2);
   2043     input_ungetc(c);
   2044     return token_create(TOK_BIT_XOR, line, column, 1);
   2045   }
   2046   case '{':
   2047     return token_create(TOK_LEFT_BRACE, line, column, 1);
   2048   case '|': {
   2049     c = input_getc();
   2050     if (c == '|')
   2051       return token_create(TOK_OR, line, column, 2);
   2052     if (c == '=')
   2053       return token_create(TOK_ASSIGN_BITOR, line, column, 2);
   2054     input_ungetc(c);
   2055     return token_create(TOK_BIT_OR, line, column, 1);
   2056   }
   2057   case '}':
   2058     return token_create(TOK_RIGHT_BRACE, line, column, 1);
   2059   case '~':
   2060     return token_create(TOK_BIT_NOT, line, column, 1);
   2061   default:
   2062     input_ungetc(c);
   2063     return NULL;
   2064   }
   2065 
   2066   return NULL;
   2067 }
   2068 ---
   2069 
   2070 @s
   2071 The `read_number` function reads a number from the input stream. It can be an integer or a floating-point number.
   2072 
   2073 I've broken it up a bit to make it easier to read. 
   2074 
   2075 --- Tokenize Number
   2076 static token_t *read_number(void) {
   2077   int c;
   2078   char buf[1024];
   2079   int i = 0;
   2080   c = input_getc();
   2081   @{Check for valid prefix}
   2082   int radix = 10;
   2083   @{Process Radix}
   2084   int is_float = 0;
   2085   @{Read Number Loop}
   2086   buf[i] = '\0';
   2087   @{Process Suffixes}
   2088   @{Check for conflicting suffixes}
   2089   if (is_float) {
   2090     @{Convert to float}
   2091   } else {
   2092     @{Convert to integer}
   2093   }
   2094   return NULL;
   2095 }
   2096 ---
   2097 
   2098 @s
   2099 To determine if a character is a valid prefix for a number, we need to check if it's a digit or a period followed by a digit
   2100 --- Check for valid prefix
   2101  // If we don't have a digit or decimal point, it's not a number
   2102   if (!isdigit(c) && c != '.') {
   2103     input_ungetc(c);
   2104     return NULL;
   2105   }
   2106   // Decimal point followed by non-digit is a struct member
   2107   if (c == '.') {
   2108     char cnext = input_getc();
   2109     if (!isdigit(cnext)) {
   2110       input_ungetc(cnext);
   2111       return token_create(TOK_MEMBER, line, column, 1);
   2112     }
   2113     input_ungetc(cnext);
   2114   }
   2115 ---
   2116 
   2117 @s
   2118 A C constant starting with a zero is either an octal or hexadecimal constant. We need to check the next character to determine which one it is.
   2119 --- Process Radix
   2120   // Check for hex and octal.
   2121   if (c == '0') {
   2122     char cnext = input_getc();
   2123     if (cnext == 'x' || cnext == 'X') {
   2124       // Hex, discard the 0x
   2125       radix = 16;
   2126     } else {
   2127       // Octal, discard the 0
   2128       input_ungetc(cnext);
   2129       radix = 8;
   2130     }
   2131   } else {
   2132     // Decimal, append the first digit
   2133     buf[i++] = c;
   2134   }
   2135 ---
   2136 
   2137 @s
   2138 
   2139 --- Read Number Loop
   2140    while ((c = input_getc()) != EOF) {
   2141     // Since there can be multiple writes to the buffer, we want to make sure we
   2142     // don't overflow by giving a 4 byte pad
   2143     if (i > 1020) {
   2144       tok_error("Number too long\n");
   2145       return NULL;
   2146     }
   2147     // Valid digits for the radix: 0-9 for decimal, 0-7 for octal, 0-9 and
   2148     // a-f/A-F for hex
   2149     if ((radix == 10 && isdigit(c)) || (radix == 16 && isxdigit(c)) ||
   2150         (radix == 8 && c >= '0' && c <= '7')) {
   2151       buf[i++] = c;
   2152       // Decimal point and not a float yet, must be a float
   2153     } else if (c == '.' && !is_float) {
   2154       is_float = 1;
   2155       if (radix != 10) {
   2156         tok_error("Invalid floating point constant, expected decimal, got %s\n",
   2157                   radix == 16 ? "hexadecimal" : "octal");
   2158         return NULL;
   2159       }
   2160       buf[i++] = c;
   2161     }
   2162     // Exponent on the end of a constant. (By standard this forces it to be a
   2163     // float)
   2164     else if (c == 'e' || c == 'E') {
   2165       buf[i++] = c;
   2166       c = input_getc();
   2167       // Sign on the exponent
   2168       if (c == '+' || c == '-') {
   2169         buf[i++] = c;
   2170         c = input_getc();
   2171       }
   2172       // Exponent must be a digit, I.E no 1e1.2
   2173       if (!isdigit(c)) {
   2174         tok_error("Invalid floating point exponent\n");
   2175         return NULL;
   2176       }
   2177       buf[i++] = c;
   2178       is_float = 1;
   2179     } else {
   2180       // Reached the end, unget the character so other functions can read it
   2181       input_ungetc(c);
   2182       break;
   2183     }
   2184   }
   2185 ---
   2186 
   2187 @s 
   2188 C constants can have suffixes to indicate their type. We need to check for these suffixes and set the appropriate flags.
   2189 
   2190 --- Process Suffixes
   2191   int is_unsigned = 0;
   2192   int is_long = 0;
   2193   int is_single = 0;
   2194   while (1) {
   2195     c = input_getc();
   2196     if (c == 'u' || c == 'U') {
   2197       if (is_unsigned) {
   2198         tok_warn(
   2199             "Warning: Duplicate suffix 'u' for integer constant ignored\n");
   2200       }
   2201       is_unsigned = 1;
   2202     } else if (c == 'l' || c == 'L') {
   2203       if (is_long) {
   2204         tok_warn(
   2205             "Warning: Duplicate suffix 'l' for integer constant ignored\n");
   2206       }
   2207       is_long = 1;
   2208     } else if (c == 'f' || c == 'F') {
   2209       if (is_single) {
   2210         tok_warn("Warning: Duplicate suffix 'f' for floating point constant "
   2211                  "ignored\n");
   2212       }
   2213       is_single = 1;
   2214     } else {
   2215       input_ungetc(c);
   2216       break;
   2217     }
   2218   }
   2219 ---
   2220 
   2221 @s
   2222 If we find conflicting suffixes, we print a warning and ignore the suffixes.
   2223 --- Check for conflicting suffixes
   2224     if (is_single && is_long) {
   2225     tok_warn("Warning: Invalid suffixes 'l' and 'f' for floating point "
   2226              "constant. Ignoring 'l'\n");
   2227     is_long = 0;
   2228   }
   2229   if (is_single && is_unsigned) {
   2230     tok_warn("Warning: Invalid suffixes 'u' and 'f' for floating point "
   2231              "constant. Ignoring 'u'\n");
   2232     is_unsigned = 0;
   2233   }
   2234   if (is_single && !is_float) {
   2235     tok_warn(
   2236         "Warning: Invalid suffix 'f' for integer constant. Ignoring 'f'\n");
   2237     is_single = 0;
   2238   }
   2239 ---
   2240 
   2241 @s
   2242 If the string contains a float, we pass it to strtod. We need to make sure that the number is in range for the given type and check for errors from strtod
   2243 
   2244 --- Convert to float
   2245     errno = 0;
   2246     // Strtod generates a unix-style error when it's given something out of
   2247     // range, so we want to get on top of that quickly instead of ignoring it
   2248     // That way we can avoid some nasty NAN-propagation in the constant folder.
   2249     double f = strtod(buf, NULL);
   2250     if (errno == ERANGE) {
   2251       tok_error("Floating point constant out of range\n");
   2252       return NULL;
   2253     }
   2254     // Warn if the constant is out of range for a float, I.E it's too big or too
   2255     // small
   2256     if (is_single && (f < FLT_MIN || f > FLT_MAX)) {
   2257       tok_warn(
   2258           "Warning: Floating point constant %f is out of range for float\n", f);
   2259     }
   2260     // Warn if the constant is too precise for a float
   2261     if (is_single && fabs((double)((float)f) - f) >= FLT_EPSILON) {
   2262       tok_warn("Warning: Converting double precision floating point constant "
   2263                "%f to float loses "
   2264                "precision\n",
   2265                f);
   2266     }
   2267     return token_create_float(is_single ? TOK_FLOAT_32
   2268                                         : TOK_FLOAT_64,
   2269                               line, column, f, i);
   2270 ---
   2271 
   2272 @s
   2273 If the string contains a number, we pass it to stroll. We need to make sure that the number is in range for the given type and check for errors from strtoll
   2274 
   2275 --- Convert to integer
   2276     errno = 0;
   2277     uint64_t int_ = strtoull(buf, NULL, radix);
   2278     // Same as above, but for integers
   2279     if (errno == ERANGE) {
   2280       tok_error("Integer constant out of range\n");
   2281       return NULL;
   2282     }
   2283     if (is_unsigned) {
   2284       if (is_long) {
   2285         return token_create_int(TOK_INTEGER_U64, line, column, int_, i);
   2286       } else {
   2287         if (int_ > UINT32_MAX) {
   2288           tok_warn(
   2289               "Warning: Integer constant %lld is out of range for unsigned "
   2290               "int\n",
   2291               int_);
   2292         }
   2293         return token_create_int(TOK_INTEGER_U32, line, column, int_, i);
   2294       }
   2295     } else {
   2296       if (is_long) {
   2297         // If the highest bit is set, that means this will overflow a signed
   2298         // long (Due to two's complement)
   2299         if (int_ & (1UL << 63)) {
   2300           tok_warn(
   2301               "Warning: Integer constant %lld is out of range for long long\n",
   2302               i);
   2303         }
   2304         return token_create_int(TOK_INTEGER_S64, line, column, int_, i);
   2305       } else {
   2306         if (int_ & (1UL << 31)) {
   2307           tok_warn("Warning: Integer constant %lld is out of range for int\n",
   2308                    int_);
   2309         }
   2310         return token_create_int(TOK_INTEGER_S32, line, column, int_, i);
   2311       }
   2312     }
   2313 ---
   2314 
   2315 @s
   2316 The `read_char_constant` function reads a character constant from the input stream. It can be a single character or a multi-character escape sequence.
   2317 --- Tokenize Character
   2318 static token_t *read_char_constant(void) {
   2319   int c;
   2320   int len = 0;
   2321   c = input_getc();
   2322   if (c != '\'') {
   2323     input_ungetc(c);
   2324     return NULL;
   2325   }
   2326   len++;
   2327   c = input_getc();
   2328   if (c == '\'') {
   2329     tok_error("Empty character constant\n");
   2330     return NULL;
   2331   }
   2332   if (c == '\\') {
   2333     c = read_escape_sequence(&len);
   2334   }
   2335   int val = c;
   2336   c = input_getc();
   2337   if (c != '\'') {
   2338     tok_error("Expected closing quote for character constant\n");
   2339     return NULL;
   2340   }
   2341   len++;
   2342   return token_create_char(TOK_CHAR_CONST, line, column, val, len);
   2343 }
   2344 ---
   2345 
   2346 @s
   2347 The `read_string_literal` function reads a string literal from the input stream.
   2348 
   2349 For this function, an automatic-lifetime buffer is used to store the string it becomes too large. At that point, a heap-allocated buffer is used.
   2350 This way we can avoid unnecessary heap allocations for small strings.
   2351 --- Tokenize String
   2352 @{Read Escape Sequence}
   2353 static token_t *read_string_literal(void) {
   2354   int c;
   2355   c = input_getc();
   2356   if (c != '"') {
   2357     input_ungetc(c);
   2358     return NULL;
   2359   }
   2360   int i = 0;
   2361   char s_buf[512];
   2362   char *buf = s_buf;
   2363   int len = 512;
   2364   int esc_pad = 0;
   2365   while ((c = input_getc()) != EOF) {
   2366     if (c == '"') {
   2367       // Implicit skip of closing quote
   2368       break;
   2369     }
   2370     if (c == '\\') {
   2371       c = read_escape_sequence(&esc_pad);
   2372       if (c == 0) {
   2373         return NULL;
   2374       }
   2375     }
   2376     if (i >= len) {
   2377       if (buf == s_buf) {
   2378         buf = malloc(1024);
   2379         if (buf == NULL) {
   2380           fputs("Out of memory. Could not parse string literal.\n", stderr);
   2381           exit(1);
   2382         }
   2383         memcpy(buf, s_buf, 512);
   2384         len *= 2;
   2385       } else {
   2386         len *= 2;
   2387         buf = realloc(buf, len);
   2388       }
   2389     }
   2390     buf[i++] = c;
   2391   }
   2392   buf[i] = '\0';
   2393   if (c == EOF) {
   2394     tok_error("Unterminated string literal\n");
   2395     if (buf != s_buf) {
   2396       free(buf);
   2397     }
   2398     return NULL;
   2399   }
   2400 
   2401   token_t *tok = token_create_string(TOK_STRING_ASCII, line, column, buf,
   2402                                      i + esc_pad + 2);
   2403   if (buf != s_buf) {
   2404     free(buf);
   2405   }
   2406   return tok;
   2407 }
   2408 ---
   2409 
   2410 @s
   2411 Escape sequences in C can either be single characters or octal/hexadecimal values. We need to handle both cases.
   2412 --- Read Escape Sequence
   2413 static char read_escape_sequence(int *len) {
   2414   int c = input_getc();
   2415   *len += 1;
   2416   switch (c) {
   2417   case 'a':
   2418     return '\a';
   2419   case 'b':
   2420     return '\b';
   2421   case 'f':
   2422     return '\f';
   2423   case 'n':
   2424     return '\n';
   2425   case 'r':
   2426     return '\r';
   2427   case 't':
   2428     return '\t';
   2429   case 'v':
   2430     return '\v';
   2431   case '\'':
   2432     return '\'';
   2433   case '"':
   2434     return '"';
   2435   case '?':
   2436     return '?';
   2437   case '\\':
   2438     return '\\';
   2439   case '0':
   2440     return '\0';
   2441   case 'x': {
   2442     c = input_getc();
   2443     if (!isxdigit(c)) {
   2444       tok_error("Invalid hexadecimal escape sequence\n");
   2445       return 0;
   2446     }
   2447     int val = 0;
   2448     while (isxdigit(c)) {
   2449       *len += 1;
   2450       val = val * 16 + (isdigit(c) ? c - '0' : tolower(c) - 'a' + 10);
   2451       c = input_getc();
   2452     }
   2453     input_ungetc(c);
   2454     return (char)val;
   2455   }
   2456   default:
   2457     if (!isdigit(c)) {
   2458       tok_error("Invalid escape sequence\n");
   2459       return 0;
   2460     }
   2461     int val = 0;
   2462     while (isdigit(c)) {
   2463       *len += 1;
   2464       val = val * 8 + c - '0';
   2465       c = input_getc();
   2466     }
   2467     input_ungetc(c);
   2468     return (char)val;
   2469   }
   2470 }
   2471 ---
   2472 
   2473 @s
   2474 Finally, I'll add some code for running the tokenizer as its own program. This way we can test it out.
   2475 --- Run Test
   2476 char *preprocess(char *in) {
   2477   char *output_name = malloc(1024);
   2478   snprintf(output_name, 1024, "%s.preprocessed", in);
   2479   char *command = malloc(2048);
   2480   snprintf(command, 2048, "gcc -E -xc %s -o %s", in, output_name);
   2481   system(command);
   2482   free(command);
   2483   return output_name;
   2484 }
   2485 
   2486 // Tokenize the input file
   2487 int main(int argc, char **argv) {
   2488   if (argc != 2) {
   2489     fprintf(stderr, "Usage: %s <input.c>\n", argv[0]);
   2490     return 1;
   2491   }
   2492   char *input_name = argv[1];
   2493   char *preprocessed = preprocess(input_name);
   2494   init_tokenizer(preprocessed);
   2495   token_t *tok;
   2496   while ((tok = next_token()) != NULL) {
   2497     print_token(tok);
   2498     token_destroy(tok);
   2499   }
   2500   destroy_tokenizer();
   2501   remove(preprocessed);
   2502   free(preprocessed);
   2503   hash_table_destroy(string_table);
   2504   return 0;
   2505 }
   2506 ---
   2507 
   2508 ### Bugs/Errata
   2509 
   2510 I wrote this code in a single sitting, so there are bound to be bugs. I'll list them here as I find them. The code you see here is the final version, with all bugs fixed.
   2511 
   2512 *   had `buffer_pos == buffer_size - 1`, left in from trying to plug some code for lookahead in, didn't work out, but I forgot to remove it, causes fallthrough to `buffer_size == 0` check which if true returns EOF, preventing input initialization. Fixed by changing to `buffer_pos == buffer_size`.
   2513 *   assertion `token->kind == TOK_STRING_ASCII` failed in token\_string. Forgot to expand check for identifiers which also use token\_string. Fixed by changing to `token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TID`.
   2514 *   token\_create\_string - call to `hash_table_get` with freed key. Fixed by moving the call to free after the call to `hash_table_get`.
   2515 *   ibid - Design of hash table and call to `hash_table_get` in token\_create\_string created double free. Fixed by rewriting part of function.
   2516 *   Tokenizer missing code to handle GCC preprocessor line directives. Fixed by adding code to handle them.
   2517 *   Destructor for string literals missing in tokenizer teardown, added it in.
   2518 *   read\_number - check `int_ > INT32_MAX` does not work due to some weird casting. Added explicit cast.
   2519 *   read\_char\_constant - Forgot to handle '\\0'. Added.
   2520 *   skip\_whitespace - When a division operator occurs in code, skip\_whitespace assumes it's a comment. Fixed by adding a check for division operator.
   2521 *   hash\_string - Optimization, not a bug, Dragon Book hash function not very fast due to misprediction. Replaced with ELF hash function.
   2522 *   read\_identifier - strlen gets called 3 times even though we already get the string len by incrementing an array index. Ew. Used i instead of strlen.
   2523 *   read\_identifier - stringized version of keywords stored, not needed. Code added to call token\_create instead of token\_create\_string for keywords.
   2524 *   Everywhere - Checks added for memory allocation failure.
   2525 *   Not a bug. Removed the seperate token type for TID. Will try to handle in the parser.
   2526 
   2527 ### Conclusion
   2528 
   2529 That's it! The C Minus tokenizer is complete. It's hopefully pretty understandable, and given the testing I've put it through, it should be pretty robust.
   2530 
   2531 Next time, we'll start on the parser.
   2532 
   2533 # Source code, biblography
   2534 
   2535 Source code for the tokenizer is available [here](/projects/cminus/code/tokenizer.c), header file is available [here](/projects/cminus/code/tokenizer.h).
   2536 
   2537 Source code for the input module is available [here](/projects/cminus/code/input.c), header file is available [here](/projects/cminus/code/input.h).
   2538 
   2539 Source code for the hash table is available [here](/projects/cminus/code/hash_table.c), header file is available [here](/projects/cminus/code/hash_table.h).
   2540 
   2541 Source code for the token module is available [here](/projects/cminus/code/token.c), header file is available [here](/projects/cminus/code/token.h).
   2542 
   2543 A lot of the logic for this project is from either the Dragon Book, Engineering a Compiler, or LCC: A Retargetable Compiler for ANSI C. Grammars are from The C Reference Manual.
   2544 
   2545 I got the idea for using zero-width arrays for optional content (the struct hack) from hacking on some weird virtio drivers (they seem to love it).
   2546 
   2547 Crafting Interpreters by Bob Nystrom inspired me to do this project, so if you see any similarities, there's probably some unintentional influence there.
   2548 
   2549 The code for the ELF hash function is from the glibc source code.
   2550 
   2551 The idea for decision trees came from LCC.
   2552 
   2553 Literate programming rendered using [literate](https://github.com/zyedidia/Literate).
   2554 
   2555   <footer style=" text-align: center; padding: 20px;">
   2556     <p>© 2024 Reagan Fischer. If for some reason you want to use my AMAZING code (lol), it's available under the MIT
   2557       license <a href="/projects/cminus/code/LICENSE.md">here</a>.</p>
   2558   </footer>
	website Website contents
	git clone git://git.reagancfischer.dev/website.git
	Log \| Files \| Refs