lexer_new.lit (21010B)
1 @code_type c .c 2 @comment_type /* %s */ 3 @compiler lit -t lexer.lit && gcc -Wall -Wextra -Wpedantic -Wstrict-aliasing=3 -Wwrite-strings -Wvla -Wcast-align=strict -Wstrict-prototypes -Wstringop-overflow=4 -Wshadow -fanalyzer tokenizer.c input.c hash_table.c -D TOK_TEST -g -O0 && rm a.out 4 5 @title Lexer 6 @add_css ../style.css 7 @s General Project Structure 8 Since this is the first article, I'll outline the project structure for the C- compiler. 9 10 The project has a series of pretty typical stages: 11 12 1. The lexer. This takes a file as input and emits a series of tokens (Its input is already preprocessed, I outsource that to "gcc -E"). 13 2. The parser. This takes the tokens and builds an abstract syntax tree (AST). 14 3. The symbol table. This exists in a sort of in-between space next to the lexer and parser. It's used to store information about variables and functions. 15 4. The type checker. This is used to ensure that the types of variables and functions are correct. 16 5. The code generator. This takes the AST and generates an intermediate representation (IR). 17 6. The optimizer. This takes the IR and optimizes it. This'll be broken up into a few stages. 18 7. The lowerer. This takes the IR and lowers it to a simpler IR. 19 8. The register allocator. This takes the IR, which has instructions in an infinite number of registers, and assigns them to a finite number of registers. 20 9. The code emitter. This takes the IR and emits RISC-V assembly. 21 22 As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena. 23 24 @s Some Rules 25 26 Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent. 27 28 1. PROGRAM LIKE IT'S 1999 29 30 > 640 KB ought to be enough for anybody. - Bill Gates 31 32 Maybe not that little, But I'm going to try to keep the project as simple as possible, 640 KB probably won't be enough, but I'll still aim for less than 10 MB of memory usage. 33 34 This places a lot of constraints on the project, but I think it's a good exercise in minimalism. 35 36 Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future). 37 38 2. PROGRAM IN C++-- 39 40 I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits: 41 42 * Quicker compilation. A change to a data structure will only require one file to be recompiled, rather than every file that includes the header. 43 * Less chance of bugs. If a function is the only way to interact with a data structure, then it's much harder to misuse that data structure. 44 * Run time type checking. I can include some sort of tag in the first field of every data structure to ensure that the correct functions are being called. 45 46 3. DON'T GET FANCY 47 48 My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else. 49 50 That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability. 51 52 4. DESIGN FOR DEBUGGING 53 54 This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point. 55 56 This might be painful, but it'll make debugging a lot simpler and let users look under the hood. 57 58 5. SMART DATA, STUPID CODE 59 60 A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible. 61 62 For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table. 63 64 @s Misc 65 THIS IS A LITERATE PROGRAM! Go to [this link](https://reagancfischer.dev/lexer.lit) to see the file that generated this HTML. 66 67 @s The Lexer 68 69 A lexical analyzer reads source code and produces tokens, which are the smallest unit of meaning in a language. For example, in the C programming language, the tokens are things like keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.). 70 71 Given a string like `int main() { return 0; }`, the lexer would produce a series of tokens like `INT`, `IDENTIFIER(main)`, `LPAREN`, `RPAREN`, `LBRACE`, `RETURN`, `INTCONSTANT(0)`, `SEMICOLON`, `RBRACE`. 72 73 @s Design 74 75 I'll break the lexer up into a couple of modules. `token.c` will contain the token data structure and functions to create and destroy tokens. `input.c` will contain the input data structure and functions to read from the input file. `tokenizer.c` will contain the main lexer logic. 76 77 @s Token Interface 78 Tokens are the smallest unit of meaning in a language. They're used by the parser to build an abstract syntax tree (AST). We'll need a couple of things to represent a token: 79 * The type of token. This will be an enum, with values like `TOK_CTK_IF` or `TOK_CONST_INTEGER_U32`. 80 * The value of the token. Some tokens, like keywords, don't have a value. Others, like identifiers or constants, do. 81 * The line and column of the token. This is used for error messages. 82 83 As I mentioned earlier, we're trying to implement a sort of class system in C. For that, we'll need to hide the token implementation details behind an opaque pointer. We could just have a `void` pointer, but that stops us from being able to use compile-time type checking. 84 Instead, we'll use a forward declaration of the token type in the header file, and then define the token type in the implementation file. 85 @s 86 --- Opaque Token Type 87 typedef struct token token_t; 88 --- 89 @s 90 We'll need a couple of functions to create and destroy tokens. 91 --- Token Creation and Destruction 92 token_t *token_data_create(c_token_types kind, int lin, int col, 93 int len); 94 95 token_t *token_create(c_token_types kind, int lin, int col, int len); 96 97 token_t *token_create_int(c_token_types kind, int lin, int col, 98 int64_t i, int len); 99 100 token_t *token_create_float(c_token_types kind, int lin, int col, 101 double f, int len); 102 103 token_t *token_create_char(c_token_types kind, int lin, int col, char c, 104 int len); 105 106 token_t *token_create_string(c_token_types kind, int lin, int col, 107 const char *s, int len); 108 109 void token_destroy(token_t *token); 110 --- 111 @s 112 We'll also need some functions to access the token data. 113 --- Token Interface 114 c_token_types token_type(token_t *token); 115 116 int64_t token_int(token_t *token); 117 118 double token_float(token_t *token); 119 120 const char *token_string(token_t *token); 121 122 char token_char(token_t *token); 123 124 int token_line(token_t *token); 125 126 int token_column(token_t *token); 127 --- 128 @s 129 We'll need some types to represent the different kinds of tokens. 130 --- Token Types 131 typedef enum { 132 // Control Keywords 133 TOK_CTK_IF, 134 TOK_CTK_ELSE, 135 TOK_CTK_SWITCH, 136 TOK_CTK_CASE, 137 TOK_CTK_DEFAULT, 138 TOK_CTK_WHILE, 139 TOK_CTK_DO, 140 TOK_CTK_FOR, 141 TOK_CTK_CONTINUE, 142 TOK_CTK_BREAK, 143 TOK_CTK_RETURN, 144 TOK_CTK_GOTO, 145 146 // Type Keywords 147 TOK_TK_VOID, 148 TOK_TK_CHAR, 149 TOK_TK_SHORT, 150 TOK_TK_INT, 151 TOK_TK_LONG, 152 TOK_TK_FLOAT, 153 TOK_TK_DOUBLE, 154 TOK_TK_SIGNED, 155 TOK_TK_UNSIGNED, 156 TOK_TK_STRUCT, 157 TOK_TK_UNION, 158 TOK_TK_ENUM, 159 TOK_TK_TYPEDEF, 160 161 // Storage Class/Specifier Keywords 162 TOK_SCSK_AUTO, 163 TOK_SCSK_REGISTER, 164 TOK_SCSK_STATIC, 165 TOK_SCSK_EXTERN, 166 TOK_SCSK_CONST, 167 TOK_SCSK_VOLATILE, 168 169 // Misc Keywords 170 TOK_MK_SIZEOF, 171 172 // Operators 173 TOK_OP_ADD, // + 174 TOK_OP_SUB, // - 175 TOK_OP_MUL, // * 176 TOK_OP_DIV, // / 177 TOK_OP_MOD, // % 178 TOK_OP_BIT_AND, // & 179 TOK_OP_BIT_OR, // | 180 TOK_OP_BIT_XOR, // ^ 181 TOK_OP_BIT_NOT, // ~ 182 TOK_OP_LSHIFT, // << 183 TOK_OP_RSHIFT, // >> 184 TOK_OP_NOT, // ! 185 TOK_OP_ASSIGN, // = 186 TOK_OP_LT, // < 187 TOK_OP_GT, // > 188 TOK_OP_INC, // ++ 189 TOK_OP_DEC, // -- 190 TOK_OP_EQ, // == 191 TOK_OP_NE, // != 192 TOK_OP_LE, // <= 193 TOK_OP_GE, // >= 194 TOK_OP_AND, // && 195 TOK_OP_OR, // || 196 TOK_OP_MEMBER_POINTER, // -> 197 TOK_OP_MEMBER, // . 198 TOK_OP_COND_DECISION, // : 199 TOK_OP_COND, // ? 200 TOK_OP_ASSIGN_ADD, // += 201 TOK_OP_ASSIGN_SUB, // -= 202 TOK_OP_ASSIGN_MUL, // *= 203 TOK_OP_ASSIGN_DIV, // /= 204 TOK_OP_ASSIGN_MOD, // %= 205 TOK_OP_ASSIGN_BITAND, // &= 206 TOK_OP_ASSIGN_BITOR, // |= 207 TOK_OP_ASSIGN_BITXOR, // ^= 208 TOK_OP_ASSIGN_LSHIFT, // <<= 209 TOK_OP_ASSIGN_RSHIFT, // >>= 210 211 // Separators 212 TOK_SEP_LEFT_PAREN, // ( 213 TOK_SEP_RIGHT_PAREN, // ) 214 TOK_SEP_LEFT_BRACKET, // [ 215 TOK_SEP_RIGHT_BRACKET, // ] 216 TOK_SEP_LEFT_BRACE, // { 217 TOK_SEP_RIGHT_BRACE, // } 218 TOK_SEP_COMMA, // , 219 TOK_SEP_SEMICOLON, // ; 220 TOK_SEP_DOT, // . 221 TOK_SEP_ELLIPSIS, // ... 222 TOK_SEP_HASH, // # 223 224 // Identifiers 225 TOK_ID, 226 227 // Constants 228 TOK_CONST_INTEGER_U32, // u 229 TOK_CONST_INTEGER_U64, // ul 230 TOK_CONST_INTEGER_S32, // (no suffix) 231 TOK_CONST_INTEGER_S64, // l 232 TOK_CONST_FLOAT_32, // f 233 TOK_CONST_FLOAT_64, // (no suffix) 234 TOK_CONST_CHAR, // 'c' 235 TOK_CONST_STRING_ASCII, // "string" (width of 8 bits) 236 237 // Special 238 TOK_SPECIAL_EOF, 239 TOK_SPECIAL_ERROR, 240 } c_token_types; 241 --- 242 @s 243 We bring this all together in `token.h`. 244 --- token.h 245 #ifndef TOKEN_H 246 #define TOKEN_H 247 #include <stdint.h> // We use this for int64_t 248 @{Token Types} 249 @{Opaque Token Type} 250 @{Token Creation and Destruction} 251 @{Token Interface} 252 #endif 253 --- 254 255 @s Token Implementation 256 Now that we have the interface, we can implement the token data structure. We'll need a couple of things: 257 * The token type. 258 * A way to store extra data. 259 * Implementations of the functions we defined in the interface. 260 261 @s 262 One problem is we haven't defined a way to verify that the token we're getting isn't corrupt. We'll use a tag for that. 263 --- Token Data Structure 264 #define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK" 265 #define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN" 266 struct token { 267 long magic; 268 int line; 269 int column; 270 short kind; 271 long opt_data[0]; 272 }; 273 274 typedef struct token token_t; 275 276 struct token_data { 277 union { 278 long long i; 279 double f; 280 const char *s; 281 char c; 282 } data; 283 }; 284 285 typedef struct token_data token_data_t; 286 --- 287 You might notice that a zero-length array is used in the token data structure. This is a GCC extension that allows us to allocate memory for the token data structure and the token data in one allocation. This is a bit of a hack, but it's a common pattern in C code. 288 @s 289 To access this extra data, we define a macro 290 --- Token Data Access 291 #define token_data(token) ((struct token_data *)((token)->opt_data)) 292 --- 293 @s 294 Now we can implement the functions we defined in the interface. 295 --- Token Creation and Destruction (Except `token_create_string`) 296 static token_t *token_data_create(c_token_types kind, int lin, int col, int len) { 297 token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data)); 298 if (token == NULL) { 299 fputs("Out of memory\n", stderr); 300 exit(1); 301 } 302 token->magic = TOK_MAGIC_1; 303 token->line = lin; 304 token->column = col; 305 column += len; 306 token->kind = kind; 307 return token; 308 } 309 310 static token_t *token_create(c_token_types kind, int lin, int col, int len) { 311 token_t *token = malloc(sizeof(token_t)); 312 if (token == NULL) { 313 fputs("Out of memory\n", stderr); 314 exit(1); 315 } 316 token->magic = TOK_MAGIC_2; 317 token->line = lin; 318 token->column = col; 319 column += len; 320 token->kind = kind; 321 return token; 322 } 323 324 static token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) { 325 token_t *token = token_data_create(kind, lin, col, len); 326 token_data(token)->data.i = i; 327 return token; 328 } 329 330 static token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) { 331 token_t *token = token_data_create(kind, lin, col, len); 332 token_data(token)->data.f = f; 333 return token; 334 } 335 336 static token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) { 337 token_t *token = token_data_create(kind, lin, col, len); 338 token_data(token)->data.c = c; 339 return token; 340 } 341 342 void token_destroy(token_t *token) { 343 if (token->magic == TOK_MAGIC_1) { 344 free(token); 345 } else if (token->magic == TOK_MAGIC_2) { 346 free(token); 347 if (kind == TOK_CONST_STRING_ASCII) { 348 free(token_data(token)->data.s); 349 } 350 } else { 351 fputs("Corrupt token\n", stderr); 352 exit(1); 353 } 354 } 355 --- 356 @s 357 `token_create_string` can be implemented either the easy way or the right way. Let's try the easy way. 358 --- token_create_string 359 token_t *token_create_string(c_token_types kind, int lin, int col, 360 const char *s, int len) { 361 token_t *token = token_create(kind, lin, col, len); 362 token_data(token)->data.s = strdup(s); 363 return token; 364 } 365 --- 366 @s 367 There's an issue with this approach. `token_create_string` will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to `fprintf`, `fscanf`, `strchr`, `strtok`, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage. 368 To fix this, we use a hash table to store the strings. We'll define a hash table in `hash_table.h` and `hash_table.c`. 369 @s Hash Table 370 A hash table is a data structure that maps keys to values. It's commonly used to store information, such as variables and functions in a symbol table. To implement a generic hash table, we'll need several things: 371 * A function to hash the keys. 372 * A function to compare keys. 373 * An opaque type for the hash table. 374 * A function to destroy deleted keys and values. 375 376 Let's start with the interface. 377 378 @s 379 --- Hash Table Opaque Types 380 typedef struct hash_table hash_table_t; 381 typedef int (*hash_table_cmp_fn)(void *key1, void *key2); 382 typedef unsigned int (*hash_table_hash_fn)(void *key); 383 typedef void (*hash_table_dtor)(void *value, int is_key); 384 --- 385 386 @s 387 --- Hash Table Creation and Destruction 388 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor); 389 void hash_table_destroy(hash_table_t *table); 390 --- 391 392 @s 393 --- Hash Table Access 394 void *hash_table_get(hash_table_t *table, void *key); 395 void hash_table_put(hash_table_t *table, void *key, void *value); 396 void hash_table_remove(hash_table_t *table, void *key); 397 --- 398 399 @s 400 --- hash_table.h 401 #ifndef HASH_TABLE_H 402 #define HASH_TABLE_H 403 @{Hash Table Opaque Types} 404 @{Hash Table Creation and Destruction} 405 @{Hash Table Access} 406 #endif 407 --- 408 409 @s 410 Let's implement the hash table now. 411 412 --- hash_table.c 413 #include <stdlib.h> 414 #include <string.h> 415 #include <stdio.h> 416 #include "hash_table.h" 417 418 @{Hash Table Data Structure} 419 @{Hash Table Entry Data Structure} 420 421 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) { 422 @{Allocate and Initialize Hash Table} 423 return table; 424 } 425 426 void hash_table_destroy(hash_table_t *table) { 427 @{Destroy Entries} 428 free(table->entries); 429 free(table); 430 } 431 432 void *hash_table_get(hash_table_t *table, void *key) { 433 @{Get Entry By Hash} 434 @{Loop Through Entries and Return Value if Match} 435 return NULL; 436 } 437 438 void hash_table_put(hash_table_t *table, void *key, void *value) { 439 @{Get Entry By Hash} 440 @{Loop Through Entries and Replace Value if Key Matches} 441 @{Allocate New Entry if No Match} 442 } 443 444 void hash_table_remove(hash_table_t *table, void *key) { 445 @{Get Entry By Hash} 446 @{Loop Through Entries and Remove Entry if Key Matches} 447 } 448 449 #ifdef TEST_HASH_TABLE 450 #include <assert.h> 451 #include <stdio.h> 452 #include <string.h> 453 454 int string_cmp(void *key1, void *key2) { 455 return strcmp((char *)key1, (char *)key2); 456 } 457 458 unsigned long string_hash(void *key) { 459 unsigned long hash = 5381; 460 char *str = (char *)key; 461 while (*str != '\0') { 462 hash = ((hash << 5) + hash) + *str; 463 str++; 464 } 465 return hash; 466 } 467 468 int main() { 469 hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL); 470 hash_table_put(table, "foo", "bar"); 471 hash_table_put(table, "foo", "baz"); 472 assert(strcmp((char *)hash_table_get(table, "foo"), "baz") == 0); 473 hash_table_remove(table, "foo"); 474 assert(hash_table_get(table, "foo") == NULL); 475 hash_table_destroy(table); 476 return 0; 477 } 478 #endif 479 --- 480 481 @s 482 For the hash table data structure, we'll define a pointer to an array of entries, the size of the array, and the hash/comparison functions. 483 484 --- Hash Table Data Structure 485 struct hash_table { 486 struct hash_table_entry **entries; 487 int size; 488 hash_table_cmp_fn cmp; 489 hash_table_hash_fn hash; 490 hash_table_dtor dtor; 491 }; 492 --- 493 494 @s 495 Entries in the hash table will have a key, a value, and a link to the next entry in the chain. 496 497 --- Hash Table Entry Data Structure 498 struct hash_table_entry { 499 void *key; 500 void *value; 501 struct hash_table_entry *next; 502 }; 503 --- 504 505 @s 506 Allocating a hash table involves allocating memory for the hash table itself and the entries, zeroing out the entries, and setting the hash and comparison functions. 507 508 --- Allocate and Initialize Hash Table 509 hash_table_t *table = malloc(sizeof(struct hash_table)); 510 if (table == NULL) { 511 fputs("Out of memory, could not allocate hash table\n", stderr); 512 exit(1); 513 } 514 table->entries = calloc(size, sizeof(struct hash_table_entry *)); 515 if (table->entries == NULL) { 516 fputs("Out of memory, could not allocate hash table entries\n", stderr); 517 exit(1); 518 } 519 table->size = size; 520 table->cmp = cmp; 521 table->hash = hash; 522 table->dtor = dtor; 523 --- 524 525 @s 526 To destroy a hash table, we loop through the entries, freeing the keys and values, and then free the entries and the table itself. 527 528 --- Destroy Entries 529 for (int i = 0; i < table->size; i++) { 530 struct hash_table_entry *entry = table->entries[i]; 531 while (entry != NULL) { 532 struct hash_table_entry *next = entry->next; 533 if (table->dtor != NULL) { 534 table->dtor(entry->key, 1); 535 table->dtor(entry->value, 0); 536 } 537 free(entry); 538 entry = next; 539 } 540 } 541 --- 542 543 @s 544 To get an entry from the hash table, we hash the key, loop through the entries, and return the value if we find a match. 545 546 --- Get Entry By Hash 547 unsigned int hash = table->hash(key) % table->size; 548 struct hash_table_entry *entry = table->entries[hash]; 549 --- 550 551 @s 552 To put an entry in the hash table, we hash the key, loop through the entries, and replace the value if we find a match. 553 554 --- Loop Through Entries and Replace Value if Key Matches 555 while (entry != NULL) { 556 if (table->cmp(entry->key, key) == 0) { 557 entry->value = value; 558 return; 559 } 560 entry = entry->next; 561 } 562 --- 563 564 @s 565 If we don't find a match, we allocate a new entry, set the key and value, and insert it at the head of the linked list. 566 This exploits a property in computer science called locality of reference. The gist of that is that when you write to a piece of memory, you're likely to read from it again soon. By putting the new entry at the head of the linked list, we increase the chances that we'll find it quickly next time. 567 568 --- Allocate New Entry if No Match 569 struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry)); 570 if (new_entry == NULL) { 571 fputs("Out of memory, could not allocate hash table entry\n", stderr); 572 exit(1); 573 } 574 new_entry->key = key; 575 new_entry->value = value; 576 new_entry->next = table->entries[hash]; 577 table->entries[hash] = new_entry; 578 --- 579 580 @s 581 To remove an entry from the hash table, we hash the key, loop through the entries, and remove the entry if we find a match. 582 583 --- Loop Through Entries and Remove Entry if Key Matches 584 struct hash_table_entry *prev = NULL; 585 while (entry != NULL) { 586 if (table->cmp(entry->key, key) == 0) { 587 if (prev == NULL) { 588 table->entries[hash] = entry->next; 589 } else { 590 prev->next = entry->next; 591 } 592 if (table->dtor != NULL) { 593 table->dtor(entry->key, 1); 594 table->dtor(entry->value, 0); 595 } 596 free(entry); 597 return; 598 } 599 prev = entry; 600 entry = entry->next; 601 } 602 --- 603 604 @s 605 To find a value associated with a given key in the hash table, we hash the string, loop through the entries, and return the value if a match is found. 606 607 --- Loop Through Entries and Return Value if Match 608 while (entry != NULL) { 609 if (table->cmp(entry->key, key) == 0) { 610 return entry->value; 611 } 612 entry = entry->next; 613 } 614 ---