lexer.lit (73462B)
1 @code_type c .c 2 @comment_type /* %s */ 3 @compiler lit -t lexer.lit && gcc -Wall -Wextra -Wstrict-aliasing=3 -Wwrite-strings -Wvla -Wcast-align=strict -Wstrict-prototypes -Wstringop-overflow=4 -Wshadow -fanalyzer tokenizer.c input.c hash_table.c token.c util.c -D TEST_TOKENIZER -g -O0 -o lextest && valgrind --leak-check=full ./lextest tokenizer.c && rm lextest 4 5 @title Lexer 6 @add_css ./style.css 7 @s General Project Structure 8 Since this is the first article, I'll outline the project structure for the C- compiler. 9 10 The project has a series of pretty typical stages: 11 12 1. The lexer. This takes a file as input and emits a series of tokens (Its input is already preprocessed, I outsource that to "gcc -E"). 13 2. The parser. This takes the tokens and builds an abstract syntax tree (AST). 14 4. The type checker/semantic analyzer. This is used to ensure that the types of variables and functions are correct. 15 5. The code generator. This takes the AST and generates an intermediate representation (IR). 16 6. The optimizer. This takes the IR and optimizes it. This'll be broken up into a few stages. 17 7. The lowerer. This takes the IR and lowers it to a simpler IR. 18 8. The register allocator. This takes the IR, which has instructions in an infinite number of registers, and assigns them to a finite number of registers. 19 9. The code emitter. This takes the IR and emits RISC-V assembly. 20 21 As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena. 22 23 @s Some Rules 24 25 Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent. 26 27 1. PROGRAM LIKE IT'S 1999 28 29 > 640 KB ought to be enough for anybody. - Bill Gates 30 31 Maybe not that little, But I'm going to try to keep the project as simple as possible, 640 KB probably won't be enough, but I'll still aim for less than 10 MB of memory usage. 32 33 This places a lot of constraints on the project, but I think it's a good exercise in minimalism. 34 35 Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future). 36 37 2. PROGRAM IN C++-- 38 39 I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits: 40 41 * Quicker compilation. A change to a data structure will only require one file to be recompiled, rather than every file that includes the header. 42 * Less chance of bugs. If a function is the only way to interact with a data structure, then it's much harder to misuse that data structure. 43 * Run time type checking. I can include some sort of tag in the first field of every data structure to ensure that the correct functions are being called. 44 45 3. DON'T GET FANCY 46 47 My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else. 48 49 That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability. 50 51 4. DESIGN FOR DEBUGGING 52 53 This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point. 54 55 This might be painful, but it'll make debugging a lot simpler and let users look under the hood. 56 57 5. SMART DATA, STUPID CODE 58 59 A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible. 60 61 For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table. 62 63 @s Misc 64 THIS IS A LITERATE PROGRAM! Go to [this link](https://reagancfischer.dev/projects/cminus/code/lexer.lit) to see the file that generated this HTML. 65 66 @s The Lexer 67 68 A lexical analyzer reads source code and produces tokens, which are the smallest units of meaning in a language. These tokens are then used by the parser to build an abstract syntax tree (AST), which represents the structure of the program. 69 70 For example, in the C programming language, tokens include keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.). 71 72 Given a string like `int main() { return 0; }`, the lexer would produce a series of tokens like `INT`, `IDENTIFIER(main)`, `LPAREN`, `RPAREN`, `LBRACE`, `RETURN`, `INTCONSTANT(0)`, `SEMICOLON`, `RBRACE`. 73 74 @s Design 75 76 I'll break the lexer into several modules: 77 - `token.c` will contain the token data structure and functions to create and destroy tokens. 78 - `input.c` will contain the input data structure and functions to read from the input file. 79 - `tokenizer.c` will contain the main lexer logic. 80 81 @s Token Interface 82 83 We'll need several components to represent a token: 84 - The type of token, which will be an enum with values like `TOK_IF` or `TOK_INTEGER_U32`. 85 - The value of the token. Some tokens, like keywords, don't have a value. Others, like identifiers or constants, do. 86 - The line and column of the token, used for error messages. 87 88 To implement a class system in C, we'll hide the token implementation details behind an opaque pointer. We'll use a forward declaration of the token type in the header file and define the token type in the implementation file. 89 90 @s 91 --- Opaque Token Type 92 typedef struct token token_t; 93 --- 94 95 @s 96 We'll need functions to create and destroy tokens. 97 --- Token Creation and Destruction Interface 98 token_t *token_create(c_token_types kind, int lin, int col, int len); 99 token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len); 100 token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len); 101 token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len); 102 token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len); 103 void token_destroy(token_t *token); 104 --- 105 106 107 @s 108 We'll also need functions to access the token data. 109 --- Token Interface 110 c_token_types token_type(token_t *token); 111 int64_t token_int(token_t *token); 112 double token_float(token_t *token); 113 const char *token_string(token_t *token); 114 char token_char(token_t *token); 115 int token_line(token_t *token); 116 int token_column(token_t *token); 117 void print_token(token_t *tok); 118 --- 119 120 @s 121 We'll need types to represent the different kinds of tokens. 122 --- Token Types 123 typedef enum { 124 // Control Keywords 125 TOK_IF, 126 TOK_ELSE, 127 TOK_SWITCH, 128 TOK_CASE, 129 TOK_DEFAULT, 130 TOK_WHILE, 131 TOK_DO, 132 TOK_FOR, 133 TOK_CONTINUE, 134 TOK_BREAK, 135 TOK_RETURN, 136 TOK_GOTO, 137 138 // Type Keywords 139 TOK_VOID, 140 TOK_CHAR, 141 TOK_SHORT, 142 TOK_INT, 143 TOK_LONG, 144 TOK_FLOAT, 145 TOK_DOUBLE, 146 TOK_SIGNED, 147 TOK_UNSIGNED, 148 TOK_STRUCT, 149 TOK_UNION, 150 TOK_ENUM, 151 TOK_TYPEDEF, 152 153 // Storage Class/Specifier Keywords 154 TOK_AUTO, 155 TOK_REGISTER, 156 TOK_STATIC, 157 TOK_EXTERN, 158 TOK_CONST, 159 TOK_VOLATILE, 160 161 // Misc Keywords 162 TOK_SIZEOF, 163 164 // Operators 165 TOK_ADD, // + 166 TOK_SUB, // - 167 TOK_MUL, // * 168 TOK_DIV, // / 169 TOK_MOD, // % 170 TOK_BIT_AND, // & 171 TOK_BIT_OR, // | 172 TOK_BIT_XOR, // ^ 173 TOK_BIT_NOT, // ~ 174 TOK_LSHIFT, // << 175 TOK_RSHIFT, // >> 176 TOK_NOT, // ! 177 TOK_ASSIGN, // = 178 TOK_LT, // < 179 TOK_GT, // > 180 TOK_INC, // ++ 181 TOK_DEC, // -- 182 TOK_EQ, // == 183 TOK_NE, // != 184 TOK_LE, // <= 185 TOK_GE, // >= 186 TOK_AND, // && 187 TOK_OR, // || 188 TOK_MEMBER_POINTER, // -> 189 TOK_MEMBER, // . 190 TOK_COND_DECISION, // : 191 TOK_COND, // ? 192 TOK_ASSIGN_ADD, // += 193 TOK_ASSIGN_SUB, // -= 194 TOK_ASSIGN_MUL, // *= 195 TOK_ASSIGN_DIV, // /= 196 TOK_ASSIGN_MOD, // %= 197 TOK_ASSIGN_BITAND, // &= 198 TOK_ASSIGN_BITOR, // |= 199 TOK_ASSIGN_BITXOR, // ^= 200 TOK_ASSIGN_LSHIFT, // <<= 201 TOK_ASSIGN_RSHIFT, // >>= 202 203 // Separators 204 TOK_LEFT_PAREN, // ( 205 TOK_RIGHT_PAREN, // ) 206 TOK_LEFT_BRACKET, // [ 207 TOK_RIGHT_BRACKET, // ] 208 TOK_LEFT_BRACE, // { 209 TOK_RIGHT_BRACE, // } 210 TOK_COMMA, // , 211 TOK_SEMICOLON, // ; 212 TOK_DOT, // . 213 TOK_ELLIPSIS, // ... 214 TOK_HASH, // # 215 216 // Identifiers 217 TOK_ID, 218 TOK_TYPEDEF_NAME, 219 220 // Constants 221 TOK_INTEGER_U32, // u 222 TOK_INTEGER_U64, // ul 223 TOK_INTEGER_S32, // (no suffix) 224 TOK_INTEGER_S64, // l 225 TOK_FLOAT_32, // f 226 TOK_FLOAT_64, // (no suffix) 227 TOK_CHAR_CONST, // 'c' 228 TOK_STRING_ASCII, // "string" (width of 8 bits) 229 230 // Special 231 TOK_EOF, 232 TOK_ERROR, 233 } c_token_types; 234 --- 235 236 @s 237 We bring this all together in `token.h`. Line and column are exposed as global variables because `skip_whitespace` will need to update them. 238 --- token.h 239 #ifndef TOKEN_H 240 #define TOKEN_H 241 #include <stdint.h> // For int64_t 242 @{Token Types} 243 @{Opaque Token Type} 244 @{Token Creation and Destruction} 245 @{Token Interface} 246 extern int column; 247 extern int line; 248 #endif 249 --- 250 251 @s Token Implementation 252 253 Now that we have the interface, we can implement the token data structure. We'll need: 254 - The token type. 255 - A way to store extra data. 256 - Implementations of the functions defined in the interface. 257 258 To verify the token isn't corrupt, we'll use a tag. `TOK_MAGIC_1` represents a token with optional data, and `TOK_MAGIC_2` represents a token without optional data. 259 260 A zero-length array is used in the token data structure. This GCC extension allows us to allocate memory for the token data structure and the token data in one allocation. 261 --- Token Data Structure 262 #define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK" 263 #define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN" 264 265 struct token { 266 long magic; 267 int line; 268 int column; 269 short kind; 270 long opt_data[0]; 271 }; 272 273 typedef struct token token_t; 274 275 struct token_data { 276 union { 277 int64_t i; 278 double f; 279 const char *s; 280 char c; 281 } data; 282 }; 283 284 typedef struct token_data token_data_t; 285 int column = 1; 286 int line = 1; 287 --- 288 289 290 @s 291 We'll implement an interface for accessing the token data and a macro for accessing optional data. 292 --- Token Data Access 293 #define token_data(token) ((struct token_data *)((token)->opt_data)) 294 295 c_token_types token_type(token_t *token) { 296 assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); 297 return token->kind; 298 } 299 300 int64_t token_int(token_t *token) { 301 assert(token->kind == TOK_INTEGER_U32 || token->kind == TOK_INTEGER_U64 || token->kind == TOK_INTEGER_S32 || token->kind == TOK_INTEGER_S64); 302 assert(token->magic == TOK_MAGIC_1); 303 return token_data(token)->data.i; 304 } 305 306 double token_float(token_t *token) { 307 assert(token->kind == TOK_FLOAT_32 || token->kind == TOK_FLOAT_64); 308 assert(token->magic == TOK_MAGIC_1); 309 return token_data(token)->data.f; 310 } 311 312 const char *token_string(token_t *token) { 313 assert(token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TYPEDEF_NAME); 314 assert(token->magic == TOK_MAGIC_1); 315 return token_data(token)->data.s; 316 } 317 318 char token_char(token_t *token) { 319 assert(token->kind == TOK_CHAR_CONST); 320 assert(token->magic == TOK_MAGIC_1); 321 return token_data(token)->data.c; 322 } 323 324 int token_line(token_t *token) { 325 assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); 326 return token->line; 327 } 328 329 int token_column(token_t *token) { 330 assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); 331 return token->column; 332 } 333 --- 334 335 @s 336 For debugging, we'll add a function to print the token type. 337 --- Token Debugging 338 @{Token Type Enum to String} 339 @{Unescape String} 340 @{Print Token} 341 --- 342 343 @s 344 This function returns a string with the token type name. 345 --- Token Type Enum to String 346 const char *token_name_from_type(c_token_types type) { 347 switch (type) { 348 case TOK_IF: 349 return "TOK_IF"; 350 case TOK_ELSE: 351 return "TOK_ELSE"; 352 case TOK_SWITCH: 353 return "TOK_SWITCH"; 354 case TOK_CASE: 355 return "TOK_CASE"; 356 case TOK_DEFAULT: 357 return "TOK_DEFAULT"; 358 case TOK_WHILE: 359 return "TOK_WHILE"; 360 case TOK_DO: 361 return "TOK_DO"; 362 case TOK_FOR: 363 return "TOK_FOR"; 364 case TOK_CONTINUE: 365 return "TOK_CONTINUE"; 366 case TOK_BREAK: 367 return "TOK_BREAK"; 368 case TOK_RETURN: 369 return "TOK_RETURN"; 370 case TOK_GOTO: 371 return "TOK_GOTO"; 372 case TOK_VOID: 373 return "TOK_VOID"; 374 case TOK_CHAR: 375 return "TOK_CHAR"; 376 case TOK_SHORT: 377 return "TOK_SHORT"; 378 case TOK_INT: 379 return "TOK_INT"; 380 case TOK_LONG: 381 return "TOK_LONG"; 382 case TOK_FLOAT: 383 return "TOK_FLOAT"; 384 case TOK_DOUBLE: 385 return "TOK_DOUBLE"; 386 case TOK_SIGNED: 387 return "TOK_SIGNED"; 388 case TOK_UNSIGNED: 389 return "TOK_UNSIGNED"; 390 case TOK_STRUCT: 391 return "TOK_STRUCT"; 392 case TOK_UNION: 393 return "TOK_UNION"; 394 case TOK_ENUM: 395 return "TOK_ENUM"; 396 case TOK_TYPEDEF: 397 return "TOK_TYPEDEF"; 398 case TOK_AUTO: 399 return "TOK_AUTO"; 400 case TOK_REGISTER: 401 return "TOK_REGISTER"; 402 case TOK_STATIC: 403 return "TOK_STATIC"; 404 case TOK_EXTERN: 405 return "TOK_EXTERN"; 406 case TOK_CONST: 407 return "TOK_CONST"; 408 case TOK_VOLATILE: 409 return "TOK_VOLATILE"; 410 case TOK_SIZEOF: 411 return "TOK_SIZEOF"; 412 case TOK_ADD: 413 return "TOK_ADD"; 414 case TOK_SUB: 415 return "TOK_SUB"; 416 case TOK_MUL: 417 return "TOK_MUL"; 418 case TOK_DIV: 419 return "TOK_DIV"; 420 case TOK_MOD: 421 return "TOK_MOD"; 422 case TOK_BIT_AND: 423 return "TOK_BIT_AND"; 424 case TOK_BIT_OR: 425 return "TOK_BIT_OR"; 426 case TOK_BIT_XOR: 427 return "TOK_BIT_XOR"; 428 case TOK_BIT_NOT: 429 return "TOK_BIT_NOT"; 430 case TOK_LSHIFT: 431 return "TOK_LSHIFT"; 432 case TOK_RSHIFT: 433 return "TOK_RSHIFT"; 434 case TOK_NOT: 435 return "TOK_NOT"; 436 case TOK_ASSIGN: 437 return "TOK_ASSIGN"; 438 case TOK_LT: 439 return "TOK_LT"; 440 case TOK_GT: 441 return "TOK_GT"; 442 case TOK_INC: 443 return "TOK_INC"; 444 case TOK_DEC: 445 return "TOK_DEC"; 446 case TOK_EQ: 447 return "TOK_EQ"; 448 case TOK_NE: 449 return "TOK_NE"; 450 case TOK_LE: 451 return "TOK_LE"; 452 case TOK_GE: 453 return "TOK_GE"; 454 case TOK_AND: 455 return "TOK_AND"; 456 case TOK_OR: 457 return "TOK_OR"; 458 case TOK_MEMBER_POINTER: 459 return "TOK_MEMBER_POINTER"; 460 case TOK_MEMBER: 461 return "TOK_MEMBER"; 462 case TOK_COND_DECISION: 463 return "TOK_COND_DECISION"; 464 case TOK_COND: 465 return "TOK_COND"; 466 case TOK_ASSIGN_ADD: 467 return "TOK_ASSIGN_ADD"; 468 case TOK_ASSIGN_SUB: 469 return "TOK_ASSIGN_SUB"; 470 case TOK_ASSIGN_MUL: 471 return "TOK_ASSIGN_MUL"; 472 case TOK_ASSIGN_DIV: 473 return "TOK_ASSIGN_DIV"; 474 case TOK_ASSIGN_MOD: 475 return "TOK_ASSIGN_MOD"; 476 case TOK_ASSIGN_BITAND: 477 return "TOK_ASSIGN_BITAND"; 478 case TOK_ASSIGN_BITOR: 479 return "TOK_ASSIGN_BITOR"; 480 case TOK_ASSIGN_BITXOR: 481 return "TOK_ASSIGN_BITXOR"; 482 case TOK_ASSIGN_LSHIFT: 483 return "TOK_ASSIGN_LSHIFT"; 484 case TOK_ASSIGN_RSHIFT: 485 return "TOK_ASSIGN_RSHIFT"; 486 case TOK_HASH: 487 return "TOK_HASH"; 488 case TOK_ID: 489 return "TOK_ID"; 490 case TOK_TYPEDEF_NAME: 491 return "TOK_TYPEDEF_NAME"; 492 case TOK_INTEGER_U32: 493 return "TOK_INTEGER_U32"; 494 case TOK_INTEGER_U64: 495 return "TOK_INTEGER_U64"; 496 case TOK_INTEGER_S32: 497 return "TOK_INTEGER_S32"; 498 case TOK_INTEGER_S64: 499 return "TOK_INTEGER_S64"; 500 case TOK_FLOAT_32: 501 return "TOK_FLOAT_32"; 502 case TOK_FLOAT_64: 503 return "TOK_FLOAT_64"; 504 case TOK_CHAR_CONST: 505 return "TOK_CHAR_CONST"; 506 case TOK_STRING_ASCII: 507 return "TOK_STRING_ASCII"; 508 case TOK_EOF: 509 return "TOK_EOF"; 510 case TOK_ERROR: 511 return "TOK_ERROR"; 512 case TOK_LEFT_PAREN: 513 return "TOK_LEFT_PAREN"; 514 case TOK_RIGHT_PAREN: 515 return "TOK_RIGHT_PAREN"; 516 case TOK_LEFT_BRACKET: 517 return "TOK_LEFT_BRACKET"; 518 case TOK_RIGHT_BRACKET: 519 return "TOK_RIGHT_BRACKET"; 520 case TOK_LEFT_BRACE: 521 return "TOK_LEFT_BRACE"; 522 case TOK_RIGHT_BRACE: 523 return "TOK_RIGHT_BRACE"; 524 case TOK_COMMA: 525 return "TOK_COMMA"; 526 case TOK_SEMICOLON: 527 return "TOK_SEMICOLON"; 528 case TOK_DOT: 529 return "TOK_DOT"; 530 case TOK_ELLIPSIS: 531 return "TOK_ELLIPSIS"; 532 } 533 return "UNKNOWN"; 534 } 535 --- 536 537 @s 538 This function adds escape characters to a string for printing. 539 --- Unescape String 540 #define clamp(x, min, max) ((x) < (min) ? (min) : (x) > (max) ? (max) : (x)) 541 char *re_escape_string(const char *str) { 542 int len = strlen(str); 543 char *buf = malloc(len * 2 + 1); 544 if (!buf) { 545 fprintf(stderr, "Out of memory. Cannot escape string\n"); 546 exit(1); 547 } 548 int i = 0; 549 for (int j = 0; j < len; j++) { 550 switch (str[j]) { 551 case '\a': buf[i++] = '\\'; buf[i++] = 'a'; break; 552 case '\b': buf[i++] = '\\'; buf[i++] = 'b'; break; 553 case '\f': buf[i++] = '\\'; buf[i++] = 'f'; break; 554 case '\n': buf[i++] = '\\'; buf[i++] = 'n'; break; 555 case '\r': buf[i++] = '\\'; buf[i++] = 'r'; break; 556 case '\t': buf[i++] = '\\'; buf[i++] = 't'; break; 557 case '\v': buf[i++] = '\\'; buf[i++] = 'v'; break; 558 case '\\': buf[i++] = '\\'; buf[i++] = '\\'; break; 559 case '\'': buf[i++] = '\\'; buf[i++] = '\''; break; 560 case '"': buf[i++] = '\\'; buf[i++] = '"'; break; 561 default: { 562 if (isprint(str[j])) { 563 buf[i++] = str[j]; 564 } else { 565 buf[i++] = '\\'; 566 buf[i++] = 'x'; 567 buf[i++] = "0123456789abcdef"[clamp(str[j] >> 4, 0, 0xf)]; 568 buf[i++] = "0123456789abcdef"[clamp(str[j] & 0xf, 0, 0xf)]; 569 } 570 } 571 } 572 } 573 buf[i] = '\0'; 574 return buf; 575 } 576 --- 577 578 579 @s 580 This function prints the token type and value. 581 --- Print Token 582 void print_token(token_t *tok) { 583 if (!tok) { 584 printf("NULL\n"); 585 return; 586 } 587 const char *name = token_name_from_type(tok->kind); 588 switch (tok->kind) { 589 case TOK_ID: 590 case TOK_STRING_ASCII: { 591 char *escaped = re_escape_string(token_string(tok)); 592 printf("%s: \"%s\"@%d:%d\n", name, escaped, tok->line, tok->column); 593 free(escaped); 594 break; 595 } 596 case TOK_TYPEDEF_NAME: { 597 char *escaped = re_escape_string(token_string(tok)); 598 printf("%s: %s@%d:%d\n", name, escaped, tok->line, tok->column); 599 free(escaped); 600 break; 601 } 602 case TOK_CHAR_CONST: 603 printf("%s: '%c'@%d:%d\n", name, token_char(tok), tok->line, tok->column); 604 break; 605 case TOK_INTEGER_S32: 606 case TOK_INTEGER_U32: 607 case TOK_INTEGER_S64: 608 case TOK_INTEGER_U64: 609 printf("%s: %ld@%d:%d\n", name, token_int(tok), tok->line, tok->column); 610 break; 611 case TOK_FLOAT_32: 612 case TOK_FLOAT_64: 613 printf("%s: %f@%d:%d\n", name, token_float(tok), tok->line, tok->column); 614 break; 615 default: 616 printf("%s@%d:%d\n", name, tok->line, tok->column); 617 break; 618 } 619 } 620 --- 621 622 @s 623 Now we can implement functions to create and destroy tokens. We'll start with the easy ones. 624 --- Token Creation and Destruction 625 token_t *token_data_create(c_token_types kind, int lin, int col, int len) { 626 token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data)); 627 if (!token) { 628 fputs("Out of memory\n", stderr); 629 exit(1); 630 } 631 token->magic = TOK_MAGIC_1; 632 token->line = lin; 633 token->column = col; 634 column += len; 635 token->kind = kind; 636 return token; 637 } 638 639 token_t *token_create(c_token_types kind, int lin, int col, int len) { 640 token_t *token = malloc(sizeof(token_t)); 641 if (!token) { 642 fputs("Out of memory\n", stderr); 643 exit(1); 644 } 645 token->magic = TOK_MAGIC_2; 646 token->line = lin; 647 token->column = col; 648 column += len; 649 token->kind = kind; 650 return token; 651 } 652 653 token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) { 654 token_t *token = token_data_create(kind, lin, col, len); 655 token_data(token)->data.i = i; 656 return token; 657 } 658 659 token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) { 660 token_t *token = token_data_create(kind, lin, col, len); 661 token_data(token)->data.f = f; 662 return token; 663 } 664 665 token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) { 666 token_t *token = token_data_create(kind, lin, col, len); 667 token_data(token)->data.c = c; 668 return token; 669 } 670 671 void token_destroy(token_t *token) { 672 if (token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2) { 673 free(token); 674 } else { 675 fputs("Corrupt token\n", stderr); 676 exit(1); 677 } 678 } 679 --- 680 681 @s 682 `token_create_string` can be implemented either the easy way or the right way. Let's try the easy way. 683 --- Token Create String 684 token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len) { 685 token_t *token = token_create(kind, lin, col, len); 686 token_data(token)->data.s = strdup(s); 687 return token; 688 } 689 --- 690 691 @s 692 There's an issue with this approach. `token_create_string` will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to `fprintf`, `fscanf`, `strchr`, `strtok`, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage. 693 694 To fix this, we use a hash table to store the strings. We'll define a hash table in `hash_table.h` and `hash_table.c`. 695 696 @s Hash Table 697 A hash table is a data structure that maps keys to values. It's commonly used for implementing symbol tables to store variables and functions. To create a generic hash table, we'll need: 698 699 1. A hash function to convert keys into array indices 700 2. A comparison function to check if two keys are equal 701 3. An opaque type to represent the hash table 702 4. A destructor function to clean up keys and values when removing entries 703 704 Let's start with the interface: 705 706 @s 707 --- Hash Table Opaque Types 708 typedef struct hash_table hash_table_t; 709 typedef int (*hash_table_cmp_fn)(const void *key1, const void *key2); 710 typedef unsigned int (*hash_table_hash_fn)(const void *key); 711 typedef void (*hash_table_dtor)(void *data, int is_key); 712 --- 713 714 @s 715 --- Hash Table Creation and Destruction 716 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor); 717 void hash_table_destroy(hash_table_t *table); 718 --- 719 720 @s 721 --- Hash Table Access 722 void *hash_table_get(const hash_table_t *table, const void *key); 723 void hash_table_put(hash_table_t *table, void *key, void *value); 724 void hash_table_remove(hash_table_t *table, const void *key); 725 --- 726 727 @s 728 --- hash_table.h 729 #ifndef HASH_TABLE_H 730 #define HASH_TABLE_H 731 732 @{Hash Table Opaque Types} 733 @{Hash Table Creation and Destruction} 734 @{Hash Table Access} 735 736 #endif /* HASH_TABLE_H */ 737 --- 738 739 @s 740 Now let's implement the hash table: 741 742 --- hash_table.c 743 #include <stdlib.h> 744 #include <string.h> 745 #include <stdio.h> 746 #include "hash_table.h" 747 748 @{Hash Table Data Structure} 749 @{Hash Table Entry Data Structure} 750 751 hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) { 752 @{Allocate and Initialize Hash Table} 753 return table; 754 } 755 756 void hash_table_destroy(hash_table_t *table) { 757 if (!table) return; 758 @{Destroy Entries} 759 free(table->entries); 760 free(table); 761 } 762 763 void *hash_table_get(const hash_table_t *table, const void *key) { 764 if (!table || !key) return NULL; 765 @{Get Entry By Hash} 766 @{Loop Through Entries and Return Value if Match} 767 return NULL; 768 } 769 770 void hash_table_put(hash_table_t *table, void *key, void *value) { 771 if (!table || !key) return; 772 @{Get Entry By Hash} 773 @{Loop Through Entries and Replace Value if Key Matches} 774 @{Allocate New Entry if No Match} 775 } 776 777 void hash_table_remove(hash_table_t *table, const void *key) { 778 if (!table || !key) return; 779 @{Get Entry By Hash} 780 @{Loop Through Entries and Remove Entry if Key Matches} 781 } 782 783 #ifdef TEST_HASH_TABLE 784 #include <assert.h> 785 786 static int string_cmp(const void *key1, const void *key2) { 787 return strcmp((const char *)key1, (const char *)key2); 788 } 789 790 static unsigned int string_hash(const void *key) { 791 unsigned int hash = 5381; 792 const unsigned char *str = key; 793 int c; 794 795 while ((c = *str++)) 796 { 797 hash = ((hash << 5) + hash) + c; 798 } 799 800 return hash; 801 } 802 803 int main(void) { 804 hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL); 805 assert(table != NULL); 806 807 hash_table_put(table, "foo", "bar"); 808 hash_table_put(table, "foo", "baz"); 809 assert(strcmp((const char *)hash_table_get(table, "foo"), "baz") == 0); 810 811 hash_table_remove(table, "foo"); 812 assert(hash_table_get(table, "foo") == NULL); 813 814 hash_table_destroy(table); 815 printf("All tests passed!\n"); 816 return 0; 817 } 818 #endif 819 --- 820 821 @s 822 The hash table data structure contains an array of entry pointers, the size of the array, and function pointers for comparison, hashing, and destruction. 823 824 --- Hash Table Data Structure 825 struct hash_table { 826 struct hash_table_entry **entries; 827 int size; 828 hash_table_cmp_fn cmp; 829 hash_table_hash_fn hash; 830 hash_table_dtor dtor; 831 }; 832 --- 833 834 @s 835 Each entry in the hash table contains a key, a value, and a pointer to the next entry in the chain for collision resolution via chaining. 836 837 --- Hash Table Entry Data Structure 838 struct hash_table_entry { 839 void *key; 840 void *value; 841 struct hash_table_entry *next; 842 }; 843 --- 844 845 846 @s 847 To allocate a hash table, we allocate memory for the table structure and its entries, initialize the entries to NULL, and set the function pointers. 848 849 --- Allocate and Initialize Hash Table 850 hash_table_t *table = malloc(sizeof(struct hash_table)); 851 if (!table) { 852 fprintf(stderr, "Error: Out of memory, could not allocate hash table\n"); 853 exit(EXIT_FAILURE); 854 } 855 table->entries = calloc(size, sizeof(struct hash_table_entry *)); 856 if (!table->entries) { 857 fprintf(stderr, "Error: Out of memory, could not allocate hash table entries\n"); 858 free(table); 859 exit(EXIT_FAILURE); 860 } 861 table->size = size; 862 table->cmp = cmp; 863 table->hash = hash; 864 table->dtor = dtor; 865 --- 866 867 @s 868 To destroy the entries in a hash table, we iterate through all entries, free the keys and values using the destructor if provided, and free the entry itself. 869 --- Destroy Entries 870 for (int i = 0; i < table->size; i++) { 871 struct hash_table_entry *entry = table->entries[i]; 872 while (entry) { 873 struct hash_table_entry *next = entry->next; 874 if (table->dtor) { 875 table->dtor(entry->key, 1); 876 table->dtor(entry->value, 0); 877 } 878 free(entry); 879 entry = next; 880 } 881 } 882 --- 883 884 @s 885 To retrieve an entry's hash bucket, we apply the hash function to the key and take the modulus of the result with the table size. 886 887 --- Get Entry By Hash 888 unsigned int hash = table->hash(key) % table->size; 889 struct hash_table_entry *entry = table->entries[hash]; 890 --- 891 892 @s 893 When putting a new entry in the table, we first check if the key already exists. If it does, we update the value; otherwise, we create a new entry. 894 895 --- Loop Through Entries and Replace Value if Key Matches 896 while (entry) { 897 if (table->cmp(entry->key, key) == 0) { 898 if (table->dtor) table->dtor(entry->value, 0); 899 entry->value = value; 900 return; 901 } 902 entry = entry->next; 903 } 904 --- 905 906 @s 907 If no matching key is found, we create a new entry and insert it at the beginning of the hash bucket. 908 909 This can possibly improve performance due to a property called temporal locality. When we access an entry, we're likely to access it again soon. Since the entry is at the beginning of the list, it's likely to be accessed again soon. 910 911 --- Allocate New Entry if No Match 912 struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry)); 913 if (!new_entry) { 914 fprintf(stderr, "Error: Out of memory, could not allocate hash table entry\n"); 915 exit(EXIT_FAILURE); 916 } 917 new_entry->key = key; 918 new_entry->value = value; 919 new_entry->next = table->entries[hash]; 920 table->entries[hash] = new_entry; 921 --- 922 923 @s 924 To remove an entry, we find its bucket, update the linked list to bypass it, then free the entry and its contents. 925 926 --- Loop Through Entries and Remove Entry if Key Matches 927 struct hash_table_entry *prev = NULL; 928 while (entry) { 929 if (table->cmp(entry->key, key) == 0) { 930 if (prev) 931 prev->next = entry->next; 932 else 933 table->entries[hash] = entry->next; 934 935 if (table->dtor) { 936 table->dtor(entry->key, 1); 937 table->dtor(entry->value, 0); 938 } 939 free(entry); 940 return; 941 } 942 prev = entry; 943 entry = entry->next; 944 } 945 --- 946 947 @s 948 To retrieve a value from a given bucket, we just walk the list and return the value if a matching key is found. 949 950 --- Loop Through Entries and Return Value if Match 951 while (entry) { 952 if (table->cmp(entry->key, key) == 0) { 953 return entry->value; 954 } 955 entry = entry->next; 956 } 957 --- 958 959 @s 960 We're now almost ready to implement `token_create_string` the right way. First, we'll need a good hash function. 961 962 Hash functions are a very interesting topic and there's a lot of good research on them. The hash function we use should be fast, have a low collision rate, and be able to handle strings of any length. 963 964 We can't just sum the characters in a string, because that would mean that "stop" and "pots" would have the same hash. Multiplying has the same problem. If we take each to the power of its position in the string, we get a better distribution, but it's still awful. 965 966 Using a simple python program, I brute-forced all possible 4-character strings and ran our power-hash function on them, the result showed that for 456976 possible strings, only 3760 were unique, which is terrible. 967 968 Instead of trying to come up with a new hash function, we can use one that's been well-tested and is known to work well. 969 970 The first time I wrote this, I used the hash function from the 'Red Dragon Book' (Compilers: Principles, Techniques, and Tools). 971 --- Hash Function 972 unsigned long hash_string(void *key) { 973 unsigned long hash = 0, g; 974 char *p = key; 975 while (*p) { 976 hash = (hash \<\< 4) + *p++; 977 if ((g = hash \& 0xf0000000) != 0) { 978 hash ^= g \>\> 24; 979 hash ^= g; 980 } 981 } 982 return hash; 983 } 984 --- 985 This is a bit slow on modern processors because it's not very cache-friendly. We can do better. Let's use its child, ELFHash, from libc. 986 987 As you can see in the code below, this function avoids extra operations and should be much faster. 988 --- Hash Function := 989 unsigned int hash_string(const void *key) { 990 unsigned long hash = 0, hi = 0; 991 const char *p = key; 992 hash = *p; 993 if (hash != 0 && p[1] != 0) { 994 hash = (hash << 4) + p[1]; 995 if (p[2] != 0) { 996 hash = (hash << 4) + p[2]; 997 if (p[3] != 0) { 998 hash = (hash << 4) + p[3]; 999 if (p[4] != 0) { 1000 hash = (hash << 4) + p[4]; 1001 p += 5; 1002 while (*p != 0) { 1003 hash = (hash << 4) + *p++; 1004 hi = hash & 0xf0000000l; 1005 hash ^= hi >> 24; 1006 } 1007 hash &= 0x0fffffffl; 1008 } 1009 } 1010 } 1011 } 1012 return hash; 1013 } 1014 --- 1015 1016 @s 1017 We also need a comparison function for strings. 1018 --- String Comparison 1019 int cmp_string(const void *key1, const void *key2) { 1020 return strcmp((char *)key1, (char *)key2); 1021 } 1022 --- 1023 1024 @s 1025 Finally, we'll need a destructor for entries. 1026 --- String Destructor 1027 void dtor_string(void *value, int is_key) { 1028 if (is_key) { 1029 free(value); // Since the key and value are the same, we only need to free once. 1030 } 1031 } 1032 --- 1033 @s 1034 These functions go in util.c 1035 --- util.c 1036 #include <string.h> 1037 #include <stdlib.h> 1038 #include "hash_table.h" 1039 @{String Comparison} 1040 @{Hash Function} 1041 @{String Destructor} 1042 --- 1043 --- util.h 1044 #ifndef UTIL_H 1045 #define UTIL_H 1046 #include "hash_table.h" 1047 int cmp_string(const void *key1, const void *key2); 1048 unsigned int hash_string(const void *key); 1049 void dtor_string(void *value, int is_key); 1050 #endif 1051 --- 1052 1053 @s 1054 Now we can implement `token_create_string` the right way. 1055 1056 You might notice that we're using the same key and value. This way of using a hash table is normally called a set. We're using it to store strings, but we could use it to store anything we want to deduplicate. 1057 --- Token Create String --- := 1058 hash_table_t *string_table; 1059 token_t *token_create_string(c_token_types kind, int lin, int col, 1060 const char *s, int len) { 1061 if (string_table == NULL) { 1062 string_table = hash_table_create(2048, cmp_string, hash_string, dtor_string); 1063 } 1064 token_t *token = token_data_create(kind, lin, col, len); 1065 char *key = hash_table_get(string_table, (void *)s); 1066 if (key == NULL) { 1067 key = strdup(s); 1068 hash_table_put(string_table, key, key); 1069 } 1070 token_data(token)->data.s = key; 1071 return token; 1072 } 1073 --- 1074 1075 @s 1076 We'll add an external declaration for `string_table` in `token.h` so other programs can take advantage of it. 1077 --- token.h := 1078 #ifndef TOKEN_H 1079 #define TOKEN_H 1080 #include <stdint.h> // We use this for int64_t 1081 #include "hash_table.h" // We need this for the string table 1082 @{Token Types} 1083 @{Opaque Token Type} 1084 @{Token Creation and Destruction Interface} 1085 @{Token Interface} 1086 extern hash_table_t *string_table; 1087 extern int column; 1088 extern int line; 1089 #endif 1090 --- 1091 1092 @s 1093 Finally, we implement the token data structure in `token.c`. 1094 --- token.c 1095 #include <stdlib.h> 1096 #include <string.h> 1097 #include <stdio.h> 1098 #include <assert.h> 1099 #include <ctype.h> 1100 #include "token.h" 1101 #include "hash_table.h" 1102 #include "util.h" 1103 @{Token Data Structure} 1104 @{Token Data Access} 1105 @{Token Creation and Destruction} 1106 @{Token Create String} 1107 @{Token Debugging} 1108 --- 1109 1110 @s Input 1111 1112 Input will provide a simple interface for reading characters from a file. The stream itself is deliberately hidden from the tokenizer, so that the tokenizer doesn't have to worry about buffering or anything like that. 1113 1114 @s 1115 --- Input Interface 1116 void input_init(const char *filename); 1117 int input_getc(void); 1118 void input_ungetc(int c); 1119 void input_destroy(void); 1120 --- 1121 1122 When the program wants to start reading a file, it calls `input_init` with the filename. It can then call `input_getc` to get the next character in the file. If there's no more input, `input_getc` will return `EOF`. 1123 1124 There's also an `input_ungetc` function, which allows the program to put a character back into the stream. I'll only allow one character to be put back, but that should be enough for the tokenizer. 1125 1126 Finally, when the program is done reading the file, it should call `input_destroy` to clean up. 1127 1128 @s Input Design Decisions 1129 1130 Per rule 1, we're trying to keep memory usage low. That means that instead of reading the entire file into memory, we'll need to read it in chunks. There are a couple of choices for how to do this: 1131 1132 1. **Read a line at a time**: This is a more natural approach, but it has two drawbacks. First, it requires a large buffer to store the line (C normally specifies `BUFSIZ` as 8192 bytes). Second, if the line is longer than `BUFSIZ`, we'll have to read the line in chunks anyway. 1133 1134 2. **Choose some arbitrary buffer size and read that many bytes at a time**: This is the approach I'm going to take. It's a little less natural, but it's more memory efficient. 1135 1136 Input will read chunks of 128 bytes at a time, reusing the same static buffer. This limitation is not visible to the tokenizer, which will only see the `input_getc` interface. 1137 1138 When the buffer is exhausted, `input_getc` will call `nextline`, which will read the next chunk of the file. 1139 1140 @s Input Implementation 1141 1142 The implementation of the input module is pretty straightforward. We have the following data structures and defines as globals: 1143 1144 @s 1145 --- Input Data 1146 #define CHUNK_SIZE 128 1147 static char buffer[CHUNK_SIZE]; 1148 static int buffer_pos = 0; 1149 static int buffer_size = 0; 1150 static char unget_buffer_stack[8]; 1151 static int unget_buffer_stack_pos = 0; 1152 1153 static FILE *file = NULL; 1154 --- 1155 1156 When the program calls `input_init`, we open the file. 1157 1158 @s 1159 --- Input Initialization 1160 void input_init(const char *filename) { 1161 file = fopen(filename, "r"); 1162 if (file == NULL) { 1163 fprintf(stderr, "Error: Cannot open file %s\n", filename); 1164 exit(1); 1165 } 1166 } 1167 --- 1168 1169 When the program calls `input_getc`, we return the next character in the buffer. If the buffer is exhausted, we call `nextline`. We also track the line and column. 1170 1171 @s 1172 --- Input Get Character 1173 int input_getc(void) { 1174 if (unget_buffer_stack_pos > 0) { 1175 return unget_buffer_stack[--unget_buffer_stack_pos]; 1176 } 1177 if (buffer_pos == buffer_size) { 1178 buffer_size = fread(buffer, 1, CHUNK_SIZE, file); 1179 buffer_pos = 0; 1180 } 1181 if (buffer_size == 0) { 1182 return EOF; 1183 } 1184 char c = buffer[buffer_pos++]; 1185 return c; 1186 } 1187 --- 1188 1189 When the program calls `input_ungetc`, we save the character in the `unget_buffer`. 1190 1191 @s 1192 --- Input Unget Character 1193 void input_ungetc(int c) { 1194 unget_buffer_stack[unget_buffer_stack_pos++] = c; 1195 } 1196 --- 1197 1198 Since we're not using dynamic memory allocation, cleanup is pretty simple. 1199 1200 @s 1201 --- Input Destroy 1202 void input_destroy(void) { 1203 fclose(file); 1204 } 1205 --- 1206 1207 @s 1208 We put the whole thing together in `input.c`. 1209 --- input.c 1210 #include <stdio.h> 1211 #include <stdlib.h> 1212 #include "input.h" 1213 @{Input Data} 1214 @{Input Initialization} 1215 @{Input Get Character} 1216 @{Input Unget Character} 1217 @{Input Destroy} 1218 --- 1219 1220 @s 1221 We'll need an external declaration for `file` in `input.h` so other programs can take advantage of it. 1222 --- input.h 1223 #ifndef INPUT_H 1224 #define INPUT_H 1225 @{Input Interface} 1226 #endif 1227 --- 1228 1229 @s 1230 We'll implement the lexer interface in `tokenizer.h` 1231 --- tokenizer.h 1232 #ifndef TOKENIZER_H 1233 #define TOKENIZER_H 1234 #include "token.h" 1235 #include "input.h" 1236 @{Tokenization Interface} 1237 #endif 1238 --- 1239 1240 @s 1241 The tokenization interface will have a couple of functions. `next_token` will return the next token in the input stream, `init_tokenizer` will initialize the tokenizer, and `destroy_tokenizer` will clean up. 1242 1243 We'll also have some helper functions for lookahead and matching. 1244 1245 'peek_token' will return the next token without consuming it (it technically does advance the input stream, but it saves the token so it can be reused). 1246 1247 'consume' will consume the next token if it matches a given kind. If it doesn't match, it will print an error message and exit. 1248 --- Tokenization Interface 1249 void init_tokenizer(const char *filename); 1250 void destroy_tokenizer(void); 1251 token_t *next_token(void); 1252 void reject_token(token_t *token); 1253 token_t *peek_token(void); 1254 void consume(c_token_types kind); 1255 void consume_alt(c_token_types *kinds, int n); 1256 --- 1257 1258 @s 1259 Now we can finally implement the tokenizer. A stack for storing tokens for lookahead is defined, as is a hash table, modified by the parser, for typedefs. 1260 --- tokenizer.c 1261 #include <ctype.h> 1262 #include <errno.h> 1263 #include <float.h> 1264 #include <math.h> 1265 #include <stdarg.h> 1266 #include <stdint.h> 1267 #include <stdio.h> 1268 #include <stdlib.h> 1269 #include <string.h> 1270 #include "tokenizer.h" 1271 #include "token.h" 1272 #include "input.h" 1273 #include "util.h" 1274 token_t *left_stack[8]; 1275 int left_stack_pos = 0; 1276 hash_table_t *typedef_table = NULL; 1277 @{Utility Functions} 1278 @{Tokenization Function} 1279 --- 1280 1281 @s 1282 Utility functions are everything that doesn't directly tokenize the input. 1283 --- Utility Functions 1284 void init_tokenizer(const char *filename) { 1285 input_init(filename); 1286 typedef_table = hash_table_create(16, cmp_string, hash_string, dtor_string); 1287 } 1288 1289 void destroy_tokenizer(void) { 1290 input_destroy(); 1291 } 1292 1293 void reject_token(token_t *token) { 1294 left_stack[left_stack_pos++] = token; 1295 } 1296 1297 token_t *peek_token(void) { 1298 if (left_stack_pos > 0) { 1299 return left_stack[left_stack_pos - 1]; 1300 } 1301 token_t *token = next_token(); 1302 reject_token(token); 1303 return token; 1304 } 1305 1306 @{Stringify Type} 1307 1308 void consume(c_token_types kind) { 1309 token_t *token = next_token(); 1310 if (token_type(token) != kind) { 1311 fprintf(stderr, "Error: Expected token of type \"%s\", got \"%s\"\n", stringify_type(kind), stringify_type(token_type(token))); 1312 exit(1); 1313 } 1314 token_destroy(token); 1315 } 1316 1317 void consume_alt(c_token_types *kinds, int n) { 1318 token_t *token = next_token(); 1319 for (int i = 0; i < n; i++) { 1320 if (token_type(token) == kinds[i]) { 1321 token_destroy(token); 1322 return; 1323 } 1324 } 1325 fprintf(stderr, "Error: Expected one of the following tokens: "); 1326 for (int i = 0; i < n; i++) { 1327 fprintf(stderr, "\"%s\" ", stringify_type(kinds[i])); 1328 } 1329 fprintf(stderr, "got \"%s\"\n", stringify_type(token_type(token))); 1330 exit(1); 1331 } 1332 --- 1333 1334 @s 1335 We'll need a helper function to convert token types to strings. It's pretty simple, just tedious. 1336 --- Stringify Type 1337 const char *stringify_type(c_token_types type) { 1338 switch (type) { 1339 case TOK_IF: 1340 return "if"; 1341 case TOK_ELSE: 1342 return "else"; 1343 case TOK_SWITCH: 1344 return "switch"; 1345 case TOK_CASE: 1346 return "case"; 1347 case TOK_DEFAULT: 1348 return "default"; 1349 case TOK_WHILE: 1350 return "while"; 1351 case TOK_DO: 1352 return "do"; 1353 case TOK_FOR: 1354 return "for"; 1355 case TOK_CONTINUE: 1356 return "continue"; 1357 case TOK_BREAK: 1358 return "break"; 1359 case TOK_RETURN: 1360 return "return"; 1361 case TOK_GOTO: 1362 return "goto"; 1363 case TOK_VOID: 1364 return "void"; 1365 case TOK_CHAR: 1366 return "char"; 1367 case TOK_SHORT: 1368 return "short"; 1369 case TOK_INT: 1370 return "int"; 1371 case TOK_LONG: 1372 return "long"; 1373 case TOK_FLOAT: 1374 return "float"; 1375 case TOK_DOUBLE: 1376 return "double"; 1377 case TOK_SIGNED: 1378 return "signed"; 1379 case TOK_UNSIGNED: 1380 return "unsigned"; 1381 case TOK_STRUCT: 1382 return "struct"; 1383 case TOK_UNION: 1384 return "union"; 1385 case TOK_ENUM: 1386 return "enum"; 1387 case TOK_TYPEDEF: 1388 return "typedef"; 1389 case TOK_AUTO: 1390 return "auto"; 1391 case TOK_REGISTER: 1392 return "register"; 1393 case TOK_STATIC: 1394 return "static"; 1395 case TOK_EXTERN: 1396 return "extern"; 1397 case TOK_CONST: 1398 return "const"; 1399 case TOK_VOLATILE: 1400 return "volatile"; 1401 case TOK_SIZEOF: 1402 return "sizeof"; 1403 case TOK_ADD: 1404 return "+"; 1405 case TOK_SUB: 1406 return "-"; 1407 case TOK_MUL: 1408 return "*"; 1409 case TOK_DIV: 1410 return "/"; 1411 case TOK_MOD: 1412 return "%"; 1413 case TOK_BIT_AND: 1414 return "&"; 1415 case TOK_BIT_OR: 1416 return "|"; 1417 case TOK_BIT_XOR: 1418 return "^"; 1419 case TOK_BIT_NOT: 1420 return "~"; 1421 case TOK_LSHIFT: 1422 return "<<"; 1423 case TOK_RSHIFT: 1424 return ">>"; 1425 case TOK_NOT: 1426 return "!"; 1427 case TOK_ASSIGN: 1428 return "="; 1429 case TOK_LT: 1430 return "<"; 1431 case TOK_GT: 1432 return ">"; 1433 case TOK_INC: 1434 return "++"; 1435 case TOK_DEC: 1436 return "--"; 1437 case TOK_EQ: 1438 return "=="; 1439 case TOK_NE: 1440 return "!="; 1441 case TOK_LE: 1442 return "<="; 1443 case TOK_GE: 1444 return ">="; 1445 case TOK_AND: 1446 return "&&"; 1447 case TOK_OR: 1448 return "||"; 1449 case TOK_MEMBER_POINTER: 1450 return "->"; 1451 case TOK_MEMBER: 1452 return "."; 1453 case TOK_COND_DECISION: 1454 return ":"; 1455 case TOK_COND: 1456 return "?"; 1457 case TOK_ASSIGN_ADD: 1458 return "+="; 1459 case TOK_ASSIGN_SUB: 1460 return "-="; 1461 case TOK_ASSIGN_MUL: 1462 return "*="; 1463 case TOK_ASSIGN_DIV: 1464 return "/="; 1465 case TOK_ASSIGN_MOD: 1466 return "%="; 1467 case TOK_ASSIGN_BITAND: 1468 return "&="; 1469 case TOK_ASSIGN_BITOR: 1470 return "|="; 1471 case TOK_ASSIGN_BITXOR: 1472 return "^="; 1473 case TOK_ASSIGN_LSHIFT: 1474 return "<<="; 1475 case TOK_ASSIGN_RSHIFT: 1476 return ">>="; 1477 case TOK_HASH: 1478 return "#"; 1479 case TOK_ID: 1480 return "identifier"; 1481 case TOK_TYPEDEF_NAME: 1482 return "typedef name"; 1483 case TOK_INTEGER_U32: 1484 case TOK_INTEGER_U64: 1485 case TOK_INTEGER_S32: 1486 case TOK_INTEGER_S64: 1487 return "integer constant"; 1488 case TOK_FLOAT_32: 1489 case TOK_FLOAT_64: 1490 return "floating constant"; 1491 case TOK_CHAR_CONST: 1492 return "character constant"; 1493 case TOK_STRING_ASCII: 1494 return "string constant"; 1495 case TOK_EOF: 1496 return "EOF"; 1497 case TOK_ERROR: 1498 return "error"; 1499 case TOK_LEFT_PAREN: 1500 return "("; 1501 case TOK_RIGHT_PAREN: 1502 return ")"; 1503 case TOK_LEFT_BRACKET: 1504 return "["; 1505 case TOK_RIGHT_BRACKET: 1506 return "]"; 1507 case TOK_LEFT_BRACE: 1508 return "{"; 1509 case TOK_RIGHT_BRACE: 1510 return "}"; 1511 case TOK_COMMA: 1512 return ","; 1513 case TOK_SEMICOLON: 1514 return ";"; 1515 case TOK_DOT: 1516 return "."; 1517 case TOK_ELLIPSIS: 1518 return "..."; 1519 } 1520 return "UNKNOWN"; 1521 } 1522 --- 1523 1524 @s 1525 Now we can implement the tokenization function. The pattern is pretty simple: we call each of the tokenization functions in turn until we find a match. If we don't find a match, we print an error message and exit. 1526 You might wonder why skip_whitespace can return a token. This makes handling the divide operator easier as comments also start with a slash. 1527 --- Tokenization Function 1528 char file_name[1024]; 1529 @{Warning/Error Functions} 1530 @{Skip Whitespace} 1531 @{Tokenize Identifier} 1532 @{Tokenize Number} 1533 @{Tokenize String} 1534 @{Tokenize Character} 1535 @{Tokenize Operator} 1536 token_t *next_token(void) { 1537 if (left_stack_pos > 0) { 1538 return left_stack[--left_stack_pos]; 1539 } 1540 token_t *tok = skip_whitespace(); 1541 if (tok != NULL) { 1542 return tok; 1543 } 1544 tok = read_identifier(); 1545 if (tok != NULL) { 1546 return tok; 1547 } 1548 tok = read_number(); 1549 if (tok != NULL) { 1550 return tok; 1551 } 1552 tok = read_char_constant(); 1553 if (tok != NULL) { 1554 return tok; 1555 } 1556 tok = read_string_literal(); 1557 if (tok != NULL) { 1558 return tok; 1559 } 1560 tok = read_operator(); 1561 if (tok != NULL) { 1562 return tok; 1563 } 1564 int c = input_getc(); 1565 if (c == EOF) { 1566 return NULL; 1567 } 1568 tok_warn( 1569 "Warning: Ignoring unexpected character '%c' at line %d, column %d\n", c, 1570 line, column); 1571 return next_token(); 1572 } 1573 1574 #ifdef TEST_TOKENIZER 1575 @{Run Test} 1576 #endif 1577 --- 1578 1579 @s 1580 We'll need a couple of helper functions to skip whitespace and print warnings/errors. 1581 --- Warning/Error Functions 1582 void tok_error(const char *fmt, ...) { 1583 va_list args; 1584 va_start(args, fmt); 1585 fprintf(stderr, "Error in file %s at line %d, column %d: ", file_name, line, 1586 column); 1587 vfprintf(stderr, fmt, args); 1588 va_end(args); 1589 } 1590 1591 void tok_warn(const char *fmt, ...) { 1592 va_list args; 1593 va_start(args, fmt); 1594 fprintf(stderr, "Warning in file %s at line %d, column %d: ", file_name, line, 1595 column); 1596 vfprintf(stderr, fmt, args); 1597 va_end(args); 1598 } 1599 --- 1600 1601 @s 1602 The `skip_whitespace` function is pretty simple. It just skips over any comments, whitespace, and line directives. 1603 --- Skip Whitespace 1604 static token_t *skip_whitespace(void) { 1605 int c; 1606 while ((c = input_getc()) != EOF) { 1607 if (isspace(c)) { // Whitespace 1608 if (c == '\n') { 1609 line++; 1610 column = 1; 1611 } else { 1612 column++; 1613 } 1614 } else if (c == '#') // GCC preprocessor line control directive. 1615 { 1616 char buf[512]; 1617 int i = 0; 1618 while ((c = input_getc()) != EOF && c != '\n') { 1619 buf[i++] = c; 1620 column++; 1621 } 1622 buf[i] = '\0'; 1623 if (sscanf(buf, "%d \"%[^\"]\"", &line, file_name) == 2) { 1624 column = 1; 1625 } else { 1626 tok_error("Invalid #line directive\n"); 1627 } 1628 if (c == EOF) { 1629 return NULL; 1630 } 1631 } else if (c == '/') { // Comment 1632 c = input_getc(); 1633 if (c == '/') { 1634 while ((c = input_getc()) != EOF && c != '\n') { 1635 column++; 1636 } 1637 if (c == EOF) { 1638 return NULL; 1639 } 1640 line++; 1641 column = 1; 1642 } else if (c == '*') { // Multiline comment 1643 while ((c = input_getc()) != EOF) { 1644 if (c == '*') { 1645 c = input_getc(); 1646 if (c == '/') { 1647 break; 1648 } 1649 } else if (c == '\n') { 1650 line++; 1651 column = 1; 1652 } else { 1653 column++; 1654 } 1655 } 1656 if (c == EOF) { 1657 return NULL; 1658 } 1659 } else { // Handled here to simplify the code. 1660 if (c == '=') 1661 return token_create(TOK_ASSIGN_DIV, line, column, 2); 1662 input_ungetc(c); 1663 return token_create(TOK_DIV, line, column, 1); 1664 } 1665 } else { 1666 input_ungetc(c); 1667 return NULL; 1668 } 1669 } 1670 return NULL; 1671 } 1672 --- 1673 1674 @s 1675 The `read_identifier` function reads an identifier from the input stream. C identifiers can contain letters, digits, and underscores, but they can't start with a digit. 1676 --- Tokenize Identifier 1677 @{Get Keyword} 1678 static token_t *read_identifier(void) { 1679 int c; 1680 char buf[1024]; 1681 int i = 0; 1682 c = input_getc(); 1683 if (!isalpha(c) && c != '_') { 1684 input_ungetc(c); 1685 return NULL; 1686 } 1687 buf[i++] = c; 1688 while ((c = input_getc()) != EOF) { 1689 if (!isalnum(c) && c != '_') { 1690 input_ungetc(c); 1691 break; 1692 } 1693 buf[i++] = c; 1694 if (i >= 1008) { 1695 tok_error("Identifier too long\n"); 1696 exit(1); 1697 } 1698 } 1699 buf[i] = '\0'; 1700 column += i; 1701 // Check if it's a keyword 1702 c_token_types kind = get_keyword(buf, i); 1703 if (kind != TOK_ID) { 1704 return token_create(kind, line, column, i); 1705 } 1706 // Check if it's a typedef 1707 if (hash_table_get(typedef_table, buf) != NULL) { 1708 return token_create(TOK_TYPEDEF_NAME, line, column, i); 1709 } 1710 return token_create_string(kind, line, column, buf, i); 1711 } 1712 --- 1713 1714 @s 1715 The `get_keyword` function is a simple decision tree for identifying keywords. The code is pretty tedious, but it works. 1716 --- Get Keyword 1717 c_token_types get_keyword(const char *buf, int len) { 1718 switch (buf[0]) { 1719 case 'a': 1720 if (len == 4 && buf[1] == 'u' && buf[2] == 't' && buf[3] == 'o') 1721 return TOK_AUTO; 1722 break; 1723 1724 case 'b': 1725 if (len == 5 && buf[1] == 'r' && buf[2] == 'e' && buf[3] == 'a' && 1726 buf[4] == 'k') 1727 return TOK_BREAK; 1728 break; 1729 1730 case 'c': 1731 switch (buf[1]) { 1732 case 'a': 1733 if (len == 4 && buf[2] == 's' && buf[3] == 'e') 1734 return TOK_CASE; 1735 break; 1736 case 'h': 1737 if (len == 4 && buf[2] == 'a' && buf[3] == 'r') 1738 return TOK_CHAR; 1739 break; 1740 case 'o': 1741 if (len == 5 && buf[2] == 'n' && buf[3] == 's' && buf[4] == 't') 1742 return TOK_CONST; 1743 if (len == 8 && buf[2] == 'n' && buf[3] == 't' && buf[4] == 'i' && 1744 buf[5] == 'n' && buf[6] == 'u' && buf[7] == 'e') 1745 return TOK_CONTINUE; 1746 break; 1747 } 1748 break; 1749 1750 case 'd': 1751 switch (buf[1]) { 1752 case 'e': 1753 if (len == 7 && buf[2] == 'f' && buf[3] == 'a' && buf[4] == 'u' && 1754 buf[5] == 'l' && buf[6] == 't') 1755 return TOK_DEFAULT; 1756 break; 1757 case 'o': 1758 if (len == 2 && buf[2] == '\0') 1759 return TOK_DO; 1760 if (len == 6 && buf[2] == 'u' && buf[3] == 'b' && buf[4] == 'l' && 1761 buf[5] == 'e') 1762 return TOK_DOUBLE; 1763 break; 1764 } 1765 break; 1766 1767 case 'e': 1768 switch (buf[1]) { 1769 case 'l': 1770 if (len == 4 && buf[2] == 's' && buf[3] == 'e') 1771 return TOK_ELSE; 1772 break; 1773 case 'n': 1774 if (len == 4 && buf[2] == 'u' && buf[3] == 'm') 1775 return TOK_ENUM; 1776 break; 1777 case 'x': 1778 if (len == 6 && buf[2] == 't' && buf[3] == 'e' && buf[4] == 'r' && 1779 buf[5] == 'n') 1780 return TOK_EXTERN; 1781 break; 1782 } 1783 break; 1784 1785 case 'f': 1786 switch (buf[1]) { 1787 case 'l': 1788 if (len == 5 && buf[2] == 'o' && buf[3] == 'a' && buf[4] == 't') 1789 return TOK_FLOAT; 1790 break; 1791 case 'o': 1792 if (len == 3 && buf[2] == 'r') 1793 return TOK_FOR; 1794 break; 1795 } 1796 break; 1797 1798 case 'g': 1799 if (len == 4 && buf[1] == 'o' && buf[2] == 't' && buf[3] == 'o') 1800 return TOK_GOTO; 1801 break; 1802 1803 case 'i': 1804 switch (buf[1]) { 1805 case 'f': 1806 if (len == 2 && buf[2] == '\0') 1807 return TOK_IF; 1808 break; 1809 case 'n': 1810 if (len == 3 && buf[2] == 't') 1811 return TOK_INT; 1812 break; 1813 } 1814 break; 1815 1816 case 'l': 1817 if (len == 4 && buf[1] == 'o' && buf[2] == 'n' && buf[3] == 'g') 1818 return TOK_LONG; 1819 break; 1820 1821 case 'r': 1822 switch (buf[1]) { 1823 case 'e': 1824 if (len == 8 && buf[2] == 'g' && buf[3] == 'i' && buf[4] == 's' && 1825 buf[5] == 't' && buf[6] == 'e' && buf[7] == 'r') 1826 return TOK_REGISTER; 1827 if (len == 6 && buf[2] == 't' && buf[3] == 'u' && buf[4] == 'r' && 1828 buf[5] == 'n') 1829 return TOK_RETURN; 1830 break; 1831 } 1832 break; 1833 1834 case 's': 1835 switch (buf[1]) { 1836 case 'h': 1837 if (len == 5 && buf[2] == 'o' && buf[3] == 'r' && buf[4] == 't') 1838 return TOK_SHORT; 1839 break; 1840 case 't': 1841 if (len == 6 && buf[2] == 'a' && buf[3] == 't' && buf[4] == 'i' && 1842 buf[5] == 'c') 1843 return TOK_STATIC; 1844 break; 1845 case 'i': 1846 if (len == 6 && buf[2] == 'g' && buf[3] == 'n' && buf[4] == 'e' && 1847 buf[5] == 'd') 1848 return TOK_SIGNED; 1849 if (len == 6 && buf[2] == 'z' && buf[3] == 'e' && buf[4] == 'o' && 1850 buf[5] == 'f') 1851 return TOK_SIZEOF; 1852 break; 1853 case 'r': 1854 if (len == 6 && buf[2] == 'u' && buf[3] == 'c' && buf[4] == 't') 1855 return TOK_STRUCT; 1856 break; 1857 case 'w': 1858 if (len == 6 && buf[2] == 'i' && buf[3] == 't' && buf[4] == 'c' && 1859 buf[5] == 'h') 1860 return TOK_SWITCH; 1861 break; 1862 } 1863 break; 1864 1865 case 't': 1866 if (len == 7 && buf[1] == 'y' && buf[2] == 'p' && buf[3] == 'e' && 1867 buf[4] == 'd' && buf[5] == 'e' && buf[6] == 'f') 1868 return TOK_TYPEDEF; 1869 break; 1870 1871 case 'u': 1872 switch (buf[1]) { 1873 case 'n': 1874 if (len == 5 && buf[2] == 'i' && buf[3] == 'o' && buf[4] == 'n') 1875 return TOK_UNION; 1876 if (len == 8 && buf[2] == 's' && buf[3] == 'i' && buf[4] == 'g' && 1877 buf[5] == 'n' && buf[6] == 'e' && buf[7] == 'd') 1878 return TOK_UNSIGNED; 1879 break; 1880 } 1881 break; 1882 1883 case 'v': 1884 switch (buf[1]) { 1885 case 'o': 1886 if (len == 4 && buf[2] == 'i' && buf[3] == 'd') 1887 return TOK_VOID; 1888 if (len == 8 && buf[2] == 'l' && buf[3] == 'a' && buf[4] == 't' && 1889 buf[5] == 'i' && buf[6] == 'l' && buf[7] == 'e') 1890 return TOK_VOLATILE; 1891 break; 1892 } 1893 break; 1894 1895 case 'w': 1896 if (len == 5 && buf[1] == 'h' && buf[2] == 'i' && buf[3] == 'l' && 1897 buf[4] == 'e') 1898 return TOK_WHILE; 1899 break; 1900 1901 default: 1902 return TOK_ID; 1903 } 1904 return TOK_ID; 1905 } 1906 --- 1907 1908 @s 1909 The `read_operator` function works similarly to the `read_identifier` function. It uses a decision tree to identify operators. 1910 --- Tokenize Operator 1911 1912 token_t *read_operator(void) { 1913 int c; 1914 c = input_getc(); 1915 switch (c) { 1916 case '!': { 1917 c = input_getc(); 1918 if (c == '=') 1919 return token_create(TOK_NE, line, column, 2); 1920 input_ungetc(c); 1921 return token_create(TOK_NOT, line, column, 1); 1922 } 1923 case '%': { 1924 c = input_getc(); 1925 if (c == '=') 1926 return token_create(TOK_ASSIGN_MOD, line, column, 2); 1927 input_ungetc(c); 1928 return token_create(TOK_MOD, line, column, 1); 1929 } 1930 case '&': { 1931 c = input_getc(); 1932 if (c == '&') 1933 return token_create(TOK_AND, line, column, 2); 1934 if (c == '=') 1935 return token_create(TOK_ASSIGN_BITAND, line, column, 2); 1936 input_ungetc(c); 1937 return token_create(TOK_BIT_AND, line, column, 1); 1938 } 1939 case '(': 1940 return token_create(TOK_LEFT_PAREN, line, column, 1); 1941 case ')': 1942 return token_create(TOK_RIGHT_PAREN, line, column, 1); 1943 case '*': { 1944 c = input_getc(); 1945 if (c == '=') 1946 return token_create(TOK_ASSIGN_MUL, line, column, 2); 1947 input_ungetc(c); 1948 return token_create(TOK_MUL, line, column, 1); 1949 } 1950 case '+': { 1951 c = input_getc(); 1952 if (c == '+') 1953 return token_create(TOK_INC, line, column, 2); 1954 if (c == '=') 1955 return token_create(TOK_ASSIGN_ADD, line, column, 2); 1956 input_ungetc(c); 1957 return token_create(TOK_ADD, line, column, 2); 1958 } 1959 case ',': 1960 return token_create(TOK_COMMA, line, column, 1); 1961 case '-': { 1962 c = input_getc(); 1963 if (c == '-') 1964 return token_create(TOK_DEC, line, column, 2); 1965 if (c == '=') 1966 return token_create(TOK_ASSIGN_SUB, line, column, 2); 1967 if (c == '>') 1968 return token_create(TOK_MEMBER_POINTER, line, column, 2); 1969 input_ungetc(c); 1970 return token_create(TOK_SUB, line, column, 1); 1971 } 1972 case '.': { 1973 c = input_getc(); 1974 if (c == '.') { 1975 c = input_getc(); 1976 if (c == '.') { 1977 return token_create(TOK_ELLIPSIS, line, column, 3); 1978 } else { 1979 // Bail out, can't store more than one unget 1980 tok_error("Unexpected character '.' at line %d, column %d\n", line, 1981 column); 1982 exit(1); 1983 } 1984 } 1985 return token_create('.', line, column, 1); 1986 } 1987 case '/': { 1988 c = input_getc(); 1989 if (c == '=') 1990 return token_create(TOK_ASSIGN_DIV, line, column, 2); 1991 input_ungetc(c); 1992 return token_create(TOK_DIV, line, column, 1); 1993 } 1994 case ':': 1995 return token_create(TOK_COND_DECISION, line, column, 1); 1996 case ';': 1997 return token_create(TOK_SEMICOLON, line, column, 1); 1998 case '<': { 1999 c = input_getc(); 2000 if (c == '<') { 2001 c = input_getc(); 2002 if (c == '=') 2003 return token_create(TOK_ASSIGN_LSHIFT, line, column, 3); 2004 input_ungetc(c); 2005 return token_create(TOK_LSHIFT, line, column, 2); 2006 } 2007 if (c == '=') 2008 return token_create(TOK_LE, line, column, 2); 2009 input_ungetc(c); 2010 return token_create(TOK_LT, line, column, 1); 2011 } 2012 case '=': { 2013 c = input_getc(); 2014 if (c == '=') 2015 return token_create(TOK_ASSIGN, line, column, 2); 2016 input_ungetc(c); 2017 return token_create(TOK_ASSIGN, line, column, 1); 2018 } 2019 case '>': { 2020 c = input_getc(); 2021 if (c == '>') { 2022 c = input_getc(); 2023 if (c == '=') 2024 return token_create(TOK_ASSIGN_RSHIFT, line, column, 3); 2025 input_ungetc(c); 2026 return token_create(TOK_RSHIFT, line, column, 2); 2027 } 2028 if (c == '=') 2029 return token_create(TOK_GE, line, column, 2); 2030 input_ungetc(c); 2031 return token_create(TOK_GT, line, column, 1); 2032 } 2033 case '?': 2034 return token_create(TOK_COND, line, column, 1); 2035 case '[': 2036 return token_create(TOK_LEFT_BRACKET, line, column, 1); 2037 case ']': 2038 return token_create(TOK_RIGHT_BRACKET, line, column, 1); 2039 case '^': { 2040 c = input_getc(); 2041 if (c == '=') 2042 return token_create(TOK_ASSIGN_BITXOR, line, column, 2); 2043 input_ungetc(c); 2044 return token_create(TOK_BIT_XOR, line, column, 1); 2045 } 2046 case '{': 2047 return token_create(TOK_LEFT_BRACE, line, column, 1); 2048 case '|': { 2049 c = input_getc(); 2050 if (c == '|') 2051 return token_create(TOK_OR, line, column, 2); 2052 if (c == '=') 2053 return token_create(TOK_ASSIGN_BITOR, line, column, 2); 2054 input_ungetc(c); 2055 return token_create(TOK_BIT_OR, line, column, 1); 2056 } 2057 case '}': 2058 return token_create(TOK_RIGHT_BRACE, line, column, 1); 2059 case '~': 2060 return token_create(TOK_BIT_NOT, line, column, 1); 2061 default: 2062 input_ungetc(c); 2063 return NULL; 2064 } 2065 2066 return NULL; 2067 } 2068 --- 2069 2070 @s 2071 The `read_number` function reads a number from the input stream. It can be an integer or a floating-point number. 2072 2073 I've broken it up a bit to make it easier to read. 2074 2075 --- Tokenize Number 2076 static token_t *read_number(void) { 2077 int c; 2078 char buf[1024]; 2079 int i = 0; 2080 c = input_getc(); 2081 @{Check for valid prefix} 2082 int radix = 10; 2083 @{Process Radix} 2084 int is_float = 0; 2085 @{Read Number Loop} 2086 buf[i] = '\0'; 2087 @{Process Suffixes} 2088 @{Check for conflicting suffixes} 2089 if (is_float) { 2090 @{Convert to float} 2091 } else { 2092 @{Convert to integer} 2093 } 2094 return NULL; 2095 } 2096 --- 2097 2098 @s 2099 To determine if a character is a valid prefix for a number, we need to check if it's a digit or a period followed by a digit 2100 --- Check for valid prefix 2101 // If we don't have a digit or decimal point, it's not a number 2102 if (!isdigit(c) && c != '.') { 2103 input_ungetc(c); 2104 return NULL; 2105 } 2106 // Decimal point followed by non-digit is a struct member 2107 if (c == '.') { 2108 char cnext = input_getc(); 2109 if (!isdigit(cnext)) { 2110 input_ungetc(cnext); 2111 return token_create(TOK_MEMBER, line, column, 1); 2112 } 2113 input_ungetc(cnext); 2114 } 2115 --- 2116 2117 @s 2118 A C constant starting with a zero is either an octal or hexadecimal constant. We need to check the next character to determine which one it is. 2119 --- Process Radix 2120 // Check for hex and octal. 2121 if (c == '0') { 2122 char cnext = input_getc(); 2123 if (cnext == 'x' || cnext == 'X') { 2124 // Hex, discard the 0x 2125 radix = 16; 2126 } else { 2127 // Octal, discard the 0 2128 input_ungetc(cnext); 2129 radix = 8; 2130 } 2131 } else { 2132 // Decimal, append the first digit 2133 buf[i++] = c; 2134 } 2135 --- 2136 2137 @s 2138 2139 --- Read Number Loop 2140 while ((c = input_getc()) != EOF) { 2141 // Since there can be multiple writes to the buffer, we want to make sure we 2142 // don't overflow by giving a 4 byte pad 2143 if (i > 1020) { 2144 tok_error("Number too long\n"); 2145 return NULL; 2146 } 2147 // Valid digits for the radix: 0-9 for decimal, 0-7 for octal, 0-9 and 2148 // a-f/A-F for hex 2149 if ((radix == 10 && isdigit(c)) || (radix == 16 && isxdigit(c)) || 2150 (radix == 8 && c >= '0' && c <= '7')) { 2151 buf[i++] = c; 2152 // Decimal point and not a float yet, must be a float 2153 } else if (c == '.' && !is_float) { 2154 is_float = 1; 2155 if (radix != 10) { 2156 tok_error("Invalid floating point constant, expected decimal, got %s\n", 2157 radix == 16 ? "hexadecimal" : "octal"); 2158 return NULL; 2159 } 2160 buf[i++] = c; 2161 } 2162 // Exponent on the end of a constant. (By standard this forces it to be a 2163 // float) 2164 else if (c == 'e' || c == 'E') { 2165 buf[i++] = c; 2166 c = input_getc(); 2167 // Sign on the exponent 2168 if (c == '+' || c == '-') { 2169 buf[i++] = c; 2170 c = input_getc(); 2171 } 2172 // Exponent must be a digit, I.E no 1e1.2 2173 if (!isdigit(c)) { 2174 tok_error("Invalid floating point exponent\n"); 2175 return NULL; 2176 } 2177 buf[i++] = c; 2178 is_float = 1; 2179 } else { 2180 // Reached the end, unget the character so other functions can read it 2181 input_ungetc(c); 2182 break; 2183 } 2184 } 2185 --- 2186 2187 @s 2188 C constants can have suffixes to indicate their type. We need to check for these suffixes and set the appropriate flags. 2189 2190 --- Process Suffixes 2191 int is_unsigned = 0; 2192 int is_long = 0; 2193 int is_single = 0; 2194 while (1) { 2195 c = input_getc(); 2196 if (c == 'u' || c == 'U') { 2197 if (is_unsigned) { 2198 tok_warn( 2199 "Warning: Duplicate suffix 'u' for integer constant ignored\n"); 2200 } 2201 is_unsigned = 1; 2202 } else if (c == 'l' || c == 'L') { 2203 if (is_long) { 2204 tok_warn( 2205 "Warning: Duplicate suffix 'l' for integer constant ignored\n"); 2206 } 2207 is_long = 1; 2208 } else if (c == 'f' || c == 'F') { 2209 if (is_single) { 2210 tok_warn("Warning: Duplicate suffix 'f' for floating point constant " 2211 "ignored\n"); 2212 } 2213 is_single = 1; 2214 } else { 2215 input_ungetc(c); 2216 break; 2217 } 2218 } 2219 --- 2220 2221 @s 2222 If we find conflicting suffixes, we print a warning and ignore the suffixes. 2223 --- Check for conflicting suffixes 2224 if (is_single && is_long) { 2225 tok_warn("Warning: Invalid suffixes 'l' and 'f' for floating point " 2226 "constant. Ignoring 'l'\n"); 2227 is_long = 0; 2228 } 2229 if (is_single && is_unsigned) { 2230 tok_warn("Warning: Invalid suffixes 'u' and 'f' for floating point " 2231 "constant. Ignoring 'u'\n"); 2232 is_unsigned = 0; 2233 } 2234 if (is_single && !is_float) { 2235 tok_warn( 2236 "Warning: Invalid suffix 'f' for integer constant. Ignoring 'f'\n"); 2237 is_single = 0; 2238 } 2239 --- 2240 2241 @s 2242 If the string contains a float, we pass it to strtod. We need to make sure that the number is in range for the given type and check for errors from strtod 2243 2244 --- Convert to float 2245 errno = 0; 2246 // Strtod generates a unix-style error when it's given something out of 2247 // range, so we want to get on top of that quickly instead of ignoring it 2248 // That way we can avoid some nasty NAN-propagation in the constant folder. 2249 double f = strtod(buf, NULL); 2250 if (errno == ERANGE) { 2251 tok_error("Floating point constant out of range\n"); 2252 return NULL; 2253 } 2254 // Warn if the constant is out of range for a float, I.E it's too big or too 2255 // small 2256 if (is_single && (f < FLT_MIN || f > FLT_MAX)) { 2257 tok_warn( 2258 "Warning: Floating point constant %f is out of range for float\n", f); 2259 } 2260 // Warn if the constant is too precise for a float 2261 if (is_single && fabs((double)((float)f) - f) >= FLT_EPSILON) { 2262 tok_warn("Warning: Converting double precision floating point constant " 2263 "%f to float loses " 2264 "precision\n", 2265 f); 2266 } 2267 return token_create_float(is_single ? TOK_FLOAT_32 2268 : TOK_FLOAT_64, 2269 line, column, f, i); 2270 --- 2271 2272 @s 2273 If the string contains a number, we pass it to stroll. We need to make sure that the number is in range for the given type and check for errors from strtoll 2274 2275 --- Convert to integer 2276 errno = 0; 2277 uint64_t int_ = strtoull(buf, NULL, radix); 2278 // Same as above, but for integers 2279 if (errno == ERANGE) { 2280 tok_error("Integer constant out of range\n"); 2281 return NULL; 2282 } 2283 if (is_unsigned) { 2284 if (is_long) { 2285 return token_create_int(TOK_INTEGER_U64, line, column, int_, i); 2286 } else { 2287 if (int_ > UINT32_MAX) { 2288 tok_warn( 2289 "Warning: Integer constant %lld is out of range for unsigned " 2290 "int\n", 2291 int_); 2292 } 2293 return token_create_int(TOK_INTEGER_U32, line, column, int_, i); 2294 } 2295 } else { 2296 if (is_long) { 2297 // If the highest bit is set, that means this will overflow a signed 2298 // long (Due to two's complement) 2299 if (int_ & (1UL << 63)) { 2300 tok_warn( 2301 "Warning: Integer constant %lld is out of range for long long\n", 2302 i); 2303 } 2304 return token_create_int(TOK_INTEGER_S64, line, column, int_, i); 2305 } else { 2306 if (int_ & (1UL << 31)) { 2307 tok_warn("Warning: Integer constant %lld is out of range for int\n", 2308 int_); 2309 } 2310 return token_create_int(TOK_INTEGER_S32, line, column, int_, i); 2311 } 2312 } 2313 --- 2314 2315 @s 2316 The `read_char_constant` function reads a character constant from the input stream. It can be a single character or a multi-character escape sequence. 2317 --- Tokenize Character 2318 static token_t *read_char_constant(void) { 2319 int c; 2320 int len = 0; 2321 c = input_getc(); 2322 if (c != '\'') { 2323 input_ungetc(c); 2324 return NULL; 2325 } 2326 len++; 2327 c = input_getc(); 2328 if (c == '\'') { 2329 tok_error("Empty character constant\n"); 2330 return NULL; 2331 } 2332 if (c == '\\') { 2333 c = read_escape_sequence(&len); 2334 } 2335 int val = c; 2336 c = input_getc(); 2337 if (c != '\'') { 2338 tok_error("Expected closing quote for character constant\n"); 2339 return NULL; 2340 } 2341 len++; 2342 return token_create_char(TOK_CHAR_CONST, line, column, val, len); 2343 } 2344 --- 2345 2346 @s 2347 The `read_string_literal` function reads a string literal from the input stream. 2348 2349 For this function, an automatic-lifetime buffer is used to store the string it becomes too large. At that point, a heap-allocated buffer is used. 2350 This way we can avoid unnecessary heap allocations for small strings. 2351 --- Tokenize String 2352 @{Read Escape Sequence} 2353 static token_t *read_string_literal(void) { 2354 int c; 2355 c = input_getc(); 2356 if (c != '"') { 2357 input_ungetc(c); 2358 return NULL; 2359 } 2360 int i = 0; 2361 char s_buf[512]; 2362 char *buf = s_buf; 2363 int len = 512; 2364 int esc_pad = 0; 2365 while ((c = input_getc()) != EOF) { 2366 if (c == '"') { 2367 // Implicit skip of closing quote 2368 break; 2369 } 2370 if (c == '\\') { 2371 c = read_escape_sequence(&esc_pad); 2372 if (c == 0) { 2373 return NULL; 2374 } 2375 } 2376 if (i >= len) { 2377 if (buf == s_buf) { 2378 buf = malloc(1024); 2379 if (buf == NULL) { 2380 fputs("Out of memory. Could not parse string literal.\n", stderr); 2381 exit(1); 2382 } 2383 memcpy(buf, s_buf, 512); 2384 len *= 2; 2385 } else { 2386 len *= 2; 2387 buf = realloc(buf, len); 2388 } 2389 } 2390 buf[i++] = c; 2391 } 2392 buf[i] = '\0'; 2393 if (c == EOF) { 2394 tok_error("Unterminated string literal\n"); 2395 if (buf != s_buf) { 2396 free(buf); 2397 } 2398 return NULL; 2399 } 2400 2401 token_t *tok = token_create_string(TOK_STRING_ASCII, line, column, buf, 2402 i + esc_pad + 2); 2403 if (buf != s_buf) { 2404 free(buf); 2405 } 2406 return tok; 2407 } 2408 --- 2409 2410 @s 2411 Escape sequences in C can either be single characters or octal/hexadecimal values. We need to handle both cases. 2412 --- Read Escape Sequence 2413 static char read_escape_sequence(int *len) { 2414 int c = input_getc(); 2415 *len += 1; 2416 switch (c) { 2417 case 'a': 2418 return '\a'; 2419 case 'b': 2420 return '\b'; 2421 case 'f': 2422 return '\f'; 2423 case 'n': 2424 return '\n'; 2425 case 'r': 2426 return '\r'; 2427 case 't': 2428 return '\t'; 2429 case 'v': 2430 return '\v'; 2431 case '\'': 2432 return '\''; 2433 case '"': 2434 return '"'; 2435 case '?': 2436 return '?'; 2437 case '\\': 2438 return '\\'; 2439 case '0': 2440 return '\0'; 2441 case 'x': { 2442 c = input_getc(); 2443 if (!isxdigit(c)) { 2444 tok_error("Invalid hexadecimal escape sequence\n"); 2445 return 0; 2446 } 2447 int val = 0; 2448 while (isxdigit(c)) { 2449 *len += 1; 2450 val = val * 16 + (isdigit(c) ? c - '0' : tolower(c) - 'a' + 10); 2451 c = input_getc(); 2452 } 2453 input_ungetc(c); 2454 return (char)val; 2455 } 2456 default: 2457 if (!isdigit(c)) { 2458 tok_error("Invalid escape sequence\n"); 2459 return 0; 2460 } 2461 int val = 0; 2462 while (isdigit(c)) { 2463 *len += 1; 2464 val = val * 8 + c - '0'; 2465 c = input_getc(); 2466 } 2467 input_ungetc(c); 2468 return (char)val; 2469 } 2470 } 2471 --- 2472 2473 @s 2474 Finally, I'll add some code for running the tokenizer as its own program. This way we can test it out. 2475 --- Run Test 2476 char *preprocess(char *in) { 2477 char *output_name = malloc(1024); 2478 snprintf(output_name, 1024, "%s.preprocessed", in); 2479 char *command = malloc(2048); 2480 snprintf(command, 2048, "gcc -E -xc %s -o %s", in, output_name); 2481 system(command); 2482 free(command); 2483 return output_name; 2484 } 2485 2486 // Tokenize the input file 2487 int main(int argc, char **argv) { 2488 if (argc != 2) { 2489 fprintf(stderr, "Usage: %s <input.c>\n", argv[0]); 2490 return 1; 2491 } 2492 char *input_name = argv[1]; 2493 char *preprocessed = preprocess(input_name); 2494 init_tokenizer(preprocessed); 2495 token_t *tok; 2496 while ((tok = next_token()) != NULL) { 2497 print_token(tok); 2498 token_destroy(tok); 2499 } 2500 destroy_tokenizer(); 2501 remove(preprocessed); 2502 free(preprocessed); 2503 hash_table_destroy(string_table); 2504 return 0; 2505 } 2506 --- 2507 2508 ### Bugs/Errata 2509 2510 I wrote this code in a single sitting, so there are bound to be bugs. I'll list them here as I find them. The code you see here is the final version, with all bugs fixed. 2511 2512 * had `buffer_pos == buffer_size - 1`, left in from trying to plug some code for lookahead in, didn't work out, but I forgot to remove it, causes fallthrough to `buffer_size == 0` check which if true returns EOF, preventing input initialization. Fixed by changing to `buffer_pos == buffer_size`. 2513 * assertion `token->kind == TOK_STRING_ASCII` failed in token\_string. Forgot to expand check for identifiers which also use token\_string. Fixed by changing to `token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TID`. 2514 * token\_create\_string - call to `hash_table_get` with freed key. Fixed by moving the call to free after the call to `hash_table_get`. 2515 * ibid - Design of hash table and call to `hash_table_get` in token\_create\_string created double free. Fixed by rewriting part of function. 2516 * Tokenizer missing code to handle GCC preprocessor line directives. Fixed by adding code to handle them. 2517 * Destructor for string literals missing in tokenizer teardown, added it in. 2518 * read\_number - check `int_ > INT32_MAX` does not work due to some weird casting. Added explicit cast. 2519 * read\_char\_constant - Forgot to handle '\\0'. Added. 2520 * skip\_whitespace - When a division operator occurs in code, skip\_whitespace assumes it's a comment. Fixed by adding a check for division operator. 2521 * hash\_string - Optimization, not a bug, Dragon Book hash function not very fast due to misprediction. Replaced with ELF hash function. 2522 * read\_identifier - strlen gets called 3 times even though we already get the string len by incrementing an array index. Ew. Used i instead of strlen. 2523 * read\_identifier - stringized version of keywords stored, not needed. Code added to call token\_create instead of token\_create\_string for keywords. 2524 * Everywhere - Checks added for memory allocation failure. 2525 * Not a bug. Removed the seperate token type for TID. Will try to handle in the parser. 2526 2527 ### Conclusion 2528 2529 That's it! The C Minus tokenizer is complete. It's hopefully pretty understandable, and given the testing I've put it through, it should be pretty robust. 2530 2531 Next time, we'll start on the parser. 2532 2533 # Source code, biblography 2534 2535 Source code for the tokenizer is available [here](/projects/cminus/code/tokenizer.c), header file is available [here](/projects/cminus/code/tokenizer.h). 2536 2537 Source code for the input module is available [here](/projects/cminus/code/input.c), header file is available [here](/projects/cminus/code/input.h). 2538 2539 Source code for the hash table is available [here](/projects/cminus/code/hash_table.c), header file is available [here](/projects/cminus/code/hash_table.h). 2540 2541 Source code for the token module is available [here](/projects/cminus/code/token.c), header file is available [here](/projects/cminus/code/token.h). 2542 2543 A lot of the logic for this project is from either the Dragon Book, Engineering a Compiler, or LCC: A Retargetable Compiler for ANSI C. Grammars are from The C Reference Manual. 2544 2545 I got the idea for using zero-width arrays for optional content (the struct hack) from hacking on some weird virtio drivers (they seem to love it). 2546 2547 Crafting Interpreters by Bob Nystrom inspired me to do this project, so if you see any similarities, there's probably some unintentional influence there. 2548 2549 The code for the ELF hash function is from the glibc source code. 2550 2551 The idea for decision trees came from LCC. 2552 2553 Literate programming rendered using [literate](https://github.com/zyedidia/Literate). 2554 2555 <footer style=" text-align: center; padding: 20px;"> 2556 <p>© 2024 Reagan Fischer. If for some reason you want to use my AMAZING code (lol), it's available under the MIT 2557 license <a href="/projects/cminus/code/LICENSE.md">here</a>.</p> 2558 </footer>