website

Website contents
git clone git://git.reagancfischer.dev/website.git
Log | Files | Refs

commit 844dc55b3121e79e47834c38ceffcd221d39bb18
parent 8439e14a4356442171bad9768ce804cbb038a33b
Author: Reagan <rfische2@uccs.edu>
Date:   Thu, 22 Aug 2024 12:40:29 -0600

Start of new literate program

Diffstat:
Mprojects/cminus/lexer_new.lit | 557+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
Mprojects/style.css | 13-------------
2 files changed, 543 insertions(+), 27 deletions(-)

diff --git a/projects/cminus/lexer_new.lit b/projects/cminus/lexer_new.lit @@ -21,11 +21,11 @@ The project has a series of pretty typical stages: As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena. -## Some Rules +@s Some Rules Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent. -#### 1\. PROGRAM LIKE IT'S 1999 +1. PROGRAM LIKE IT'S 1999 > 640 KB ought to be enough for anybody. - Bill Gates @@ -35,7 +35,7 @@ This places a lot of constraints on the project, but I think it's a good exercis Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future). -#### 2\. PROGRAM IN C++-- +2. PROGRAM IN C++-- I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits: @@ -43,44 +43,572 @@ I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. * Less chance of bugs. If a function is the only way to interact with a data structure, then it's much harder to misuse that data structure. * Run time type checking. I can include some sort of tag in the first field of every data structure to ensure that the correct functions are being called. -#### 3\. DON'T GET FANCY +3. DON'T GET FANCY My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else. That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability. -#### 4\. DESIGN FOR DEBUGGING +4. DESIGN FOR DEBUGGING This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point. This might be painful, but it'll make debugging a lot simpler and let users look under the hood. -#### 5\. SMART DATA, STUPID CODE +5. SMART DATA, STUPID CODE A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible. For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table. -#### Misc +@s Misc THIS IS A LITERATE PROGRAM! Go to [this link](https://reagancfischer.dev/lexer.lit) to see the file that generated this HTML. -## The Lexer +@s The Lexer A lexical analyzer reads source code and produces tokens, which are the smallest unit of meaning in a language. For example, in the C programming language, the tokens are things like keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.). Given a string like `int main() { return 0; }`, the lexer would produce a series of tokens like `INT`, `IDENTIFIER(main)`, `LPAREN`, `RPAREN`, `LBRACE`, `RETURN`, `INTCONSTANT(0)`, `SEMICOLON`, `RBRACE`. -### Design +@s Design I'll break the lexer up into a couple of modules. `token.c` will contain the token data structure and functions to create and destroy tokens. `input.c` will contain the input data structure and functions to read from the input file. `tokenizer.c` will contain the main lexer logic. -### Token +@s Token Interface Tokens are the smallest unit of meaning in a language. They're used by the parser to build an abstract syntax tree (AST). We'll need a couple of things to represent a token: * The type of token. This will be an enum, with values like `TOK_CTK_IF` or `TOK_CONST_INTEGER_U32`. * The value of the token. Some tokens, like keywords, don't have a value. Others, like identifiers or constants, do. * The line and column of the token. This is used for error messages. -As I mentioned earlier, we're trying to implement a sort of class system in C. To that end, tokens will be opaque to all other modules. The only way to interact with them will be through functions in `token.c`. -### Input - -#### Input Interface +As I mentioned earlier, we're trying to implement a sort of class system in C. For that, we'll need to hide the token implementation details behind an opaque pointer. We could just have a `void` pointer, but that stops us from being able to use compile-time type checking. +Instead, we'll use a forward declaration of the token type in the header file, and then define the token type in the implementation file. +@s +--- Opaque Token Type +typedef struct token token_t; +--- +@s +We'll need a couple of functions to create and destroy tokens. +--- Token Creation and Destruction +token_t *token_data_create(c_token_types kind, int lin, int col, + int len); + +token_t *token_create(c_token_types kind, int lin, int col, int len); + +token_t *token_create_int(c_token_types kind, int lin, int col, + int64_t i, int len); + +token_t *token_create_float(c_token_types kind, int lin, int col, + double f, int len); + +token_t *token_create_char(c_token_types kind, int lin, int col, char c, + int len); + +token_t *token_create_string(c_token_types kind, int lin, int col, + const char *s, int len); + +void token_destroy(token_t *token); +--- +@s +We'll also need some functions to access the token data. +--- Token Interface +c_token_types token_type(token_t *token); + +int64_t token_int(token_t *token); + +double token_float(token_t *token); + +const char *token_string(token_t *token); + +char token_char(token_t *token); + +int token_line(token_t *token); + +int token_column(token_t *token); +--- +@s +We'll need some types to represent the different kinds of tokens. +--- Token Types +typedef enum { + // Control Keywords + TOK_CTK_IF, + TOK_CTK_ELSE, + TOK_CTK_SWITCH, + TOK_CTK_CASE, + TOK_CTK_DEFAULT, + TOK_CTK_WHILE, + TOK_CTK_DO, + TOK_CTK_FOR, + TOK_CTK_CONTINUE, + TOK_CTK_BREAK, + TOK_CTK_RETURN, + TOK_CTK_GOTO, + + // Type Keywords + TOK_TK_VOID, + TOK_TK_CHAR, + TOK_TK_SHORT, + TOK_TK_INT, + TOK_TK_LONG, + TOK_TK_FLOAT, + TOK_TK_DOUBLE, + TOK_TK_SIGNED, + TOK_TK_UNSIGNED, + TOK_TK_STRUCT, + TOK_TK_UNION, + TOK_TK_ENUM, + TOK_TK_TYPEDEF, + + // Storage Class/Specifier Keywords + TOK_SCSK_AUTO, + TOK_SCSK_REGISTER, + TOK_SCSK_STATIC, + TOK_SCSK_EXTERN, + TOK_SCSK_CONST, + TOK_SCSK_VOLATILE, + + // Misc Keywords + TOK_MK_SIZEOF, + + // Operators + TOK_OP_ADD, // + + TOK_OP_SUB, // - + TOK_OP_MUL, // * + TOK_OP_DIV, // / + TOK_OP_MOD, // % + TOK_OP_BIT_AND, // & + TOK_OP_BIT_OR, // | + TOK_OP_BIT_XOR, // ^ + TOK_OP_BIT_NOT, // ~ + TOK_OP_LSHIFT, // << + TOK_OP_RSHIFT, // >> + TOK_OP_NOT, // ! + TOK_OP_ASSIGN, // = + TOK_OP_LT, // < + TOK_OP_GT, // > + TOK_OP_INC, // ++ + TOK_OP_DEC, // -- + TOK_OP_EQ, // == + TOK_OP_NE, // != + TOK_OP_LE, // <= + TOK_OP_GE, // >= + TOK_OP_AND, // && + TOK_OP_OR, // || + TOK_OP_MEMBER_POINTER, // -> + TOK_OP_MEMBER, // . + TOK_OP_COND_DECISION, // : + TOK_OP_COND, // ? + TOK_OP_ASSIGN_ADD, // += + TOK_OP_ASSIGN_SUB, // -= + TOK_OP_ASSIGN_MUL, // *= + TOK_OP_ASSIGN_DIV, // /= + TOK_OP_ASSIGN_MOD, // %= + TOK_OP_ASSIGN_BITAND, // &= + TOK_OP_ASSIGN_BITOR, // |= + TOK_OP_ASSIGN_BITXOR, // ^= + TOK_OP_ASSIGN_LSHIFT, // <<= + TOK_OP_ASSIGN_RSHIFT, // >>= + + // Separators + TOK_SEP_LEFT_PAREN, // ( + TOK_SEP_RIGHT_PAREN, // ) + TOK_SEP_LEFT_BRACKET, // [ + TOK_SEP_RIGHT_BRACKET, // ] + TOK_SEP_LEFT_BRACE, // { + TOK_SEP_RIGHT_BRACE, // } + TOK_SEP_COMMA, // , + TOK_SEP_SEMICOLON, // ; + TOK_SEP_DOT, // . + TOK_SEP_ELLIPSIS, // ... + TOK_SEP_HASH, // # + + // Identifiers + TOK_ID, + + // Constants + TOK_CONST_INTEGER_U32, // u + TOK_CONST_INTEGER_U64, // ul + TOK_CONST_INTEGER_S32, // (no suffix) + TOK_CONST_INTEGER_S64, // l + TOK_CONST_FLOAT_32, // f + TOK_CONST_FLOAT_64, // (no suffix) + TOK_CONST_CHAR, // 'c' + TOK_CONST_STRING_ASCII, // "string" (width of 8 bits) + + // Special + TOK_SPECIAL_EOF, + TOK_SPECIAL_ERROR, +} c_token_types; +--- +@s +We bring this all together in `token.h`. +--- token.h +#ifndef TOKEN_H +#define TOKEN_H +#include <stdint.h> // We use this for int64_t +@{Token Types} +@{Opaque Token Type} +@{Token Creation and Destruction} +@{Token Interface} +#endif +--- + +@s Token Implementation +Now that we have the interface, we can implement the token data structure. We'll need a couple of things: +* The token type. +* A way to store extra data. +* Implementations of the functions we defined in the interface. + +@s +One problem is we haven't defined a way to verify that the token we're getting isn't corrupt. We'll use a tag for that. +--- Token Data Structure +#define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK" +#define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN" +struct token { + long magic; + int line; + int column; + short kind; + long opt_data[0]; +}; + +typedef struct token token_t; + +struct token_data { + union { + long long i; + double f; + const char *s; + char c; + } data; +}; + +typedef struct token_data token_data_t; +--- +You might notice that a zero-length array is used in the token data structure. This is a GCC extension that allows us to allocate memory for the token data structure and the token data in one allocation. This is a bit of a hack, but it's a common pattern in C code. +@s +To access this extra data, we define a macro +--- Token Data Access +#define token_data(token) ((struct token_data *)((token)->opt_data)) +--- +@s +Now we can implement the functions we defined in the interface. +--- Token Creation and Destruction (Except `token_create_string`) +static token_t *token_data_create(c_token_types kind, int lin, int col, int len) { + token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data)); + if (token == NULL) { + fputs("Out of memory\n", stderr); + exit(1); + } + token->magic = TOK_MAGIC_1; + token->line = lin; + token->column = col; + column += len; + token->kind = kind; + return token; +} + +static token_t *token_create(c_token_types kind, int lin, int col, int len) { + token_t *token = malloc(sizeof(token_t)); + if (token == NULL) { + fputs("Out of memory\n", stderr); + exit(1); + } + token->magic = TOK_MAGIC_2; + token->line = lin; + token->column = col; + column += len; + token->kind = kind; + return token; +} + +static token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) { + token_t *token = token_data_create(kind, lin, col, len); + token_data(token)->data.i = i; + return token; +} + +static token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) { + token_t *token = token_data_create(kind, lin, col, len); + token_data(token)->data.f = f; + return token; +} + +static token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) { + token_t *token = token_data_create(kind, lin, col, len); + token_data(token)->data.c = c; + return token; +} + +void token_destroy(token_t *token) { + if (token->magic == TOK_MAGIC_1) { + free(token); + } else if (token->magic == TOK_MAGIC_2) { + free(token); + if (kind == TOK_CONST_STRING_ASCII) { + free(token_data(token)->data.s); + } + } else { + fputs("Corrupt token\n", stderr); + exit(1); + } +} +--- +@s +`token_create_string` can be implemented either the easy way or the right way. Let's try the easy way. +--- token_create_string +token_t *token_create_string(c_token_types kind, int lin, int col, + const char *s, int len) { + token_t *token = token_create(kind, lin, col, len); + token_data(token)->data.s = strdup(s); + return token; +} +--- +@s +There's an issue with this approach. `token_create_string` will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to `fprintf`, `fscanf`, `strchr`, `strtok`, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage. +To fix this, we use a hash table to store the strings. We'll define a hash table in `hash_table.h` and `hash_table.c`. +@s Hash Table +A hash table is a data structure that maps keys to values. It's commonly used to store information, such as variables and functions in a symbol table. To implement a generic hash table, we'll need several things: +* A function to hash the keys. +* A function to compare keys. +* An opaque type for the hash table. +* A function to destroy deleted keys and values. + +Let's start with the interface. + +@s +--- Hash Table Opaque Types +typedef struct hash_table hash_table_t; +typedef int (*hash_table_cmp_fn)(void *key1, void *key2); +typedef unsigned int (*hash_table_hash_fn)(void *key); +typedef void (*hash_table_dtor)(void *value, int is_key); +--- + +@s +--- Hash Table Creation and Destruction +hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor); +void hash_table_destroy(hash_table_t *table); +--- + +@s +--- Hash Table Access +void *hash_table_get(hash_table_t *table, void *key); +void hash_table_put(hash_table_t *table, void *key, void *value); +void hash_table_remove(hash_table_t *table, void *key); +--- + +@s +--- hash_table.h +#ifndef HASH_TABLE_H +#define HASH_TABLE_H +@{Hash Table Opaque Types} +@{Hash Table Creation and Destruction} +@{Hash Table Access} +#endif +--- + +@s +Let's implement the hash table now. + +--- hash_table.c +#include <stdlib.h> +#include <string.h> +#include <stdio.h> +#include "hash_table.h" + +@{Hash Table Data Structure} +@{Hash Table Entry Data Structure} + +hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) { + @{Allocate and Initialize Hash Table} + return table; +} + +void hash_table_destroy(hash_table_t *table) { + @{Destroy Entries} + free(table->entries); + free(table); +} + +void *hash_table_get(hash_table_t *table, void *key) { + @{Get Entry By Hash} + @{Loop Through Entries and Return Value if Match} + return NULL; +} + +void hash_table_put(hash_table_t *table, void *key, void *value) { + @{Get Entry By Hash} + @{Loop Through Entries and Replace Value if Key Matches} + @{Allocate New Entry if No Match} +} + +void hash_table_remove(hash_table_t *table, void *key) { + @{Get Entry By Hash} + @{Loop Through Entries and Remove Entry if Key Matches} +} + +#ifdef TEST_HASH_TABLE +#include <assert.h> +#include <stdio.h> +#include <string.h> + +int string_cmp(void *key1, void *key2) { + return strcmp((char *)key1, (char *)key2); +} + +unsigned long string_hash(void *key) { + unsigned long hash = 5381; + char *str = (char *)key; + while (*str != '\0') { + hash = ((hash << 5) + hash) + *str; + str++; + } + return hash; +} + +int main() { + hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL); + hash_table_put(table, "foo", "bar"); + hash_table_put(table, "foo", "baz"); + assert(strcmp((char *)hash_table_get(table, "foo"), "baz") == 0); + hash_table_remove(table, "foo"); + assert(hash_table_get(table, "foo") == NULL); + hash_table_destroy(table); + return 0; +} +#endif +--- + +@s +For the hash table data structure, we'll define a pointer to an array of entries, the size of the array, and the hash/comparison functions. + +--- Hash Table Data Structure +struct hash_table { + struct hash_table_entry **entries; + int size; + hash_table_cmp_fn cmp; + hash_table_hash_fn hash; + hash_table_dtor dtor; +}; +--- + +@s +Entries in the hash table will have a key, a value, and a link to the next entry in the chain. + +--- Hash Table Entry Data Structure +struct hash_table_entry { + void *key; + void *value; + struct hash_table_entry *next; +}; +--- + +@s +Allocating a hash table involves allocating memory for the hash table itself and the entries, zeroing out the entries, and setting the hash and comparison functions. + +--- Allocate and Initialize Hash Table +hash_table_t *table = malloc(sizeof(struct hash_table)); +if (table == NULL) { + fputs("Out of memory, could not allocate hash table\n", stderr); + exit(1); +} +table->entries = calloc(size, sizeof(struct hash_table_entry *)); +if (table->entries == NULL) { + fputs("Out of memory, could not allocate hash table entries\n", stderr); + exit(1); +} +table->size = size; +table->cmp = cmp; +table->hash = hash; +table->dtor = dtor; +--- + +@s +To destroy a hash table, we loop through the entries, freeing the keys and values, and then free the entries and the table itself. + +--- Destroy Entries +for (int i = 0; i < table->size; i++) { + struct hash_table_entry *entry = table->entries[i]; + while (entry != NULL) { + struct hash_table_entry *next = entry->next; + if (table->dtor != NULL) { + table->dtor(entry->key, 1); + table->dtor(entry->value, 0); + } + free(entry); + entry = next; + } +} +--- + +@s +To get an entry from the hash table, we hash the key, loop through the entries, and return the value if we find a match. + +--- Get Entry By Hash +unsigned int hash = table->hash(key) % table->size; +struct hash_table_entry *entry = table->entries[hash]; +--- + +@s +To put an entry in the hash table, we hash the key, loop through the entries, and replace the value if we find a match. + +--- Loop Through Entries and Replace Value if Key Matches +while (entry != NULL) { + if (table->cmp(entry->key, key) == 0) { + entry->value = value; + return; + } + entry = entry->next; +} +--- + +@s +If we don't find a match, we allocate a new entry, set the key and value, and insert it at the head of the linked list. +This exploits a property in computer science called locality of reference. The gist of that is that when you write to a piece of memory, you're likely to read from it again soon. By putting the new entry at the head of the linked list, we increase the chances that we'll find it quickly next time. + +--- Allocate New Entry if No Match +struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry)); +if (new_entry == NULL) { + fputs("Out of memory, could not allocate hash table entry\n", stderr); + exit(1); +} +new_entry->key = key; +new_entry->value = value; +new_entry->next = table->entries[hash]; +table->entries[hash] = new_entry; +--- + +@s +To remove an entry from the hash table, we hash the key, loop through the entries, and remove the entry if we find a match. + +--- Loop Through Entries and Remove Entry if Key Matches +struct hash_table_entry *prev = NULL; +while (entry != NULL) { + if (table->cmp(entry->key, key) == 0) { + if (prev == NULL) { + table->entries[hash] = entry->next; + } else { + prev->next = entry->next; + } + if (table->dtor != NULL) { + table->dtor(entry->key, 1); + table->dtor(entry->value, 0); + } + free(entry); + return; + } + prev = entry; + entry = entry->next; +} +--- + +@s +To find a value associated with a given key in the hash table, we hash the string, loop through the entries, and return the value if a match is found. + +--- Loop Through Entries and Return Value if Match +while (entry != NULL) { + if (table->cmp(entry->key, key) == 0) { + return entry->value; + } + entry = entry->next; +} +---+ \ No newline at end of file diff --git a/projects/style.css b/projects/style.css @@ -1,17 +1,4 @@ /* code blocks (Style from jmeiners.com/lc3-vm, CC BY-NC-SA 4.0, used with attribution) */ -code, -.block-header, -.file-name - { - font-size: 11pt; - font-family: 'Fira Mono', Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace; -} - -.file-name-hr -{ - font-size: 13pt; - font-family: 'Fira Mono', Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace; -} /* Quotes and Block Quotes */ blockquote {