Lexer & Parser
Lexer & Parser
Lexer (src/syntax/native/lexer.tg)
The lexer converts Thagore source text into a flat token stream. It is indentation-aware, tracking indent levels to emit INDENT/DEDENT tokens (similar to Python).
Tokenization Process
- Comment stripping —
#and//line comments are removed - Line-by-line scanning — each line is processed individually
- Indentation tracking — indent level changes emit
INDENT/DEDENTtokens using an indent stack - Token emission — identifiers, numbers, strings, operators, and keywords are emitted
Token Format
Tokens are encoded as a semicolon-delimited string:
KIND|TEXT|LINE|COL|INDENT;KIND|TEXT|LINE|COL|INDENT;...For example, let x = 42 produces tokens like:
LET|let|1|1|0;IDENT|x|1|5|0;SYMBOL|=|1|7|0;NUMBER|42|1|9|0;Token Types
| Token | Description | Example |
|---|---|---|
IDENT | Identifier | x, myFunc, _count |
NUMBER | Integer literal | 42, 0, 100 |
STRING | String literal | "hello" |
INTERPOLATED_STRING | Interpolated string | v"hello {name}" |
FUNC | func keyword | func |
LET | let keyword | let |
IF | if keyword | if |
ELSE | else keyword | else |
WHILE | while keyword | while |
FOR | for keyword | for |
IN | in keyword | in |
MATCH | match keyword | match |
ENUM | enum keyword | enum |
TYPE | type keyword | type |
TRAIT | trait keyword | trait |
IMPL | impl keyword | impl |
PUB | pub keyword | pub |
UNSAFE | unsafe keyword | unsafe |
DEFER | defer keyword | defer |
COMPTIME | comptime keyword | comptime |
RETURN | return keyword | return |
EQEQ | Equality operator | == |
NEQ | Not-equal operator | != |
GTE | Greater-or-equal | >= |
LTE | Less-or-equal | <= |
ARROW | Arrow operator | -> |
RANGE | Range operator | .. |
SYMBOL | Single-char operator | +, -, *, /, (, ), :, ., = |
INDENT | Indentation increase | (structural) |
DEDENT | Indentation decrease | (structural) |
NEWLINE | Line separator | (structural) |
E_INDENT | Indent mismatch error | (error token) |
EOF | End of file | (structural) |
Key Lexer Functions
| Function | Purpose |
|---|---|
tokenize_native(source) | Main tokenizer entry point — returns full token stream |
strip_comments(source) | Remove # and // comments from source |
has_sig_indentation(source) | Check if source uses significant indentation |
token_summary(source) | Alias for tokenize_native |
Parser (src/syntax/native/parser.tg)
The parser processes the token stream to build an AST represented as the ProgramAstNative struct.
ProgramAstNative Structure
struct ProgramAstNative: source: String # Original source text normalized: String # Cleaned/normalized source token_stream: String # Tokenized representation enums: String # Parsed enum declarations aliases: String # Type alias declarations funcs: String # Function declarations node_rows: String # AST node rows next_node_id: i32 # Auto-increment node ID # Feature counters (17 fields): enum_payload_count: i32 match_count: i32 range_loop_count: i32 if_expr_count: i32 closure_count: i32 unsafe_count: i32 extension_impl_count: i32 visibility_count: i32 tuple_destruct_count: i32 array_literal_count: i32 slice_expr_count: i32 loop_label_count: i32 raw_string_count: i32 interpolated_string_count: i32 result_sugar_count: i32 defer_scope_count: i32 comptime_count: i32Parsing Process
- Tokenize — call
tokenize_native(source)to get the token stream - Line-by-line header analysis:
- Detect
funcdeclarations → extract name, params, return type - Detect
enumdeclarations → extract name and variants - Detect
typealiases → extract name and target - Detect
structdeclarations → extract fields - Detect
implblocks → track method implementations
- Detect
- Feature counting — count every advanced feature usage for the feature set
- AST construction — build node rows with id, parent_id, kind, and payload
Key Parser Functions
| Function | Return | Purpose |
|---|---|---|
parse_to_ast(source) | ProgramAstNative | Full AST parse |
parse_normalized(source) | String | Get normalized source |
parse_enums(source) | String | Extract enum declarations |
parse_aliases(source) | String | Extract type aliases |
parse_funcs(source) | String | Extract function signatures |
parse_feature_set(source) | String | Get feature flags set |
Feature Set
The parser builds a feature set string like:
;enum_payload;match;range_loop;if_expr;This tells downstream stages (typechecker, lowering, emitter) which advanced features are used, enabling targeted code generation paths.