Skip to content

Lexer & Parser

Lexer & Parser

Lexer (src/syntax/native/lexer.tg)

The lexer converts Thagore source text into a flat token stream. It is indentation-aware, tracking indent levels to emit INDENT/DEDENT tokens (similar to Python).

Tokenization Process

  1. Comment stripping# and // line comments are removed
  2. Line-by-line scanning — each line is processed individually
  3. Indentation tracking — indent level changes emit INDENT/DEDENT tokens using an indent stack
  4. Token emission — identifiers, numbers, strings, operators, and keywords are emitted

Token Format

Tokens are encoded as a semicolon-delimited string:

KIND|TEXT|LINE|COL|INDENT;KIND|TEXT|LINE|COL|INDENT;...

For example, let x = 42 produces tokens like:

LET|let|1|1|0;IDENT|x|1|5|0;SYMBOL|=|1|7|0;NUMBER|42|1|9|0;

Token Types

TokenDescriptionExample
IDENTIdentifierx, myFunc, _count
NUMBERInteger literal42, 0, 100
STRINGString literal"hello"
INTERPOLATED_STRINGInterpolated stringv"hello {name}"
FUNCfunc keywordfunc
LETlet keywordlet
IFif keywordif
ELSEelse keywordelse
WHILEwhile keywordwhile
FORfor keywordfor
INin keywordin
MATCHmatch keywordmatch
ENUMenum keywordenum
TYPEtype keywordtype
TRAITtrait keywordtrait
IMPLimpl keywordimpl
PUBpub keywordpub
UNSAFEunsafe keywordunsafe
DEFERdefer keyworddefer
COMPTIMEcomptime keywordcomptime
RETURNreturn keywordreturn
EQEQEquality operator==
NEQNot-equal operator!=
GTEGreater-or-equal>=
LTELess-or-equal<=
ARROWArrow operator->
RANGERange operator..
SYMBOLSingle-char operator+, -, *, /, (, ), :, ., =
INDENTIndentation increase(structural)
DEDENTIndentation decrease(structural)
NEWLINELine separator(structural)
E_INDENTIndent mismatch error(error token)
EOFEnd of file(structural)

Key Lexer Functions

FunctionPurpose
tokenize_native(source)Main tokenizer entry point — returns full token stream
strip_comments(source)Remove # and // comments from source
has_sig_indentation(source)Check if source uses significant indentation
token_summary(source)Alias for tokenize_native

Parser (src/syntax/native/parser.tg)

The parser processes the token stream to build an AST represented as the ProgramAstNative struct.

ProgramAstNative Structure

struct ProgramAstNative:
source: String # Original source text
normalized: String # Cleaned/normalized source
token_stream: String # Tokenized representation
enums: String # Parsed enum declarations
aliases: String # Type alias declarations
funcs: String # Function declarations
node_rows: String # AST node rows
next_node_id: i32 # Auto-increment node ID
# Feature counters (17 fields):
enum_payload_count: i32
match_count: i32
range_loop_count: i32
if_expr_count: i32
closure_count: i32
unsafe_count: i32
extension_impl_count: i32
visibility_count: i32
tuple_destruct_count: i32
array_literal_count: i32
slice_expr_count: i32
loop_label_count: i32
raw_string_count: i32
interpolated_string_count: i32
result_sugar_count: i32
defer_scope_count: i32
comptime_count: i32

Parsing Process

  1. Tokenize — call tokenize_native(source) to get the token stream
  2. Line-by-line header analysis:
    • Detect func declarations → extract name, params, return type
    • Detect enum declarations → extract name and variants
    • Detect type aliases → extract name and target
    • Detect struct declarations → extract fields
    • Detect impl blocks → track method implementations
  3. Feature counting — count every advanced feature usage for the feature set
  4. AST construction — build node rows with id, parent_id, kind, and payload

Key Parser Functions

FunctionReturnPurpose
parse_to_ast(source)ProgramAstNativeFull AST parse
parse_normalized(source)StringGet normalized source
parse_enums(source)StringExtract enum declarations
parse_aliases(source)StringExtract type aliases
parse_funcs(source)StringExtract function signatures
parse_feature_set(source)StringGet feature flags set

Feature Set

The parser builds a feature set string like:

;enum_payload;match;range_loop;if_expr;

This tells downstream stages (typechecker, lowering, emitter) which advanced features are used, enabling targeted code generation paths.