Building a Karel parser as a compiler front end

6 de junio de 2026 · 20 min de lectura · Ingeniería

A walkthrough of a Java lexer, LL(1) recursive-descent parser, and scope checker for a small Karel-style language.

Small languages are useful because they make compiler theory visible.

Karel-Lite is a compact Java 17 implementation of a lexer and LL(1) recursive-descent parser for a Karel-style teaching language. It does not try to execute the program, build an AST, or generate bytecode. Its job is narrower and cleaner:

Given a source file, decide whether the program is valid.

The output contract is intentionally simple:

YES.
NO. <diagnostic>

That simple answer hides a real compiler-front-end pipeline. The project has to recognize regular tokens, validate a context-free grammar, and enforce scope-related semantic rules that the grammar alone cannot express.

Core model The lexer handles regular structure, the parser handles context-free syntax, and the scope checker handles static semantics. That separation is what makes this small repo worth reading.

The language shape

The source language uses parenthesized instructions. That gives it a Lisp-like surface and makes the grammar easier to parse predictively:

( defvar rotate 3 )
( = rotate 1 )
( if (can-move? :north) (move-dir 1 :front) (null) )
( defun foo ( c p )
  ( = c 7 )
  ( put :chips c )
  ( put :balloons p )
  ( move rotate ) )
( foo 1 3 )

The language includes:

Construct	Examples	What it tests
Variables	`(defvar x 1)`, `(= x 2)`	Declarations and assignment checks
Movement	`(move 1)`, `(jump steps)`, `(turn :left)`	Command parsing and value parsing
Direction calls	`(move-dir 1 :front)`, `(face :north)`	Direction subgrammars
Resource actions	`(put :chips 1)`, `(pick :balloons n)`	Object selectors plus values
Conditions	`(blocked?)`, `(not (blocked?))`	Nested predicate syntax
Control flow	`if`, `loop`, `repeat`	Structured instruction parsing
Functions	`(defun foo (a) ...)`, `(foo 1)`	Scope, arity, and call validation
Blocks	`((move 1) (turn :left))`	Nested instruction lists and scopes

The point is not that Karel-Lite is a large language. The point is that even a small language forces the same boundaries a real compiler has to respect.

From grammar to recognizer

The grammar in grammar.txt is written in the standard formal-language shape:

G := {terminals, nonTerminals, P, productionRules}

The terminals are the atomic symbols the parser can see after lexing: keywords, commands, punctuation, identifiers, numbers, and constants. The non-terminals are syntactic categories such as program, instruction, command, condition, and function definition.

The start symbol is P, for program.

P  -> [Is]
Is -> I {I}
I  -> '(' [( CMD | CS | FC | Is )] ')'

Those three rules define the top-level shape.

Non-terminal	Meaning	Practical consequence
`P`	Program	A program may be empty, or it may contain an instruction list.
`Is`	Non-empty instruction list	Once a block starts, it must contain at least one instruction.
`I`	One parenthesized instruction	Every command, control structure, function call, or block lives inside `(` and `)`.

That last rule is doing a lot of work. Because every instruction is parenthesized, the parser can use ( as the entry point for nearly everything. Once inside the parentheses, the next token tells it which branch it is parsing.

The lexer: regular languages first

Before parsing, the source file has to become a stream of tokens. That is the lexer’s job.

The lexer defines token rules with regular expressions:

addTokenRule(TokenType.DEF_VAR_KW, "defvar");
addTokenRule(TokenType.MOVE_DIR_CMD, "move-dir");
addTokenRule(TokenType.NUMBER, "[1-9]\\d*|0");
addTokenRule(TokenType.NAME, "[_a-z][_a-z\\d]*");

This is the regular-language layer. A keyword like defvar, a number like 123, and a name like rotate can all be recognized with finite-state logic.

The implementation keeps a small scanner state:

START
MATCH
PARTIAL_MATCH

That is important because some tokens are prefixes of longer tokens. move is a valid command, but move-dir and move-face are also valid commands. If the lexer emitted move as soon as it saw those four characters, it would tokenize move-dir incorrectly.

Instead, it keeps reading while the current candidate is still a possible prefix of a longer token. This is the classic maximal-munch idea: emit the longest valid token that fits.

The lexer also appends an explicit end marker:

END_SYMBOL -> "$"

That gives the parser a real token to accept at the end. A program is not valid just because some prefix parsed; it is valid only if the parser consumes the whole token stream.

The parser: one method per grammar idea

The parser is hand-written recursive descent. The grammar is not hidden inside a generator. It is expressed directly as Java methods.

Grammar idea	Parser method
Program	`parseP()`
Instruction list	`parseIs(...)`
Instruction	`parseI()`
Command	`parseCMD()`
Control structure	`parseCS()`
Condition	`parseCOND()`
Function definition	`parseFD()`
Function call	`parseFC()`
Value	`parseV()`
Direction/object	`parseD1()`, `parseD2()`, `parseD3()`, `parseX()`

The core parser primitive is accept.

private void accept(TokenType expected) throws Exception {
    TokenType actual = SigTok().getType();
    if (actual.equals(expected)) {
        currentPosition++;
    } else {
        throw new Exception("Expected token: " + expected + " but found: " + actual);
    }
}

If the next token is the expected one, consume it. If not, the source program is rejected.

That gives the parser a mechanical rhythm:

look at the next token
choose a production
accept the required tokens
call smaller parse methods for nested structures
fail immediately when the grammar contract is broken

Method by method: grammar executed as code

The clearest way to read the parser is to treat every parseX method as the implementation of one grammar production. The method name is not incidental: parseCMD processes CMD, parseCOND processes COND, parseFD processes FD, and so on.

That works because the grammar is mostly predictive. The parser can look at the next token, choose the only legal branch, consume the expected terminals, and delegate nested non-terminals to smaller parser methods.

Method	Grammar idea	Equivalent CFG rule
`parseP`	Program	`P -> [Is]`
`parseIs`	Instruction list	`Is -> I {I}`
`parseI`	Instruction	`I -> '(' optional instruction-content ')'`
`parseCMD`	Built-in command	`CMD` command alternatives
`parseCS`	Control structure	`CS` control-structure alternatives
`parseFD`	Function definition	`FD -> defun name '(' [PARAMS] ')' Is`
`parseCOND`	Condition	`COND -> '(' predicate ')'`
`parseFC`	Function call	`FC -> name [Vs]`
`parseV`	Value	`V` value alternatives
`parseVs`	Argument list	`Vs` one-or-more argument sequence
`parseD1/D2/D3/X`	Direction/object	`D1`, `D2`, `D3`, and `X` selector rules

`parseP`: the start symbol

Formal rule:

P -> [Is]

Operational rule:

if the next token is '(':
  parse an instruction list
then require END_SYMBOL

The grammar says the program may contain an instruction list, but it does not have to. The implementation mirrors that exactly: parseP only calls parseIs when the next token is a left parenthesis. Then it calls accept(END_SYMBOL) so a valid prefix is not enough. The whole input has to be consumed.

This is the parser’s root contract:

valid program = optional instruction list + no leftover tokens

END_SYMBOL is an implementation sentinel appended by the lexer. It is not one of the user-written Karel tokens, but it is part of the operational grammar the parser uses.

`parseIs`: one or more instructions

Formal rule:

Is -> I {I}

Operational rule:

push a new scope
define any function parameters in that scope
parse one instruction
while the next token is '(':
  parse another instruction
pop the scope

The first parseI() implements the required I. The while loop implements {I}, meaning zero or more additional instructions.

This method is also where syntax and semantics meet. A context-free grammar can describe “a block is a list of instructions,” but the implementation gives every block a scope. That means the code is doing two jobs:

Layer	What `parseIs` does
Syntax	Recognizes `I {I}`.
Semantics	Creates a scope, inserts parameters, and removes scope at exit.

That is why nested blocks matter in this language. They are not just visual grouping. They control name visibility.

`parseI`: dispatching an instruction

Formal rule:

I -> '(' [( CMD | CS | FC | Is )] ')'

Operational rule:

accept '('
look at the next token
route to CMD, CS, FC, or Is
accept ')'

The method processes the rule in two phases. First, it consumes the opening parenthesis. Then it decides what is inside the instruction.

Lookahead token	Grammar branch	Method called
Token in `SELECTION_SET_CMD`	`CMD`	`parseCMD()`
Token in `SELECTION_SET_CS`	`CS`	`parseCS()`
`NAME`	`FC`	`parseFC()`
`LEFT_P`	`Is`	`parseIs(...)`
`RIGHT_P`	empty content	no nested method

That table is the LL(1) heart of the parser. One lookahead token is enough to choose the branch.

The empty-content case comes from the optional bracket in the grammar:

[( CMD | CS | FC | Is )]

Because the content is optional, () is accepted. A stricter language could remove that optional bracket, but this implementation follows the rule as written.

`parseCMD`: built-in command alternatives

Formal rule:

CMD -> ( ( 'defvar' | '=' ) 'name'
       | 'move'
       | 'jump'
       | ( 'put' | 'pick' ) X ) V
     | 'move-dir' V D3
     | 'run-dirs' Ds
     | ( 'move-face' V | 'face' ) ':' D2
     | 'turn' D1
     | 'null'

Operationally, parseCMD is a switch over the first command token. Each case implements one alternative in the production.

Code branch	CFG alternative	How the method processes it
`DEF_VAR_KW`, `ASSIGN_OP`	`(defvar or =) name V`	Accept keyword/operator, accept `NAME`, then parse `V`.
`MOVE_CMD`, `JUMP_CMD`	`(move or jump) V`	Accept command, then parse a value.
`PUT_CMD`, `PICK_CMD`	`(put or pick) X V`	Accept command, parse object selector, parse value.
`MOVE_DIR_CMD`	`move-dir V D3`	Accept command, parse value, parse relative direction.
`RUN_DIRS_CMD`	`run-dirs Ds`	Accept command, parse one or more `D3` directions.
`MOVE_FACE_CMD`	`move-face V ':' D2`	Accept command, parse value, accept colon, parse absolute direction.
`FACE_CMD`	`face ':' D2`	Accept command, accept colon, parse absolute direction.
`TURN_CMD`	`turn D1`	Accept command, parse turn direction.
`NULL_CMD`	`null`	Accept `null` and nothing else.

The variable commands add semantic checks:

defvar name V
  -> define name in the current scope

= name V
  -> require name to already exist

That means parseCMD is not just syntax recognition. It also enforces a simple declaration-before-use rule.

`parseCS`: choosing a control structure

Formal rule:

CS -> 'if' COND I [I]
    | ('loop' COND | 'repeat' V) I
    | FD

Operationally, parseCS switches on the first token:

First token	CFG alternative	Method behavior
`IF_KW`	`if COND I [I]`	Accept `if`, parse condition, parse then-instruction, optionally parse else-instruction.
`LOOP_KW`	`loop COND I`	Accept `loop`, parse condition, parse body instruction.
`REPEAT_KW`	`repeat V I`	Accept `repeat`, parse count value, parse body instruction.
`DEFUN_KW`	`FD`	Delegate to `parseFD()`.

This method does not execute control flow. It only verifies that the structure of each control form is legal. The body of an if, loop, or repeat is still just an I, so all ordinary instruction rules apply recursively inside those constructs.

`parseFD`: function definitions

Formal rule:

FD -> 'defun' 'name' '(' [PARAMS] ')' Is
PARAMS -> name {name}

Operational rule:

accept defun
read function name
accept '('
if the next token is NAME:
  parse parameter names
reject duplicate visible name+arity
define the function signature
accept ')'
parse function body as Is(params)

parseFD is where the parser gives function definitions their static meaning. The grammar can say a function has a name, optional parameters, and a body. The code adds the compile-time rules:

Rule	Implemented by
Parameters are names	`parsePARAMS()` accepts one or more `NAME` tokens.
Function signatures use arity	`checkFunctionDefinition(functionName, params.size())`.
Same signature cannot be redefined	Throws before registering the function.
Function body sees parameters	Calls `parseIs(params)` so parameters become local variables.
Direct recursion works	Registers the signature before parsing the body.

This is the first place where the article’s “compiler front end” claim becomes concrete. parseFD recognizes grammar, installs a symbol, and then parses the body under a new scope.

`parseCOND`: recursive predicate syntax

Formal rule:

COND -> '(' (
          ( 'facing?' | 'can-move?' ) [':'] D2
        | 'blocked?'
        | ( ( 'can-put?' | 'can-pick?' ) X | 'isZero?' ) V
        | 'not' COND
        ) ')'

Operationally, every condition starts by accepting ( and ends by accepting ). The token inside the parentheses chooses the predicate form.

Predicate token	CFG alternative	Method behavior
`FACING_KW`, `CAN_MOVE_KW`	`(facing? or can-move?) [':'] D2`	Accept predicate, optionally accept colon, parse absolute direction.
`BLOCKED_KW`	`blocked?`	Accept predicate only.
`CAN_PUT_KW`, `CAN_PICK_KW`	`(can-put? or can-pick?) X V`	Accept predicate, parse object selector, parse value.
`IS_ZERO_KW`	`isZero? V`	Accept predicate, parse value.
`NOT_KW`	`not COND`	Accept `not`, then recursively parse another condition.

The recursive not COND branch is the important theory point. This is why the language needs a context-free parser rather than a pile of regexes. Conditions can nest:

(not (not (blocked?)))

The parser handles that because parseCOND can call itself.

`parseFC`, `parseVs`, and function-call arity

Formal rules:

FC -> name [Vs]
Vs -> ( V | VC ) {( V | VC )}

Operational rule:

read function name
if the next token can start an argument:
  parse one or more arguments and count them
require a visible function with matching name and arity

The argument list is split into helpers:

Method	CFG rule	Code behavior
`parseFC`	`FC -> name [Vs]`	Reads the call name, optionally parses arguments, checks `name + arity`.
`parseVs`	`Vs -> (V or VC) repeated`	Parses at least one argument, then loops while another argument can start.
`parseVorVC`	`V or VC`	Routes to `parseV` or `parseVC` based on the next token.

The grammar only describes the call shape. The code enforces the semantic rule:

(foo 1 2)

is valid only if a visible function foo exists with arity 2.

`parseV`, `parseVC`, and argument values

Formal rules:

V  -> number | name | constant
VC -> left | right | around | north | south | east | west
    | balloons | chips | front | back

parseV accepts numeric values, variable names, and built-in constants. When the value is a NAME, it performs an extra scope check:

if V is NAME:
  require the variable to exist

parseVC handles symbolic constants that can be passed as function arguments. That is why function calls can pass values such as north or chips, even though ordinary numeric command values use V.

This distinction is a grammar design choice:

Rule	Used for	Name checking?
`V`	command counts, repeat counts, variable values	Yes, when token is `NAME`.
`VC`	symbolic function-call constants	No variable lookup; token itself is the value.

Direction and selector methods

The small direction/object parser methods are grammar guards. They keep tokens from being accepted in the wrong semantic position.

Formal rules:

D1 -> ':' (left | right | around)
D2 -> north | south | east | west
X  -> ':' (balloons | chips)
D3 -> ':' (front | right | left | back)
Ds -> D3 {D3}

Operational mapping:

Method	CFG rule	What it enforces
`parseD1`	`D1`	`turn` can only use `:left`, `:right`, or `:around`.
`parseD2`	`D2`	absolute directions are only `north/south/east/west`.
`parseX`	`X`	object selectors are only `:balloons` or `:chips`.
`parseD3`	`D3`	relative directions are only `:front/:right/:left/:back`.
`parseDs`	`Ds`	`run-dirs` accepts one or more `D3` directions.

These helpers are small, but they are the reason a phrase like this is rejected:

(turn :north)

north is a valid language token, but it is not valid in D1.

Why this is LL(1)

Karel-Lite is parsed as an LL(1) language. In practical terms:

L: read input left to right
L: follow a leftmost derivation
1: choose the next grammar branch with one token of lookahead

The parser’s selection sets are the code version of FIRST sets. They say which tokens can start a given syntactic category.

private static final Set<TokenType> SELECTION_SET_CMD = Set.of(
    TokenType.DEF_VAR_KW, TokenType.ASSIGN_OP, TokenType.MOVE_CMD,
    TokenType.JUMP_CMD, TokenType.PICK_CMD, TokenType.PUT_CMD,
    TokenType.MOVE_DIR_CMD, TokenType.RUN_DIRS_CMD,
    TokenType.MOVE_FACE_CMD, TokenType.FACE_CMD, TokenType.TURN_CMD,
    TokenType.NULL_CMD
);

So inside parseI(), the parser can choose the right branch from one token:

if next token starts a command: parseCMD
if next token starts a control structure: parseCS
if next token is a name: parseFC
if next token is '(': parse a nested instruction list

That is the reason the language design matters. The syntax is shaped so the parser rarely has to guess.

How the grammar avoids ambiguity

The grammar is not just context-free. It is designed to be predictive. That is the difference between a grammar that merely describes a language and a grammar that a one-token-lookahead parser can process cleanly.

For an LL(1) parser, every point of choice needs a deterministic answer:

Given the next token, which production should I use?

That is where selection sets matter. A selection set is the set of tokens that tell the parser a production branch is valid. For an ordinary alternative, the selection set is basically its FIRST set:

A -> alpha | beta | gamma

SELECT(alpha) = FIRST(alpha)
SELECT(beta)  = FIRST(beta)
SELECT(gamma) = FIRST(gamma)

For the parser to choose with one token, those sets must not overlap:

intersection(SELECT(alpha), SELECT(beta)) = empty
intersection(SELECT(alpha), SELECT(gamma)) = empty
intersection(SELECT(beta), SELECT(gamma)) = empty

If two branches can start with the same token, the parser cannot know which one to use without more lookahead or backtracking.

Alternatives must start differently

The control-structure production is the cleanest example:

CS -> 'if' COND I [I]
    | ('loop' COND | 'repeat' V) I
    | FD
FD -> 'defun' 'name' '(' [PARAMS] ')' Is

The alternatives have disjoint starts:

Branch	Selection set
`if ...`	`{ if }`
`loop ...`	`{ loop }`
`repeat ...`	`{ repeat }`
`FD`	`{ defun }`

That is why parseCS() can be a simple switch. There is no ambiguity between an if, a loop, a repeat, and a function definition.

The same idea appears inside parseI():

I -> '(' [( CMD | CS | FC | Is )] ')'

Once the parser consumes (, the next token chooses the content branch.

Branch	Selection set
`CMD`	`{ defvar, =, move, jump, pick, put, move-dir, run-dirs, move-face, face, turn, null }`
`CS`	`{ if, loop, repeat, defun }`
`FC`	`{ name }`
`Is`	`{ ( }`
empty optional content	`{ ) }` as the follow token

Those sets do not overlap. That is why the parser can decide:

defvar -> parseCMD
if     -> parseCS
name   -> parseFC
(      -> parseIs
)      -> empty instruction content

Optional rules must not collide with what follows

Optionals add one extra requirement. If a production has an optional piece:

A -> X [B] C

then the parser needs to know whether the next token starts B or means “skip B and continue with C.”

So the rule is:

FIRST(B) must not overlap FIRST(C)

More generally, if the optional can disappear, its selection set includes the tokens that may follow it:

SELECT(B) = FIRST(B)
SELECT(empty) = FOLLOW([B])
intersection(FIRST(B), FOLLOW([B])) must be empty

Karel-Lite uses that pattern several times.

Optional grammar	Optional FIRST set	Follow token set	Why it is safe
`P -> [Is]`	`{ ( }`	`{ END_SYMBOL }`	A program body starts with `(`; the end marker means empty program.
`I -> '(' [content] ')'`	command/control/name/`(`	`{ ) }`	`)` cannot start content, so it means empty instruction body.
`if COND I [I]`	`{ ( }`	`{ ) }`	An else-branch starts with `(`; `)` means no else-branch.
`FD -> defun name '(' [PARAMS] ')' Is`	`{ name }`	`{ ) }`	Parameters start with a name; `)` means zero parameters.
`FC -> name [Vs]`	value/constant tokens	`{ ) }`	Arguments start with values; `)` means zero arguments.
`(facing? or can-move?) [':'] D2`	`{ : }`	`{ north, south, east, west }`	The colon is optional because absolute directions never start with `:`.

That last case is a good small example. Both forms are accepted:

(can-move? north)
(can-move? :north)

The parser can support the optional colon because : is not a valid first token for D2. If D2 could also start with :, the parser would not know whether to consume the optional colon or leave it for D2.

Repetition must know when to stop

Repetition has the same issue. A repeated group must have a start set that does not overlap with the tokens that come after the group.

A -> B {C} D

The parser keeps consuming C while the lookahead is in FIRST(C). It stops when the lookahead belongs to what follows.

So:

FIRST(C) must not overlap FIRST(D)

Karel-Lite examples:

Repetition	Repeated FIRST set	Follow token set	Parser behavior
`Is -> I {I}`	`{ ( }`	`{ ), END_SYMBOL }`	Keep parsing instructions while the next token is `(`.
`PARAMS -> name {name}`	`{ name }`	`{ ) }`	Keep parsing parameter names until `)`.
`Vs -> (V or VC) {(V or VC)}`	value/constant tokens	`{ ) }`	Keep parsing arguments until the function-call close.
`Ds -> D3 {D3}`	`{ : }`	`{ ) }`	Keep parsing relative directions until command close.

This is why parseDs() can be:

parse one D3
while the next token is ':':
  parse another D3

It is not guessing. D3 starts with :, and the only thing that follows the direction list in run-dirs is the closing ).

Shared tokens are safe when there is no local choice

One subtle point: selection sets do not need to be globally disjoint across the entire grammar. They only need to be disjoint at a point where the parser must choose between alternatives.

For example, D1 and D3 both start with :.

D1 -> ':' (left | right | around)
D3 -> ':' (front | right | left | back)

That would be a conflict if the parser had a rule like:

Direction -> D1 | D3

because both branches would start with :.

Karel-Lite avoids that local ambiguity by using the command token to choose the context first:

turn     -> parseD1
move-dir -> parseD3
run-dirs -> parseDs -> parseD3

By the time the parser sees the colon, it already knows which direction category is expected.

The lexer does a related job for command names. In raw text, move is a prefix of move-dir and move-face. That could look ambiguous, but the lexer uses maximal munch and emits distinct tokens:

move      -> MOVE_CMD
move-dir  -> MOVE_DIR_CMD
move-face -> MOVE_FACE_CMD

The parser receives token types, not raw prefixes, so parseCMD() can switch on those command tokens cleanly.

The design rule

The rule of thumb is:

At every parser decision point, the next token should identify exactly one path.

For alternatives, each branch needs a disjoint selection set. For optionals, the optional branch must not start with a token that can also follow the optional. For repetitions, the repeated element must not start with a token that also means “the repetition is done.”

That is the real grammar design work behind this repo. The grammar is shaped so the code can be simple.

Program and block rules

The top-level program rule is:

P -> [Is]

That means an empty program is accepted. If the first token is (, the parser parses an instruction list. Either way, it then requires the end marker.

The instruction-list rule is:

Is -> I {I}

So a block is one or more instructions. In code, parseIs parses one instruction, then keeps parsing while the next token is another left parenthesis.

parseIs also creates a scope:

push new scope
define function parameters in that scope
parse instructions
pop scope

That means nested blocks are not just syntax. They also create static semantic boundaries.

Instruction rules

The instruction rule is:

I -> '(' [( CMD | CS | FC | Is )] ')'

An instruction always starts with ( and ends with ). Between them, it may contain:

Branch	Meaning	Starts with
`CMD`	Built-in command	`move`, `defvar`, `=`, `put`, etc.
`CS`	Control structure	`if`, `loop`, `repeat`, `defun`
`FC`	Function call	`name`
`Is`	Nested instruction list	`(`

The optional content means () is syntactically valid in this grammar. That is not necessarily a feature one would keep in a production language, but it is honest to the grammar implemented here.

Values, constants, and selectors

The main value rule is:

V -> 'number' | 'name' | 'constant'

Numbers and built-in constants are accepted directly. Names are different: the parser verifies that a variable with that name exists in the current scope or an outer scope.

That is not context-free syntax anymore. It is static semantics.

The language also has value constants used in function arguments:

VC -> left | right | around | north | south | east | west | balloons | chips | front | back

The distinction matters. V is for numeric-like values and variable references. VC is for symbolic constants that can be passed to functions.

Selectors and direction categories are intentionally split:

Rule	Valid values	Used by
`X`	`:balloons`, `:chips`	`put`, `pick`, `can-put?`, `can-pick?`
`D1`	`:left`, `:right`, `:around`	`turn`
`D2`	`north`, `south`, `east`, `west`	`face`, `move-face`, `facing?`, `can-move?`
`D3`	`:front`, `:right`, `:left`, `:back`	`move-dir`, `run-dirs`

This keeps the parser from accepting semantically strange phrases such as:

(turn :north)
(put :front 1)
(face :left)

Those tokens exist in the language, but not in those positions.

Command rules

Commands are parsed in parseCMD().

The grammar is:

CMD -> ( ( 'defvar' | '=' ) 'name'
       | 'move'
       | 'jump'
       | ( 'put' | 'pick' ) X ) V
     | 'move-dir' V D3
     | 'run-dirs' Ds
     | ( 'move-face' V | 'face' ) ':' D2
     | 'turn' D1
     | 'null'

That compact rule turns into several command families.

Command	Required shape	Validation performed
`defvar`	`(defvar name V)`	Defines `name` in current scope.
assignment	`(= name V)`	Requires `name` to already exist.
`move` / `jump`	`(move V)`, `(jump V)`	Parses a valid value.
`put` / `pick`	`(put X V)`, `(pick X V)`	Requires `X` to be `:chips` or `:balloons`.
`move-dir`	`(move-dir V D3)`	Requires relative direction.
`run-dirs`	`(run-dirs D3 {D3})`	Requires one or more relative directions.
`move-face`	`(move-face V : D2)`	Requires amount plus absolute direction.
`face`	`(face : D2)`	Requires absolute direction.
`turn`	`(turn D1)`	Requires turn direction.
`null`	`(null)`	Accepts no extra arguments.

The variable distinction is the most important rule here. defvar extends the current scope. = does not. Assignment must refer to an existing variable.

(defvar x 1)  -> valid
(= x 2)       -> valid after x exists
(= y 2)       -> invalid if y was never defined

Names used as values are checked too:

(move x)      -> valid only if x is defined

So a syntactically valid command can still be rejected by the semantic layer.

Control structures

Control structures are parsed in parseCS().

The grammar is:

CS -> 'if' COND I [I]
    | ('loop' COND | 'repeat' V) I
    | FD

There are four branches: if, loop, repeat, and function definition.

If

The shape is:

(if COND I [I])

So an if always has:

A condition.
A then-instruction.
Optionally, an else-instruction.

Example:

(if (can-move? :north)
  (move-dir 1 :front)
  (null))

The parser does not evaluate the condition. It validates that the condition has legal syntax, then validates each branch as an instruction. If another ( is present after the then-branch, it parses that as the optional else-branch.

Loop

The shape is:

(loop COND I)

Example:

(loop (not (blocked?)) (move 1))

Again, this is not runtime execution. The parser only checks that the condition is valid and that the loop body is a valid instruction.

Repeat

The shape is:

(repeat V I)

Example:

(repeat Spaces (put :chips 1))

The repeat count is parsed as V, which means it can be a number, a variable, or a built-in constant. If it is a variable name, the scope checker verifies that the name exists.

Function definition

Function definitions are parsed as a control-structure branch because defun lives where other structured constructs live.

FD -> 'defun' 'name' '(' [PARAMS] ')' Is

Example:

(defun foo (c p)
  (= c 7)
  (put :chips c)
  (put :balloons p)
  (move rotate))

The parser:

Accepts defun.
Reads the function name.
Parses zero or more parameter names.
Rejects a duplicate function with the same name and arity in visible scope.
Registers the function signature.
Parses the function body as an instruction list with parameters pre-defined as local variables.

The registration-before-body detail matters. It allows direct recursion:

(defun goend ()
  (if (not (blocked?))
    ((move one) (goend))
    (null)))

The function goend can call itself because its signature is known before the body is parsed.

Conditions

Conditions are parsed in parseCOND().

The grammar is:

COND -> '(' (
          ( 'facing?' | 'can-move?' ) [':'] D2
        | 'blocked?'
        | ( ( 'can-put?' | 'can-pick?' ) X | 'isZero?' ) V
        | 'not' COND
        ) ')'

Every condition is parenthesized. The predicate then determines the rest of the shape.

Predicate	Shape	Meaning at parse time
`facing?`	`(facing? [:] D2)`	Requires an absolute direction.
`can-move?`	`(can-move? [:] D2)`	Requires an absolute direction.
`blocked?`	`(blocked?)`	No arguments.
`can-put?` / `can-pick?`	`(can-put? X V)`	Requires object selector + value.
`isZero?`	`(isZero? V)`	Requires a value.
`not`	`(not COND)`	Recursively parses a condition.

The optional colon in facing? and can-move? is a useful detail. Both of these are valid:

(can-move? north)
(can-move? :north)

The parser handles that by accepting a colon if it appears, then parsing D2.

The not rule is recursively defined. That is a direct example of why a context-free grammar is the right model for the language: nested conditions can be arbitrarily deep.

(not (not (blocked?)))

A regular expression is the wrong tool for that structure. A parser with a stack is the right tool.

Function calls and arity

Function calls are simple on the surface:

FC -> 'name' [Vs]
Vs -> ( V | VC ) {( V | VC )}

Examples:

(foo)
(foo 1)
(foo 1 north chips)

The parser reads the function name, counts the arguments, and checks whether a function with that name and arity exists in current or outer scope.

That means these are different signatures:

foo 0
foo 1
foo 2

The Scope object stores functions by name + arity, so the validation rule is not just “does a function named foo exist?” It is “does a function named foo exist with exactly this number of arguments?”

This fails:

(defun foo (a) (move a))
(foo 1 2)

because foo was defined with one parameter but called with two arguments.

The semantic layer

This repo is most interesting where it leaves pure context-free parsing and enters static semantics.

A context-free grammar can validate this shape:

(move x)

But it cannot, by itself, answer:

Was x declared before this use?

That is why the parser keeps a stack of scopes.

current block scope
outer block scope
outer function or program scope
...

Variable lookup walks those scopes from inner to outer. Function lookup does the same, but with name plus arity.

Rule	Behavior
Variable declaration	Adds the variable to the current scope.
Variable assignment	Requires the variable in current or outer scope.
Variable use as `V`	Requires the variable in current or outer scope.
Function definition	Adds `name arity` to the current scope.
Function call	Requires matching `name arity` in visible scope.
Function parameters	Become variables in the function body scope.
Variable shadowing	Allowed.
Same-arity function redeclaration	Rejected if visible in current or outer scope.

That is exactly the boundary a real compiler front end has:

lexer: this is a NAME token
parser: a NAME token is legal in this position
semantic checker: this particular name is or is not defined here

What the project does not do

It is useful to be precise about the boundary.

Karel-Lite does not:

build an AST;
execute robot commands;
type-check numeric versus directional values beyond the grammar categories;
optimize anything;
recover from errors and continue parsing;
support mutual recursion unless the called function has already been parsed.

Those are not flaws in the article-worthy sense. They clarify the project’s actual shape: it is a recognizer and static validator for a small language.

That makes it a good portfolio artifact. The implementation is small enough to read in one sitting, but it contains the core ideas behind larger language tools: scanning, token categories, predictive parsing, grammar productions, scope, arity, and diagnostics.

Why it matters

The project is interesting because it shows that compiler construction is not magic. A language front end is a disciplined sequence of increasingly richer questions:

Can this text be split into valid tokens?
Can those tokens form a valid syntax tree shape?
Do the names and calls make sense in scope?
Did we consume the whole input?

Karel-Lite answers those questions with direct code instead of hiding them behind a parser generator. That is why it reads well as engineering work: the formal language theory is not decorative. It is the structure of the program.

The repo is small, but the ideas are not. Regular languages handle tokens. Context-free grammars handle nested syntax. LL(1) selection sets make parsing deterministic. A scope stack handles the semantic rules that syntax alone cannot see.

That is a compiler front end in miniature.

The language shape

From grammar to recognizer

The lexer: regular languages first

The parser: one method per grammar idea

Method by method: grammar executed as code

parseP: the start symbol

parseIs: one or more instructions

parseI: dispatching an instruction

parseCMD: built-in command alternatives

parseCS: choosing a control structure

parseFD: function definitions

parseCOND: recursive predicate syntax

parseFC, parseVs, and function-call arity

parseV, parseVC, and argument values

Direction and selector methods

Why this is LL(1)

How the grammar avoids ambiguity

Alternatives must start differently

Optional rules must not collide with what follows

Repetition must know when to stop

Shared tokens are safe when there is no local choice

The design rule

Program and block rules

Instruction rules

Values, constants, and selectors

Command rules

Control structures

If

Loop

Repeat

Function definition

Conditions

Function calls and arity

The semantic layer

What the project does not do

Why it matters

`parseP`: the start symbol

`parseIs`: one or more instructions

`parseI`: dispatching an instruction

`parseCMD`: built-in command alternatives

`parseCS`: choosing a control structure

`parseFD`: function definitions

`parseCOND`: recursive predicate syntax

`parseFC`, `parseVs`, and function-call arity

`parseV`, `parseVC`, and argument values