Building a Karel parser as a compiler front end
A walkthrough of a Java lexer, LL(1) recursive-descent parser, and scope checker for a small Karel-style language.
Small languages are useful because they make compiler theory visible.
Karel-Lite is a compact Java 17 implementation of a lexer and LL(1) recursive-descent parser for a Karel-style teaching language. It does not try to execute the program, build an AST, or generate bytecode. Its job is narrower and cleaner:
Given a source file, decide whether the program is valid.
The output contract is intentionally simple:
YES.
NO. <diagnostic>
That simple answer hides a real compiler-front-end pipeline. The project has to recognize regular tokens, validate a context-free grammar, and enforce scope-related semantic rules that the grammar alone cannot express.
The language shape
The source language uses parenthesized instructions. That gives it a Lisp-like surface and makes the grammar easier to parse predictively:
( defvar rotate 3 )
( = rotate 1 )
( if (can-move? :north) (move-dir 1 :front) (null) )
( defun foo ( c p )
( = c 7 )
( put :chips c )
( put :balloons p )
( move rotate ) )
( foo 1 3 )
The language includes:
| Construct | Examples | What it tests |
|---|---|---|
| Variables | (defvar x 1), (= x 2) | Declarations and assignment checks |
| Movement | (move 1), (jump steps), (turn :left) | Command parsing and value parsing |
| Direction calls | (move-dir 1 :front), (face :north) | Direction subgrammars |
| Resource actions | (put :chips 1), (pick :balloons n) | Object selectors plus values |
| Conditions | (blocked?), (not (blocked?)) | Nested predicate syntax |
| Control flow | if, loop, repeat | Structured instruction parsing |
| Functions | (defun foo (a) ...), (foo 1) | Scope, arity, and call validation |
| Blocks | ((move 1) (turn :left)) | Nested instruction lists and scopes |
The point is not that Karel-Lite is a large language. The point is that even a small language forces the same boundaries a real compiler has to respect.
From grammar to recognizer
The grammar in grammar.txt is written in the standard formal-language shape:
G := {terminals, nonTerminals, P, productionRules}
The terminals are the atomic symbols the parser can see after lexing: keywords, commands, punctuation, identifiers, numbers, and constants. The non-terminals are syntactic categories such as program, instruction, command, condition, and function definition.
The start symbol is P, for program.
P -> [Is]
Is -> I {I}
I -> '(' [( CMD | CS | FC | Is )] ')'
Those three rules define the top-level shape.
| Non-terminal | Meaning | Practical consequence |
|---|---|---|
P | Program | A program may be empty, or it may contain an instruction list. |
Is | Non-empty instruction list | Once a block starts, it must contain at least one instruction. |
I | One parenthesized instruction | Every command, control structure, function call, or block lives inside ( and ). |
That last rule is doing a lot of work. Because every instruction is
parenthesized, the parser can use ( as the entry point for nearly everything.
Once inside the parentheses, the next token tells it which branch it is parsing.
The lexer: regular languages first
Before parsing, the source file has to become a stream of tokens. That is the lexer’s job.
The lexer defines token rules with regular expressions:
addTokenRule(TokenType.DEF_VAR_KW, "defvar");
addTokenRule(TokenType.MOVE_DIR_CMD, "move-dir");
addTokenRule(TokenType.NUMBER, "[1-9]\\d*|0");
addTokenRule(TokenType.NAME, "[_a-z][_a-z\\d]*");
This is the regular-language layer. A keyword like defvar, a number like
123, and a name like rotate can all be recognized with finite-state logic.
The implementation keeps a small scanner state:
START
MATCH
PARTIAL_MATCH
That is important because some tokens are prefixes of longer tokens. move is
a valid command, but move-dir and move-face are also valid commands. If the
lexer emitted move as soon as it saw those four characters, it would tokenize
move-dir incorrectly.
Instead, it keeps reading while the current candidate is still a possible prefix of a longer token. This is the classic maximal-munch idea: emit the longest valid token that fits.
The lexer also appends an explicit end marker:
END_SYMBOL -> "$"
That gives the parser a real token to accept at the end. A program is not valid just because some prefix parsed; it is valid only if the parser consumes the whole token stream.
The parser: one method per grammar idea
The parser is hand-written recursive descent. The grammar is not hidden inside a generator. It is expressed directly as Java methods.
| Grammar idea | Parser method |
|---|---|
| Program | parseP() |
| Instruction list | parseIs(...) |
| Instruction | parseI() |
| Command | parseCMD() |
| Control structure | parseCS() |
| Condition | parseCOND() |
| Function definition | parseFD() |
| Function call | parseFC() |
| Value | parseV() |
| Direction/object | parseD1(), parseD2(), parseD3(), parseX() |
The core parser primitive is accept.
private void accept(TokenType expected) throws Exception {
TokenType actual = SigTok().getType();
if (actual.equals(expected)) {
currentPosition++;
} else {
throw new Exception("Expected token: " + expected + " but found: " + actual);
}
}
If the next token is the expected one, consume it. If not, the source program is rejected.
That gives the parser a mechanical rhythm:
look at the next token
choose a production
accept the required tokens
call smaller parse methods for nested structures
fail immediately when the grammar contract is broken
Method by method: grammar executed as code
The clearest way to read the parser is to treat every parseX method as the
implementation of one grammar production. The method name is not incidental:
parseCMD processes CMD, parseCOND processes COND, parseFD processes
FD, and so on.
That works because the grammar is mostly predictive. The parser can look at the next token, choose the only legal branch, consume the expected terminals, and delegate nested non-terminals to smaller parser methods.
| Method | Grammar idea | Equivalent CFG rule |
|---|---|---|
parseP | Program | P -> [Is] |
parseIs | Instruction list | Is -> I {I} |
parseI | Instruction | I -> '(' optional instruction-content ')' |
parseCMD | Built-in command | CMD command alternatives |
parseCS | Control structure | CS control-structure alternatives |
parseFD | Function definition | FD -> defun name '(' [PARAMS] ')' Is |
parseCOND | Condition | COND -> '(' predicate ')' |
parseFC | Function call | FC -> name [Vs] |
parseV | Value | V value alternatives |
parseVs | Argument list | Vs one-or-more argument sequence |
parseD1/D2/D3/X | Direction/object | D1, D2, D3, and X selector rules |
parseP: the start symbol
Formal rule:
P -> [Is]
Operational rule:
if the next token is '(':
parse an instruction list
then require END_SYMBOL
The grammar says the program may contain an instruction list, but it does not
have to. The implementation mirrors that exactly: parseP only calls parseIs
when the next token is a left parenthesis. Then it calls accept(END_SYMBOL) so
a valid prefix is not enough. The whole input has to be consumed.
This is the parser’s root contract:
valid program = optional instruction list + no leftover tokens
END_SYMBOL is an implementation sentinel appended by the lexer. It is not one
of the user-written Karel tokens, but it is part of the operational grammar the
parser uses.
parseIs: one or more instructions
Formal rule:
Is -> I {I}
Operational rule:
push a new scope
define any function parameters in that scope
parse one instruction
while the next token is '(':
parse another instruction
pop the scope
The first parseI() implements the required I. The while loop implements
{I}, meaning zero or more additional instructions.
This method is also where syntax and semantics meet. A context-free grammar can describe “a block is a list of instructions,” but the implementation gives every block a scope. That means the code is doing two jobs:
| Layer | What parseIs does |
|---|---|
| Syntax | Recognizes I {I}. |
| Semantics | Creates a scope, inserts parameters, and removes scope at exit. |
That is why nested blocks matter in this language. They are not just visual grouping. They control name visibility.
parseI: dispatching an instruction
Formal rule:
I -> '(' [( CMD | CS | FC | Is )] ')'
Operational rule:
accept '('
look at the next token
route to CMD, CS, FC, or Is
accept ')'
The method processes the rule in two phases. First, it consumes the opening parenthesis. Then it decides what is inside the instruction.
| Lookahead token | Grammar branch | Method called |
|---|---|---|
Token in SELECTION_SET_CMD | CMD | parseCMD() |
Token in SELECTION_SET_CS | CS | parseCS() |
NAME | FC | parseFC() |
LEFT_P | Is | parseIs(...) |
RIGHT_P | empty content | no nested method |
That table is the LL(1) heart of the parser. One lookahead token is enough to choose the branch.
The empty-content case comes from the optional bracket in the grammar:
[( CMD | CS | FC | Is )]
Because the content is optional, () is accepted. A stricter language could
remove that optional bracket, but this implementation follows the rule as
written.
parseCMD: built-in command alternatives
Formal rule:
CMD -> ( ( 'defvar' | '=' ) 'name'
| 'move'
| 'jump'
| ( 'put' | 'pick' ) X ) V
| 'move-dir' V D3
| 'run-dirs' Ds
| ( 'move-face' V | 'face' ) ':' D2
| 'turn' D1
| 'null'
Operationally, parseCMD is a switch over the first command token. Each case
implements one alternative in the production.
| Code branch | CFG alternative | How the method processes it |
|---|---|---|
DEF_VAR_KW, ASSIGN_OP | (defvar or =) name V | Accept keyword/operator, accept NAME, then parse V. |
MOVE_CMD, JUMP_CMD | (move or jump) V | Accept command, then parse a value. |
PUT_CMD, PICK_CMD | (put or pick) X V | Accept command, parse object selector, parse value. |
MOVE_DIR_CMD | move-dir V D3 | Accept command, parse value, parse relative direction. |
RUN_DIRS_CMD | run-dirs Ds | Accept command, parse one or more D3 directions. |
MOVE_FACE_CMD | move-face V ':' D2 | Accept command, parse value, accept colon, parse absolute direction. |
FACE_CMD | face ':' D2 | Accept command, accept colon, parse absolute direction. |
TURN_CMD | turn D1 | Accept command, parse turn direction. |
NULL_CMD | null | Accept null and nothing else. |
The variable commands add semantic checks:
defvar name V
-> define name in the current scope
= name V
-> require name to already exist
That means parseCMD is not just syntax recognition. It also enforces a simple
declaration-before-use rule.
parseCS: choosing a control structure
Formal rule:
CS -> 'if' COND I [I]
| ('loop' COND | 'repeat' V) I
| FD
Operationally, parseCS switches on the first token:
| First token | CFG alternative | Method behavior |
|---|---|---|
IF_KW | if COND I [I] | Accept if, parse condition, parse then-instruction, optionally parse else-instruction. |
LOOP_KW | loop COND I | Accept loop, parse condition, parse body instruction. |
REPEAT_KW | repeat V I | Accept repeat, parse count value, parse body instruction. |
DEFUN_KW | FD | Delegate to parseFD(). |
This method does not execute control flow. It only verifies that the structure
of each control form is legal. The body of an if, loop, or repeat is still
just an I, so all ordinary instruction rules apply recursively inside those
constructs.
parseFD: function definitions
Formal rule:
FD -> 'defun' 'name' '(' [PARAMS] ')' Is
PARAMS -> name {name}
Operational rule:
accept defun
read function name
accept '('
if the next token is NAME:
parse parameter names
reject duplicate visible name+arity
define the function signature
accept ')'
parse function body as Is(params)
parseFD is where the parser gives function definitions their static meaning.
The grammar can say a function has a name, optional parameters, and a body. The
code adds the compile-time rules:
| Rule | Implemented by |
|---|---|
| Parameters are names | parsePARAMS() accepts one or more NAME tokens. |
| Function signatures use arity | checkFunctionDefinition(functionName, params.size()). |
| Same signature cannot be redefined | Throws before registering the function. |
| Function body sees parameters | Calls parseIs(params) so parameters become local variables. |
| Direct recursion works | Registers the signature before parsing the body. |
This is the first place where the article’s “compiler front end” claim becomes
concrete. parseFD recognizes grammar, installs a symbol, and then parses the
body under a new scope.
parseCOND: recursive predicate syntax
Formal rule:
COND -> '(' (
( 'facing?' | 'can-move?' ) [':'] D2
| 'blocked?'
| ( ( 'can-put?' | 'can-pick?' ) X | 'isZero?' ) V
| 'not' COND
) ')'
Operationally, every condition starts by accepting ( and ends by accepting
). The token inside the parentheses chooses the predicate form.
| Predicate token | CFG alternative | Method behavior |
|---|---|---|
FACING_KW, CAN_MOVE_KW | (facing? or can-move?) [':'] D2 | Accept predicate, optionally accept colon, parse absolute direction. |
BLOCKED_KW | blocked? | Accept predicate only. |
CAN_PUT_KW, CAN_PICK_KW | (can-put? or can-pick?) X V | Accept predicate, parse object selector, parse value. |
IS_ZERO_KW | isZero? V | Accept predicate, parse value. |
NOT_KW | not COND | Accept not, then recursively parse another condition. |
The recursive not COND branch is the important theory point. This is why the
language needs a context-free parser rather than a pile of regexes. Conditions
can nest:
(not (not (blocked?)))
The parser handles that because parseCOND can call itself.
parseFC, parseVs, and function-call arity
Formal rules:
FC -> name [Vs]
Vs -> ( V | VC ) {( V | VC )}
Operational rule:
read function name
if the next token can start an argument:
parse one or more arguments and count them
require a visible function with matching name and arity
The argument list is split into helpers:
| Method | CFG rule | Code behavior |
|---|---|---|
parseFC | FC -> name [Vs] | Reads the call name, optionally parses arguments, checks name + arity. |
parseVs | Vs -> (V or VC) repeated | Parses at least one argument, then loops while another argument can start. |
parseVorVC | V or VC | Routes to parseV or parseVC based on the next token. |
The grammar only describes the call shape. The code enforces the semantic rule:
(foo 1 2)
is valid only if a visible function foo exists with arity 2.
parseV, parseVC, and argument values
Formal rules:
V -> number | name | constant
VC -> left | right | around | north | south | east | west
| balloons | chips | front | back
parseV accepts numeric values, variable names, and built-in constants. When
the value is a NAME, it performs an extra scope check:
if V is NAME:
require the variable to exist
parseVC handles symbolic constants that can be passed as function arguments.
That is why function calls can pass values such as north or chips, even
though ordinary numeric command values use V.
This distinction is a grammar design choice:
| Rule | Used for | Name checking? |
|---|---|---|
V | command counts, repeat counts, variable values | Yes, when token is NAME. |
VC | symbolic function-call constants | No variable lookup; token itself is the value. |
Direction and selector methods
The small direction/object parser methods are grammar guards. They keep tokens from being accepted in the wrong semantic position.
Formal rules:
D1 -> ':' (left | right | around)
D2 -> north | south | east | west
X -> ':' (balloons | chips)
D3 -> ':' (front | right | left | back)
Ds -> D3 {D3}
Operational mapping:
| Method | CFG rule | What it enforces |
|---|---|---|
parseD1 | D1 | turn can only use :left, :right, or :around. |
parseD2 | D2 | absolute directions are only north/south/east/west. |
parseX | X | object selectors are only :balloons or :chips. |
parseD3 | D3 | relative directions are only :front/:right/:left/:back. |
parseDs | Ds | run-dirs accepts one or more D3 directions. |
These helpers are small, but they are the reason a phrase like this is rejected:
(turn :north)
north is a valid language token, but it is not valid in D1.
Why this is LL(1)
Karel-Lite is parsed as an LL(1) language. In practical terms:
L: read input left to right
L: follow a leftmost derivation
1: choose the next grammar branch with one token of lookahead
The parser’s selection sets are the code version of FIRST sets. They say which tokens can start a given syntactic category.
private static final Set<TokenType> SELECTION_SET_CMD = Set.of(
TokenType.DEF_VAR_KW, TokenType.ASSIGN_OP, TokenType.MOVE_CMD,
TokenType.JUMP_CMD, TokenType.PICK_CMD, TokenType.PUT_CMD,
TokenType.MOVE_DIR_CMD, TokenType.RUN_DIRS_CMD,
TokenType.MOVE_FACE_CMD, TokenType.FACE_CMD, TokenType.TURN_CMD,
TokenType.NULL_CMD
);
So inside parseI(), the parser can choose the right branch from one token:
if next token starts a command: parseCMD
if next token starts a control structure: parseCS
if next token is a name: parseFC
if next token is '(': parse a nested instruction list
That is the reason the language design matters. The syntax is shaped so the parser rarely has to guess.
How the grammar avoids ambiguity
The grammar is not just context-free. It is designed to be predictive. That is the difference between a grammar that merely describes a language and a grammar that a one-token-lookahead parser can process cleanly.
For an LL(1) parser, every point of choice needs a deterministic answer:
Given the next token, which production should I use?
That is where selection sets matter. A selection set is the set of tokens that tell the parser a production branch is valid. For an ordinary alternative, the selection set is basically its FIRST set:
A -> alpha | beta | gamma
SELECT(alpha) = FIRST(alpha)
SELECT(beta) = FIRST(beta)
SELECT(gamma) = FIRST(gamma)
For the parser to choose with one token, those sets must not overlap:
intersection(SELECT(alpha), SELECT(beta)) = empty
intersection(SELECT(alpha), SELECT(gamma)) = empty
intersection(SELECT(beta), SELECT(gamma)) = empty
If two branches can start with the same token, the parser cannot know which one to use without more lookahead or backtracking.
Alternatives must start differently
The control-structure production is the cleanest example:
CS -> 'if' COND I [I]
| ('loop' COND | 'repeat' V) I
| FD
FD -> 'defun' 'name' '(' [PARAMS] ')' Is
The alternatives have disjoint starts:
| Branch | Selection set |
|---|---|
if ... | { if } |
loop ... | { loop } |
repeat ... | { repeat } |
FD | { defun } |
That is why parseCS() can be a simple switch. There is no ambiguity between
an if, a loop, a repeat, and a function definition.
The same idea appears inside parseI():
I -> '(' [( CMD | CS | FC | Is )] ')'
Once the parser consumes (, the next token chooses the content branch.
| Branch | Selection set |
|---|---|
CMD | { defvar, =, move, jump, pick, put, move-dir, run-dirs, move-face, face, turn, null } |
CS | { if, loop, repeat, defun } |
FC | { name } |
Is | { ( } |
| empty optional content | { ) } as the follow token |
Those sets do not overlap. That is why the parser can decide:
defvar -> parseCMD
if -> parseCS
name -> parseFC
( -> parseIs
) -> empty instruction content
Optional rules must not collide with what follows
Optionals add one extra requirement. If a production has an optional piece:
A -> X [B] C
then the parser needs to know whether the next token starts B or means “skip
B and continue with C.”
So the rule is:
FIRST(B) must not overlap FIRST(C)
More generally, if the optional can disappear, its selection set includes the tokens that may follow it:
SELECT(B) = FIRST(B)
SELECT(empty) = FOLLOW([B])
intersection(FIRST(B), FOLLOW([B])) must be empty
Karel-Lite uses that pattern several times.
| Optional grammar | Optional FIRST set | Follow token set | Why it is safe |
|---|---|---|---|
P -> [Is] | { ( } | { END_SYMBOL } | A program body starts with (; the end marker means empty program. |
I -> '(' [content] ')' | command/control/name/( | { ) } | ) cannot start content, so it means empty instruction body. |
if COND I [I] | { ( } | { ) } | An else-branch starts with (; ) means no else-branch. |
FD -> defun name '(' [PARAMS] ')' Is | { name } | { ) } | Parameters start with a name; ) means zero parameters. |
FC -> name [Vs] | value/constant tokens | { ) } | Arguments start with values; ) means zero arguments. |
(facing? or can-move?) [':'] D2 | { : } | { north, south, east, west } | The colon is optional because absolute directions never start with :. |
That last case is a good small example. Both forms are accepted:
(can-move? north)
(can-move? :north)
The parser can support the optional colon because : is not a valid first token
for D2. If D2 could also start with :, the parser would not know whether
to consume the optional colon or leave it for D2.
Repetition must know when to stop
Repetition has the same issue. A repeated group must have a start set that does not overlap with the tokens that come after the group.
A -> B {C} D
The parser keeps consuming C while the lookahead is in FIRST(C). It stops
when the lookahead belongs to what follows.
So:
FIRST(C) must not overlap FIRST(D)
Karel-Lite examples:
| Repetition | Repeated FIRST set | Follow token set | Parser behavior |
|---|---|---|---|
Is -> I {I} | { ( } | { ), END_SYMBOL } | Keep parsing instructions while the next token is (. |
PARAMS -> name {name} | { name } | { ) } | Keep parsing parameter names until ). |
Vs -> (V or VC) {(V or VC)} | value/constant tokens | { ) } | Keep parsing arguments until the function-call close. |
Ds -> D3 {D3} | { : } | { ) } | Keep parsing relative directions until command close. |
This is why parseDs() can be:
parse one D3
while the next token is ':':
parse another D3
It is not guessing. D3 starts with :, and the only thing that follows the
direction list in run-dirs is the closing ).
Shared tokens are safe when there is no local choice
One subtle point: selection sets do not need to be globally disjoint across the entire grammar. They only need to be disjoint at a point where the parser must choose between alternatives.
For example, D1 and D3 both start with :.
D1 -> ':' (left | right | around)
D3 -> ':' (front | right | left | back)
That would be a conflict if the parser had a rule like:
Direction -> D1 | D3
because both branches would start with :.
Karel-Lite avoids that local ambiguity by using the command token to choose the context first:
turn -> parseD1
move-dir -> parseD3
run-dirs -> parseDs -> parseD3
By the time the parser sees the colon, it already knows which direction category is expected.
The lexer does a related job for command names. In raw text, move is a prefix
of move-dir and move-face. That could look ambiguous, but the lexer uses
maximal munch and emits distinct tokens:
move -> MOVE_CMD
move-dir -> MOVE_DIR_CMD
move-face -> MOVE_FACE_CMD
The parser receives token types, not raw prefixes, so parseCMD() can switch on
those command tokens cleanly.
The design rule
The rule of thumb is:
At every parser decision point, the next token should identify exactly one path.
For alternatives, each branch needs a disjoint selection set. For optionals, the optional branch must not start with a token that can also follow the optional. For repetitions, the repeated element must not start with a token that also means “the repetition is done.”
That is the real grammar design work behind this repo. The grammar is shaped so the code can be simple.
Program and block rules
The top-level program rule is:
P -> [Is]
That means an empty program is accepted. If the first token is (, the parser
parses an instruction list. Either way, it then requires the end marker.
The instruction-list rule is:
Is -> I {I}
So a block is one or more instructions. In code, parseIs parses one
instruction, then keeps parsing while the next token is another left
parenthesis.
parseIs also creates a scope:
push new scope
define function parameters in that scope
parse instructions
pop scope
That means nested blocks are not just syntax. They also create static semantic boundaries.
Instruction rules
The instruction rule is:
I -> '(' [( CMD | CS | FC | Is )] ')'
An instruction always starts with ( and ends with ). Between them, it may
contain:
| Branch | Meaning | Starts with |
|---|---|---|
CMD | Built-in command | move, defvar, =, put, etc. |
CS | Control structure | if, loop, repeat, defun |
FC | Function call | name |
Is | Nested instruction list | ( |
The optional content means () is syntactically valid in this grammar. That is
not necessarily a feature one would keep in a production language, but it is
honest to the grammar implemented here.
Values, constants, and selectors
The main value rule is:
V -> 'number' | 'name' | 'constant'
Numbers and built-in constants are accepted directly. Names are different: the parser verifies that a variable with that name exists in the current scope or an outer scope.
That is not context-free syntax anymore. It is static semantics.
The language also has value constants used in function arguments:
VC -> left | right | around | north | south | east | west | balloons | chips | front | back
The distinction matters. V is for numeric-like values and variable references.
VC is for symbolic constants that can be passed to functions.
Selectors and direction categories are intentionally split:
| Rule | Valid values | Used by |
|---|---|---|
X | :balloons, :chips | put, pick, can-put?, can-pick? |
D1 | :left, :right, :around | turn |
D2 | north, south, east, west | face, move-face, facing?, can-move? |
D3 | :front, :right, :left, :back | move-dir, run-dirs |
This keeps the parser from accepting semantically strange phrases such as:
(turn :north)
(put :front 1)
(face :left)
Those tokens exist in the language, but not in those positions.
Command rules
Commands are parsed in parseCMD().
The grammar is:
CMD -> ( ( 'defvar' | '=' ) 'name'
| 'move'
| 'jump'
| ( 'put' | 'pick' ) X ) V
| 'move-dir' V D3
| 'run-dirs' Ds
| ( 'move-face' V | 'face' ) ':' D2
| 'turn' D1
| 'null'
That compact rule turns into several command families.
| Command | Required shape | Validation performed |
|---|---|---|
defvar | (defvar name V) | Defines name in current scope. |
| assignment | (= name V) | Requires name to already exist. |
move / jump | (move V), (jump V) | Parses a valid value. |
put / pick | (put X V), (pick X V) | Requires X to be :chips or :balloons. |
move-dir | (move-dir V D3) | Requires relative direction. |
run-dirs | (run-dirs D3 {D3}) | Requires one or more relative directions. |
move-face | (move-face V : D2) | Requires amount plus absolute direction. |
face | (face : D2) | Requires absolute direction. |
turn | (turn D1) | Requires turn direction. |
null | (null) | Accepts no extra arguments. |
The variable distinction is the most important rule here. defvar extends the
current scope. = does not. Assignment must refer to an existing variable.
(defvar x 1) -> valid
(= x 2) -> valid after x exists
(= y 2) -> invalid if y was never defined
Names used as values are checked too:
(move x) -> valid only if x is defined
So a syntactically valid command can still be rejected by the semantic layer.
Control structures
Control structures are parsed in parseCS().
The grammar is:
CS -> 'if' COND I [I]
| ('loop' COND | 'repeat' V) I
| FD
There are four branches: if, loop, repeat, and function definition.
If
The shape is:
(if COND I [I])
So an if always has:
- A condition.
- A then-instruction.
- Optionally, an else-instruction.
Example:
(if (can-move? :north)
(move-dir 1 :front)
(null))
The parser does not evaluate the condition. It validates that the condition has
legal syntax, then validates each branch as an instruction. If another ( is
present after the then-branch, it parses that as the optional else-branch.
Loop
The shape is:
(loop COND I)
Example:
(loop (not (blocked?)) (move 1))
Again, this is not runtime execution. The parser only checks that the condition is valid and that the loop body is a valid instruction.
Repeat
The shape is:
(repeat V I)
Example:
(repeat Spaces (put :chips 1))
The repeat count is parsed as V, which means it can be a number, a variable,
or a built-in constant. If it is a variable name, the scope checker verifies
that the name exists.
Function definition
Function definitions are parsed as a control-structure branch because defun
lives where other structured constructs live.
FD -> 'defun' 'name' '(' [PARAMS] ')' Is
Example:
(defun foo (c p)
(= c 7)
(put :chips c)
(put :balloons p)
(move rotate))
The parser:
- Accepts
defun. - Reads the function name.
- Parses zero or more parameter names.
- Rejects a duplicate function with the same name and arity in visible scope.
- Registers the function signature.
- Parses the function body as an instruction list with parameters pre-defined as local variables.
The registration-before-body detail matters. It allows direct recursion:
(defun goend ()
(if (not (blocked?))
((move one) (goend))
(null)))
The function goend can call itself because its signature is known before the
body is parsed.
Conditions
Conditions are parsed in parseCOND().
The grammar is:
COND -> '(' (
( 'facing?' | 'can-move?' ) [':'] D2
| 'blocked?'
| ( ( 'can-put?' | 'can-pick?' ) X | 'isZero?' ) V
| 'not' COND
) ')'
Every condition is parenthesized. The predicate then determines the rest of the shape.
| Predicate | Shape | Meaning at parse time |
|---|---|---|
facing? | (facing? [:] D2) | Requires an absolute direction. |
can-move? | (can-move? [:] D2) | Requires an absolute direction. |
blocked? | (blocked?) | No arguments. |
can-put? / can-pick? | (can-put? X V) | Requires object selector + value. |
isZero? | (isZero? V) | Requires a value. |
not | (not COND) | Recursively parses a condition. |
The optional colon in facing? and can-move? is a useful detail. Both of
these are valid:
(can-move? north)
(can-move? :north)
The parser handles that by accepting a colon if it appears, then parsing D2.
The not rule is recursively defined. That is a direct example of why a
context-free grammar is the right model for the language: nested conditions can
be arbitrarily deep.
(not (not (blocked?)))
A regular expression is the wrong tool for that structure. A parser with a stack is the right tool.
Function calls and arity
Function calls are simple on the surface:
FC -> 'name' [Vs]
Vs -> ( V | VC ) {( V | VC )}
Examples:
(foo)
(foo 1)
(foo 1 north chips)
The parser reads the function name, counts the arguments, and checks whether a function with that name and arity exists in current or outer scope.
That means these are different signatures:
foo 0
foo 1
foo 2
The Scope object stores functions by name + arity, so the validation rule is
not just “does a function named foo exist?” It is “does a function named foo
exist with exactly this number of arguments?”
This fails:
(defun foo (a) (move a))
(foo 1 2)
because foo was defined with one parameter but called with two arguments.
The semantic layer
This repo is most interesting where it leaves pure context-free parsing and enters static semantics.
A context-free grammar can validate this shape:
(move x)
But it cannot, by itself, answer:
Was x declared before this use?
That is why the parser keeps a stack of scopes.
current block scope
outer block scope
outer function or program scope
...
Variable lookup walks those scopes from inner to outer. Function lookup does the same, but with name plus arity.
| Rule | Behavior |
|---|---|
| Variable declaration | Adds the variable to the current scope. |
| Variable assignment | Requires the variable in current or outer scope. |
Variable use as V | Requires the variable in current or outer scope. |
| Function definition | Adds name arity to the current scope. |
| Function call | Requires matching name arity in visible scope. |
| Function parameters | Become variables in the function body scope. |
| Variable shadowing | Allowed. |
| Same-arity function redeclaration | Rejected if visible in current or outer scope. |
That is exactly the boundary a real compiler front end has:
lexer: this is a NAME token
parser: a NAME token is legal in this position
semantic checker: this particular name is or is not defined here
What the project does not do
It is useful to be precise about the boundary.
Karel-Lite does not:
- build an AST;
- execute robot commands;
- type-check numeric versus directional values beyond the grammar categories;
- optimize anything;
- recover from errors and continue parsing;
- support mutual recursion unless the called function has already been parsed.
Those are not flaws in the article-worthy sense. They clarify the project’s actual shape: it is a recognizer and static validator for a small language.
That makes it a good portfolio artifact. The implementation is small enough to read in one sitting, but it contains the core ideas behind larger language tools: scanning, token categories, predictive parsing, grammar productions, scope, arity, and diagnostics.
Why it matters
The project is interesting because it shows that compiler construction is not magic. A language front end is a disciplined sequence of increasingly richer questions:
Can this text be split into valid tokens?
Can those tokens form a valid syntax tree shape?
Do the names and calls make sense in scope?
Did we consume the whole input?
Karel-Lite answers those questions with direct code instead of hiding them behind a parser generator. That is why it reads well as engineering work: the formal language theory is not decorative. It is the structure of the program.
The repo is small, but the ideas are not. Regular languages handle tokens. Context-free grammars handle nested syntax. LL(1) selection sets make parsing deterministic. A scope stack handles the semantic rules that syntax alone cannot see.
That is a compiler front end in miniature.