Build a Simple Compiler
Introduction
Compilers transform source code into executable machine code. In this tutorial, we'll build a compiler that compiles a simple programming language to bytecode, which is then executed by a virtual machine.
This approach (compiler + VM) is used by many popular languages including Python, Ruby, and Java—the key advantage is portability and security.
- A bytecode instruction set
- A compiler that transforms AST to bytecode
- A stack-based virtual machine
- An object system for runtime values
- Function call support
- A complete REPL
- How compilers work internally
- Bytecode instruction design
- Stack-based virtual machines
- Symbol tables and scope
- Code generation
- Runtime stack management
Core Concepts
Compilation vs Interpretation
Interpretation executes source code directly by parsing and evaluating each statement. Compilation translates source code into an intermediate representation (bytecode or machine code) that can be executed more efficiently.
Our compiler will take the same language we built in the interpreter tutorial and compile it to bytecode instead of interpreting it directly.
Bytecode
Bytecode is a compact binary representation of program instructions. Each bytecode instruction is typically one byte (hence the name), though operands can be multi-byte.
Our bytecode will have instructions like:
OpConstant- Load a constant onto the stackOpAdd- Pop two values, add them, push resultOpPop- Pop and discard top of stackOpJump- Unconditional jumpOpCall- Call a function
Stack-Based VM
A stack-based VM uses an operand stack to compute values. Operations pop their operands from the stack and push results back. This is simpler than register-based VMs and works well for bytecode.
Symbol Table
The compiler maintains a symbol table to track variable names and their stack locations. When compiling a variable reference, we look up its slot on the stack.
Project Overview
Our compiler will target the same language features as our interpreter:
| Feature | Description |
|---|---|
| Integers | Integer literals and arithmetic |
| Booleans | true/false with logical operations |
| Strings | String literals and concatenation |
| Variables | Let bindings with local scope |
| Functions | Function literals and calls |
| Conditionals | If-else expressions |
| Return | Return statements with values |
Prerequisites
- Python 3.8+ - Installed on your system
- Completion of Interpreter tutorial - We'll reuse the lexer and AST
- Understanding of stacks - LIFO data structure
Bytecode Format
Create opcode.py to define our bytecode instructions:
class OpCode:
CONSTANT = 0
NULL = 1
TRUE = 2
FALSE = 3
POP = 4
ADD = 5
SUB = 6
MUL = 7
DIV = 8
MOD = 9
POW = 10
MINUS = 11
BANG = 12
JUMP = 13
JUMP_NOT_TRUTHY = 14
NULL_POP = 15
GET_GLOBAL = 16
SET_GLOBAL = 17
GET_LOCAL = 18
SET_LOCAL = 19
GET_BUILTIN = 20
CLOSURE = 21
CALL = 22
RETURN_VALUE = 23
RETURN = 24
ARRAY = 25
HASH = 26
INDEX = 27
EQ = 28
NE = 29
GT = 30
GE = 31
LT = 32
LE = 33
AND = 34
OR = 35
INSTRUCTIONS = {
0: "OpConstant",
1: "OpNull",
2: "OpTrue",
3: "OpFalse",
4: "OpPop",
5: "OpAdd",
6: "OpSub",
7: "OpMul",
8: "OpDiv",
9: "OpMod",
10: "OpPow",
11: "OpMinus",
12: "OpBang",
13: "OpJump",
14: "OpJumpNotTruthy",
15: "OpNullPop",
16: "OpGetGlobal",
17: "OpSetGlobal",
18: "OpGetLocal",
19: "OpSetLocal",
20: "OpGetBuiltin",
21: "OpClosure",
22: "OpCall",
23: "OpReturnValue",
24: "OpReturn",
25: "OpArray",
26: "OpHash",
27: "OpIndex",
28: "OpEq",
29: "OpNe",
30: "OpGt",
31: "OpGe",
32: "OpLt",
33: "OpLe",
34: "OpAnd",
35: "OpOr",
}
Instructions Explained
Each instruction has a specific purpose:
OpConstant- Takes a 2-byte operand for the constant indexOpAdd/Sub/Mul/Div- Binary operations, pop 2, push 1OpJump- Takes a 2-byte jump destinationOpGetLocal/SetLocal- Takes 1-byte slot index
The Compiler
Create compiler.py to compile AST to bytecode:
import struct
from lexer import Lexer
from parser import Parser
from ast import *
from opcode import OpCode
from instructions import Instructions, make
class Bytecode:
def __init__(self, instructions, constants):
self.instructions = instructions
self.constants = constants
class Compiler:
def __init__(self):
self.instructions = []
self.constants = []
self.symbol_table = SymbolTable()
self.scopes = [Scope()]
self.scope_index = 0
def compile(self, node):
if isinstance(node, Program):
for stmt in node.statements:
self.compile(stmt)
elif isinstance(node, ExpressionStatement):
self.compile(node.expression)
self.emit(OpCode.POP)
elif isinstance(node, IntegerLiteral):
self.emit_constant(node.value)
elif isinstance(node, StringLiteral):
self.emit_constant(node.value)
elif isinstance(node, BooleanLiteral):
if node.value:
self.emit(OpCode.TRUE)
else:
self.emit(OpCode.FALSE)
elif isinstance(node, PrefixExpression):
self.compile(node.right)
if node.operator == '!':
self.emit(OpCode.BANG)
elif node.operator == '-':
self.emit(OpCode.MINUS)
elif isinstance(node, InfixExpression):
if node.operator == '+':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.ADD)
elif node.operator == '-':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.SUB)
elif node.operator == '*':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.MUL)
elif node.operator == '/':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.DIV)
elif node.operator == '==':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.EQ)
elif node.operator == '!=':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.NE)
elif node.operator == '>':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.GT)
elif node.operator == '<':
self.compile(node.left)
self.compile(node.right)
self.emit(OpCode.GT)
self.emit(OpCode.SWAP)
elif isinstance(node, IfExpression):
self.compile(node.condition)
jump_not_truthy = self.emit(OpCode.JUMP_NOT_TRUTHY, 9999)
self.compile(node.consequence)
self.emit(OpCode.POP)
if node.alternate:
jump = self.emit(OpCode.JUMP, 9999)
self.change_operand(jump_not_truthy, len(self.instructions))
self.compile(node.alternate)
else:
self.change_operand(jump_not_truthy, len(self.instructions))
self.emit(OpCode.POP)
elif isinstance(node, LetStatement):
self.compile(node.value)
symbol = self.define_symbol(node.name.value)
if symbol.scope == 'global':
self.emit(OpCode.SET_GLOBAL, symbol.index)
else:
self.emit(OpCode.SET_LOCAL, symbol.index)
elif isinstance(node, Identifier):
symbol = self.symbol_table.resolve(node.value)
if symbol:
if symbol.scope == 'global':
self.emit(OpCode.GET_GLOBAL, symbol.index)
else:
self.emit(OpCode.GET_LOCAL, symbol.index)
elif isinstance(node, FunctionLiteral):
fn_index = len(self.constants)
self.constants.append(node)
self.emit(OpCode.CLOSURE, fn_index, 0)
elif isinstance(node, CallExpression):
self.compile(node.function)
for arg in node.arguments:
self.compile(arg)
self.emit(OpCode.CALL, len(node.arguments))
elif isinstance(node, ReturnStatement):
if node.value:
self.compile(node.value)
else:
self.emit(OpCode.NULL)
self.emit(OpCode.RETURN_VALUE)
elif isinstance(node, LetStatement):
self.compile(node.value)
symbol = self.define_symbol(node.name.value)
def emit_constant(self, value):
const_index = self.add_constant(value)
self.emit(OpCode.CONSTANT, const_index)
def add_constant(self, value):
self.constants.append(value)
return len(self.constants) - 1
def emit(self, op, *operands):
ins = make(op, operands)
self.instructions.extend(ins)
return len(self.instructions) - len(ins)
def change_operand(self, pos, new_operand):
self.instructions[pos:pos+2] = struct.pack('>H', new_operand)
def define_symbol(self, name):
symbol = self.symbol_table.define(name)
return symbol
def bytecode(self):
return Bytecode(Instructions(self.instructions), self.constants)
class SymbolTable:
def __init__(self):
self.store = {}
self.free = []
self.counter = 0
def define(self, name):
symbol = Symbol(name, self.counter, 'global')
self.store[name] = symbol
self.counter += 1
return symbol
def resolve(self, name):
return self.store.get(name)
class Symbol:
def __init__(self, name, index, scope):
self.name = name
self.index = index
self.scope = scope
class Scope:
def __init__(self):
self.symbols = []
self.num_locals = 0
def define(self, name):
symbol = Symbol(name, self.num_locals, 'local')
self.symbols.append(symbol)
self.num_locals += 1
return symbol
Compiler Design
The compiler traverses the AST and emits bytecode instructions. Key design decisions:
- Symbol Table - Tracks variable names and their stack slots
- Constant Pool - Stores all constants (integers, strings, functions)
- Emission - Each AST node emits corresponding bytecode
- Jumps - Uses placeholder positions, fixed up after target known
The Virtual Machine
Create vm.py to execute bytecode:
import struct
from opcode import OpCode
from object import *
class VM:
def __init__(self, bytecode):
self.constants = bytecode.constants
self.instructions = bytecode.instructions
self.stack = [None] * 2048
self.sp = 0
self.globals = [None] * 64
self.frames = [Frame(None, 0, 0)]
self.frame_index = 0
def run(self):
while self.current_frame().ip < len(self.instructions):
ip = self.current_frame().ip
op = self.instructions[ip]
if op == OpCode.CONSTANT:
idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
self.push(self.constants[idx])
elif op == OpCode.NULL:
self.push(NULL)
elif op == OpCode.TRUE:
self.push(TRUE)
elif op == OpCode.FALSE:
self.push(FALSE)
elif op == OpCode.ADD:
right = self.pop()
left = self.pop()
if left.type == STRING and right.type == STRING:
self.push(String(left.value + right.value))
else:
self.push(Integer(left.value + right.value))
elif op == OpCode.SUB:
right = self.pop()
left = self.pop()
self.push(Integer(left.value - right.value))
elif op == OpCode.MUL:
right = self.pop()
left = self.pop()
self.push(Integer(left.value * right.value))
elif op == OpCode.DIV:
right = self.pop()
left = self.pop()
self.push(Integer(left.value // right.value))
elif op == OpCode.POP:
self.pop()
elif op == OpCode.MINUS:
val = self.pop()
self.push(Integer(-val.value))
elif op == OpCode.BANG:
val = self.pop()
self.push(TRUE if self.falsy(val) else FALSE)
elif op == OpCode.EQ:
right = self.pop()
left = self.pop()
self.push(TRUE if left.value == right.value else FALSE)
elif op == OpCode.NE:
right = self.pop()
left = self.pop()
self.push(TRUE if left.value != right.value else FALSE)
elif op == OpCode.GT:
right = self.pop()
left = self.pop()
self.push(TRUE if left.value > right.value else FALSE)
elif op == OpCode.GET_GLOBAL:
idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
self.push(self.globals[idx])
elif op == OpCode.SET_GLOBAL:
idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
self.globals[idx] = self.peek()
elif op == OpCode.GET_LOCAL:
slot = self.instructions[ip+1]
self.push(self.current_frame().locals[slot])
elif op == OpCode.SET_LOCAL:
slot = self.instructions[ip+1]
self.current_frame().locals[slot] = self.peek()
elif op == OpCode.JUMP:
pos = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
self.current_frame().ip = pos - 1
elif op == OpCode.JUMP_NOT_TRUTHY:
pos = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
condition = self.pop()
if self.falsy(condition):
self.current_frame().ip = pos - 1
elif op == OpCode.CLOSURE:
const_idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
free_count = self.instructions[ip+3]
fn = self.constants[const_idx]
closure = Closure(fn, [NULL] * free_count)
self.push(closure)
elif op == OpCode.CALL:
num_args = self.instructions[ip+1]
self.call_closure(num_args)
elif op == OpCode.RETURN_VALUE:
ret_val = self.pop()
frame = self.pop_frame()
self.sp = frame.sp - 1
self.push(ret_val)
elif op == OpCode.RETURN:
self.pop_frame()
self.sp = self.current_frame().sp - 1
self.push(NULL)
self.current_frame().ip += 1 + self.operand_count(op)
return self.pop() if self.sp > 0 else NULL
def push(self, obj):
self.stack[self.sp] = obj
self.sp += 1
def pop(self):
obj = self.stack[self.sp - 1]
self.sp -= 1
return obj
def peek(self):
return self.stack[self.sp - 1]
def current_frame(self):
return self.frames[self.frame_index]
def call_closure(self, num_args):
fn = self.stack[self.sp - 1 - num_args]
if fn.type != CLOSURE:
raise RuntimeError(f"not a function: {fn.type}")
frame = Frame(fn, self.sp - num_args, len(fn.free))
for i, free in enumerate(fn.free):
frame.locals[i] = free
self.frames.append(frame)
self.frame_index += 1
def pop_frame(self):
frame = self.frames.pop()
self.frame_index -= 1
return frame
def falsy(self, obj):
return obj == NULL or obj == FALSE
class Frame:
def __init__(self, closure, sp, num_locals):
self.closure = closure
self.ip = 0
self.sp = sp
self.locals = [NULL] * num_locals
Object System
Create object.py for runtime values:
class Object:
def type(self):
return self.__class__.__name__.upper()
class Integer(Object):
def __init__(self, value):
self.value = value
class String(Object):
def __init__(self, value):
self.value = value
class Boolean(Object):
def __init__(self, value):
self.value = value
class Null(Object):
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
class Function(Object):
def __init__(self, parameters, body, num_locals):
self.parameters = parameters
self.body = body
self.num_locals = num_locals
class Closure(Object):
def __init__(self, function, free):
self.function = function
self.free = free
class ReturnValue(Object):
def __init__(self, value):
self.value = value
NULL = Null()
TRUE = Boolean(True)
FALSE = Boolean(False)
INTEGER = "INTEGER"
STRING = "STRING"
BOOLEAN = "BOOLEAN"
NULL_TYPE = "NULL"
FUNCTION = "FUNCTION"
CLOSURE = "CLOSURE"
RETURN_VALUE = "RETURN_VALUE"
Testing the Compiler
Create a REPL to test the compiler:
from lexer import Lexer
from parser import Parser
from compiler import Compiler
from vm import VM
from object import NULL
def run(source):
lexer = Lexer(source)
parser = Parser(lexer)
program = parser.parse_program()
if parser.errors:
print("Parser errors:")
for err in parser.errors:
print(f" {err}")
return
compiler = Compiler()
try:
compiler.compile(program)
except Exception as e:
print(f"Compiler error: {e}")
return
bytecode = compiler.bytecode()
vm = VM(bytecode)
try:
result = vm.run()
if result and result != NULL:
print(result.value if hasattr(result, 'value') else result)
except Exception as e:
print(f"Runtime error: {e}")
def repl():
print("Bytecode Compiler VM v0.1")
print("Type 'exit' to quit")
print("=" * 40)
while True:
try:
source = input(">> ")
if source.lower() in ('exit', 'quit'):
break
if not source.strip():
continue
run(source)
except KeyboardInterrupt:
print("\nExiting...")
break
if __name__ == "__main__":
repl()
Test Examples
# Arithmetic
>> 5 + 3
8
>> 10 - 4 * 2
2
>> (10 - 4) * 2
12
# Variables
>> let x = 5;
>> let y = 10;
>> x + y
15
# Conditionals
>> let max = fn(a, b) { if (a > b) { a; } else { b; } };
>> max(5, 10)
10
# Functions
>> let fib = fn(n) { if (n <= 1) { n; } else { fib(n - 1) + fib(n - 2); } };
>> fib(10)
55
Summary
Congratulations! You've built a complete compiler and virtual machine. Here's what you learned:
- Bytecode Design - How to design efficient bytecode instructions
- Compilation - How to translate AST to bytecode
- Symbol Tables - How to track variables during compilation
- Stack VM - How to execute bytecode on a virtual machine
- Function Calls - How to implement closures and function calls
- Runtime - How to manage runtime values and stack frames
Possible Extensions
- Add arrays and hash maps
- Implement garbage collection
- Add method calls on objects
- Implement tail call optimization
- Add coroutines/generators
- Build a debugger with breakpoints