← Back to Tutorials
Python

Build a Simple Compiler

Difficulty: Advanced Est. Time: ~6 hours

Introduction

Compilers transform source code into executable machine code. In this tutorial, we'll build a compiler that compiles a simple programming language to bytecode, which is then executed by a virtual machine.

This approach (compiler + VM) is used by many popular languages including Python, Ruby, and Java—the key advantage is portability and security.

What You'll Build
  • A bytecode instruction set
  • A compiler that transforms AST to bytecode
  • A stack-based virtual machine
  • An object system for runtime values
  • Function call support
  • A complete REPL
What You'll Learn
  • How compilers work internally
  • Bytecode instruction design
  • Stack-based virtual machines
  • Symbol tables and scope
  • Code generation
  • Runtime stack management

Core Concepts

Compilation vs Interpretation

Interpretation executes source code directly by parsing and evaluating each statement. Compilation translates source code into an intermediate representation (bytecode or machine code) that can be executed more efficiently.

Our compiler will take the same language we built in the interpreter tutorial and compile it to bytecode instead of interpreting it directly.

Bytecode

Bytecode is a compact binary representation of program instructions. Each bytecode instruction is typically one byte (hence the name), though operands can be multi-byte.

Our bytecode will have instructions like:

  • OpConstant - Load a constant onto the stack
  • OpAdd - Pop two values, add them, push result
  • OpPop - Pop and discard top of stack
  • OpJump - Unconditional jump
  • OpCall - Call a function

Stack-Based VM

A stack-based VM uses an operand stack to compute values. Operations pop their operands from the stack and push results back. This is simpler than register-based VMs and works well for bytecode.

Symbol Table

The compiler maintains a symbol table to track variable names and their stack locations. When compiling a variable reference, we look up its slot on the stack.

Project Overview

Our compiler will target the same language features as our interpreter:

Feature Description
Integers Integer literals and arithmetic
Booleans true/false with logical operations
Strings String literals and concatenation
Variables Let bindings with local scope
Functions Function literals and calls
Conditionals If-else expressions
Return Return statements with values

Prerequisites

  • Python 3.8+ - Installed on your system
  • Completion of Interpreter tutorial - We'll reuse the lexer and AST
  • Understanding of stacks - LIFO data structure

Bytecode Format

Create opcode.py to define our bytecode instructions:

class OpCode:
    CONSTANT = 0
    NULL = 1
    TRUE = 2
    FALSE = 3
    POP = 4
    ADD = 5
    SUB = 6
    MUL = 7
    DIV = 8
    MOD = 9
    POW = 10
    MINUS = 11
    BANG = 12
    JUMP = 13
    JUMP_NOT_TRUTHY = 14
    NULL_POP = 15
    GET_GLOBAL = 16
    SET_GLOBAL = 17
    GET_LOCAL = 18
    SET_LOCAL = 19
    GET_BUILTIN = 20
    CLOSURE = 21
    CALL = 22
    RETURN_VALUE = 23
    RETURN = 24
    ARRAY = 25
    HASH = 26
    INDEX = 27
    EQ = 28
    NE = 29
    GT = 30
    GE = 31
    LT = 32
    LE = 33
    AND = 34
    OR = 35


INSTRUCTIONS = {
    0: "OpConstant",
    1: "OpNull",
    2: "OpTrue",
    3: "OpFalse",
    4: "OpPop",
    5: "OpAdd",
    6: "OpSub",
    7: "OpMul",
    8: "OpDiv",
    9: "OpMod",
    10: "OpPow",
    11: "OpMinus",
    12: "OpBang",
    13: "OpJump",
    14: "OpJumpNotTruthy",
    15: "OpNullPop",
    16: "OpGetGlobal",
    17: "OpSetGlobal",
    18: "OpGetLocal",
    19: "OpSetLocal",
    20: "OpGetBuiltin",
    21: "OpClosure",
    22: "OpCall",
    23: "OpReturnValue",
    24: "OpReturn",
    25: "OpArray",
    26: "OpHash",
    27: "OpIndex",
    28: "OpEq",
    29: "OpNe",
    30: "OpGt",
    31: "OpGe",
    32: "OpLt",
    33: "OpLe",
    34: "OpAnd",
    35: "OpOr",
}

Instructions Explained

Each instruction has a specific purpose:

  • OpConstant - Takes a 2-byte operand for the constant index
  • OpAdd/Sub/Mul/Div - Binary operations, pop 2, push 1
  • OpJump - Takes a 2-byte jump destination
  • OpGetLocal/SetLocal - Takes 1-byte slot index

The Compiler

Create compiler.py to compile AST to bytecode:

import struct
from lexer import Lexer
from parser import Parser
from ast import *
from opcode import OpCode
from instructions import Instructions, make

class Bytecode:
    def __init__(self, instructions, constants):
        self.instructions = instructions
        self.constants = constants


class Compiler:
    def __init__(self):
        self.instructions = []
        self.constants = []
        self.symbol_table = SymbolTable()
        self.scopes = [Scope()]
        self.scope_index = 0
    
    def compile(self, node):
        if isinstance(node, Program):
            for stmt in node.statements:
                self.compile(stmt)
        elif isinstance(node, ExpressionStatement):
            self.compile(node.expression)
            self.emit(OpCode.POP)
        elif isinstance(node, IntegerLiteral):
            self.emit_constant(node.value)
        elif isinstance(node, StringLiteral):
            self.emit_constant(node.value)
        elif isinstance(node, BooleanLiteral):
            if node.value:
                self.emit(OpCode.TRUE)
            else:
                self.emit(OpCode.FALSE)
        elif isinstance(node, PrefixExpression):
            self.compile(node.right)
            if node.operator == '!':
                self.emit(OpCode.BANG)
            elif node.operator == '-':
                self.emit(OpCode.MINUS)
        elif isinstance(node, InfixExpression):
            if node.operator == '+':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.ADD)
            elif node.operator == '-':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.SUB)
            elif node.operator == '*':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.MUL)
            elif node.operator == '/':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.DIV)
            elif node.operator == '==':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.EQ)
            elif node.operator == '!=':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.NE)
            elif node.operator == '>':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.GT)
            elif node.operator == '<':
                self.compile(node.left)
                self.compile(node.right)
                self.emit(OpCode.GT)
                self.emit(OpCode.SWAP)
        elif isinstance(node, IfExpression):
            self.compile(node.condition)
            jump_not_truthy = self.emit(OpCode.JUMP_NOT_TRUTHY, 9999)
            self.compile(node.consequence)
            self.emit(OpCode.POP)
            if node.alternate:
                jump = self.emit(OpCode.JUMP, 9999)
                self.change_operand(jump_not_truthy, len(self.instructions))
                self.compile(node.alternate)
            else:
                self.change_operand(jump_not_truthy, len(self.instructions))
            self.emit(OpCode.POP)
        elif isinstance(node, LetStatement):
            self.compile(node.value)
            symbol = self.define_symbol(node.name.value)
            if symbol.scope == 'global':
                self.emit(OpCode.SET_GLOBAL, symbol.index)
            else:
                self.emit(OpCode.SET_LOCAL, symbol.index)
        elif isinstance(node, Identifier):
            symbol = self.symbol_table.resolve(node.value)
            if symbol:
                if symbol.scope == 'global':
                    self.emit(OpCode.GET_GLOBAL, symbol.index)
                else:
                    self.emit(OpCode.GET_LOCAL, symbol.index)
        elif isinstance(node, FunctionLiteral):
            fn_index = len(self.constants)
            self.constants.append(node)
            self.emit(OpCode.CLOSURE, fn_index, 0)
        elif isinstance(node, CallExpression):
            self.compile(node.function)
            for arg in node.arguments:
                self.compile(arg)
            self.emit(OpCode.CALL, len(node.arguments))
        elif isinstance(node, ReturnStatement):
            if node.value:
                self.compile(node.value)
            else:
                self.emit(OpCode.NULL)
            self.emit(OpCode.RETURN_VALUE)
        elif isinstance(node, LetStatement):
            self.compile(node.value)
            symbol = self.define_symbol(node.name.value)

    def emit_constant(self, value):
        const_index = self.add_constant(value)
        self.emit(OpCode.CONSTANT, const_index)

    def add_constant(self, value):
        self.constants.append(value)
        return len(self.constants) - 1

    def emit(self, op, *operands):
        ins = make(op, operands)
        self.instructions.extend(ins)
        return len(self.instructions) - len(ins)

    def change_operand(self, pos, new_operand):
        self.instructions[pos:pos+2] = struct.pack('>H', new_operand)

    def define_symbol(self, name):
        symbol = self.symbol_table.define(name)
        return symbol

    def bytecode(self):
        return Bytecode(Instructions(self.instructions), self.constants)


class SymbolTable:
    def __init__(self):
        self.store = {}
        self.free = []
        self.counter = 0

    def define(self, name):
        symbol = Symbol(name, self.counter, 'global')
        self.store[name] = symbol
        self.counter += 1
        return symbol

    def resolve(self, name):
        return self.store.get(name)


class Symbol:
    def __init__(self, name, index, scope):
        self.name = name
        self.index = index
        self.scope = scope


class Scope:
    def __init__(self):
        self.symbols = []
        self.num_locals = 0

    def define(self, name):
        symbol = Symbol(name, self.num_locals, 'local')
        self.symbols.append(symbol)
        self.num_locals += 1
        return symbol

Compiler Design

The compiler traverses the AST and emits bytecode instructions. Key design decisions:

  • Symbol Table - Tracks variable names and their stack slots
  • Constant Pool - Stores all constants (integers, strings, functions)
  • Emission - Each AST node emits corresponding bytecode
  • Jumps - Uses placeholder positions, fixed up after target known

The Virtual Machine

Create vm.py to execute bytecode:

import struct
from opcode import OpCode
from object import *

class VM:
    def __init__(self, bytecode):
        self.constants = bytecode.constants
        self.instructions = bytecode.instructions
        self.stack = [None] * 2048
        self.sp = 0
        self.globals = [None] * 64
        self.frames = [Frame(None, 0, 0)]
        self.frame_index = 0

    def run(self):
        while self.current_frame().ip < len(self.instructions):
            ip = self.current_frame().ip
            op = self.instructions[ip]
            
            if op == OpCode.CONSTANT:
                idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                self.push(self.constants[idx])
            elif op == OpCode.NULL:
                self.push(NULL)
            elif op == OpCode.TRUE:
                self.push(TRUE)
            elif op == OpCode.FALSE:
                self.push(FALSE)
            elif op == OpCode.ADD:
                right = self.pop()
                left = self.pop()
                if left.type == STRING and right.type == STRING:
                    self.push(String(left.value + right.value))
                else:
                    self.push(Integer(left.value + right.value))
            elif op == OpCode.SUB:
                right = self.pop()
                left = self.pop()
                self.push(Integer(left.value - right.value))
            elif op == OpCode.MUL:
                right = self.pop()
                left = self.pop()
                self.push(Integer(left.value * right.value))
            elif op == OpCode.DIV:
                right = self.pop()
                left = self.pop()
                self.push(Integer(left.value // right.value))
            elif op == OpCode.POP:
                self.pop()
            elif op == OpCode.MINUS:
                val = self.pop()
                self.push(Integer(-val.value))
            elif op == OpCode.BANG:
                val = self.pop()
                self.push(TRUE if self.falsy(val) else FALSE)
            elif op == OpCode.EQ:
                right = self.pop()
                left = self.pop()
                self.push(TRUE if left.value == right.value else FALSE)
            elif op == OpCode.NE:
                right = self.pop()
                left = self.pop()
                self.push(TRUE if left.value != right.value else FALSE)
            elif op == OpCode.GT:
                right = self.pop()
                left = self.pop()
                self.push(TRUE if left.value > right.value else FALSE)
            elif op == OpCode.GET_GLOBAL:
                idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                self.push(self.globals[idx])
            elif op == OpCode.SET_GLOBAL:
                idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                self.globals[idx] = self.peek()
            elif op == OpCode.GET_LOCAL:
                slot = self.instructions[ip+1]
                self.push(self.current_frame().locals[slot])
            elif op == OpCode.SET_LOCAL:
                slot = self.instructions[ip+1]
                self.current_frame().locals[slot] = self.peek()
            elif op == OpCode.JUMP:
                pos = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                self.current_frame().ip = pos - 1
            elif op == OpCode.JUMP_NOT_TRUTHY:
                pos = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                condition = self.pop()
                if self.falsy(condition):
                    self.current_frame().ip = pos - 1
            elif op == OpCode.CLOSURE:
                const_idx = struct.unpack('>H', bytes(self.instructions[ip+1:ip+3]))[0]
                free_count = self.instructions[ip+3]
                fn = self.constants[const_idx]
                closure = Closure(fn, [NULL] * free_count)
                self.push(closure)
            elif op == OpCode.CALL:
                num_args = self.instructions[ip+1]
                self.call_closure(num_args)
            elif op == OpCode.RETURN_VALUE:
                ret_val = self.pop()
                frame = self.pop_frame()
                self.sp = frame.sp - 1
                self.push(ret_val)
            elif op == OpCode.RETURN:
                self.pop_frame()
                self.sp = self.current_frame().sp - 1
                self.push(NULL)
            
            self.current_frame().ip += 1 + self.operand_count(op)

        return self.pop() if self.sp > 0 else NULL

    def push(self, obj):
        self.stack[self.sp] = obj
        self.sp += 1

    def pop(self):
        obj = self.stack[self.sp - 1]
        self.sp -= 1
        return obj

    def peek(self):
        return self.stack[self.sp - 1]

    def current_frame(self):
        return self.frames[self.frame_index]

    def call_closure(self, num_args):
        fn = self.stack[self.sp - 1 - num_args]
        if fn.type != CLOSURE:
            raise RuntimeError(f"not a function: {fn.type}")
        
        frame = Frame(fn, self.sp - num_args, len(fn.free))
        for i, free in enumerate(fn.free):
            frame.locals[i] = free
        
        self.frames.append(frame)
        self.frame_index += 1

    def pop_frame(self):
        frame = self.frames.pop()
        self.frame_index -= 1
        return frame

    def falsy(self, obj):
        return obj == NULL or obj == FALSE


class Frame:
    def __init__(self, closure, sp, num_locals):
        self.closure = closure
        self.ip = 0
        self.sp = sp
        self.locals = [NULL] * num_locals

Object System

Create object.py for runtime values:

class Object:
    def type(self):
        return self.__class__.__name__.upper()


class Integer(Object):
    def __init__(self, value):
        self.value = value


class String(Object):
    def __init__(self, value):
        self.value = value


class Boolean(Object):
    def __init__(self, value):
        self.value = value


class Null(Object):
    _instance = None
    
    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance


class Function(Object):
    def __init__(self, parameters, body, num_locals):
        self.parameters = parameters
        self.body = body
        self.num_locals = num_locals


class Closure(Object):
    def __init__(self, function, free):
        self.function = function
        self.free = free


class ReturnValue(Object):
    def __init__(self, value):
        self.value = value


NULL = Null()
TRUE = Boolean(True)
FALSE = Boolean(False)
INTEGER = "INTEGER"
STRING = "STRING"
BOOLEAN = "BOOLEAN"
NULL_TYPE = "NULL"
FUNCTION = "FUNCTION"
CLOSURE = "CLOSURE"
RETURN_VALUE = "RETURN_VALUE"

Testing the Compiler

Create a REPL to test the compiler:

from lexer import Lexer
from parser import Parser
from compiler import Compiler
from vm import VM
from object import NULL

def run(source):
    lexer = Lexer(source)
    parser = Parser(lexer)
    program = parser.parse_program()
    
    if parser.errors:
        print("Parser errors:")
        for err in parser.errors:
            print(f"  {err}")
        return
    
    compiler = Compiler()
    try:
        compiler.compile(program)
    except Exception as e:
        print(f"Compiler error: {e}")
        return
    
    bytecode = compiler.bytecode()
    vm = VM(bytecode)
    
    try:
        result = vm.run()
        if result and result != NULL:
            print(result.value if hasattr(result, 'value') else result)
    except Exception as e:
        print(f"Runtime error: {e}")


def repl():
    print("Bytecode Compiler VM v0.1")
    print("Type 'exit' to quit")
    print("=" * 40)
    
    while True:
        try:
            source = input(">> ")
            if source.lower() in ('exit', 'quit'):
                break
            if not source.strip():
                continue
            run(source)
        except KeyboardInterrupt:
            print("\nExiting...")
            break


if __name__ == "__main__":
    repl()

Test Examples

# Arithmetic
>> 5 + 3
8
>> 10 - 4 * 2
2
>> (10 - 4) * 2
12

# Variables
>> let x = 5;
>> let y = 10;
>> x + y
15

# Conditionals
>> let max = fn(a, b) { if (a > b) { a; } else { b; } };
>> max(5, 10)
10

# Functions
>> let fib = fn(n) { if (n <= 1) { n; } else { fib(n - 1) + fib(n - 2); } };
>> fib(10)
55

Summary

Congratulations! You've built a complete compiler and virtual machine. Here's what you learned:

  • Bytecode Design - How to design efficient bytecode instructions
  • Compilation - How to translate AST to bytecode
  • Symbol Tables - How to track variables during compilation
  • Stack VM - How to execute bytecode on a virtual machine
  • Function Calls - How to implement closures and function calls
  • Runtime - How to manage runtime values and stack frames

Possible Extensions

  • Add arrays and hash maps
  • Implement garbage collection
  • Add method calls on objects
  • Implement tail call optimization
  • Add coroutines/generators
  • Build a debugger with breakpoints