Assignment 1
Welcome to the first assignment. In this assignment we’re first going to set some groundwork: we’re going to sketch out what our infrastructure will look like for our multiple intertwined components. I’ll paint a picture of how this will carry us through the course.
Then we’ll start by compiling the very smallest programs: integers. If your program consists of a single integer, we’ll compile it. No more, no less. Then we’ll add on some other constants like booleans and characters.
Since that’s kind of boring still, we’ll add some primitive functions to operate on those constants. You’ll be able to use your compiler as a calculator!
Since calculators are more useful with variable names, we’ll add the ability to
bind local variables with the let construct.
Last, we’ll add conditional branches, so your programs can make decisions.
Throughout this homework and others, you should be reading the Ghuloum tutorial as a second resource. While the mechanics are slightly different—host language, target language—the pointer tagging scheme and general ideas are going to be very nearly the same.
I recommend starting this assignment early. The earlier the better. Really, I recommend this with all assignments, but starting off the quarter with a good baseline make the rest easier.
Let’s go!
Setting up shop
We’re going to have three main components that interact:
- a compiler that produces bytecode
- an interpreter that runs that bytecode
- a test harness that supplies multiple programs, runs them through the pipeline, and compares the resulting value against an expected value
Why the split between a compiler and an interpreter, you may wonder. It’s not strictly necessary. Many a Scheme interpreter is what we call “tree-walking”—directly operates on a parsed representation of the source string. But this class is about compilers, so we need to do a little more transformation than just parsing. You can think about it as taking the tree-walking out of the interpreter and putting it an earlier stage.
We could go as far as Ghuloum does—compiling Scheme directly to x86 assembly—but then we wouldn’t get the fun of writing an interpreter too.
Your compiler will take Scheme code in the form of a string. Inside the
compiler, you will have a component—the parser—that is responsible for
turning that text into something more readily understandable by your compiler.
Generally, this will look like some kind of tree structure. In my simple
compiler, it is a lot of nested Python lists.
scheme_parse("1") # 1
scheme_parse("()") # []
scheme_parse("(+ a 3)") # ["+", "a", 3]
If you prefer to use a more typed approach—classes and subclasses, abstract datatypes (ADTs), etc—you are welcome to.
Let’s scaffold a couple of functions:
class Parser:
def __init__(self, source: str):
self.source = source
self.pos = 0
self.length = len(source)
def parse(self) -> object:
raise NotImplementedError("parse")
def scheme_parse(source: str) -> object:
return Parser(source).parse()
class Compiler:
def __init__(self):
self.code = []
def compile(self, expr):
raise NotImplementedError("compile")
def write_to_stream(self, f):
raise NotImplementedError("write_to_stream")
import enum
class I(enum.IntEnum):
# Where all of our opcodes will go
pass
We’re going to incrementally modify the parser, the code generator, and the interpreter as we add features. Let’s start with constants.
Integers
Let’s make sure we can read numbers first. This is what the structure should
look like. You will have to implement parse_number and peek and
skip_whitespace.
class Parser:
def parse(self) -> object:
self.skip_whitespace()
match self.peek():
case '':
raise EOFError("Unexpected end of input")
case c if c.isdigit():
return self.parse_number()
case c:
raise NotImplementedError(f"Parser only supports numbers currently. Found {c}")
Now add tests.
class ParseTests(unittest.TestCase):
def _parse(self, source: str) -> object:
return Parser(source).parse()
def test_parse_fixnum(self):
self.assertEqual(self._parse("42"), 42)
def test_parse_fixnum_with_whitespace(self):
self.assertEqual(self._parse(" 42"), 42)
What other tests might you want? What is a valid number? At minimum, we should be able to parse positive base-10 integers.
If you are using classes or ADTs, you may find it helpful to write some testing
helper functions like is_int(expr) or is_int_equals(expr, expected).
Important: make sure your tests are running. Make sure they can fail.
Now let’s compile numbers. Remember the contract we are going to be keeping between all of the operations we build: operations leave their results on the stack.
So what does it mean to compile a number? That’s right, push it on the stack.
class Compiler:
def __init__(self):
self.code = []
self.max_locals_count = 0
def compile(self, expr, env):
emit = self.code.append
match expr:
case int(_):
emit(I.LOAD64)
emit(box_fixnum(expr))
def compile_function(self, expr):
self.compile(expr)
self.code.append(I.RETURN)
This is where the pointer tagging comes in. All values that flow through the
Scheme program are tagged pointers: we don’t want to allocate memory for small
numbers, which are very frequent, so we pretend they are pointers. To avoid
having a different opcode for every kind of small in-pointer constant
(LOAD_FIXNUM, LOAD_CHAR, …), we tag the objects in the compiler instead
of in the interpreter and put the fake pointer directly in the bytecode.
Write a box_fixnum that mirrors the tagging scheme used in the Ghuloum
tutorial.
Question: what happens if we try to compile a bigger integer than fits in our tagged pointer representation?
Add opcodes LOAD64 and RETURN.
class I(enum.IntEnum):
# Starts at 0 and increments
LOAD64 = enum.auto()
RETURN = enum.auto()
Test your compiler. Make sure it emits the bytecode you expect.
Now let’s write the bytecode to disk so we can interpret it. The exact format you choose is not important, but it is important that you are consistent between your compiler and your interpreter. They have to speak a common language: a binary bytecode format.
Here is an example write_to_stream function that writes the instructions in our
code in little-endian form, with each opcode and argument taking a full 8
bytes:
class Compiler:
def write_to_stream(self, f):
for op in self.code:
f.write(op.to_bytes(8, "little"))
Here is a little driver that reads a Scheme program from stdin and writes the bytecode to stdout:
import sys
def compile_program():
source = sys.stdin.read()
program = scheme_parse(source)
compiler = Compiler()
compiler.compile_function(program)
compiler.write_to_stream(sys.stdout)
if __name__ == "__main__":
compile_program()
This means you can use your compiler like so:
$ python3 compiler.py < test.scm > test.bc
$
Inspect the bytecode with xxd or hexdump. Is it what you expect? If you
don’t know how to read the output of these tools, there are manuals online.
Question: what is the difference between
compileandcompile_function? If you don’t have an answer now, wait until the next section and think about it some more.
Now, finally, let us write the interpreter part of this section. Pick up a different programming language (say, C, C++, Rust, Java).
Write a function to load bytecode from stdin into memory as a series of opcodes and their arguments. Keep in mind that you will have to read the binary in little-endian, as we wrote it.
Once you have the code in-memory as a series of 64-bit values, you can write an interpret function. This function takes in the bytecode and produces a value.
It consists of two bits of state:
- a
pc, or program counter, which is an index into the given code and always points at the next instruction to be executed (or next argument to be read) - a
stack, which holds the intermediate results of some computation
It is the job of the interpreter to march the pc inexorably forward until it
hits a RETURN opcode, at which point it must pop a value off the stack and
finally return from the interpret function.
Here is some pseudocode in Python for what your interpreter should look like:
def interpret(code):
pc = 0
stack = []
def push(val):
...
def pop():
...
def readword():
...
while True:
instr = readword()
match instr:
case I.LOAD64:
...
case I.RETURN:
...
case _:
raise NotImplementedError(instr)
raise Exception("fell off the end")
The function that calls the interpret function should also print the
resulting value to stdout for inspection. This will help us write tests for our
code: we can use the printed value as a proxy for program correctness.
This means you can use your interpreter like so:
$ ./interpreter < test.bc
5
$
But remember: all of the values are tagged. This means that you must write a function to inspect the tagged value, see what kind of object it is, and print it accordingly.
(Make sure to print a newline after printing the value.)
Other constants
Repeat the same exercise in the parser, compiler, and printer: teach your system about characters, and booleans.