I wrote previously about different slices of life in the various implementations of the Scrapscript programming language. Until two weeks ago, the most interesting implementation was a so-called “baseline compiler” directly from the AST to C. Now we have an intermediate representation—an IR.
The AST is fine. It’s a very abstract representation of the progam and it is easy to generate directly from the source text. None of the names are resolved (there are a bunch of strings everywhere), there is a lot left implicit in the representation, and evaluating it requires having a stack or recursion to keep temporary data around. But hey, it works and it’s reasonably small.
It’s possible to incrementally rewrite the AST by adding new types of nodes or
folding existing ones together. For example, it’s possible to find patterns
similar to Binop(BinopKind.ADD, Int(2), Int(3))
and rewrite them directly
into Int(5)
. This kind of optimization either doesn’t go very far or also
requires keeping significant amounts of context on the side.
It’s also possible to try and retrofit some of this constant folding into other AST passes that run anyway. You can kind of see this in the post about the compiler tricks where we opportunistically fold constant data into its own section so it does not need to be heap allocated.
Instead, I chose to create a new program representation designed to be optimized and rewritten: an SSA IR. True to the spirit of the project, it’s a home-grown one, not some existing library like LLVM.
This is partly because it keeps things simple—the IR, some optimization passes, and the compiler from the IR to C fit neatly into ~1000 lines of code—and partly because projects like LLVM are not designed for languages like this. They can probably do some impressive things, but they would do better if purpose-built domain-specific analyses unwrapped some of the functional abstractions first.
Still, you might be wondering why we then compile our IR to C instead of LLVM. Two reasons: 1) we already have an existing C runtime and 2) I really, really, really do not want to be tied to a particular LLVM version and its quirks. One of the most frustrating things when hacking on someone else’s language project is watching it download a huge LLVM bundle and then compile it. It does a heck of a lot. It’s very impressive. It’s many person-lifetimes of engineering marvels. But it’s too big for this dinky little tugboat of a project.
(What about QBE? Well, if I can figure out how to integrate our GC and its notion of handles into a QBE IR, great. But for now it’s C only.)
Let’s start small and make a list. The Scrapscript expression [1, 2]
, when
evaluated, creates a list containing two elements. The language doesn’t
expressly guarantee how lists are stored in memory but the baseline compiler
and C runtime store them as cons cells—as linked lists. The IR to build up
this linked list looks like this1:
fn0 {
bb0 {
v0 = Const<[]>
v1 = Const<2>
v2 = ListCons v1, v0
v3 = Const<1>
v4 = ListCons v3, v2
Return v4
}
}
This is a text representation generated by a to_string
function. The actual
IR involves classes and pointers and data structures, but I sometimes find it
helpful to think in “text space”.
The IR contains one function, fn0
, which has one basic block, bb0
, which
has a number of instructions. Each instruction is given a name which also
corresponds to the value it produces.
The first two are the Const
instruction and they carry constant data known to
the compiler at compile-time. These are represented by “immediates” enclosed in
<
angle brackets>
.
The third is the ListCons
instruction. This corresponds to a heap allocation:
building a new linked list node at run-time. It takes two operands, v1
and
v0
, which represent the head and tail of the new linked list node,
respectively.
The last one is the Return
instruction, which exits the current function with
the given operand. Return
is a control flow instruction. Control flow is only
allowed to happen at the end of a basic block, which makes all the instructions
inside a block nicely control flow free.
Let’s do a slightly more complicated example that involves pattern matching.
We’ll take a look at the IR corresponding to the Scrapscript program | 1 -> 2 + 3
,
which is a pattern matching function. If the argument to the function is equal
to 1, it returns 5. Otherwise, it aborts.
We’ll first look at the function that contains—that creates—the pattern matching function. The “main function”, if you will.
fn0 {
bb0 {
v0 = NewClosure<fn1>
Return v0
}
}
This function allocates a closure object on the heap corresponding to the
function fn1
. we’ll come back to the details later, but the important thing
to know is that unlike in the source program and AST, fn1
is not a name to be
looked up later as a string. Instead, it’s a direct pointer to a known IR
function. And that function’s IR looks like this:
fn1 {
bb0 {
v0 = Param<0; $clo>
v1 = Param<1; arg_0>
Jump bb2
}
bb2 {
v2 = IsIntEqualWord v1, 1
CondBranch v2, bb3, bb1
}
bb1 {
MatchFail
}
bb3 {
v3 = Const<2>
v4 = Const<3>
v5 = IntAdd v3, v4
Return v5
}
}
Whoa, multiple blocks! Let’s go through them one by one.
The first block contains instructions that exist to give names to function parameters. It might be a little confusing that we have not one but two parameters for a function that has no parameters visible in the program text. Here’s why we need each of them:
arg_0
, because it exists implicitly in the program.
It’s the unnamed thing being matched—in this case, checked to see if the
value is 1.Then we have a new control instruction, Jump
, that represents continuing the
flow of execution at another block. In this case, bb2
.
bb2
checks if v1
is equal to the integer 1 and returns a native boolean
(not a Scrapscript object). This gets fed into our third control instruction so
far, CondBranch
.
CondBranch
takes a condition and two basic blocks. If the condition is true,
it passes control the first one. If it’s false, to the second. This gives us
the ability to support conditionals in the source language.
In the true case, if the argument is 1, we add up 2 and 3 using the IntAdd
instruction. Otherwise, in bb1
, we encounter our last control instruction:
MatchFail
. This aborts the program.
You might notice that this IR snippet leaves something to be desired. For one,
we have an unconditional jump from bb0
to bb2
and bb2
has no other
incoming control flow from any other blocks. It seems like we should be able to
smush the contents of bb2
into bb0
instead.
You might also notice that the closure parameter is completely unused. We
could get rid of it because nothing uses v0
.
Last, you might notice that, similar to the first example with the list, we have two constant operands to an instruction that seems like it should be able to be computed at compile-time.
Some good news: all of these can be dealt with using very generic compiler passes over the IR.
Let’s tackle each of the aforementioned concerns one by one.
To merge basic blocks, I adapted a compiler pass called CleanCFG from the Cinder project, which I previously adapted from the compiler pass CleanCFG from the HHVM project.
@dataclasses.dataclass
class CleanCFG:
fn: IRFunction
def run(self) -> None:
changed = True
while changed:
changed = False
for block in self.fn.cfg.rpo():
if not block.instrs:
# Ignore transient empty blocks.
continue
# Keep working on the current block until no further changes are made.
while self.absorb_dst_block(block):
pass
changed = self.remove_unreachable_blocks()
# ...
To remove dead code, we have a reasonably straightforward implementation of dead code elimination (DCE). DCE is kind of like a garbage collector for instructions: it finds the root set (the required instructions, kind of like the backbone of the program), marks them, and then marks the instructions they indirectly/transitively depend on. Then it removes (sweeps) the rest.
Because each instruction embeds a union-find data structure in it, I
“soft-delete” instructions by marking them equivalent to a no-op instruction. I
could have also gone through and deleted them, but deleting instructions from
the middle of list while iterating over it felt finicky and bothersome. Nop
instructions are not shown in the to_string
representation of the IR and
don’t compile to any C code.
@dataclasses.dataclass
class DeadCodeElimination:
fn: IRFunction
def is_critical(self, instr: Instr) -> bool:
if isinstance(instr, Const):
return False
if isinstance(instr, IntAdd):
return False
# TODO(max): Add more. Track heap effects?
return True
def run(self) -> None:
worklist: list[Instr] = []
marked: set[Instr] = set()
blocks = self.fn.cfg.rpo()
# Mark
for block in blocks:
for instr in block.instrs:
instr = instr.find()
if self.is_critical(instr):
marked.add(instr)
worklist.append(instr)
while worklist:
instr = worklist.pop(0).find()
if isinstance(instr, HasOperands):
for op in instr.operands:
op = op.find()
if op not in marked:
marked.add(op)
worklist.append(op)
# Sweep
for block in blocks:
for instr in block.instrs:
instr = instr.find()
if instr not in marked:
instr.make_equal_to(Nop())
The last problem, the lack of constant folding, could be solved by a pass that looks something like this:
def simplify(fn):
changed = True
while changed:
changed = False
for block in fn.blocks:
for instr in block.instrs:
changed |= try_simplify(instr)
This would probably work, but it would do many passes over the entire control flow graph for each function. For big functions with many small reduction steps, this could take a long time.
Additionally, it doesn’t include any type inference. Each instruction is
limited to the type information visible from its local context. For small
optimizations like our 2+3
or list cons, this might be fine, but there are
situations where local type information would not be enough to simplify
an instruction.
Instead, we crack open the Constant propagation with conditional branches paper (known informally as SCCP) which takes a smarter worklist approach.
Two worklists, actually. It uses one to hold basic blocks and one to hold instructions. Starting at the entrypoint, it constructs a set of reachable blocks and flows types (members of the constant propagation lattice) through the instructions.
The exciting bit is that it uses the types in discovering the control flow: if a conditional branch is known to branch a certain way at analysis time, the other branch is not analyzed at all and its computation does not pollute the analysis results.
Here is a snippet of SCCP, excluding the type lattice. The type lattice is an interesting concept that I am implementing poorly and I would like to do a better job (faster, smaller code, …) later.
@dataclasses.dataclass
class SCCP:
fn: IRFunction
instr_type: dict[Instr, ConstantLattice]
block_executable: set[Block]
instr_uses: dict[Instr, set[Instr]]
def type_of(self, instr: Instr) -> ConstantLattice:
...
def run(self) -> dict[Instr, ConstantLattice]:
block_worklist: list[Block] = [self.fn.cfg.entry]
instr_worklist: list[Instr] = []
while block_worklist or instr_worklist:
if instr_worklist and (instr := instr_worklist.pop(0)):
instr = instr.find()
if isinstance(instr, HasOperands):
for operand in instr.operands:
if operand not in self.instr_uses:
self.instr_uses[operand] = set()
self.instr_uses[operand].add(instr)
new_type: ConstantLattice = CBottom()
if isinstance(instr, Const):
value = instr.value
if isinstance(value, Int):
new_type = CInt(value.value)
if isinstance(value, List):
new_type = CList()
# ...
elif isinstance(instr, CondBranch):
match self.type_of(instr.operands[0]):
case CCBool(True):
instr.make_equal_to(Jump(instr.conseq))
block_worklist.append(instr.conseq)
case CCBool(False):
instr.make_equal_to(Jump(instr.alt))
block_worklist.append(instr.alt)
case CBottom():
pass
case _:
block_worklist.append(instr.conseq)
block_worklist.append(instr.alt)
elif isinstance(instr, IntAdd):
match (self.type_of(instr.operands[0]), self.type_of(instr.operands[1])):
case (CInt(int(l)), CInt(int(r))):
new_type = CInt(l + r)
instr.make_equal_to(Const(Int(l+r)))
case (CInt(_), CInt(_)):
new_type = CInt()
# ...
old_type = self.type_of(instr)
if union(old_type, new_type) != old_type:
self.instr_type[instr] = new_type
for use in self.instr_uses.get(instr, set()):
instr_worklist.append(use)
if block_worklist and (block := block_worklist.pop(0)):
if block not in self.block_executable:
self.block_executable.add(block)
instr_worklist.extend(block.instrs)
return self.instr_type
One thing to note is that SCCP requires either pre-computing or incrementally discovering use edges in your IR. That is, when it updates the type of instruction A, it needs to know what other instructions use A so that they can be re-queued for analysis.
Also, please ignore the using-lists-as-queues thing which is terribly slow. You should instead use a proper queue or deque structure.
SCCP is pretty powerful but at least as shown in the paper, it is a little bit limited. The paper constrains the lattice to ensure that the number of passes over the IR is bounded by a small constant factor (the height of the lattice), but I think more interesting things happen if your lattice is taller…
What if you use your full type lattice instead of just a constant propagation
lattice? It should still have a finite height to guarantee that the analysis
terminates, but you could get more information than just “constant or not”. In
the example above, you can see a little snippet of that: CInt
indicates that
an object is an integer, whereas CInt(...)
indicates that it is an integer
with a specific value. You can maybe imagine extending this to work on sets
of functions, at which point this begins to look a little bit like
CFA/points-to analysis.
I may have some writing coming out in a couple of weeks about interprocedural SCCP.
In any case, here is the result of running these passes over the CFG. You can see that a lot of the things we thought were sub-optimal have been removed:
fn1 {
bb0 {
v0 = Param<1; arg_0>
v1 = IsIntEqualWord v0, 1
CondBranch v1, bb3, bb1
}
bb1 {
MatchFail
}
bb3 {
v2 = Const<5>
Return v2
}
}
Types are great but they aren’t everything. Sometimes, the analysis runs on a totally different lattice that describes a different property of how the data is used. Sometimes, the analysis even runs backwards!
A great example of this is liveness analysis, which is used in, among other things, register allocation. Liveness starts from the end(s) of a function and work their way up. Variables start dead, become live at their first use, and then become… undead? I guess? at their definition:
a = 1 # a is (un)dead before this expression
b = a + 2 # a is live in this expression and above
print(b) # a is dead here
It’s also useful for something I mentioned obliquely in previous Scrapscript posts for handle elision. For some context, the garbage collector deletes unused memory. In order to find out what is used, it iterates over the root set—things known to the runtime absolutely be live—and anything transitively reachable from there. Anything not reachable is garbage.
In an interpreter, this is “easy”—scan the stack data structure you’re using
for intermediate results. In compiled code, we have no such stack (…ish), so
we have to instead tell the garbage collector which pointers on the native
stack to keep around. We do this with some macros called GC_PROTECT
and
OBJECT_HANDLE
which push pointers to those pointers (struct object**
) to
some global data structure accessible by the garbage collector.
struct object* f_0(struct object* this, struct object* x) {
HANDLES();
GC_PROTECT(x);
if (!(is_num(x))) {
fprintf(stderr, "assertion is_num(x) failed\n");
abort();
}
OBJECT_HANDLE(tmp_1, num_add(x, _mksmallint(1)));
// ...
}
This is great, but unfortunately three bad things happen:
Right now we conservatively use handles for every intermediate object. However, using a liveness analysis, we can determine which objects don’t live across allocations and therefore don’t need to be stored in handles. Neat! We can also use some additional type/range information to track which IR nodes correspond to small integers, small strings, etc and aren’t heap allocated at all. These objects do not ever need handles.
There are other similarly interesting abstract interpretation based analyses we could do to generate better code, such as allocation removal/escape analysis, load/store forwarding, and dead store removal.2
There are also non-abstract-interpretation based analyses that I should add like common subexpression elimination (CSE).
Control flow graph based IRs give you basic blocks and linearizations of
instructions and stuff like that. CFGs are useful on their own but require
constructing additional metadata on the side to track what “version of a name”
something actually points to in an optimization. For example, in the following
snippet, which definition of x
is being used?
# Not SSA
x = 3
x = 4
print(x + 5)
In this case, it might be clear from a top-to-bottom reading that the second
definition, x = 4
, is in use at print(x + 5)
. But for an analysis, that
requires keeping a bunch of state around on the side.
Why not instead unroll that information directly into your IR? If you can encode this metadata into the name, your analysis gets much easier. Consider:
# SSA
x0 = 3
x1 = 4
print(x1 + 5)
Now that we’ve added the SSA invariant of one definition per name, we know that
we’re definitely using x1
. No context needed.
Once you start thinking like this you eventually arrive at the same place as Scott Ananian and co: Static Single Information form.
I’m going to handwave a bit because I don’t think a lot of compilers need “full SSI”, but the essence is that you can encode other properties of the SSA variables in new names.
In the following example pseudocode, we can learn something about the x
variable in the “then” branch of the if-statement: it’s 0.
def foo(x):
if x == 0:
return x + 5
# ...
We could keep track of this in some type environment in a static analysis, but that information would be specific to a context in the CFG: the particular basic block (and blocks dominated by it).
Or, we could add a refinement instruction. Maybe something like this:
def foo(x):
if x == 0:
x1 = refine(x, 0)
return x1 + 5
# ...
Once we replace uses of x
dominated by that basic block, we can run a type
inference and cache the fact that x1
is always 0. Then it can be looked up
without context.
There are probably other similar ideas to SSA and SSI here that I am missing, so please write in.
At this point we can construct and optimize an IR but it’s not executable yet. For that, we have to leave SSA. I chose to compile to C because we have an existing functional runtime written in C.
The code generator is almost the same as in the baseline compiler except that we don’t need to think about inventing temporary variables or doing ad-hoc optimizations anymore—all of that is done for us in either the IR builder or the optimizer.
The code generator mostly consists of iterating over functions, which iterate over basic blocks, which iterate over instructions:
class IRFunction:
# ...
def to_c(self) -> str:
with io.StringIO() as f:
f.write(f"{self.c_decl()} {{\n")
f.write("HANDLES();\n")
for param in self.params:
f.write(f"GC_PROTECT({param});\n")
gvn = InstrId()
for block in self.cfg.rpo():
self._block_to_c(f, block, gvn)
f.write("}")
return f.getvalue()
return
def _block_to_c(self, f: io.StringIO, block: Block, gvn: InstrId) -> None:
f.write(f"{block.name()}:;\n")
for instr in block.instrs:
instr = instr.find()
if isinstance(instr, Nop):
continue
f.write(self._instr_to_c(instr, gvn))
def _instr_to_c(self, instr: Instr, gvn: InstrId) -> str:
def _handle(rhs: str) -> str:
return f"OBJECT_HANDLE({gvn.name(instr)}, {rhs});\n"
if isinstance(instr, IntAdd):
operands = ", ".join(gvn.name(op) for op in instr.operands)
return _handle(f"num_add({operands})")
# ...
Oh. Yeah. This gvn
/InstrId
thing. For the IR-to-string printer and the
IR-to-C printer both, we have to come up with names to point to each
instruction. To do this, we have an identity hash table that assigns string
names to instructions in the order that it sees them.
This way, we don’t have to store any additional IDs or names on the Instr
class and the numbers don’t grow very big as we continue to optimize the
function (and allocate new instructions).
One nice thing about this code generator is that it’s out-of-SSA on easy mode:
the only join points we have are at function calls and returns (so far), so we
don’t (yet) have any need to deconstruct Phi
instructions into parallel
copies.
When the optimizer gets more advanced, we’ll probably need to invent a Phi
instruction and deal with that.
That’s all for today, folks. I’ll mostly be continuing to write some optimization passes for the IR. I’m using this as a playground to learn about compiler things, which is neat, but it’s a little disappointing that there are not (yet?) large and slow programs in Scrapscript that I can use to benchmark the compiler.
If you’re sitting there saying to yourself, “Golly gee, that looks a lot like the Cinder IR,” a) I am impressed you put two and two together based on vibes and b) yeah, I spent a lot of time reading that particular text representation so I am trying to mimic the style here. The actual design decisions in the IR data structures are pretty different. ↩
In a class I took with Olin Shivers, he called this process “automated theorem finding” and I like that perspective. ↩