A compiler IR for Scrapscript

February 4, 2025

I wrote previously about different slices of life in the various implementations of the Scrapscript programming language. Until two weeks ago, the most interesting implementation was a so-called “baseline compiler” directly from the AST to C. Now we have an intermediate representation—an IR.

Why add an IR?

The AST is fine. It’s a very abstract representation of the progam text and it is easy to generate directly from the source text. None of the names are resolved (there are a bunch of strings everywhere), there is a lot left implicit in the representation, and evaluating it requires having a stack or recursion to keep temporary data around. But hey, it works and it’s reasonably small.

It’s possible to incrementally rewrite the AST by adding new types of nodes or folding existing ones together. For example, it’s possible to find patterns similar to Binop(BinopKind.ADD, Int(2), Int(3)) and rewrite them directly into Int(5). This kind of optimization either doesn’t go very far or also requires keeping significant amounts of context on the side.

It’s also possible to try and retrofit some of this constant folding into other AST passes that run anyway. You can kind of see this in the post about the compiler tricks where we opportunistically fold constant data into its own section so it does not need to be heap allocated.

Instead, I chose to create a new program representation designed to be optimized and rewritten: an SSA IR. True to the spirit of the project, it’s a home-grown one, not some existing library like LLVM.

This is partly because it keeps things simple—the IR, some optimization passes, and the compiler from the IR to C fit neatly into ~1000 lines of code—and partly because projects like LLVM are not designed for languages like this. They can probably do some impressive things, but they would do better if purpose-built domain-specific analyses unwrapped some of the functional abstractions first.

Still, you might be wondering why we then compile our IR to C instead of LLVM. Two reasons: 1) we already have an existing C runtime and 2) I really, really, really do not want to be tied to a particular LLVM version and its quirks. One of the most frustrating things when hacking on someone else’s language project is watching it download a huge LLVM bundle and then compile it. It does a heck of a lot. It’s very impressive. It’s many person-lifetimes of engineering marvels. But it’s too big for this dinky little tugboat of a project.

(What about QBE? Well, if I can figure out how to integrate our GC and its notion of handles into a QBE IR, great. But for now it’s C only.)

What does the IR look like?

A small example

Let’s start small and make a list. The Scrapscript expression [1, 2], when evaluated, creates a list containing two elements. The language doesn’t expressly guarantee how lists are stored in memory but the baseline compiler and C runtime store them as cons cells—as linked lists. The IR to build up this linked list looks like this¹:

fn0 {
  bb0 {
    v0 = Const<[]>
    v1 = Const<2>
    v2 = ListCons v1, v0
    v3 = Const<1>
    v4 = ListCons v3, v2
    Return v4
  }
}

This is a text representation generated by a to_string function. The actual IR involves classes and pointers and data structures, but I sometimes find it helpful to think in “text space”.

The IR contains one function, fn0, which has one basic block, bb0, which has a number of instructions. Each instruction is given a name which also corresponds to the value it produces.

The first two are the Const instruction and they carry constant data known to the compiler at compile-time. These are represented by “immediates” enclosed in <angle brackets>.

The third is the ListCons instruction. This corresponds to a heap allocation: building a new linked list node at run-time. It takes two operands, v1 and v0, which represent the head and tail of the new linked list node, respectively.

The last one is the Return instruction, which exits the current function with the given operand. Return is a control flow instruction. Control flow is only allowed to happen at the end of a basic block, which makes all the instructions inside a block nicely control flow free.

A bigger example

Let’s do a slightly more complicated example that involves pattern matching. We’ll take a look at the IR corresponding to the Scrapscript program | 1 -> 2 + 3, which is a pattern matching function. If the argument to the function is equal to 1, it returns 5. Otherwise, it aborts.

We’ll first look at the function that contains—that creates—the pattern matching function. The “main function”, if you will.

fn0 {
  bb0 {
    v0 = NewClosure<fn1>
    Return v0
  }
}

This function allocates a closure object on the heap corresponding to the function fn1. we’ll come back to the details later, but the important thing to know is that unlike in the source program and AST, fn1 is not a name to be looked up later as a string. Instead, it’s a direct pointer to a known IR function. And that function’s IR looks like this:

fn1 {
  bb0 {
    v0 = Param<0; $clo>
    v1 = Param<1; arg_0>
    Jump bb2
  }
  bb2 {
    v2 = IsIntEqualWord v1, 1
    CondBranch v2, bb3, bb1
  }
  bb1 {
    MatchFail
  }
  bb3 {
    v3 = Const<2>
    v4 = Const<3>
    v5 = IntAdd v3, v4
    Return v5
  }
}

Whoa, multiple blocks! Let’s go through them one by one.

The first block contains instructions that exist to give names to function parameters. It might be a little confusing that we have not one but two parameters for a function that has no parameters visible in the program text. Here’s why we need each of them:

We need the second one, arg_0, because it exists implicitly in the program. It’s the unnamed thing being matched—in this case, checked to see if the value is 1.
We have (but don’t apparently need) the first one because in its un-optimized state, every function has a closure parameter. This is part of the “closure conversion” process whereby the abstract world-of-ideas notion of “function code + environment” gets turned (later) into “C code + tack-on struct”. It makes functions callable with a very generic (slow) calling convention by default, but we might be able to optimize functions into a faster calling convention later in the project.

Then we have a new control instruction, Jump, that represents continuing the flow of execution at another block. In this case, bb2.

bb2 checks if v1 is equal to the integer 1 and returns a native boolean (not a Scrapscript object). This gets fed into our third control instruction so far, CondBranch.

CondBranch takes a condition and two basic blocks. If the condition is true, it passes control the first one. If it’s false, to the second. This gives us the ability to support conditionals in the source language.

In the true case, if the argument is 1, we add up 2 and 3 using the IntAdd instruction. Otherwise, in bb1, we encounter our last control instruction: MatchFail. This aborts the program.

Cleaning up

You might notice that this IR snippet leaves something to be desired. For one, we have an unconditional jump from bb0 to bb2 and bb2 has no other incoming control flow from any other blocks. It seems like we should be able to smush the contents of bb2 into bb0 instead.

You might also notice that the closure parameter is completely unused. We could get rid of it because nothing uses v0.

Last, you might notice that, similar to the first example with the list, we have two constant operands to an instruction that seems like it should be able to be computed at compile-time.

Some good news: all of these can be dealt with using very generic compiler passes over the IR.

Some optimization passes

Let’s tackle each of the aforementioned concerns one by one.

To merge basic blocks, I adapted a compiler pass called CleanCFG from the Cinder project, which I previously adapted from the compiler pass CleanCFG from the HHVM project.

@dataclasses.dataclass
class CleanCFG:
    fn: IRFunction

    def run(self) -> None:
        changed = True
        while changed:
            changed = False
            for block in self.fn.cfg.rpo():
                if not block.instrs:
                    # Ignore transient empty blocks.
                    continue
                # Keep working on the current block until no further changes are made.
                while self.absorb_dst_block(block):
                    pass
            changed = self.remove_unreachable_blocks()

    # ...

(where rpo is a reverse post-order traversal of the CFG)

To remove dead code, we have a reasonably straightforward implementation of dead code elimination (DCE). DCE is kind of like a garbage collector for instructions: it finds the root set (the required instructions, kind of like the backbone of the program), marks them, and then marks the instructions they indirectly/transitively depend on. Then it removes (sweeps) the rest.

Because each instruction embeds a union-find data structure in it, I “soft-delete” instructions by marking them equivalent to a no-op instruction. I could have also gone through and deleted them, but deleting instructions from the middle of list while iterating over it felt finicky and bothersome. Nop instructions are not shown in the to_string representation of the IR and don’t compile to any C code.

@dataclasses.dataclass
class DeadCodeElimination:
    fn: IRFunction

    def is_critical(self, instr: Instr) -> bool:
        if isinstance(instr, Const):
            return False
        if isinstance(instr, IntAdd):
            return False
        # TODO(max): Add more. Track heap effects?
        return True

    def run(self) -> None:
        worklist: list[Instr] = []
        marked: set[Instr] = set()
        blocks = self.fn.cfg.rpo()
        # Mark
        for block in blocks:
            for instr in block.instrs:
                instr = instr.find()
                if self.is_critical(instr):
                    marked.add(instr)
                    worklist.append(instr)
        while worklist:
            instr = worklist.pop(0).find()
            if isinstance(instr, HasOperands):
                for op in instr.operands:
                    op = op.find()
                    if op not in marked:
                        marked.add(op)
                        worklist.append(op)
        # Sweep
        for block in blocks:
            for instr in block.instrs:
                instr = instr.find()
                if instr not in marked:
                    instr.make_equal_to(Nop())

The last problem, the lack of constant folding, could be solved by a pass that looks something like this:

def simplify(fn):
    changed = True
    while changed:
        changed = False
        for block in fn.blocks:
            for instr in block.instrs:
                changed |= try_simplify(instr)

This would probably work, but it would do many passes over the entire control flow graph for each function. For big functions with many small reduction steps, this could take a long time.

Additionally, it doesn’t include any type inference. Each instruction is limited to the type information visible from its local context. For small optimizations like our 2+3 or list cons, this might be fine, but there are situations where local type information would not be enough to simplify an instruction.

Instead, we crack open the Constant propagation with conditional branches paper (known informally as SCCP) which takes a smarter worklist approach.

Two worklists, actually. It uses one to hold basic blocks and one to hold instructions. Starting at the entrypoint, it constructs a set of reachable blocks and flows types (members of the constant propagation lattice) through the instructions.

The exciting bit is that it uses the types in discovering the control flow: if a conditional branch is known to branch a certain way at analysis time, the other branch is not analyzed at all and its computation does not pollute the analysis results.

Here is a snippet of SCCP, excluding the type lattice. The type lattice is an interesting concept that I am implementing poorly and I would like to do a better job (faster, smaller code, …) later.

@dataclasses.dataclass
class SCCP:
    fn: IRFunction
    instr_type: dict[Instr, ConstantLattice]
    block_executable: set[Block]
    instr_uses: dict[Instr, set[Instr]]

    def type_of(self, instr: Instr) -> ConstantLattice:
        ...

    def run(self) -> dict[Instr, ConstantLattice]:
        block_worklist: list[Block] = [self.fn.cfg.entry]
        instr_worklist: list[Instr] = []

        while block_worklist or instr_worklist:
            if instr_worklist and (instr := instr_worklist.pop(0)):
                instr = instr.find()
                if isinstance(instr, HasOperands):
                    for operand in instr.operands:
                        if operand not in self.instr_uses:
                            self.instr_uses[operand] = set()
                        self.instr_uses[operand].add(instr)
                new_type: ConstantLattice = CBottom()
                if isinstance(instr, Const):
                    value = instr.value
                    if isinstance(value, Int):
                        new_type = CInt(value.value)
                    if isinstance(value, List):
                        new_type = CList()
                # ...
                elif isinstance(instr, CondBranch):
                    match self.type_of(instr.operands[0]):
                        case CCBool(True):
                            instr.make_equal_to(Jump(instr.conseq))
                            block_worklist.append(instr.conseq)
                        case CCBool(False):
                            instr.make_equal_to(Jump(instr.alt))
                            block_worklist.append(instr.alt)
                        case CBottom():
                            pass
                        case _:
                            block_worklist.append(instr.conseq)
                            block_worklist.append(instr.alt)
                elif isinstance(instr, IntAdd):
                    match (self.type_of(instr.operands[0]), self.type_of(instr.operands[1])):
                        case (CInt(int(l)), CInt(int(r))):
                            new_type = CInt(l + r)
                            instr.make_equal_to(Const(Int(l+r)))
                        case (CInt(_), CInt(_)):
                            new_type = CInt()
                # ...
                old_type = self.type_of(instr)
                if union(old_type, new_type) != old_type:
                    self.instr_type[instr] = new_type
                    for use in self.instr_uses.get(instr, set()):
                        instr_worklist.append(use)
            if block_worklist and (block := block_worklist.pop(0)):
                if block not in self.block_executable:
                    self.block_executable.add(block)
                    instr_worklist.extend(block.instrs)

        return self.instr_type

One thing to note is that SCCP requires either pre-computing or incrementally discovering use edges in your IR. That is, when it updates the type of instruction A, it needs to know what other instructions use A so that they can be re-queued for analysis.

Also, please ignore the using-lists-as-queues thing which is terribly slow. You should instead use a proper queue or deque structure.

SCCP is pretty powerful but at least as shown in the paper, it is a little bit limited. The paper constrains the lattice to ensure that the number of passes over the IR is bounded by a small constant factor (the height of the lattice), but I think more interesting things happen if your lattice is taller…

What if you use your full type lattice instead of just a constant propagation lattice? It should still have a finite height to guarantee that the analysis terminates, but you could get more information than just “constant or not”. In the example above, you can see a little snippet of that: CInt indicates that an object is an integer, whereas CInt(...) indicates that it is an integer with a specific value. You can maybe imagine extending this to work on sets of functions, at which point this begins to look a little bit like CFA/points-to analysis.

I may have some writing coming out in a couple of weeks about interprocedural SCCP.

In any case, here is the result of running these passes over the CFG. You can see that a lot of the things we thought were sub-optimal have been removed:

fn1 {
  bb0 {
    v0 = Param<1; arg_0>
    v1 = IsIntEqualWord v0, 1
    CondBranch v1, bb3, bb1
  }
  bb1 {
    MatchFail
  }
  bb3 {
    v2 = Const<5>
    Return v2
  }
}

More advanced optimizations

Types are great but they aren’t everything. Sometimes, the analysis runs on a totally different lattice that describes a different property of how the data is used. Sometimes, the analysis even runs backwards!

A great example of this is liveness analysis, which is used in, among other things, register allocation. Liveness starts from the end(s) of a function and work their way up. Variables start dead, become live at their first use, and then become… undead? I guess? at their definition:

a = 1  # a is (un)dead before this expression
b = a + 2  # a is live in this expression and above
print(b)  # a is dead here

It’s also useful for something I mentioned obliquely in previous Scrapscript posts for handle elision. For some context, the garbage collector deletes unused memory. In order to find out what is used, it iterates over the root set—things known to the runtime absolutely be live—and anything transitively reachable from there. Anything not reachable is garbage.

In an interpreter, this is “easy”—scan the stack data structure you’re using for intermediate results. In compiled code, we have no such stack (…ish), so we have to instead tell the garbage collector which pointers on the native stack to keep around. We do this with some macros called GC_PROTECT and OBJECT_HANDLE which push pointers to those pointers (struct object**) to some global data structure accessible by the garbage collector (often called a shadow stack).

struct object* f_0(struct object* this, struct object* x) {
  HANDLES();
  GC_PROTECT(x);
  if (!(is_num(x))) {
    fprintf(stderr, "assertion is_num(x) failed\n");
    abort();
  }
  OBJECT_HANDLE(tmp_1, num_add(x, _mksmallint(1)));
  // ...
}

This is great, but unfortunately three bad things happen:

These objects become kind of opaque to the C compiler’s optimizer because from its function-at-a-time perspective, “anything” could happen to those pointers at function call boundaries
This not only limits Clang’s (for example) optimization potential but also clutters the generated code with a bunch of loads and stores
And it makes garbage collection slower because there are more roots to scan

Right now we conservatively use handles for every intermediate object. However, using a liveness analysis, we can determine which objects don’t live across allocations and therefore don’t need to be stored in handles. Neat! We can also use some additional type/range information to track which IR nodes correspond to small integers, small strings, etc and aren’t heap allocated at all. These objects do not ever need handles.

There are other similarly interesting abstract interpretation based analyses we could do to generate better code, such as allocation removal/escape analysis, load/store forwarding, and dead store removal.²

There are also non-abstract-interpretation based analyses that I should add like common subexpression elimination (CSE).

Design decisions: what’s up with SSA?

Control flow graph based IRs give you basic blocks and linearizations of instructions and stuff like that. CFGs are useful on their own but require constructing additional metadata on the side to track what “version of a name” something actually points to in an optimization. For example, in the following snippet, which definition of x is being used?

# Not SSA
x = 3
x = 4
print(x + 5)

In this case, it might be clear from a top-to-bottom reading that the second definition, x = 4, is in use at print(x + 5). But for an analysis, that requires keeping a bunch of state around on the side.

Why not instead unroll that information directly into your IR? If you can encode this metadata into the name, your analysis gets much easier. Consider:

# SSA
x0 = 3
x1 = 4
print(x1 + 5)

Now that we’ve added the SSA invariant of one definition per name, we know that we’re definitely using x1. No context needed.

Once you start thinking like this you eventually arrive at the same place as Scott Ananian and co: Static Single Information form.

SSI

I’m going to handwave a bit because I don’t think a lot of compilers need “full SSI”, but the essence is that you can encode other properties of the SSA variables in new names (this is apparently called “sparseness” in the literature).

In the following example pseudocode, we can learn something about the x variable in the “then” branch of the if-statement: it’s 0.

def foo(x):
    if x == 0:
        return x + 5
  # ...

We could keep track of this in some type environment in a static analysis, but that information would be specific to a context in the CFG: the particular basic block (and blocks dominated by it).

Or, we could add a refinement instruction. Maybe something like this:

def foo(x):
    if x == 0:
        x1 = refine(x, 0)
        return x1 + 5
  # ...

Once we replace uses of x dominated by that basic block, we can run a type inference and cache the fact that x1 is always 0. Then it can be looked up without context.

There are probably other similar ideas to SSA and SSI here that I am missing, so please write in.

Compiling to C

At this point we can construct and optimize an IR but it’s not executable yet. For that, we have to leave SSA. I chose to compile to C because we have an existing functional runtime written in C.

The code generator is almost the same as in the baseline compiler except that we don’t need to think about inventing temporary variables or doing ad-hoc optimizations anymore—all of that is done for us in either the IR builder or the optimizer.

The code generator mostly consists of iterating over functions, which iterate over basic blocks, which iterate over instructions:

class IRFunction:
    # ...
    def to_c(self) -> str:
        with io.StringIO() as f:
            f.write(f"{self.c_decl()} {{\n")
            f.write("HANDLES();\n")
            for param in self.params:
                f.write(f"GC_PROTECT({param});\n")
            gvn = InstrId()
            for block in self.cfg.rpo():
                self._block_to_c(f, block, gvn)
            f.write("}")
            return f.getvalue()
        return

    def _block_to_c(self, f: io.StringIO, block: Block, gvn: InstrId) -> None:
        f.write(f"{block.name()}:;\n")
        for instr in block.instrs:
            instr = instr.find()
            if isinstance(instr, Nop):
                continue
            f.write(self._instr_to_c(instr, gvn))

    def _instr_to_c(self, instr: Instr, gvn: InstrId) -> str:
        def _handle(rhs: str) -> str:
            return f"OBJECT_HANDLE({gvn.name(instr)}, {rhs});\n"

        if isinstance(instr, IntAdd):
            operands = ", ".join(gvn.name(op) for op in instr.operands)
            return _handle(f"num_add({operands})")
        # ...

Oh. Yeah. This gvn/InstrId thing. For the IR-to-string printer and the IR-to-C printer both, we have to come up with names to point to each instruction. To do this, we have an identity hash table that assigns string names to instructions in the order that it sees them.

This way, we don’t have to store any additional IDs or names on the Instr class and the numbers don’t grow very big as we continue to optimize the function (and allocate new instructions).

One nice thing about this code generator is that it’s out-of-SSA on easy mode: the only join points we have are at function calls and returns (so far), so we don’t (yet) have any need to deconstruct Phi instructions into parallel copies. (Speaking of join points, I’d love to implement proper tail call optimization soon because it’s the only way to do loops. That will probably require some Shenanigans.)

When the optimizer gets more advanced, we’ll probably need to invent a Phi instruction and deal with that.

Wrapping up

That’s all for today, folks. The code is available in this branch, which I will merge when it catches up and has all the weird ad-hoc optimizations that the baseline compiler does. I’ll mostly be continuing to write some optimization passes for the IR. I’m using this as a playground to learn about compiler things, which is neat, but it’s a little disappointing that there are not (yet?) large and slow programs in Scrapscript that I can use to benchmark the compiler.

Thank you to CF Bolz-Tereick for excellent writing feedback.

If you’re sitting there saying to yourself, “Golly gee, that looks a lot like the Cinder IR,” a) I am impressed you put two and two together based on vibes and b) yeah, I spent a lot of time reading that particular text representation so I am trying to mimic the style here. The actual design decisions in the IR data structures are pretty different. ↩
In a class I took with Olin Shivers, he called this process “automated theorem finding” and I like that perspective. ↩