A quick look at destination-driven code generation

November 9, 2023

Picture this: you’re sitting there, writing a compiler, when all of a sudden you have to generate assembly. You have some intermediate representation (IR) but now you have to turn virtual registers into machine registers. This is called register allocation.

Register allocation is tricky. It’s also slow. Even very fast register allocators like linear scan can dominate compile time. So let’s skip it. Let’s write a silly compiler in Python and see if we can improve the generated code without going full regalloc.

We will use a small math language that supports constant integers, addition, and multiplication:

# Ignore the @ir decorator. It's not important except for some dataclasses
# shenanigans.
@ir
class Instr:
    # ...
    pass


@ir
class Const(Instr):
    value: int

    def __repr__(self):
        return f"{self.var()} = {self.value}"


@ir
class Binary(Instr):
    left: Instr
    right: Instr

    def operands(self):
        return (self.left, self.right)


@ir
class Add(Binary):
    pass


@ir
class Mul(Binary):
    pass

We’re going to pretend that we can’t constant-fold these expressions to one number because that’s not the point of this post.

The IR’s text form will look like this:

v0 = 2
v1 = 3
v2 = Add v0, v1

Nothing too fancy. No control-flow. No function calls. No data structures.

We can write a recursive tree-walking evaluator for this IR in a couple lines of code:

def eval_exp(exp):
    if isinstance(exp, Const):
        return exp.value
    elif isinstance(exp, Add):
        left = eval_exp(exp.left)
        right = eval_exp(exp.right)
        return left + right
    elif isinstance(exp, Mul):
        left = eval_exp(exp.left)
        right = eval_exp(exp.right)
        return left * right
    else:
        raise NotImplementedError(f"unexpected exp {exp}")

And the corresponding tree-walking compiler to x86-64 looks similar enough if you squint:

def naive_compile(exp):
    tmp = RCX
    if isinstance(exp, Const):
        return [X86.Mov(RAX, Imm(exp.value))]
    elif isinstance(exp, Add):
        right_code = naive_compile(exp.right)
        left_code = naive_compile(exp.left)
        return [
            *right_code,
            X86.Push(RAX),
            *left_code,
            X86.Pop(tmp),
            X86.Add(RAX, tmp),
        ]
    elif isinstance(exp, Mul):
        right_code = naive_compile(exp.right)
        left_code = naive_compile(exp.left)
        return [
            *right_code,
            X86.Push(RAX),
            *left_code,
            X86.Pop(tmp),
            X86.Mul(RAX, tmp),
        ]
    else:
        raise NotImplementedError(f"unexpected exp {exp}")

(I wrote some datatypes for a small set of x86-64 instructions and operands so we can construct them in our compiler without slinging a bunch of strings.)

With each case, our goal is to get the result into the register RAX. In the constant case, we emit one instruction to move the value—the immediate—into RAX.

In the recursive binary cases, we generate code for the left and right side of the expression. We put the compiled code for the right hand side first, then for the left hand side.

Since both cases will try to put the result in RAX (and intermediate calls to the compiler could trash any register we haven’t specifically set aside), we have to stash the result from the right hand side in a temporary location: the stack. Then, when it comes time to add it, we can fetch it from the stack into a register.

This works because we maintain two invariants:

Within a compiler case, we have the same stack pointer before and after a recursive call. That is, recursive calls to the compiler may change it about but the code should eventually restore the stack to what we gave it.
All cases only use our temporary, RCX, as a temporary and don’t try to stick anything long-lived in it.

So what does the generated code look like? I will answer in the form of unit tests.

Let’s look at Const first:

class NaiveCompilerTests(unittest.TestCase):
    def _alloc(self, exp):
        x86 = naive_compile(exp)
        return [str(op) for op in x86]

    def test_const(self):
        exp = Const(2)
        self.assertEqual(
            self._alloc(exp),
            ["X86.Mov(dst=rax, src=Imm(2))"],
        )

Right. Nice. Just about what we expected.

And now the binary cases:

class NaiveCompilerTests(unittest.TestCase):
    # ...
    def test_add(self):
        exp = Add(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
            ],
        )

    def test_mul(self):
        exp = Mul(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Pop(dst=rcx)",
                "X86.Mul(dst=rax, src=rcx)",
            ],
        )

Hmm. You know, now that I look at it, maybe moving everything to RAX was not the right answer. It kind of looks like there is a lot of data movement going on. I guess it can’t be helped.

And some more tests with deeper nesting, for good measure:

class NaiveCompilerTests(unittest.TestCase):
    # ...
    def test_mul_add(self):
        exp = Mul(
            Add(Const(1), Const(2)),
            Add(Const(3), Const(4)),
        )
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=rax, src=Imm(4))",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(1))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Pop(dst=rcx)",
                "X86.Mul(dst=rax, src=rcx)",
            ],
        )

    def test_add_deep(self):
        exp = Add(Const(2), Add(Const(3), Add(Const(4), Const(5))))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=rax, src=Imm(5))",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(4))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Push(src=rax)",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
            ],
        )

Actually, yikes, I take it back. We have to do something about this. The code exploded. So what’s to be done?

We could write a full register allocator like linear scan or graph coloring¹, but that sounds like a lot of work. Also, sometimes (like in a baseline compiler) your code generation needs to be really fast and those don’t fit the performance budget.

Thankfully, Dybvig and the gang are back with some cool idea called destination-driven code generation (DDCG).

Destination-driven code generation

The paper (PDF) is approachable enough, but if you want an even more approachable introduction, there’s a great talk (PDF) by Kevin Millikin explaining how V8 implemented it in one of their first compilers. I implemented a small programming language in C++ using this technique some time ago (and even wrote part of a blog post) but never got around to finishing it.

The key insight is that in a recursive code generator, the caller of the recursive compilation step knows where it wants the result of the callee to go—so we should not be copying everything around through some result register like RAX. We should instead pass the destination we want as a parameter.

For this post², there are two kinds of destinations: the stack and the accumulator (RAX).

class Dest:
    STACK = 0
    ACCUM = 1

We can take these destinations and thread them through the compiler, starting with the top-level function requesting its result end up in RAX.

The destination-driven compiler looks suspiciously similar to the normal tree-walking one:

def ddcg_compile(code):
    return _ddcg_compile(code, Dest.ACCUM)

def _ddcg_compile(exp, dst):
    tmp = RCX
    if isinstance(exp, Const):
        return _plug_imm(dst, exp.value)
    elif isinstance(exp, Add):
        return [
            *_ddcg_compile(exp.left, Dest.STACK),
            *_ddcg_compile(exp.right, Dest.ACCUM),
            X86.Pop(tmp),
            X86.Add(RAX, tmp),
            *_plug_reg(dst, RAX),
        ]
    elif isinstance(exp, Mul):
        return [
            *_ddcg_compile(exp.left, Dest.STACK),
            *_ddcg_compile(exp.right, Dest.ACCUM),
            X86.Pop(tmp),
            X86.Mul(RAX, tmp),
            *_plug_reg(dst, RAX),
        ]
    else:
        raise NotImplementedError(exp)

It also does recursive compilation of the left and right cases. It does push (via passing the stack destination) and pop. But it has these funny-looking plug functions.

The plug functions are responsible for doing most of the data movement. They can be pushes or movs or—sometimes—nothing at all, if the register is already in the right place.

def _plug_imm(dst, value):
    if dst == Dest.STACK:
        return [X86.Push(Imm(value))]
    elif dst == Dest.ACCUM:
        return [X86.Mov(RAX, Imm(value))]
    else:
        raise NotImplementedError(dst)


def _plug_reg(dst, reg):
    if dst == Dest.STACK:
        return [X86.Push(reg)]
    elif dst == Dest.ACCUM:
        if reg == RAX:
            return []
        return [X86.Mov(RAX, reg)]
    else:
        raise NotImplementedError(dst)

Not so bad. I wonder why the accumulator was isolated in the original paper instead of allowing for arbitrary destination registers. Perhaps something to look at in the future.

So what does the generated code look like? Again, some tests:

class DDCGTests(unittest.TestCase):
    def _alloc(self, exp):
        x86 = ddcg_compile(exp)
        return [str(op) for op in x86]

    def test_const(self):
        exp = Const(2)
        self.assertEqual(
            self._alloc(exp),
            ["X86.Mov(dst=rax, src=Imm(2))"],
        )

    def test_add(self):
        exp = Add(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Push(src=Imm(2))",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
            ],
        )

    def test_mul(self):
        exp = Mul(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Push(src=Imm(2))",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Pop(dst=rcx)",
                "X86.Mul(dst=rax, src=rcx)",
            ],
        )

    def test_mul_add(self):
        exp = Mul(
            Add(Const(1), Const(2)),
            Add(Const(3), Const(4)),
        )
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Push(src=Imm(1))",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Push(src=rax)",
                "X86.Push(src=Imm(3))",
                "X86.Mov(dst=rax, src=Imm(4))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Pop(dst=rcx)",
                "X86.Mul(dst=rax, src=rcx)",
            ],
        )

    def test_add_deep(self):
        exp = Add(Const(2), Add(Const(3), Add(Const(4), Const(5))))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Push(src=Imm(2))",
                "X86.Push(src=Imm(3))",
                "X86.Push(src=Imm(4))",
                "X86.Mov(dst=rax, src=Imm(5))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
            ],
        )

If you look at the tests side-by-side with the naive compiler, you should see a major difference: there are way fewer movs because everything does not have to go through RAX all the time.

But there are still some pushes. Can we do better? Yes, but after a brief detour.

Adding control destinations

The original paper has a notion of control destinations which helps generate better code for conditionals and loops. Since this post is really only a tiny demo and does not have control flow, it does not make sense to implement here. But I recommend checking out the original paper, getting confused, then checking the V8 presentation, getting confused, then checking out my implementation.

Back to our scheduled programming.

Adding a virtual stack

I saw this cool JIT challenge by Takashi Kokubun to do simple native code generation for Ruby bytecode. Buried in it is a very cool trick: use a “stack” of registers instead of your actual stack. (EDIT: I also apparently saw this in JIT Compilers 102 by Charlie Cummings but forgot!) That is, instead of generating push and pop instructions, move to registers. I translated some of his sample Ruby code to Python:

STACK = ["r8", "r9"]
stack_size = 0
code = []

def push(src):
    code = X86.Mov(STACK[stack_size], src)
    stack_size += 1
    return code

def pop(dst):
    stack_size -= 1
    return X86.Mov(dst, STACK[stack_size])

His code assumes (because it is a tiny demo) that the temporaries all fit in our virtual stack. They may not always fit, so it is possible to extend push and pop to fall back to using the system stack.

Adding this strategy to the existing destination-driven code generator requires some state—the stack_size—so I turned it into a class. The code it generates looks similar enough but has fewer pushes and pops.

class DDCGStackTests(unittest.TestCase):
    def _alloc(self, exp):
        gen = DDCGStack()
        gen.compile(exp)
        return [str(op) for op in gen.code]

    def test_const(self):
        exp = Const(2)
        self.assertEqual(
            self._alloc(exp),
            ["X86.Mov(dst=rax, src=Imm(2))"],
        )

    def test_add(self):
        exp = Add(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=r8, src=Imm(2))",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Add(dst=rax, src=r8)",
            ],
        )

    def test_mul(self):
        exp = Mul(Const(2), Const(3))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=r8, src=Imm(2))",
                "X86.Mov(dst=rax, src=Imm(3))",
                "X86.Mul(dst=rax, src=r8)",
            ],
        )

    def test_mul_add(self):
        exp = Mul(
            Add(Const(1), Const(2)),
            Add(Const(3), Const(4)),
        )
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=r8, src=Imm(1))",
                "X86.Mov(dst=rax, src=Imm(2))",
                "X86.Add(dst=rax, src=r8)",
                "X86.Mov(dst=r8, src=rax)",
                "X86.Mov(dst=r9, src=Imm(3))",
                "X86.Mov(dst=rax, src=Imm(4))",
                "X86.Add(dst=rax, src=r9)",
                "X86.Mul(dst=rax, src=r8)",
            ],
        )

    def test_add_deep(self):
        # This tests pushing and popping beyond the limits of our virtual
        # stack.
        assert len(STACK_REGS) == 2
        exp = Add(Const(2), Add(Const(3), Add(Const(4), Const(5))))
        self.assertEqual(
            self._alloc(exp),
            [
                "X86.Mov(dst=r8, src=Imm(2))",
                "X86.Mov(dst=r9, src=Imm(3))",
                "X86.Push(src=Imm(4))",
                "X86.Mov(dst=rax, src=Imm(5))",
                "X86.Pop(dst=rcx)",
                "X86.Add(dst=rax, src=rcx)",
                "X86.Add(dst=rax, src=r9)",
                "X86.Add(dst=rax, src=r8)",
            ],
        )

(In fact, the tests you are seeing here is me taking it one step further with yet another trick: if the top of the stack is in a register when you need to use it, don’t move it to another register. So we completely eliminate virtual stack pops!)

Turning push and pop into inter-register movs hypothetically saves on latency but at the expense of code size. The push imm opcode can be as few as two bytes whereas the mov reg, imm form is seven bytes. I haven’t measured a darn thing, though.

I don’t know what this register-stack trick is called or if anyone has tried this combination of compilation techniques before.

Testing

So we can generate these funky-looking x86-64 instructions and check that we have generated what we expect. Great—we have characterization tests. But do they run? I’m glad you asked, because I wrote a little x86-64 emulator³.

With this emulator, I can do both compilation tests and execution tests. I can also make sure that the value returned from the emulator matches our tree-walking interpreter. How cool is that?

But do the instructions work in real life? Maybe! I suppose the next step would be to mmap a buffer, assemble the instructions in, and use ctypes to call our expression from Python or something. But we don’t really have parameters, so I am not sure how that would work. In my C++ implementation, I do execute the code for real.

Next steps and conclusion

In an ideal world I would have implemented the whole paper in this little Python implementation, but I don’t have that kind of time right now. The paper even has some more ideas in its conclusion that I did not mention here. As a reminder, I did implement a much bigger language and control destinations in my C++ implementation, which comes with a half-baked write-up in post.md.

I also wonder if this can be applied to static single assignment (SSA) IRs that have phi nodes. I’ve never written a pass to lower from SSA, but I’ve heard the parallel copy is really hard to get right.

Also, could this be extended to support arbitrary registers as data destinations? Allowing only the accumulator is kind of limiting.

I wonder if we could prove the generated code correct using Chris Fallin’s register allocation symbolic checker.

Can we combine this with delayed code generation?

Last, how easy is it to do DDCG straight from stack-based bytecode? I think the implicit tree-like lifetimes don’t exist anymore, which might make it harder.

The full code is available on GitHub.