Compiling a Lisp: Primitive binary functions

September 16, 2020

Welcome back to the “Compiling a Lisp” series. Last time, we added some primitive unary instructions like add1 and integer->char. This time, we’re going to add some primitive binary functions like + and <. After this post, we’ll be able to compile programs like:

(< (+ 1 2) (- 4 3))

Note that these expressions may look like function calls but, like last chapter, they are not opening new stack frames (which I’ll explain more about later). Instead, the compiler will recognize that the programmer is directly applying the symbol + and generate special code. You can think about this kind of like an inlined function call.

It’s important to remember that the compiler has a certain internal contract: the result of any given compiled expression is stored in rax. This isn’t some intrinsic property of all compilers, but it’s one we’ve kept so far in this series.

This is similar to but not the same as the calling convention that I mentioned earlier, where function results are stored in rax. That calling convention is for interacting with other people’s code. Within your own generated code, there are no rules. So we could pick any other register, really, for storing intermediate results.

Now that we’re building primitive functions that can take two arguments, you might notice a problem: our strategy of storing the result in rax won’t work on its own. If we were to naïvely write something like the following to implement +, then rax would get overwritten in the code generated by compiling operand1(args):

int Compile_call(Buffer *buf, ASTNode *callable, ASTNode *args) {
  if (AST_is_symbol(callable)) {
    // ...
    if (AST_symbol_matches(callable, "+")) {
      _(Compile_expr(buf, operand2(args)));
      // The result of this is stored in rax ^
      _(Compile_expr(buf, operand1(args)));
      // Oops, we just overwrote rax ^
      Emit_add_something(buf, /*dst=*/kRax));
      return 0;
    }
    // ...
  }
  // ...
}

We could try and work around this by adding some kind of register allocation algorithm and take advantage of rcx, rdx, etc. Or, simpler, we could decide to allocate all intermediate values on the stack and move on with our lives. I prefer the latter. It’s simpler.

Stack background info

Since we can’t yet save our compiled programs to disk, there’s some amount of setup that has to happen before they’re run. Right now, the C programs I’m providing along with this series compile to binaries that just run the test suites for the compiler. They don’t actually run full programs. For this reason, there are already some call frames on the stack by the time our generated code is run.

Let’s take a look at the stack at the moment we enter a compiled Lisp program:

|                  | High addresses
+------------------+
|  main            |
+------------------+ |
|  ~ some data ~   | |
|  ~ some data ~   | |
+------------------+ |
|  compile_test    | |
+------------------+ |
|  ~ some data ~   | |
|  ~ some data ~   | v
+------------------+
|  Testing_exe...  | rsp (stack pointer)
+------------------+
|                  | <-- Our frame!
|                  | Low addresses

In this diagram, we have the C program’s main function, which has its own local variables and so on. Then the main function calls the compile_test unit suite. This in turn calls this Testing_execute_expr function (abbreviated in the diagram), which is responsible for calling into our generated code. Every call stores the return address (some place to find the next instruction to execute) on the stack and adjusts rsp down.

Refresher: the call stack grows down. Why? Check out this StackOverflow answer that quotes an architect on the Intel 4004 and 8080 architectures. It’s stayed the same ever since.

In this diagram, we have rsp pointing at a return address somewhere inside the function Testing_execute_expr, since that’s what called our Lisp entrypoint. We have some data “above” (higher addresses) rsp that we’re not allowed to poke at, and we have this empty space “below” (lower addresses) rsp that is in our current stack frame. I say “empty” because we haven’t yet stored anything there, not because it’s necessarily zero-ed out. I don’t think there are any guarantees about the values in this stack frame.

We can use our stack frame to write and read values for our current Lisp program. With every recursive subexpression, we can allocate a little more stack space to keep track of the values. When I say “allocate”, I mean “subtract from the stack pointer”, because the stack is already a contiguous space in memory allocated for us. For example, here is how we can write to the stack:

mov [rsp-8], 0x4

This puts the integer 4 at displacement -8 from rsp. On the stack diagram above, it would be at the slot labeled “Our frame”. It’s also possible to read with a positive or zero displacement, but those point to previous stack frames and the return address, respectively. So let’s avoid manipulating those.

Note that I used a multiple of 8. Not every store has to be a to an address that is a multiple of 8, but it is natural and I think also faster to store 8-byte-sized things at aligned addresses.

Let’s walk through a real example to get more hands-on experience with this stack storage idea. We’ll use the program (+ 1 2). The compiled version of that program should:

Move compile(2) to rax
Move rax into [rsp-8]
Move compile(1) to rax
Add [rsp-8] to rax

So after compiling that, the stack will look like this:

|                  | High addresses
+------------------+
|  Testing_exe...  | RSP
+------------------+
|  0x8             | RSP-8 (result of compile(2))
|                  | Low addresses

And the result will be in rax, per our internal compiler contract.

This is all well and good, but at some point we’ll need our compiled programs to emit the push instruction or make function calls of their own. Both of these modify the stack pointer. push writes to the stack and decrements rsp. call is roughly equivalent to push followed by jmp.

For that reason, x86-64 comes with another register called rbp and it’s designed to hold the Base Pointer. While the stack pointer is supposed to track the “top” (low address) of the stack, the base pointer is meant to keep a pointer around to the “bottom” (high address) of our current stack frame.

This is why in a lot of compiled code you see the following instructions repeated¹:

myfunction:
push rbp
mov rbp, rsp
sub rsp, N  ; optional; allocate stack space for locals
; ... function body ...
mov rsp, rbp  ; required if you subtracted from rsp above
pop rbp
ret

The first three instructions, called the prologue, save rbp to the stack, and then set rbp to the current stack pointer. Then it’s possible to maintain steady references to variable locations on the stack even as rsp changes. Yes, the compiler could adjust its internal table of references every time the compiler emits code that modifies rsp, but that sounds much harder.

The last three instructions, called the epilogue, fetch the old rbp that we saved to the stack, write it back into rbp, then exit the call.

To confirm this for yourself, check out this sample compiled C code. Look at the disassembly following the label square. Prologue, code, epilogue.

Stack allocation infrastructure

Until now, we haven’t needed to keep track of much as we recursively traverse expression trees. Now, in order to keep track of how much space on the stack any given compiled code will need, we have to add more state to our compiler. We’ll call this state the stack_index — Ghuloum calls it si — and we’ll pass it around as a parameter. Whatever it’s called, it points to the first writable (unused) index in the stack at any given point.

In compiled functions, the first writable index is -kWordSize (-8), since the return address is already at 0.

int Compile_function(Buffer *buf, ASTNode *node) {
  Buffer_write_arr(buf, kFunctionPrologue, sizeof kFunctionPrologue);
  _(Compile_expr(buf, node, -kWordSize));
  Buffer_write_arr(buf, kFunctionEpilogue, sizeof kFunctionEpilogue);
  return 0;
}

I’ve also gone ahead and added the prologue and epilogue. They’re stored in static arrays. This makes them easier to modify, and also makes them accessible to testing helpers. The testing helpers can use these arrays to make testing easier for us — we can check if our expected code is book-ended by this code.

static const byte kFunctionPrologue[] = {
    // push rbp
    0x55,
    // mov rbp, rsp
    kRexPrefix, 0x89, 0xe5,
};

static const byte kFunctionEpilogue[] = {
    // pop rbp
    0x5d,
    // ret
    0xc3,
};

For Compile_expr, we just pass this new stack index through.

int Compile_expr(Buffer *buf, ASTNode *node, word stack_index) {
  // ...
  if (AST_is_pair(node)) {
    return Compile_call(buf, AST_pair_car(node), AST_pair_cdr(node),
                        stack_index);
  }
  // ...
}

And for Compile_call, we actually get to use it. Let’s look back at our stack storage strategy for compiling (+ 1 2) (now replacing rsp with rbp):

Move compile(2) to rax
Move rax into [rbp-8]
Move compile(1) to rax
Add [rbp-8] to rax

For binary functions, this can be generalized to:

Compile arg2 (stored in rax)
Move rax to stack_index
Compile arg1 (stored in rax)
Do something with the results (in [rbp-stack_index] and rax)

The key is this: for the first recursive call to Compile_expr, the compiler is allowed to emit code that can use the current stack_index and anything below that on the stack. For the second recursive call to Compile_expr, the compiler has to bump stack_index, since we’ve stored the result of the first compiled call at stack_index.

Take a look at our implementation of binary add:

int Compile_call(Buffer *buf, ASTNode *callable, ASTNode *args,
                 word stack_index) {
  if (AST_is_symbol(callable)) {
    // ...
    if (AST_symbol_matches(callable, "+")) {
      _(Compile_expr(buf, operand2(args), stack_index));
      Emit_store_reg_indirect(buf, /*dst=*/Ind(kRbp, stack_index),
                              /*src=*/kRax);
      _(Compile_expr(buf, operand1(args), stack_index - kWordSize));
      Emit_add_reg_indirect(buf, /*dst=*/kRax, /*src=*/Ind(kRbp, stack_index));
      return 0;
    }
    // ...
  }
  // ...
}

In this snippet, Ind stands for “indirect”, and is a constructor for a struct. This an easy and readable way to represent (register, displacement) pairs for use in reading from and writing to memory. We’ll cover this more detail in the instruction encoding.

To prove to ourselves that this approach works, we’ll add some tests later.

Other binary functions

Subtraction, multiplication, and division are much the same as addition. We’re also going to completely ignore overflow, underflow, etc.

Equality is different in that it does some comparisons after the fact (see Primitive unary functions). To check if two values are equal, we compare their pointers:

    if (AST_symbol_matches(callable, "=")) {
      _(Compile_expr(buf, operand2(args), stack_index));
      Emit_store_reg_indirect(buf, /*dst=*/Ind(kRbp, stack_index),
                              /*src=*/kRax);
      _(Compile_expr(buf, operand1(args), stack_index - kWordSize));
      Emit_cmp_reg_indirect(buf, kRax, Ind(kRbp, stack_index));
      Emit_mov_reg_imm32(buf, kRax, 0);
      Emit_setcc_imm8(buf, kEqual, kAl);
      Emit_shl_reg_imm8(buf, kRax, kBoolShift);
      Emit_or_reg_imm8(buf, kRax, kBoolTag);
      return 0;
    }

It uses a new comparison opcode that compares a register with some memory. This is why we can’t use the Compile_compare_imm32 helper function.

The less-than operator (<) is very similar to equality, but instead we use setcc with the kLess flag instead of the kEqual flag.

New opcodes

We used some new opcodes today, so let’s take a look at the implementations. First, here is the indirection implementation I mentioned earlier:

typedef struct Indirect {
  Register reg;
  int8_t disp;
} Indirect;

Indirect Ind(Register reg, int8_t disp) {
  return (Indirect){.reg = reg, .disp = disp};
}

I would have used the same name in the struct and the constructor but unfortunately that’s not allowed.

Here’s an implementation of an opcode that uses this Indirect type. This emits code for instructions of the form mov [reg+disp], src.

uint8_t disp8(int8_t disp) { return disp >= 0 ? disp : 0x100 + disp; }

void Emit_store_reg_indirect(Buffer *buf, Indirect dst, Register src) {
  Buffer_write8(buf, kRexPrefix);
  Buffer_write8(buf, 0x89);
  Buffer_write8(buf, 0x40 + src * 8 + dst.reg);
  Buffer_write8(buf, disp8(dst.disp));
}

The disp8 function is a helper that encodes negative numbers.

The opcodes for add, sub, and cmp are similar enough to this one, except src and dst are swapped. mul is a little funky because it doesn’t take two parameters. It assumes that one of the operands is always in rax.

Testing

As usual, we’ll close with some snippets of tests.

Here’s a test for +. I’m trying to see if inlining the text assembly with the hex makes it more readable. Thanks Kartik for the suggestion.

TEST compile_binary_plus(Buffer *buf) {
  ASTNode *node = new_binary_call("+", AST_new_integer(5), AST_new_integer(8));
  int compile_result = Compile_function(buf, node);
  ASSERT_EQ(compile_result, 0);
  byte expected[] = {
      // 0:  48 c7 c0 20 00 00 00    mov    rax,0x20
      0x48, 0xc7, 0xc0, 0x20, 0x00, 0x00, 0x00,
      // 7:  48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
      0x48, 0x89, 0x45, 0xf8,
      // b:  48 c7 c0 14 00 00 00    mov    rax,0x14
      0x48, 0xc7, 0xc0, 0x14, 0x00, 0x00, 0x00,
      // 12: 48 03 45 f8             add    rax,QWORD PTR [rbp-0x8]
      0x48, 0x03, 0x45, 0xf8};
  EXPECT_FUNCTION_CONTAINS_CODE(buf, expected);
  Buffer_make_executable(buf);
  uword result = Testing_execute_expr(buf);
  ASSERT_EQ(result, Object_encode_integer(13));
  AST_heap_free(node);
  PASS();
}

Here’s a test for <.

TEST compile_binary_lt_with_left_greater_than_right_returns_false(Buffer *buf)
{
  ASTNode *node = new_binary_call("<", AST_new_integer(6), AST_new_integer(5));
  int compile_result = Compile_function(buf, node);
  ASSERT_EQ(compile_result, 0);
  Buffer_make_executable(buf);
  uword result = Testing_execute_expr(buf);
  ASSERT_EQ_FMT(Object_false(), result, "0x%lx");
  AST_heap_free(node);
  PASS();
}

There are more tests in the implementation, as usual. Take a look if you like.

This has been a more complicated post than the previous ones, I think. The stack allocation may not make sense immediately. It might take some time to sink in. Try writing some of the code yourself and see if that helps.

Next time we’ll add ~~the ability to bind variables using let~~ a parser so we can input expressions more easily. See you then!

You may also see an enter instruction paired with a leave instruction. These are equivalent. Read more here. ↩