Inline caching: quickening

February 3, 2021

In my last post I discussed inline caching as a technique for runtime optimization. I ended the post with some extensions to the basic technique, like quickening. If you have not read the previous post, I recommend it. This post will make many references to it.

Quickening involves bytecode rewriting — self modifying code — to remove some branches and indirection in the common path. Stefan Brunthaler writes about it in his papers Efficient Interpretation using Quickening and Inline Caching Meets Quickening.

The problem

Let’s take a look at a fragment of the caching interpreter from the last post so we can talk about the problem more concretely. You can also get the sources from the repo and open interpreter.c in your preferred editor.

void add_update_cache(Frame* frame, Object* left, Object* right) {
  Method method = lookup_method(object_type(left), kAdd);
  cache_at_put(frame, object_type(left), method);
  Object* result = (*method)(left, right);
  push(frame, result);
}

void eval_code_cached(Frame* frame) {
  // ...
  while (true) {
    // ...
    switch (op) {
      // ...
      case ADD: {
        Object* right = pop(frame);
        Object* left = pop(frame);
        CachedValue cached = cache_at(frame);
        Method method = cached.value;
        if (method == NULL || cached.key != object_type(left)) {
          add_update_cache(frame, left, right);
          break;
        }
        Object* result = (*method)(left, right);
        push(frame, result);
        break;
      }
      // ...
    }
    frame->pc += kBytecodeSize;
  }
}

As I also mentioned last post, the ADD opcode handler has three cases to handle:

  1. Cache is empty
  2. Cache has the wrong key
  3. Cache has the right key

Since Deutsch & Schiffman found that types don’t vary that much, the third case is the fast path case. This means that we should do as little as possible in that case. And right now, we’re doing too much work.

Why should we have to check if the cache slot is empty if in the fast path it shouldn’t be? And why should we then have to make an indirect call? On some CPUs, indirect calls are much slower than direct calls. And this assumes the compiler generates a call instruction — it’s very possible that a compiler would decide to inline the direct call.

Quickening is a technique that reduces the number of checks by explitly marking state transitions in the bytecode.

Removing the empty check

In order to remove one of the checks — the method == NULL check — we can add a new opcode, ADD_CACHED. The ADD_CACHED opcode can skip the check because our interpreter will maintain the following invariant:

Invariant: The opcode ADD_CACHED will appear in the bytecode stream if and only if there is an entry in the cache at that opcode.

After ADD adds something to the cache, it can rewrite itself to ADD_CACHED. This way, the next time around, we have satisfied the invariant.

G ADD ADD ADD_CACHED ADD_CACHED ADD->ADD_CACHED non-int observed ADD_CACHED->ADD_CACHED

Let’s see how that looks:

void eval_code_quickening(Frame* frame) {
  // ...
  while (true) {
    // ...
    switch (op) {
      // ...
      case ADD: {
        Object* right = pop(frame);
        Object* left = pop(frame);
        add_update_cache(frame, left, right);
        code->bytecode[frame->pc] = ADD_CACHED;
        break;
      }
      case ADD_CACHED: {
        Object* right = pop(frame);
        Object* left = pop(frame);
        CachedValue cached = cache_at(frame);
        if (cached.key != object_type(left)) {
          add_update_cache(frame, left, right);
          break;
        }
        Method method = cached.value;
        Object* result = (*method)(left, right);
        push(frame, result);
        break;
      }
    // ...
    }
    frame->pc += kBytecodeSize;
  }
}

Not too different. We’ve shuffled the code around a little bit but overall it looks fairly similar. We still get to share some code in add_update_cache, so there isn’t too much duplication, either.

Now that we’ve moved the empty check, it’s time to remove the indirect call.

Removing the indirect call

Let’s assume for a minute that you, the writer of a language runtime, know that most of the time, when people write a + b, the operation refers to integer addition.

Not many other primitive types implement addition. Frequently floating point numbers use the same operator (though languages like OCaml do not). Maybe strings. And maybe your language allows for overloading the plus operator. But most people don’t do that. They add numbers.

In that case, you want to remove as much of the overhead as possible for adding two numbers. So let’s introduce a new opcode, ADD_INT that is specialized for integer addition.

In an ideal world, we would just be able to pop two objects, add them, and move on. But in our current reality, we still have to deal with the possibility of programmers passing in a non-integer every once in a while.

So first, we check if the types match. If they don’t, we populate the cache and transition to ADD_CACHED. I’ll get to why we do that in a moment.

And if we did actually get an int, great, we call this new function do_add_int.

void do_add_int(Frame* frame, Object* left, Object* right) {
  Object* result = int_add(left, right);
  push(frame, result);
}

void eval_code_quickening(Frame* frame) {
  // ...
  while (true) {
    // ...
    switch (op) {
      // ...
      case ADD_INT: {
        Object* right = pop(frame);
        Object* left = pop(frame);
        if (object_type(left) != kInt) {
          add_update_cache(frame, left, right);
          code->bytecode[frame->pc] = ADD_CACHED;
          break;
        }
        do_add_int(frame, left, right);
        break;
      }
    // ...
    }
    frame->pc += kBytecodeSize;
  }
}

This is a nice opcode handler for ADD_INT, but right now it’s orphaned. Some opcode has to take the leap and rewrite itself to ADD_INT, otherwise it’ll never get run.

I suggest we make ADD do the transition. This keeps ADD_CACHED fast for other types. If ADD observes that the left hand side of the operation is an integer, it’ll call do_add_int and rewrite itself.

G ADD ADD ADD_INT ADD_INT ADD->ADD_INT int observed ADD_CACHED ADD_CACHED ADD->ADD_CACHED non-int observed ADD_INT->ADD_INT int observed ADD_INT->ADD_CACHED non-int observed ADD_CACHED->ADD_CACHED

Let’s see how that looks in code.

void eval_code_quickening(Frame* frame) {
  // ...
  while (true) {
    // ...
    switch (op) {
      // ...
      case ADD: {
        Object* right = pop(frame);
        Object* left = pop(frame);
        if (object_type(left) == kInt) {
          do_add_int(frame, left, right);
          code->bytecode[frame->pc] = ADD_INT;
          break;
        }
        add_update_cache(frame, left, right);
        code->bytecode[frame->pc] = ADD_CACHED;
        break;
      }
    // ...
    }
    frame->pc += kBytecodeSize;
  }
}

Back to “why transition from ADD_INT to ADD_CACHED”. Two thoughts:

  1. We could transition back to ADD. In that case, this code would perform poorly in an environment where the programmer passes multiple different types at this opcode. There would be a lot of bytecode rewriting overhead going on as it goes back and forth between ADD and ADD_INT.

  2. We could also assume it’s a hiccup and not rewrite. This would perform poorly if the first time the argument is an integer, but something else every subsequent operation. There would be a lot of lookup_method calls.

A great extension here would be to add a polymorphic cache. Those are designed to efficiently handle a small (less than five, normally) amount of repeated types at a given point.

Why is this faster?

Even if we leave the interpreter in this state, a small C bytecode interpreter, we save a couple of instructions and some call overhead in the fast path of integer addition. This is a decent win for math-heavy applications.

In the best case, though, we save a great deal of instructions. It’s entirely possible that the compiler will optimize the entire body of ADD_INT to something like:

pop rax
pop rcx
cmp rax, $IntTag
jne slow_path
add rcx
push rcx
jmp next_opcode
slow_path:
; ...

It won’t look exactly like that, due to our object representation and because our push/pop functions do not operate on the C call stack, but it will be a little closer than before. But what if we could fix these issues and trim down the code even further?

Then we might have something like the Dart intermediate implementation of addition for small integers on x86-64. The following C++ code emits assembly for a specialized small integer handler:

void CheckedSmiOpInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  // ...
  Register left = locs()->in(0).reg();
  Register right = locs()->in(1).reg();
  // Check both left and right are small integers
  __ movq(TMP, left);
  __ orq(TMP, right);
  __ testq(TMP, compiler::Immediate(kSmiTagMask));
  __ j(NOT_ZERO, slow_path->entry_label());
  Register result = locs()->out(0).reg();
  __ movq(result, left);
  __ addq(result, right);
  // ...
}

This example is a little bit different since it is using an optimizing compiler and assumes the input and output are both in registers, but still expresses the main ideas.

See also the JVM template interpreter implementation for binary operations on small integers:

void TemplateTable::iop2(Operation op) {
  // ...
  __ pop_i(rdx);
  __ addl (rax, rdx);
  // ...
}

which pops the top of the stack and adds rax to it. I think this is because the JVM caches the top of the stack in the register rax at all times, but I have not been able to confirm this. It would explain why it adds rax and why there is no push, though.

Exploring further

There are a number of improvements that could be made to this very simple demo. Bytecode rewriting can unlock a lot of performance gains with additional work. I will list some of them below:

Maybe I will even write about them in the future.