Load and store forwarding in the Toy Optimizer

December 24, 2025

Another entry in the Toy Optimizer series.

A long, long time ago (two years!) CF Bolz-Tereick and I made a video about load/store forwarding and an accompanying GitHub Gist about load/store forwarding in the Toy Optimizer. I said I would write a blog post about it, but never found the time—it got lost amid a sea of large life changes.

It’s a neat idea: do an abstract interpretation over the trace, modeling the heap at compile-time, removing redundant reads and writes. That means it’s possible to optimize traces like this:

v0 = ...
v1 = load(v0, 5)
v2 = store(v0, 6, 123)
v3 = load(v0, 6)
v4 = load(v0, 5)
v5 = do_something(v1, v3, v4)

into traces like this:

v0 = ...
v1 = load(v0, 5)
v2 = store(v0, 6, 123)
v5 = do_something(v1, 123, v1)

This indicates that we were able to remove two redundant loads by keeping around information about previous loads and stores. Let’s get to work making this possible.

The usual infrastructure

We’ll start off with the usual infrastructure from the Toy Optimizer series: a very stringly-typed representation of a trace-based SSA IR and a union-find rewrite mechamism.

This means we can start writing some new optimization pass and our first test:

def optimize_load_store(bb: Block):
    opt_bb = Block()
    # TODO: copy an optimized version of bb into opt_bb
    return opt_bb

def test_two_loads():
    bb = Block()
    var0 = bb.getarg(0)
    var1 = bb.load(var0, 0)
    var2 = bb.load(var0, 0)
    bb.escape(var1)
    bb.escape(var2)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = escape(var1)
var3 = escape(var1)"""

This test is asserting that we can remove duplicate loads. Why load twice if we can cache the result? Let’s make that happen.

Caching loads

To do this, we’ll model the the heap at compile-time. When I say “model”, I mean that we will have an imprecise but correct abstract representation of the heap: we don’t (and can’t) have knowledge of every value, but we can know for sure that some addresses have certain values.

For example, if we have observed a load from object O at offset 8 v0 = load(O, 8), we know that the SSA value v0 is at heap[(O, 8)]. That sounds tautological, but it’s not. Future loads can make use of this information.

def get_num(op: Operation, index: int=1):
    assert isinstance(op.arg(index), Constant)
    return op.arg(index).value

def optimize_load_store(bb: Block):
    opt_bb = Block()
    # Stores things we know about the heap at... compile-time.
    # Key: an object and an offset pair acting as a heap address
    # Value: a previous SSA value we know exists at that address
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "load":
            obj = op.arg(0)
            offset = get_num(op, 1)
            load_info = (obj, offset)
            previous = compile_time_heap.get(load_info)
            if previous is not None:
                op.make_equal_to(previous)
                continue
            compile_time_heap[load_info] = op
        opt_bb.append(op)
    return opt_bb

This pass records information about loads and uses the result of a previous cached load operation if available. We treat the pair of (SSA value, offset) as an address into our abstract heap.

That’s great! If you run our simple test, it should now pass. But what happens if we store into that address before the second load? Oops…

def test_store_to_same_object_offset_invalidates_load():
    bb = Block()
    var0 = bb.getarg(0)
    var1 = bb.load(var0, 0)
    var2 = bb.store(var0, 0, 5)
    var3 = bb.load(var0, 0)
    bb.escape(var1)
    bb.escape(var3)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = store(var0, 0, 5)
var3 = load(var0, 0)
var4 = escape(var1)
var5 = escape(var3)"""

This test fails because we are incorrectly keeping around var1 in our abstract heap. We need to get rid of it and not replace var3 with var1.

Invalidating cached loads

So it turns out we have to also model stores in order to cache loads correctly. One valid, albeit aggressive, way to do that is to throw away all the information we know at each store operation:

def optimize_load_store(bb: Block):
    opt_bb = Block()
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "store":
            compile_time_heap.clear()
        elif op.name == "load":
            # ...
        opt_bb.append(op)
    return opt_bb

That makes our test pass—yay!—but at great cost. It means any store operation mucks up redundant loads. In our world where we frequently read from and write to objects, this is what we call a huge bummer.

For example, a store to offset 4 on some object should never interfere with a load from a different offset on the same object¹. We should be able to keep our load from offset 0 cached here:

def test_store_to_same_object_different_offset_does_not_invalidate_load():
    bb = Block()
    var0 = bb.getarg(0)
    var1 = bb.load(var0, 0)
    var2 = bb.store(var0, 4, 5)
    var3 = bb.load(var0, 0)
    bb.escape(var1)
    bb.escape(var3)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = store(var0, 4, 5)
var3 = escape(var1)
var4 = escape(var1)"""

We could try instead checking if our specific (object, offset) pair is in the heap and only removing cached information about that offset and that object. That would definitely help!

def optimize_load_store(bb: Block):
    opt_bb = Block()
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "store":
            load_info = (op.arg(0), get_num(op, 1))
            if load_info in compile_time_heap:
                del compile_time_heap[load_info]
        elif op.name == "load":
            # ...
        opt_bb.append(op)
    return opt_bb

It makes our test pass, too, which is great news.

Unfortunately, this runs into problems due to aliasing: it’s entirely possible that our compile-time heap could contain a pair (v0, 0) and a pair (v1, 0) where v0 and v1 are the same object (but not known to the optimizer). Then we might run into a situation where we incorrectly cache loads because the optimizer doesn’t know our abstract addresses (v0, 0) and (v1, 0) are actually the same pointer at run-time.

This means that we are breaking abstract interpretation rules: our abstract interpreter has to correctly model all possible outcomes at run-time. This means to me that we should instead pick some tactic in-between clearing all information (correct but over-eager) and clearing only exact matches of object+offset (incorrect).

The term that will help us here is called an alias class. It is a name for a way to efficiently partition objects in your abstract heap into completely disjoint sets. Writes to any object in one class never affect objects in another class.

Our very scrappy alias classes will be just based on the offset: each offset is a different alias class. If we write to any object at offset K, we have to invalidate all of our compile-time offset K knowledge—even if it’s for another object. This is a nice middle ground, and it’s possible because our (made up) object system guarantees that distinct objects do not overlap, and also that we are not writing out-of-bounds.²

So let’s remove all of the entries from compile_time_heap where the offset matches the offset in the current store:

def optimize_load_store(bb: Block):
    opt_bb = Block()
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "store":
            offset = get_num(op, 1)
            compile_time_heap = {
                load_info: value
                for load_info, value in compile_time_heap.items()
                if load_info[1] != offset
            }
        elif op.name == "load":
            # ...
        opt_bb.append(op)
    return opt_bb

Great! Now our test passes.

This concludes the load optimization section of the post. We have modeled enough of loads and stores that we can eliminate redundant loads. Very cool. But we can go further.

Caching stores

Stores don’t just invalidate information. They also give us new information! Any time we see an operation of the form v1 = store(v0, 8, 5) we also learn that load(v0, 8) == 5! Until it gets invalidated, anyway.

For example, in this test, we can eliminate the load from var0 at offset 0:

def test_load_after_store_removed():
    bb = Block()
    var0 = bb.getarg(0)
    bb.store(var0, 0, 5)
    var1 = bb.load(var0, 0)
    var2 = bb.load(var0, 1)
    bb.escape(var1)
    bb.escape(var2)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 5)
var2 = load(var0, 1)
var3 = escape(5)
var4 = escape(var2)"""

Making that work is thankfully not very hard; we need only add that new information to the compile-time heap after removing all the potentially-aliased info:

def optimize_load_store(bb: Block):
    opt_bb = Block()
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "store":
            offset = get_num(op, 1)
            compile_time_heap = # ... as before ...
            obj = op.arg(0)
            new_value = op.arg(2)
            compile_time_heap[(obj, offset)] = new_value  # NEW!
        elif op.name == "load":
            # ...
        opt_bb.append(op)
    return opt_bb

This makes the test pass. It makes another test fail, but only because—oops—we now know more. You can delete the old test because the new test supersedes it.

Now, note that we are not removing the store. This is because we have nothing in our optimizer that keeps track of what might have observed the side-effects of the store. What if the object got escaped? Or someone did a load later on? We would only be able to remove the store (continue) if we could guarantee it was not observable.

In our current framework, this only happens in one case: someone is doing a store of the exact same value that already exists in our compile-time heap. That is, either the same constant, or the same SSA value. If we see this, then we can completely skip the second store instruction.

Here’s a test case for that, where we have gained information from the load instruction that we can then use to get rid of the store instruction:

def test_load_then_store():
    bb = Block()
    arg1 = bb.getarg(0)
    var1 = bb.load(arg1, 0)
    bb.store(arg1, 0, var1)
    bb.escape(var1)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = escape(var1)"""

Let’s make it pass. To do that, first we’ll make an equality function that works for both constants and operations. Constants are equal if their values are equal, and operations are equal if they are the identical (by address/pointer) operation.

def eq_value(left: Value|None, right: Value) -> bool:
    if isinstance(left, Constant) and isinstance(right, Constant):
        return left.value == right.value
    return left is right

This is a partial equality: if two operations are not equal under eq_value, it doesn’t mean that they are different, only that we don’t know that they are the same.

Then, after that, we need only check if the current value in the compile-time heap is the same as the value being stored in. If it is, wonderful. No need to store. continue and don’t append the operation to opt_bb:

def optimize_load_store(bb: Block):
    opt_bb = Block()
    compile_time_heap: Dict[Tuple[Value, int], Value] = {}
    for op in bb:
        if op.name == "store":
            obj = op.arg(0)
            offset = get_num(op, 1)
            store_info = (obj, offset)
            current_value = compile_time_heap.get(store_info)
            new_value = op.arg(2)
            if eq_value(current_value, new_value):  # NEW!
                continue
            compile_time_heap = # ... as before ...
            # ...
        elif op.name == "load":
            load_info = (op.arg(0), get_num(op, 1))
            if load_info in compile_time_heap:
                op.make_equal_to(compile_time_heap[load_info])
                continue
            compile_time_heap[load_info] = op
        opt_bb.append(op)
    return opt_bb

This makes our load-then-store pass and it also makes other tests pass too, like eliminating a store after another store!

def test_store_after_store():
    bb = Block()
    arg1 = bb.getarg(0)
    bb.store(arg1, 0, 5)
    bb.store(arg1, 0, 5)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 5)"""

Unfortunately, this only works if the values—constants or SSA values—are known to be the same. If we store different values, we can’t optimize. In the live stream, we left this an exercise for the viewer:

@pytest.mark.xfail
def test_exercise_for_the_reader():
    bb = Block()
    arg0 = bb.getarg(0)
    var0 = bb.store(arg0, 0, 5)
    var1 = bb.store(arg0, 0, 7)
    var2 = bb.load(arg0, 0)
    bb.escape(var2)
    opt_bb = optimize_load_store(bb)
    assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 7)
var2 = escape(7)"""

We would only be able to optimize this away if we had some notion of a store being dead. In this case, that is a store in which the value is never read before being overwritten.

Removing dead stores

TODO, I suppose. I have not gotten this far yet. If I get around to it, I will come back and update the post.

In the real world

This small optimization pass may seem silly or fiddly—when would we ever see something like this in a real IR?—but it’s pretty useful. Here’s the Ruby code that got me thinking about it again some years later for ZJIT:

class C
  def initialize
    @a = 1
    @b = 2
    @c = 3
  end
end

CRuby has a shape system and ZJIT makes use of it, so we end up optimizing this code (if it’s monomorphic) into a series of shape checks and stores. The HIR might end up looking something like the mess below, where I’ve annotated the shape guards (can be thought of as loads) and stores with asterisks:

fn initialize@tmp/init.rb:3:
# ...
bb2(v6:BasicObject):
  v10:Fixnum[1] = Const Value(1)
  v31:HeapBasicObject = GuardType v6, HeapBasicObject
* v32:HeapBasicObject = GuardShape v31, 0x400000
* StoreField v32, :@a@0x10, v10
  WriteBarrier v32, v10
  v35:CShape[0x40008e] = Const CShape(0x40008e)
* StoreField v32, :_shape_id@0x4, v35
  v16:Fixnum[2] = Const Value(2)
  v37:HeapBasicObject = GuardType v6, HeapBasicObject
* v38:HeapBasicObject = GuardShape v37, 0x40008e
* StoreField v38, :@b@0x18, v16
  WriteBarrier v38, v16
  v41:CShape[0x40008f] = Const CShape(0x40008f)
* StoreField v38, :_shape_id@0x4, v41
  v22:Fixnum[3] = Const Value(3)
  v43:HeapBasicObject = GuardType v6, HeapBasicObject
* v44:HeapBasicObject = GuardShape v43, 0x40008f
* StoreField v44, :@c@0x20, v22
  WriteBarrier v44, v22
  v47:CShape[0x400090] = Const CShape(0x400090)
* StoreField v44, :_shape_id@0x4, v47
  CheckInterrupts
  Return v22

If we had store-load forwarding in ZJIT, we could get rid of the intermediate shape guards; they would know the shape from the previous StoreField instruction. If we had dead store elimination, we could get rid of the intermediate shape writes; they are never read. (And the repeated type guards to check if it’s a heap object still are just silly and need to get removed eventually.)

This is on the roadmap and will make object initialization even faster than it is right now.

Wrapping up

Thanks for reading the text version of the video that CF and I made a while back. Now you know how to do load/store elimination on traces.

I think this does not need too much extra work to get it going on full CFGs; a block is pretty much the same as a trace, so you can do a block-local version without much fuss. If you want to go global, you need dominator information and gen-kill sets.

Maybe I will touch on this in a future post…

Thank you

Thank you to CF, who walked me through this live on a stream two years ago! This blog post wouldn’t be possible without you.

In this toy optimizer example, we are assuming that all reads and writes are the same size and different offsets don’t overlap at all. This is often the case for managed runtimes, where object fields are pointer-sized and all reads/writes are pointed aligned. ↩
We could do better. If we had type information, we could also use that to make alias classes. Writes to a List will never overlap with writes to a Map, for example. This requires your compiler to have strict aliasing—if you can freely cast between types, as in C, then this tactic goes out the window.

This is called Type-based alias analysis (PDF). ↩