Another entry in the Toy Optimizer series.
Last time, we did load-store forwarding in the context of our Toy Optimizer. We managed to cache the results of both reads from and writes to the heap—at compile-time!
We were careful to mind object aliasing: we separated our heap information into
alias classes based on what offset the reads/writes referenced. This way, if we
didn’t know if object a and b aliased, we could at least know that
different offsets would never alias (assuming our objects don’t overlap and
memory accesses are on word-sized slots). This is a coarse-grained heuristic.
Fortunately, we often have much more information available at compile-time than just the offset, so we should use it. I mentioned in a footnote that we could use type information, for example, to improve our alias analysis. We’ll add a lightweight form of type-based alias analysis (TBAA) (PDF) in this post.
We return once again to Fil Pizlo land, specifically How I implement SSA form. We’re going to be using the hierarchical heap effect representation from the post in our implementation, but you can use your own type representation if you have one already.
This representation divides the heap into disjoint regions by type. Consider,
for example, that Array objects and String objects do not overlap. A
LinkedList pointer is never going to alias an Integer pointer. They can
therefore be reasoned about separately.
But sometimes you don’t have perfect type information available. If you have in
your language an Object base class of all objects, then the Object heap
overlaps with, say, the Array heap. So you need some way to represent that
too—just having an enum doesn’t work cleanly.
Here is an example simplified type hierarchy:
Any
Object
Array
String
Other
Where Other might represent different parts of the runtime’s data structures,
and could be further segmented into GC, Thread, etc.
Fil’s idea is that we can represent each node in that hierarchy with a tuple of
integers [start, end) (inclusive, exclusive) that represent the pre- and
post-order traversals of the tree. Or, if tree traversals are not engraved into
your bones, they represent the range of all the nested objects within them.
Any [0, 3)
Object [0, 2)
Array [0, 1)
String [1, 2)
Other [2, 3)
Then the “does this write interfere with this read” check—the aliasing check—is a range overlap query.
Here’s a perhaps over-engineered Python implementation of the range and heap hierarchy based on the Ruby generator and C++ runtime code from JavaScriptCore:
class HeapRange:
def __init__(self, start: int, end: int) -> None:
self.start = start
self.end = end
def __repr__(self) -> str:
return f"[{self.start}, {self.end})"
def is_empty(self) -> bool:
return self.start == self.end
def overlaps(self, other: "HeapRange") -> bool:
# Empty ranges interfere with nothing
if self.is_empty() or other.is_empty():
return False
return self.end > other.start and other.end > self.start
class AbstractHeap:
def __init__(self, name: str) -> None:
self.name = name
self.parent = None
self.children = []
self.range = None
def add_child(self, name: str) -> None:
result = AbstractHeap(name)
result.parent = self
self.children.append(result)
return result
def compute(self, start: int) -> None:
current = start
if not self.children:
self.range = HeapRange(start, current + 1)
return
for child in self.children:
child.compute(current)
current = child.range.end
self.range = HeapRange(start, current)
Any = AbstractHeap("Any")
Object = Any.add_child("Object")
Array = Object.add_child("Array")
String = Object.add_child("String")
Other = Any.add_child("Other")
Any.compute(0)
Where Any.compute(0) kicks off the tree-numbering scheme.
Fil’s implementation also covers a bunch of abstract heaps such as SSAState and Control because his is used for code motion and whatnot. That can be added on later but we will not do so in this post.
So there you have it: a type representation. Now we need to use it in our load-store forwarding.
Recall that our load-store optimization pass looks like this:
def optimize_load_store(bb: Block):
opt_bb = Block()
# Stores things we know about the heap at... compile-time.
# Key: an object and an offset pair acting as a heap address
# Value: a previous SSA value we know exists at that address
compile_time_heap: Dict[Tuple[Value, int], Value] = {}
for op in bb:
if op.name == "store":
obj = op.arg(0)
offset = get_num(op, 1)
store_info = (obj, offset)
current_value = compile_time_heap.get(store_info)
new_value = op.arg(2)
if eq_value(current_value, new_value):
continue
compile_time_heap = {
load_info: value
for load_info, value in compile_time_heap.items()
if load_info[1] != offset
}
compile_time_heap[store_info] = new_value
elif op.name == "load":
load_info = (op.arg(0), get_num(op, 1))
if load_info in compile_time_heap:
op.make_equal_to(compile_time_heap[load_info])
continue
compile_time_heap[load_info] = op
opt_bb.append(op)
return opt_bb
At its core, it iterates over the instructions, keeping a representation of the heap at compile-time. Reads get cached, writes get cached, and writes also invalidate the state of compile-time information about fields that may alias.
In this case, our may alias asks only if the offsets overlap. This means that the following unit test will fail:
def test_store_to_same_offset_different_heaps_does_not_invalidate_load():
bb = Block()
var0 = bb.getarg(0)
var0.info = Array
var1 = bb.getarg(1)
var1.info = String
var2 = bb.store(var0, 0, 3)
var3 = bb.store(var1, 0, 4)
var4 = bb.load(var0, 0)
bb.escape(var4)
opt_bb = optimize_load_store(bb)
assert (
bb_to_str(opt_bb)
== """\
var0 = getarg(0)
var1 = getarg(1)
var2 = store(var0, 0, 3)
var3 = store(var1, 0, 4)
var4 = escape(3)"""
)
This test is expecting the write to var0 to still remain cached even though
we wrote to the same offset in var1—because we have annotated var0 as
being an Array and var1 as being a String. If we account for type
information in our alias analysis, we can get this test to pass.
After doing a bunch of fussing around with the load-store forwarding (many rewrites), I eventually got it down to a very short diff:
+def may_alias(left: Value, right: Value) -> bool:
+ return (left.info or Any).range.overlaps((right.info or Any).range)
+
+
def optimize_load_store(bb: Block):
opt_bb = Block()
# Stores things we know about the heap at... compile-time.
@@ -138,6 +210,10 @@ def optimize_load_store(bb: Block):
load_info: value
for load_info, value in compile_time_heap.items()
if load_info[1] != offset
+ or not may_alias(load_info[0], obj)
}
compile_time_heap[store_info] = new_value
If we don’t have any type/alias information, we default to “I know nothing”
(Any) for each object. Then we check range overlap.
The boolean logic in optimize_load_store looks a little weird, maybe. But we
can also rewrite (via DeMorgan’s law) as:
{
... for ...
if not (load_info[1] == offset
and may_alias(load_info[0], obj))
}
So, keeping all the cached field state about fields that are known by offset and by type not to alias. Maybe that is clearer (but not as nice a diff).
Note that the type representation is not so important here! You could use a bitset version of the type information if you want. The important things are that you can cheaply construct types and check overlap between them.
Nice, now our test passes! We can differentiate between memory accesses on objects of different types.
But what if we knew more?
Sometimes we know where an object came from. For example, we may have seen it get allocated in the trace. If we saw an object’s allocation, we know that it does not alias (for example) any object that was passed in via a parameter. We can use this kind of information to our advantage.
For example, in the following made up IR snippet:
trace(arg0):
v0 = malloc(8)
v1 = malloc(16)
...
We know that (among other facts) v0 doesn’t alias arg0 or v1 because we
have seen its allocation site.
I saw this in the old V8 IR Hydrogen’s lightweight alias analysis1:
enum HAliasing {
kMustAlias,
kMayAlias,
kNoAlias
};
HAliasing Query(HValue* a, HValue* b) {
// The same SSA value always references the same object.
if (a == b) return kMustAlias;
if (a->IsAllocate() || a->IsInnerAllocatedObject()) {
// Two non-identical allocations can never be aliases.
if (b->IsAllocate()) return kNoAlias;
if (b->IsInnerAllocatedObject()) return kNoAlias;
// An allocation can never alias a parameter or a constant.
if (b->IsParameter()) return kNoAlias;
if (b->IsConstant()) return kNoAlias;
}
if (b->IsAllocate() || b->IsInnerAllocatedObject()) {
// An allocation can never alias a parameter or a constant.
if (a->IsParameter()) return kNoAlias;
if (a->IsConstant()) return kNoAlias;
}
// Constant objects can be distinguished statically.
if (a->IsConstant() && b->IsConstant()) {
return a->Equals(b) ? kMustAlias : kNoAlias;
}
return kMayAlias;
}
There is plenty of other useful information such as:
if (a == b) { ... } else { ... }If you have other fun ones, please write in.
We only handle loads and stores in our optimizer. Unfortunately, this means we may accidentally cache stale information. Consider: what happens if a function call (or any other opaque instruction) writes into an object we are tracking?
The conservative approach is to invalidate all cached information on a function call. This is definitely correct, but it’s a bummer for the optimizer. Can we do anything?
Well, perhaps we are calling a well-known function or a specific IR instruction. In that case, we can annotate it with effects in the same abstract heap model: if the instruction does not write, or only writes to some heaps, we can at least only partially invalidate our heap.
known_builtin_functions = {
"Array_length": Effects(reads=Array, writes=Empty()),
"Object_setShape": Effects(reads=Empty(), writes=Object),
"String_setEncoding": Effects(reads=Empty(), writes=String),
}
However, if the function is unknown or otherwise opaque, we need at least more advanced alias information and perhaps even (partial) escape analysis.
Consider: even if an instruction takes no operands, we have no idea what state it has access to. If it writes to any object A, we cannot safely cache information about any other object B unless we know for sure that A and B do not alias. And we don’t know what the instruction writes to. So we may only know we can cache information about B because it was allocated locally and has not escaped.
Some runtimes such as ART pre-compute all of their alias information in a bit matrix. This makes more sense if you are using alias information in a full control-flow graph, where you might need to iterate over the graph a few times. In a trace context, you can do a lot in one single pass—no need to make a matrix.
As usual, this is a toy IR and a toy optimizer, so it’s hard to say how much faster it makes its toy programs.
In general, though, there is a dial for analysis and optimization that goes between precision and speed. This is a happy point on that dial, only a tiny incremental analysis cost bump above offset-only invalidation, but for higher precision. I like that tradeoff.
Also, it is very useful in JIT compilers where generally the managed language is a little better-behaved than a C-like language. Somewhere in your IR there will be a lot of duplicate loads and stores from a strength reduction pass, and this can clean up the mess.
Thanks for joining as I work through a small use of type-based alias analysis for myself. I hope you enjoyed.
Thank you to Chris Gregory for helpful feedback.
I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan. ↩