Optimizing compilers like to keep track of each IR instruction’s effects. An instruction’s effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state).
This post can be thought of as a continuation of What I talk about when I talk about IRs, specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have?
Different compilers represent and track these effects differently. I’ve been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more.
Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as “writes to a local variable”, “writes to a list”, or “reads from the stack” indicate what instructions can be re-ordered, duplicated, or removed entirely.
For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR:
# ...
v = some_var[0]
another_var[0] = 5
# ...
The goal of effects is to communicate to the compiler if, for example, these two IR
instructions can be re-ordered. The second instruction might write to a
location that the first one reads. But it also might not! This is about knowing
if some_var and another_var alias—if they are different names that
refer to the same object.
We can sometimes answer that question directly, but often it’s cheaper to
compute an approximate answer: could they even alias? It’s possible that
some_var and another_var have different types, meaning that (as long as you
have strict aliasing) the Load and Store operations that implement these
reads and writes by definition touch different locations. And if they look
at disjoint locations, there need not be any explicit order enforced.
Different compilers keep track of this information differently. The null effect analysis gives up and says “every instruction is maximally effectful” and therefore “we can’t re-order or delete any instructions”. That’s probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid.
But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That’s where I am right now. So here’s a catalog of different compilers I have looked at recently.
There are two main ways I have seen to represent effects: bitsets and heap range lists. We’ll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers.
We’ll start with Cinder, a Python JIT, because that’s what I used to work on.
Cinder tracks heap effects for its high-level IR (HIR) in
instr_effects.h. Pretty much everything happens in
the memoryEffects(const Instr& instr) function, which is expected to know
everything about what effects the given instruction might have.
The data representation is a bitset representation of a lattice called an
AliasClass and that is defined in alias_class.h. Each
bit in the bitset represents a distinct location in the heap: reads from and
writes to each of these locations are guaranteed not to affect any of the other
locations.
Here is the X-macro that defines it:
#define HIR_BASIC_ACLS(X) \
X(ArrayItem) \
X(CellItem) \
X(DictItem) \
X(FuncArgs) \
X(FuncAttr) \
X(Global) \
X(InObjectAttr) \
X(ListItem) \
X(Other) \
X(TupleItem) \
X(TypeAttrCache) \
X(TypeMethodCache)
enum BitIndexes {
#define ACLS(name) k##name##Bit,
HIR_BASIC_ACLS(ACLS)
#undef ACLS
};
Note that each bit implicitly represents a set: ListItem does not refer to a
specific list index, but the infinite set of all possible list indices. It’s
any list index. Still, every list index is completely disjoint from, say, every
entry in a global variable table.
(And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected, is the container.)
Like other bitset lattices, it’s possible to union the sets by or-ing the bits. It’s possible to query for overlap by and-ing the bits.
class AliasClass {
// The union of two AliasClass
AliasClass operator|(AliasClass other) const {
return AliasClass{bits_ | other.bits_};
}
// The intersection (overlap) of two AliasClass
AliasClass operator&(AliasClass other) const {
return AliasClass{bits_ & other.bits_};
}
};
If this sounds familiar, it’s because (as the repo notes) it’s a similar idea to Cinder’s type lattice representation.
Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects):
#define HIR_OR_BITS(name) | k##name
#define HIR_UNION_ACLS(X) \
/* Bottom union */ \
X(Empty, 0) \
/* Top union */ \
X(Any, 0 HIR_BASIC_ACLS(HIR_OR_BITS)) \
/* Memory locations accessible by managed code */ \
X(ManagedHeapAny, kAny & ~kFuncArgs)
Union operations naturally hit a fixpoint at Any and intersection operations
naturally hit a fixpoint at Empty.
All of this together lets the optimizer ask and answer questions such as:
and more.
Let’s take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR:
v0: Tuple = ...
v1: List = ...
v2: Int[5] = ...
# v = some_var[0]
v3: Object = LoadTupleItem v0, 0
# another_var[0] = 5
StoreListItem v1, 0, v2
You can imagine that LoadTupleItem declares that it reads from the
TupleItem heap and StoreListItem declares that it writes to the ListItem
heap. Because tuple and list pointers cannot be casted into one another and
therefore cannot alias, these are
disjoint heaps in our bitset. Therefore ListItem & TupleItem == 0, therefore
these memory operations can never interfere! They can (for example) be
re-ordered arbitrarily.
In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE.
DCE involves first finding the set of instructions that need to be kept around
because they are useful/important/have effects. So here is what the Cinder DCE
isUseful looks like:
bool isUseful(Instr& instr) {
return instr.IsTerminator() || instr.IsSnapshot() ||
(instr.asDeoptBase() != nullptr && !instr.IsPrimitiveBox()) ||
(!instr.IsPhi() && memoryEffects(instr).may_store != AEmpty);
}
There are some other checks in there but memoryEffects is right there at the
core of it!
Now that we have seen the bitset representation of effects and an implementation in Cinder, let’s take a look at a different representation and and an implementation in JavaScriptCore.
I keep coming back to How I implement SSA form by Fil Pizlo, one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of “abstract heaps” felt very… well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click.
It didn’t make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations—because, you know, JSC is six compilers in a trenchcoat[citation needed].
DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We’ll look at DOMJIT mostly because it’s a smaller example and also illustrates something else that’s interesting: builtins. We’ll come back to builtins in a minute.
Let’s take a lookat how DOMJIT structures its abstract heaps: a YAML file.
DOM:
Tree:
Node:
- Node_firstChild
- Node_lastChild
- Node_parentNode
- Node_nextSibling
- Node_previousSibling
- Node_ownerDocument
Document:
- Document_documentElement
- Document_body
It’s a hierarchy. Node_firstChild is a subheap of Node is a subheap of…
and so on. A write to any Node_nextSibling is a write to Node is a write to
… Sibling heaps are unrelated: Node_firstChild and Node_lastChild, for
example, are disjoint.
To get a feel for this, I wired up a simplified version of ZJIT’s bitset generator (for types!) to read a YAML document and generate a bitset. It generated the following Rust code:
mod bits {
pub const Empty: u64 = 0u64;
pub const Document_body: u64 = 1u64 << 0;
pub const Document_documentElement: u64 = 1u64 << 1;
pub const Document: u64 = Document_body | Document_documentElement;
pub const Node_firstChild: u64 = 1u64 << 2;
pub const Node_lastChild: u64 = 1u64 << 3;
pub const Node_nextSibling: u64 = 1u64 << 4;
pub const Node_ownerDocument: u64 = 1u64 << 5;
pub const Node_parentNode: u64 = 1u64 << 6;
pub const Node_previousSibling: u64 = 1u64 << 7;
pub const Node: u64 = Node_firstChild | Node_lastChild | Node_nextSibling | Node_ownerDocument | Node_parentNode | Node_previousSibling;
pub const Tree: u64 = Document | Node;
pub const DOM: u64 = Tree;
pub const NumTypeBits: u64 = 8;
}
It’s not a fancy X-macro, but it’s a short and flexible Ruby script.
Then I took the DOMJIT abstract heap generator—also funnily enough a short Ruby script—modified the output format slightly, and had it generate its int pairs:
mod bits {
/* DOMJIT Abstract Heap Tree.
DOM<0,8>:
Tree<0,8>:
Node<0,6>:
Node_firstChild<0,1>
Node_lastChild<1,2>
Node_parentNode<2,3>
Node_nextSibling<3,4>
Node_previousSibling<4,5>
Node_ownerDocument<5,6>
Document<6,8>:
Document_documentElement<6,7>
Document_body<7,8>
*/
pub const DOM: HeapRange = HeapRange { start: 0, end: 8 };
pub const Tree: HeapRange = HeapRange { start: 0, end: 8 };
pub const Node: HeapRange = HeapRange { start: 0, end: 6 };
pub const Node_firstChild: HeapRange = HeapRange { start: 0, end: 1 };
pub const Node_lastChild: HeapRange = HeapRange { start: 1, end: 2 };
pub const Node_parentNode: HeapRange = HeapRange { start: 2, end: 3 };
pub const Node_nextSibling: HeapRange = HeapRange { start: 3, end: 4 };
pub const Node_previousSibling: HeapRange = HeapRange { start: 4, end: 5 };
pub const Node_ownerDocument: HeapRange = HeapRange { start: 5, end: 6 };
pub const Document: HeapRange = HeapRange { start: 6, end: 8 };
pub const Document_documentElement: HeapRange = HeapRange { start: 6, end: 7 };
pub const Document_body: HeapRange = HeapRange { start: 7, end: 8 };
}
It already comes with a little diagram, which is super helpful for readability.
Any empty range(s) represent empty heap effects: if the start and end are the
same number, there are no effects. There is no one Empty value, but any empty
range could be normalized to HeapRange { start: 0, end: 0 }.
Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me.
What about checking overlap? Here is the implementation in JSC:
namespace WTF {
// Check if two ranges overlap assuming that neither range is empty.
template<typename T>
constexpr bool nonEmptyRangesOverlap(T leftMin, T leftMax, T rightMin, T rightMax)
{
ASSERT_UNDER_CONSTEXPR_CONTEXT(leftMin < leftMax);
ASSERT_UNDER_CONSTEXPR_CONTEXT(rightMin < rightMax);
return leftMax > rightMin && rightMax > leftMin;
}
// Pass ranges with the min being inclusive and the max being exclusive.
template<typename T>
constexpr bool rangesOverlap(T leftMin, T leftMax, T rightMin, T rightMax) {
ASSERT_UNDER_CONSTEXPR_CONTEXT(leftMin <= leftMax);
ASSERT_UNDER_CONSTEXPR_CONTEXT(rightMin <= rightMax);
// Empty ranges interfere with nothing.
if (leftMin == leftMax)
return false;
if (rightMin == rightMax)
return false;
return nonEmptyRangesOverlap(leftMin, leftMax, rightMin, rightMax);
}
}
class HeapRange {
bool overlaps(const HeapRange& other) const {
return WTF::rangesOverlap(m_begin, m_end, other.m_begin, other.m_end);
}
}
(See also How to check for overlapping intervals and Range overlap in two compares for more fun.)
While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap.
With int ranges, it’s a little more complicated. An imprecise union of a and
b can take the maximal range that covers both a and b. To get a more
precise union, you have to keep track of both. In the worst case, if you want
efficient arbitrary queries, you need to store your int ranges in an interval
tree. So what gives?
I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it’s more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn’t holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post:
The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. […] you can have instructions report that they read/write multiple heaps […] you can have a utility function that produces such lists on demand.
It’s important to note that this doesn’t actually involve any allocation of lists. JSC does this very clever thing where they have “functors” that they pass in as arguments that compress/summarize what they want to out of an instruction’s effects.
Let’s take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat.
Note that AbstractHeap in the example below is a thin wrapper over the DFG
compiler’s own DOMJIT::HeapRange equivalent:
class AbstractHeapOverlaps {
public:
AbstractHeapOverlaps(AbstractHeap heap)
: m_heap(heap)
, m_result(false)
{
}
void operator()(AbstractHeap otherHeap) const
{
if (m_result)
return;
m_result = m_heap.overlaps(otherHeap);
}
bool result() const { return m_result; }
private:
AbstractHeap m_heap;
mutable bool m_result;
};
bool writesOverlap(Graph& graph, Node* node, AbstractHeap heap)
{
NoOpClobberize noOp;
AbstractHeapOverlaps addWrite(heap);
clobberize(graph, node, noOp, addWrite, noOp);
return addWrite.result();
}
clobberize is the function that calls these functors (noOp or addWrite in
this case) for each effect that the given IR instruction node declares.
I’ve pulled some relevant snippets of clobberize, which is quite long, that I
think are interesting.
First, some instructions (constants, here) have no effects. There’s some
utility in the def(PureValue(...)) call but I didn’t understand fully.
Then there are some instructions that conditionally have effects depending on the use types of their operands.1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code.
Some run-time IR guards that might cause side exits are annotated as
such—they write to the SideState heap.
Local variable instructions read specific heaps indexed by what looks like the local index but I’m not sure. This means accessing two different locals won’t alias!
Instructions that allocate can’t be re-ordered, it looks like; they both read
and write the HeapObjectCount. This probably limits the amount of allocation
sinking that can be done.
Then there’s CallDOM, which is the builtins stuff I was talking about. We’ll
come back to that after the code block.
template<typename ReadFunctor, typename WriteFunctor, typename DefFunctor, typename ClobberTopFunctor>
void clobberize(Graph& graph, Node* node, const ReadFunctor& read, const WriteFunctor& write, const DefFunctor& def)
{
// ...
switch (node->op()) {
case JSConstant:
case DoubleConstant:
case Int52Constant:
def(PureValue(node, node->constant()));
return;
case ArithAbs:
if (node->child1().useKind() == Int32Use || node->child1().useKind() == DoubleRepUse)
def(PureValue(node, node->arithMode()));
else
clobberTop();
return;
case AssertInBounds:
case AssertNotEmpty:
write(SideState);
return;
case GetLocal:
read(AbstractHeap(Stack, node->operand()));
def(HeapLocation(StackLoc, AbstractHeap(Stack, node->operand())), LazyNode(node));
return;
case NewArrayWithSize:
case NewArrayWithSizeAndStructure:
read(HeapObjectCount);
write(HeapObjectCount);
return;
case CallDOM: {
const DOMJIT::Signature* signature = node->signature();
DOMJIT::Effect effect = signature->effect;
if (effect.reads) {
if (effect.reads == DOMJIT::HeapRange::top())
read(World);
else
read(AbstractHeap(DOMState, effect.reads.rawRepresentation()));
}
if (effect.writes) {
if (effect.writes == DOMJIT::HeapRange::top()) {
if (Options::validateDFGClobberize())
clobberTopFunctor();
write(Heap);
} else
write(AbstractHeap(DOMState, effect.writes.rawRepresentation()));
}
ASSERT_WITH_MESSAGE(effect.def == DOMJIT::HeapRange::top(), "Currently, we do not accept any def for CallDOM.");
return;
}
}
}
(Remember that these AbstractHeap operations are very similar to DOMJIT’s
HeapRange with a couple more details—and in some cases even contain DOMJIT
HeapRanges!)
This CallDOM node is the way for the DOM APIs in the browser—a significant
chunk of the builtins, which are written in C++—to communicate what they do
to the optimizing compiler. Without any annotations, the JIT has to assume that
a call into C++ could do anything to the JIT state. Bummer!
But because, for example, Node.firstChild annotates what
memory it reads from and what it doesn’t write to,
the JIT can optimize around it better—or even remove the access completely.
It means the JIT can reason about calls to known builtins the same way that
it reasons about normal JIT opcodes.
(Incidentally it looks like it doesn’t even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.)
Last, we’ll look at Simple, which has a slightly different take on all of this.
Simple is Cliff Click’s pet Sea of Nodes (SoN) project to try and showcase the idea to the world—outside of a HotSpot C2 context.
This one is a little harder for me to understand but it looks like each
translation unit has a StartNode that doles out
different classes of memory nodes for each alias class. Each IR node then takes
data dependencies on whatever effect nodes it might uses.
Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): “Our approach is a form of TBAA similar to the ‘FieldTypeDecl’ algorithm described in the paper.”
The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10.
Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets.
HHVM, a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h.
HHVM has a couple places that use this information, such as a definition-sinking pass, alias analysis, DCE, store elimination, refcount opts, and more.
If you are wondering why the HHVM representation looks similar to the Cinder representation, it’s because some former HHVM engineers such as Brett Simmers also worked on Cinder!
(note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource)
Android’s ART Java runtime also
uses a bitset for its effect representation. It’s a very compact class called
SideEffects in nodes.h.
The side effects are used in loop-invariant code motion, global value numbering, write barrier elimination, scheduling, and more.
CoreCLR mostly uses a bitset for its SideEffectSet
class. This one is interesting though because it also splits out effects
specifically to include sets of local variables (LclVarSet).
V8 is also about six completely different compilers in a trenchcoat.
Turboshaft uses a struct in operations.h called
OpEffects which is two bitsets for reads/writes of effects. This is used in
value numbering as well a bunch of
other small optimization passes they call “reducers”.
Maglev also has this thing called NodeT::kProperties in their IR
nodes that also looks like a bitset and is used in their various
reducers. It has effect query methods on it such as can_eager_deopt and
can_write.
Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself.
Guile Scheme looks like it has a custom tagging scheme type thing.
Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8.
Remember to ask the right questions of your IR when doing analysis.
Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory, Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful.
This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def. It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes). ↩