I have a lot of thoughts about the design of compiler intermediate representations (IRs). In this post I’m going to try and communicate some of those ideas and why I think they are important.
The overarching idea is being able to make decisions with only local information.
That comes in a couple of different flavors. We’ll assume that we’re compiling a method at a time, instead of a something more trace-like (tracing, tracelets, basic block versioning, etc).
A function will normally have some control-flow: if
, while
, for
, any
amount of jumping around within a function. Let’s look at an example function
in a language with advanced control-flow constructs:
int sumto(int n) {
int result = 0;
for (result = 0; n >= 0; n -= 1) {
result += n;
}
return result;
}
Most compilers will deconstruct this for
, with its many nested expressions,
into simple comparisons and jumps. In order to resolve jump targets in your
compiler, you may have some notion of labels (in this case, words ending with a
colon):
int sumto(int n) {
entry:
int result = 0
header:
if n < 0, jumpto end
result = result + n
n = n - 1
jumpto header
end:
return result
}
This looks kind of like a pseudo-assembly language. It has its high-level
language features decomposed into many smaller instructions. It also has
implicit fallthrough between labeled sections (for example, entry
into
header
).
I mention these things because they, like the rest of the ideas in this post,
are points in an IR design space. Representing code this way is an explicit
choice, not a given. For example, one could make the jumps explicit by adding a
jumpto header
at the end of entry
.
As soon as we add that instruction, the code becomes position-independent: as
long as we start with entry
, the chunks of code between labels could be
ordered any which way: they are addressed by name and have no implicit ordering
requirements.
This may seem arbitrary, but it gives the optimizer more flexibility. If some
optimization rule decides, for example, that a branch to block4
may rarely be
taken, it can freely re-order it toward the end of the function (or even on a
different page!) so that more hot code can be on a single cache line.
Explicit jumps and labels turn the code from a strictly linear assembly into a control-flow graph (CFG). Each sequence of code without internal control-flow is a called basic block and is a vertex in this graph. The directed edges represent jumps between blocks. See for example this crude GraphViz representation:
We’re actually kind of looking at extended basic blocks (EBBs), which allow for multiple control exits per block but only one control entry. A strict basic block representation of the above code would look, in text form, something like this:
int sumto(int n) {
entry:
int result = 0
condbranch n < 0, iftrue: end, iffalse: header
header:
result = result + n
n = n - 1
condbranch n < 0, iftrue: end, iffalse: header
end:
return result
}
Notice how each block has exactly one terminator (control-flow instruction), with (in this case) 0 or 2 targets.
Opinions differ about the merits and issues of extended vs normal basic blocks. Most compilers I see use normal basic blocks. In either case, bringing the IR into a graph form gives us an advantage: thanks to Cousot and Cousot, our favorite power couple, we know how to do abstract interpretation on graphs and we can use this to build an advanced optimizer. See, for example, my intro to abstract interpretation post.
Some IRs are stack based. For concatenative languages or some newer JIT compilers, IRs are formatted in such a way that each opcode reads its operands from a stack and writes its outputs to a stack. This is reminiscent of a point-free coding style in languages such as Haskell or OCaml.
inc = (+ 1)
In this style, there is an implicit shared state: the stack. Dataflow is explicit (pushes and pops) and instructions can only be rearranged if the stack structure is preserved. This requires some non-local reasoning: to move an instruction, one must also rearrange the stack.
PUSH 1
PUSH 2
ADD
By contrast, in a register-based IR, things are more explicit. Instructions
take named inputs (v0
, v12
, etc) and produce named outputs. Instructions
can be slightly more easily moved around (modulo effects) as long as inputs
remain defined. Local variables do not exist. The stack does not exist.
Everything is IR “variables”.
v1 = CONST 1
v2 = CONST 2
v3 = ADD v1 v2
The constraints (names being defined) are part of the IR. This gets a little bit tricky if it’s possible to define a name multiple times.
v1 = CONST 1
v2 = CONST 2
x = ADD v1 v2
x = ADD x v2
v3 = MUL x 12
What does x
mean in the instruction for v3
? Which definition does it refer
to? In order to reason about the instruction v3 = MUL x 12
, we have to keep
around some context. This is non-trivial: requiring compiler writers to
constantly truck around side-tables and update them as they do analysis is slow
and error-prone. Fortunately, if we enforce some interesting rules, we can push
that analysis work into one pass up-front…
Static single assignment (SSA) was introduced by a bunch of folks at IBM (see my blog
post about the different implementations). In SSA-based IRs, each
variable can only be defined once. Put another way, a variable is its
defining instruction; alternately, a variable and its defining instruction are
addressed by the same name. The previous example is not valid SSA; x
has two
definitions.
If we turn the previous example into SSA, we can now use a different name for each instruction. This is related to the unique name assumption or the global names property: names do not depend on context.
v1 = CONST 1
v2 = CONST 2
x0 = ADD v1 v2
x1 = ADD x0 v2
Now we can identify each different ADD
instruction by the variable it
defines. This is useful in analysis…
I took a class with Olin Shivers and he described abstract interpretation as “automated theorem finding”. Unlike theorem provers such as Lean and Rocq, where you have to manually prove the properties you want, static analysis finds interesting facts that already exist in your program (and optimizers use them to make your program faster).
Your static analysis pass(es) can annotate your IR nodes with little bits of information such as:
int
5
2
and 5
y*z
If your static analysis is over SSA, then generally the static analysis is easier and (potentially) storing facts is easier. This is due to this property called sparseness. Where a static analysis over non-SSA programs has to store facts about all variables at all program points, an analysis over SSA need only store facts about all variables, independent of context.
I sometimes describe this as “pushing time through the IR” but I don’t know that that makes a ton of sense.
Potentially more subtle here is that we could represent the above IR snippet as a list of tuples, where instructions are related via some other table (say, a “variable environment”):
code = [
("v1", "CONST", 1),
("v2", "CONST", 2),
("x0", "ADD", "v1", "v2"),
("x1", "ADD", "x0", "v2"),
]
Instead, though, we could allocate an object for each instruction and let them
refer to one another by pointer (or index, if using Rust or something).
Then they directly refer to one another (no need for a side-table), which might
be faster and more compact. We can re-create nice names as needed for printing.
Then, when optimizing, we look up the type information of an operand by
directly reading a field (operands[N]->type
or similar).
Another thing to note: when you start adding type information to your IR, you’re going to start asking type information questions in your analysis. Questions such as “what type is this instruction?”, where “type” could span a semilattice, and even refer to a specific run-time object by its pointer. In that case, it’s important to ask the right questions. For example:
Const
instruction?Const
instructions are likely not the only opcodes that could produce
specific objects; if you have an instruction like GuardIsObject
, for example,
that burns a specific expected pointer into the generated code, the type (and
therefore the pointer) will come from the GuardIsObject
instruction.
The big idea is that types represent a different slice of your IR than the opcodes and should be treated as such.
Anyway, SSA only stores type information about instructions and does not encode information that we might later learn in the IR. With basic SSA, there’s not a good way to encode refinements…
Static single information (SSI) form gives us new ways to encode metadata about instructions (variables). It was introduced by C. Scott Ananian in 1999 in his MS thesis (PDF). (I also discussed it briefly in the Scrapscript IR post.) Consider the following SSA program (represented as pseudo-Python):
# @0
x = int(...)
# @1
if x >= 0:
# @2
return x
else:
# @4
result = -x
# @5
return result
x
is undefined at @0
. x
is defined and an integer at @1
. But then we
do something interesting: we split control flow based on the run-time value of
x
. We can take this split to add new and interesting information to x
. For
non-sparse analysis, we can record some fact on the side. That’s fine.
# @0
x = int(...)
# @1
if x >= 0:
# @2
LearnFact(x, nonnegative)
# @3
return x
else:
# @4
LearnFact(x, negative)
# @5
result = -x
# @6
return result
When doing a dataflow analysis, we can keep track of the fact that at @3
, x
is nonnegative, and at @5
, x
is negative. This is neat: we can then
determine that all paths to this function return a positive integer.
Importantly, LearnFact(x, nonnegative)
does not override the existing known
type of x
. Instead, it is a refinement: a set intersection. A lattice meet.
The middle bit of a Venn diagram containing two overlapping circles, Integer
and Nonnegative
.
If we want to keep our information sparse, though, we have to add a new definition to the IR.
# @0
x = int(...)
# @1
if x >= 0:
# @2
newx0 = LearnFact(x, nonnegative)
# @3
return newx0
else:
# @4
newx1 = LearnFact(x, negative)
# @5
result = -newx1
# @6
return result
This is complicated (choose which variables to split, replace all uses, to
maintain SSA, etc) but gives us new places to store information inside the
IR. It means that every time we refer to newx0
, we know that it is
nonnegative and every time we refer to newx1
, we know that it is negative.
This information is independent of context!
I should note that you can get a lot of the benefits of SSI without going “full SSI”. There is no need to split every variable at every branch, nor add a special new merge instruction.
Okay, so we can encode a lot of information very sparsely in the IR. That’s neat. It’s powerful. But we should also be mindful that even in this very sparse representation, we are encoding information implicitly that we may not want to: execution order.
In a traditional CFG representation, the instructions are already scheduled, or ordered. Normally this comes from the programmer in the original source form and is faithfully maintained. We get data use edges in an IR like SSA, but the control information is left implicit. Some forms of IR, however, seek to reify both data and control dependencies into the IR itself. One such IR design is sea of nodes (SoN), which was originally designed by Cliff Click during his PhD.
In sea of nodes, every instruction gets its own vertex in the graph. Instructions have use edges to their operands, which can be either data or some other ordering property (control, effects, etc). The main idea is that IR nodes are by default unordered and are only ordered later, after effect analysis has removed a bunch of use edges.
Per Vognsen also notes that there is another motivating example of sea of
nodes: in the previous SSI example, the LearnFact(x, nonnegative)
cannot be
validly hoisted above the n >= 0
check. In a “normal” IR, this is implicit in
the ordering. In a sea of nodes world, this is explicitly marked with an edge
from the LearnFact
to the if n >= 0
. I think Graal, for example, calls
these LearnFact
nodes “Pi nodes”.
I think I need to re-read the original paper, read a modern implementation (I get the feeling it’s not done exactly the same way anymore), and then go write more about it later. For now, see Simple, by Cliff Click and friends. It is an implementation in Java and a little book to go with it.
Design neighbors include value dependence graphs (VDG), value state dependence graphs (VSDG), region value state dependence graphs (RVSDG), and program dependence graphs (PDG).
There’s probably more in this vein to be explored right now and probably more to invent in the future, too. Some other potentially interesting concepts to think about include:
Thank you to Chris Fallin, Hunter Goldstein, and Per Vognsen for valuable feedback on drafts of this post.