Much of the code and education that resulted in this post happened with Aaron Patterson.
The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space1.
This is an example of a code snippet using virtual registers:
add R1, R2 -> R3
add R1, R3 -> R4
ret R4
And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps):
add Stack[0], P0 -> P1
add Stack[0], P1 -> P0
ret
Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren’t using R2 anymore).
People use register allocators like they use garbage collectors: it’s an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it’s probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures.
How do JIT compilers do register allocation? Well, “everyone knows” that “every JIT does its own variant of linear scan”2. This bothered me for some time because I’ve worked on a couple of JITs and still didn’t understand the backend bits.
There are a couple different approaches to register allocation, but in this post we’ll focus on linear scan of SSA.
I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA. Reading alone didn’t make a ton of sense—I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references!
For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer’s Master’s thesis (PDF, 2004).
LINEAR_SCAN // order blocks and operations (including loop detection) COMPUTE_BLOCK_ORDER NUMBER_OPERATIONS // create intervals with live ranges COMPUTE_LOCAL_LIVE_SETS COMPUTE_GLOBAL_LIVE_SETS BUILD_INTERVALS // allocate registers WALK_INTERVALS RESOLVE_DATA_FLOW // replace virtual registers with physical registers ASSIGN_REG_NUM // special handling for the Intel FPU stack ALLOCATE_FPU_STACK
There it is, all laid out at once. It’s very refreshing when compared to all of the compact research papers.
I didn’t realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan—as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts.
Throughout this post, we’ll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010’s code snippet is between the arrows and we add some filler (as alluded to in the paper):
label B1(R10, R11):
jmp B2($1, R11)
# vvvvvvvvvv #
label B2(R12, R13)
cmp R13, $1
branch lessThan B4()
label B3()
mul R12, R13 -> R14
sub R13, $1 -> R15
jump B2(R14, R15)
label B4()
# ^^^^^^^^^^ #
add R10, R12 -> R16
ret R16
Virtual registers start with R and are defined either with an arrow or by a block parameter.
Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I’ve also provided the same code in graphical form. Block names (and block parameters) are shaded with grey.
We have one entry block, B1
, that is implied in
Wimmer2010. Its only job is to define R10
and R11
for the rest of the CFG.
Then we have a loop between B2
and B3
with an implicit fallthrough. Instead
of doing that, we instead generate a conditional branch with explicit jump
targets. This makes it possible to re-order blocks as much as we like.
The contents of B4
are also just to fill in the blanks from Wimmer2010 and
add some variable uses.
Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately.
For now, let’s rewind the clock and look at how linear scan came about.
Linear scan register allocation (LSRA) has been around for awhile. It’s neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We’ll talk more about what that means in a minute.)
It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called ‘C (TickC), for which a fast register allocator is quite useful.
Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997.
Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers.
Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order.
This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising.
In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet:
... -> a
add 1, a -> b
add 1, b -> c
add 1, c -> a
add 1, a -> d
There are two definitions of a
and they each live for different amounts of
time:
a b c a d
... -> a | <- the first a
add 1, a -> b v |
add 1, b -> c v |
add 1, c -> a v | <- the second a
add 1, a -> d v |
In fact, the ranges are completely disjoint. It wouldn’t make sense for the
register allocator to consider variables, because there’s no reason the two
a
s should necessarily live in the same physical register.
In SSA land, it’s a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We’ll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator3.
Linear scan starts at the point in your compiler process where you already know these live ranges—that you have already done some kind of analysis to build a mapping.
In this blog post, we’re going to back up to the point where we’ve just built our SSA low-level IR and have yet to do any work on it. We’ll do all of the analysis from scratch.
Part of this analysis is called liveness analysis.
The result of liveness analysis is a mapping of BasicBlock ->
Set[Instruction]
that tells you which virtual registers (remember, since we’re
in SSA, instruction==vreg) are alive (used later) at the beginning of the basic
block. This is called a live-in set. For example:
B0:
... -> R12
... -> R13
jmp B1
B1:
mul R12, R13 -> R14
sub R13, 1 -> R15
jmp B2
B2:
add R14, R15 -> R16
ret R16
We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition.
In this case, at the end of B2, nothing is live. If we step backwards to the
ret
, we see a use: R16 becomes live. If we step once more, we see its
definition—R16 no longer live—but now we see a use of R14 and R15, which
become live. This leaves us with R14 and R15 being live-in to B2.
This live-in set becomes B1’s live-out set because B1 is B2’s predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block.
It gets more interesting, though, when we have branches: what does it mean when two blocks’ live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together.
That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C.
Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter. That’s right, we’re doing AI.
In this liveness analysis, we’ll:
We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby’s Integer class.
Like most abstract interpretations, it doesn’t matter what order we iterate
over the collection of basic blocks for correctness, but it does matter for
performance. In this case, iterating backwards (post_order
) converges much
faster than forwards (reverse_post_order
):
class Function
def compute_initial_liveness_sets order
# Map of Block -> what variables it alone needs to be live-in
gen = Hash.new 0
# Map of Block -> what variables it alone defines
kill = Hash.new 0
order.each do |block|
block.instructions.reverse_each do |insn|
out = insn.out&.as_vreg
if out
kill[block] |= (1 << out.num)
end
insn.vreg_ins.each do |vreg|
gen[block] |= (1 << vreg.num)
end
end
block.parameters.each do |param|
kill[block] |= (1 << param.num)
end
end
[gen, kill]
end
def analyze_liveness
order = post_order
gen, kill = compute_initial_liveness_sets(order)
# Map from Block -> what variables are live-in
live_in = Hash.new 0
changed = true
while changed
changed = false
for block in order
# Union-ing all the successors' live-in sets gives us this block's
# live-out, which is a good starting point for computing the live-in
block_live = block.successors.map { |succ| live_in[succ] }.reduce(0, :|)
block_live |= gen[block]
block_live &= ~kill[block]
if live_in[block] != block_live
changed = true
live_in[block] = block_live
end
end
end
live_in
end
end
We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now.
The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop. By having this information available, it folds the liveness analysis into the live range building, which we’ll instead do separately in a moment.
The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn’t. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method.
For now, we will pick a schedule for the control-flow graph.
In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range’s start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph:
class Function
def number_instructions!
@block_order = rpo
number = 16 # just so we match the Wimmer2010 paper
@block_order.each do |blk|
blk.number = number
number += 2
blk.instructions.each do |insn|
insn.number = number
number += 2
end
blk.to = number
end
end
end
A couple interesting things to note:
Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper.
16: label B1(R10, R11):
18: jmp B2($1, R11)
# vvvvvvvvvv #
20: label B2(R12, R13)
22: cmp R13, $1
24: branch lessThan B4()
26: label B3()
28: mul R12, R13 -> R14
30: sub R13, $1 -> R15
32: jump B2(R14, R15)
34: label B4()
# ^^^^^^^^^^ #
36: add R10, R12 -> R16
38: ret R16
Since we’re not going to be messing with the order of the instructions within a
block anymore, all we have to do going forward is make sure that we iterate
through the blocks in @block_order
.
Finally, we have all that we need to compute live ranges.
We’ll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We’ll have two main differences:
I know I said we were going to be computing live ranges. So why am I presenting
you with a function called build_intervals
? That’s because early in
the history of linear scan (Traub1998!), people moved from having a single range for a
particular virtual register to having multiple disjoint ranges. This
collection of multiple ranges is called an interval and it exists to free up
registers in the context of branches.
For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.)
The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the
interval for R12 (called i12) is [[20, 28), [34, ...)]
. Interval holes are
not strictly necessary—they exist to generate better code. So for this post,
we’re going to start simple and assume 1 interval == 1 range. We may come back
later and add additional ranges, but that will require some fixes to our later
implementation. We’ll note where we think those fixes should happen.
BUILDINTERVALS
for each block b in reverse order do
live = union of successor.liveIn for each successor of b
for each phi function phi of successors of b do
live.add(phi.inputOf(b))
for each opd in live do
intervals[opd].addRange(b.from, b.to)
for each operation op of b in reverse order do
for each output operand opd of op do
intervals[opd].setFrom(op.id)
live.remove(opd)
for each input operand opd of op do
intervals[opd].addRange(b.from, op.id)
live.add(opd)
for each phi function phi of b do
live.remove(phi.output)
if b is loop header then
loopEnd = last block of the loop starting at b
for each opd in live do
intervals[opd].addRange(b.from, loopEnd.to)
b.liveIn = live
Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper:
class Function
def build_intervals live_in
intervals = Hash.new { |hash, key| hash[key] = Interval.new }
@block_order.each do |block|
# live = union of successor.liveIn for each successor of b
# this is the *live out* of the current block since we're going to be
# iterating backwards over instructions
live = block.successors.map { |succ| live_in[succ] }.reduce(0, :|)
# for each phi function phi of successors of b do
# live.add(phi.inputOf(b))
live |= block.out_vregs.map { |vreg| 1 << vreg.num }.reduce(0, :|)
each_bit(live) do |idx|
opd = vreg idx
intervals[opd].add_range(block.from, block.to)
end
block.instructions.reverse.each do |insn|
out = insn.out&.as_vreg
if out
# for each output operand opd of op do
# intervals[opd].setFrom(op.id)
intervals[out].set_from(insn.number)
end
# for each input operand opd of op do
# intervals[opd].addRange(b.from, op.id)
insn.vreg_ins.each do |opd|
intervals[opd].add_range(block.from, insn.number)
end
end
end
intervals.default_proc = nil
intervals.freeze
end
end
Another difference is that since we’re using block parameters, we don’t really
have this phi.inputOf
thing. That’s just the block argument.
The last difference is that since we’re skipping the loop liveness hack, we
don’t modify a block’s live
set as we iterate through instructions.
I know we said we’re building live ranges, so our Interval
class only has
one Range
on it. This is Ruby’s built-in range, but it’s really just being
used as a tuple of integers here.
class Interval
attr_reader :range
def add_range(from, to)
if to <= from
raise ArgumentError, "Invalid range: #{from} to #{to}"
end
if !@range
@range = Range.new(from, to)
return
end
@range = Range.new([@range.begin, from].min, [@range.end, to].max)
end
def set_from(from)
@range = if @range
if @range.end <= from
raise ArgumentError, "Invalid range: #{from} to #{@range.end}"
end
Range.new(from, @range.end)
else
# This happens when we don't have a use of the vreg
# If we don't have a use, the live range is very short
Range.new(from, from)
end
end
def ==(other)
other.is_a?(Interval) && @range == other.range
end
end
Note that there’s some implicit behavior happening here:
add_range
builds the smallest range that overlaps with
the existing range and incoming informationset_from
may shrink itFor example, if we have [1, 5)
and someone calls add_range(7, 10)
, we end
up with [1, 10)
. There’s no gap in the middle.
And if we have [1, 7)
and someone calls set_from(3)
, we end up with [3,
7)
.
After figuring out from scratch some of these assumptions about what the
interval/range API should and should not do, Aaron and I realized that there
was some actual code for add_range
in a different, earlier paper: Linear
Scan Register Allocation in the Context of SSA Form and Register
Constraints (PDF, 2002) by Mössenböck and Pfeiffer.
ADDRANGE(i: Instruction; b: Block; end: integer)
if b.first.n ≤ i.n ≤ b.last.n then range ← [i.n, end[ else range ← [b.first.n, end[
add range to interval[i.n] // merging adjacent ranges
Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above.
Finally we can start thinking about doing some actual register assignment. Let’s return to the 90s.
Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997).
I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar.
LinearScanRegisterAllocation
active ← {}
foreach live interval i, in order of increasing start point
ExpireOldIntervals(i)
if length(active) = R then
SpillAtInterval(i)
else
register[i] ← a register removed from pool of free registers
add i to active, sorted by increasing end point
ExpireOldIntervals(i)
foreach interval j in active, in order of increasing end point
if endpoint[j] ≥ startpoint[i] then
return
remove j from active
add register[j] to pool of free registers
SpillAtInterval(i)
spill ← last interval in active
if endpoint[spill] > endpoint[i] then
register[i] ← register[spill]
location[spill] ← new stack location
remove spill from active
add i to active, sorted by increasing end point
else
location[i] ← new stack location
class Function
def ye_olde_linear_scan intervals, num_registers
if num_registers <= 0
raise ArgumentError, "Number of registers must be positive"
end
free_registers = Set.new 0...num_registers
active = [] # Active intervals, sorted by increasing end point
assignment = {} # Map from Interval to PReg|StackSlot
num_stack_slots = 0
# Iterate through intervals in order of increasing start point
sorted_intervals = intervals.sort_by { |_, interval| interval.range.begin }
sorted_intervals.each do |_vreg, interval|
# expire_old_intervals(interval)
active.select! do |active_interval|
if active_interval.range.end > interval.range.begin
true
else
operand = assignment.fetch(active_interval)
raise "Should be assigned a register" unless operand.is_a?(PReg)
free_registers.add(operand.name)
false
end
end
if active.length == num_registers
# spill_at_interval(interval)
# Pick an interval to spill. Picking the longest-lived active one is
# a heuristic from the original linear scan paper.
spill = active.last
# In either case, we need to allocate a slot on the stack.
slot = StackSlot.new(num_stack_slots)
num_stack_slots += 1
if spill.range.end > interval.range.end
# The last active interval ends further away than the current
# interval; spill the last active interval.
assignment[interval] = assignment[spill]
raise "Should be assigned a register" unless assignment[interval].is_a?(PReg)
assignment[spill] = slot
active.pop # We know spill is the last one
# Insert interval into already-sorted active
insert_idx = active.bsearch_index { |i| i.range.end >= interval.range.end } || active.length
active.insert(insert_idx, interval)
else
# The current interval ends further away than the last active
# interval; spill the current interval.
assignment[interval] = slot
end
else
reg = free_registers.min
free_registers.delete(reg)
assignment[interval] = PReg.new(reg)
# Insert interval into already-sorted active
insert_idx = active.bsearch_index { |i| i.range.end >= interval.range.end } || active.length
active.insert(insert_idx, interval)
end
end
[assignment, num_stack_slots]
end
end
Internalizing this took us a bit. It is mostly a three-state machine:
We would like to come back to this and incrementally modify it as we add lifetime holes to intervals.
I finally understood, very late in the game, that linear scan assigns one location per virtual register. Ever. It’s not that every virtual register gets a shot in a register and then gets moved to a stack slot—that would be interval splitting and hopefully we get to that later—if a register gets spilled, it’s in a stack slot from beginning to end.
I only found this out accidentally after trying to figure out a bug (that wasn’t a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mössenböck):
However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely.
Also,
In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register
Also,
Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately
Marvelous.
Let’s take a look at the code snippet again. Here it is before register allocation, using virtual registers:
16: label B1(R10, R11):
18: jmp B2($1, R11)
# vvvvvvvvvv #
20: label B2(R12, R13)
22: cmp R13, $1
24: branch lessThan B4()
26: label B3()
28: mul R12, R13 -> R14
30: sub R13, $1 -> R15
32: jump B2(R14, R15)
34: label B4()
# ^^^^^^^^^^ #
36: add R10, R12 -> R16
38: ret R16
Let’s run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments:
{R10: P0, R11: P1, R12: P1, R13: P2, R14: P3, R15: P2, R16: P0}
{R10: Stack[0], R11: P1, R12: P1, R13: P2, R14: P0, R15: P2, R16: P0}
{R10: Stack[0], R11: P1, R12: Stack[1], R13: P0, R14: P1, R15: P0, R16: P0}
{R10: Stack[0], R11: P0, R12: Stack[1], R13: P0, R14: Stack[2], R15: P0, R16: P0}
Some other things to note:
If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space.
In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea.
Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use.
This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code.
Also, block ordering? You guessed it. Heuristic.
Here is an example “slideshow” I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time4.
At this point we have register assignments: we have a hash table mapping intervals to physical locations. That’s great but we’re still in SSA form: labelled code regions don’t have block arguments in hardware. We need to write some code to take us out of SSA and into the real world.
We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now—lifetime holes—but we can simplify.
RESOLVE
for each control flow edge from predecessor to successor do
for each interval it live at begin of successor do
if it starts at begin of successor then
phi = phi function defining it
opd = phi.inputOf(predecessor)
if opd is a constant then
moveFrom = opd
else
moveFrom = location of intervals[opd] at end of predecessor
else
moveFrom = location of it at end of predecessor
moveTo = location of it at begin of successor
if moveFrom ≠ moveTo then
mapping.add(moveFrom, moveTo)
mapping.orderAndInsertMoves()
Because we have a 1:1 mapping of virtual registers to live ranges, we know that every interval live at the beginning of a block is either:
For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes, we would have to go back to the full Wimmer SSA resolution.
This means that we’re going to iterate over every outbound edge from every block. For each edge, we’re going to insert some parallel moves.
class Function
def resolve_ssa intervals, assignments
# ...
@block_order.each do |predecessor|
outgoing_edges = predecessor.edges
num_successors = outgoing_edges.length
outgoing_edges.each do |edge|
mapping = []
successor = edge.block
edge.args.zip(successor.parameters).each do |moveFrom, moveTo|
if moveFrom != moveTo
mapping << [moveFrom, moveTo]
end
end
# predecessor.order_and_insert_moves(mapping)
# TODO: order_and_insert_moves
end
end
# Remove all block parameters and arguments; we have resolved SSA
@block_order.each do |block|
block.parameters.clear
block.edges.each do |edge|
edge.args.clear
end
end
end
end
This already looks very similar to the RESOLVE function from Wimmer2010.
Unfortunately, Wimmer2010 basically off orderAndInsertMoves
with an eh, it’s
already in the literature comment.
What’s not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes:
This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky’s implementation with bugfixes. We found Dmitry Stogov’s unmerged pull request to the same repository to fix another bug.
We looked at Benoit Boissinot’s thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing.
All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away.
We ended up with this implementation which passes all of the tests from Sokolovsky’s repository as well as the example from Boissinot’s thesis (though, as we discussed in the email, the example solution in the thesis is incorrect5).
# copies contains an array of [src, dst] arrays
def sequentialize copies
ready = [] # Contains only destination regs ("available")
to_do = [] # Contains only destination regs
pred = {} # Map of destination reg -> what reg is written to it (its source)
loc = {} # Map of reg -> the current location where the initial value of reg is available ("resource")
result = []
emit_copy = -> (src, dst) {
# We add an arrow here just for clarity in reading this algorithm because
# different people do [src, dst] and [dst, src] depending on if they prefer
# Intel or AT&T
result << [src, "->", dst]
}
# In Ruby, loc[x] is nil if x not in loc, so this loop could be omitted
copies.each do |(src, dst)|
loc[dst] = nil
end
copies.each do |(src, dst)|
loc[src] = src
if pred.key? dst # Alternatively, to_do.include? dst
raise "Conflicting assignments to destination #{dst}, latest: #{[dst, src]}"
end
pred[dst] = src
to_do << dst
end
copies.each do |(src, dst)|
if !loc[dst]
# All destinations that are not also sources can be written to immediately (tree leaves)
ready << dst
end
end
while !to_do.empty?
while b = ready.pop
a = loc[pred[b]] # a in the paper
emit_copy.(a, b)
# pred[b] is now living at b
loc[pred[b]] = b
if to_do.include?(a)
to_do.delete a
end
if pred[b] == a && pred.include?(a)
ready << a
end
end
if to_do.empty?
break
end
dst = to_do.pop
if dst != loc[pred[dst]]
emit_copy.(dst, "tmp")
loc[dst] = "tmp"
ready << dst
end
end
result
end
Leroy’s algorithm, which is shorter, passes almost all the tests—in one test case, it uses one more temporary variable than Boissinot’s does. We haven’t spent much time looking at why.
def move_one i, src, dst, status, result
return if src[i] == dst[i]
status[i] = :being_moved
for j in 0...(src.length) do
if src[j] == dst[i]
case status[j]
when :to_move
move_one j, src, dst, status, result
when :being_moved
result << [src[j], "->", "tmp"]
src[j] = "tmp"
end
end
end
result << [src[i], "->", dst[i]]
status[i] = :moved
end
def leroy_sequentialize copies
src = copies.map { it[0] }
dst = copies.map { it[1] }
status = [:to_move] * src.length
result = []
status.each_with_index do |item, i|
if item == :to_move
move_one i, src, dst, status, result
end
end
result
end
Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the “swap problem”.
class Function
def resolve_ssa intervals, assignments
# ...
# predecessor.order_and_insert_moves(mapping)
sequence = sequentialize(mapping).map do |(src, _, dst)|
Insn.new(:mov, dst, [src])
end
# TODO: insert the moves!
# ...
end
end
That’s great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the “lost copy” problem?
As it turns out, we still need to handle critical edge splitting. Let’s
consider what it means to insert moves at an edge between blocks A -> B
when
the surrounding CFG looks a couple of different ways.
A -> B
A -> B
and A -> C
A -> B
and D -> B
A -> B
and A -> C
and D -> B
These are the four (really, three) cases we may come across.
In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn’t matter: at the end of A or at the beginning of B are both fine.
In Case 2, if A has two successors, then we should insert the moves at the
beginning of B. That way we won’t be mucking things up for the edge A -> C
.
In Case 3, if B has two predecessors, then we should insert the moves at the
end of A. That way we won’t be mucking things up for the edge D -> B
.
Case 4 is the most complicated. There is no extant place in the graph we can
insert moves. If we insert in A, we mess things up for A -> C
. If we insert
in B
, we mess things up for D -> B
. Inserting in C
or D
doesn’t make
any sense. What is there to do?
As it turns out, Case 4 is called a critical edge. And we have to split it.
We can insert a new block E along the edge A -> B
and put the moves in E!
That way they still happen along the edge without affecting any other blocks.
Neat.
In Ruby code, that looks like:
class Function
def resolve_ssa intervals, assignments
num_predecessors = Hash.new 0
@block_order.each do |block|
block.edges.each do |edge|
num_predecessors[edge.block] += 1
end
end
# ...
# predecessor.order_and_insert_moves(mapping)
sequence = ...
# If we don't have any moves to insert, we don't have any block to
# insert
next if sequence.empty?
if num_predecessors[successor] > 1 && num_successors > 1
# Make a new interstitial block
b = new_block
b.insert_moves_at_start sequence
b.instructions << Insn.new(:jump, nil, [Edge.new(successor, [])])
edge.block = b
elsif num_successors > 1
# Insert into the beginning of the block
successor.insert_moves_at_start sequence
else
# Insert into the end of the block... before the terminator
predecessor.insert_moves_at_end sequence
end
# ...
end
end
Adding a new block invalidates the cached @block_order
, so we also need to
recompute that.
We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in
resolve_ssa
, we can clean up branches to empty blocks!
(See also Nick’s post on critical edge splitting, which also links to Faddegon’s thesis, which I should at least skim.)
And that’s it, folks. We have gone from virtual registers in SSA form to physical locations. Everything’s all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right?
Not so fast…
You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn’t really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999:
When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately.
Fun. We will start off by handling calls and method parameters separately, we will note that it’s not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally.
We’ll call this new function handle_caller_saved_regs
after register
allocation but before SSA resolution. We do it after register allocation so we
know where each virtual register goes but before resolution so we can still
inspect the virtual register operands.
Its goal is to do a couple of things:
push
and pop
instructions around call
instructions to
preserve virtual registers that are used on the other side of the call
. We
only care about preserving virtual registers that are stored in physical
registers, though; no need to preserve anything that already lives on the
stack.call
instruction.We’ll also remove the call
operands since we’re placing them in special
registers explicitly now.
class Function
def handle_caller_saved_regs intervals, assignments, return_reg, param_regs
@block_order.each do |block|
x = block.instructions.flat_map do |insn|
if insn.name == :call
survivors = intervals.select { |_vreg, interval|
interval.survives?(insn.number)
}.map(&:first).select { |vreg|
assignments[intervals[vreg]].is_a?(PReg)
}
mov_input = insn.out
insn.out = return_reg
ins = insn.ins.drop(1)
raise if ins.length > param_regs.length
insn.ins.replace(insn.ins.first(1))
mapping = ins.zip(param_regs).to_h
sequence = sequentialize(mapping).map do |(src, _, dst)|
Insn.new(:mov, dst, [src])
end
survivors.map { |s| Insn.new(:push, nil, [s]) } +
sequence +
[insn, Insn.new(:mov, mov_input, [return_reg])] +
survivors.map { |s| Insn.new(:pop, nil, [s]) }.reverse
else
insn
end
end
block.instructions.replace(x)
end
end
end
(Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.)
Now, you may have noticed that we don’t do anything special for the incoming
params of the function we’re compiling! That’s another thing we have to handle.
Thankfully, we can handle it with yet another parallel move (wow!) at the end
of resolve_ssa
.
class Function
def resolve_ssa intervals, assignments
# ...
# We're typically going to have more param regs than block parameters
# When we zip the param regs with block params, we'll end up with param
# regs mapping to nil. We filter those away by selecting for tuples
# that have a truthy second value
# [[x, y], [z, nil]].select(&:last) (reject the second tuple)
mapping = param_regs.zip(entry_block.parameters).select(&:last).to_h
sequence = sequentialize(mapping).map do |(src, _, dst)|
Insn.new(:mov, dst, [src])
end
entry_block.insert_moves_at_start(sequence)
end
end
Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code.
But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way.
We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It’s possible to write tests that spot check sample programs of all shapes and sizes but it’s very difficult to anticipate every possible edge case that will appear in the real world.
Even if the original algorithm you’re using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors.
We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it’s very hard (impossible?) to scale that to sets of graphs.
Maybe that’s enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It’s fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants.
Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space.
Some time after finding Chris’s blog post, I also stumbled across the very same thing in V8!
I find this stuff so cool. I’ll also mention Boissinot’s Rust code again because it does something similar for parallel moves.
It’s possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator, the LuaJIT register allocator, and Reverse Linear Scan Allocation is probably a good idea. By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though.
We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints.
The full Ruby code listing is not (yet?) public.
Thanks to Waleed Khan and Iain Ireland for giving feedback on this post.
It’s not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms. ↩
Well. As I said on one of the social media sites earlier this year, “All AOT compilers are alike; each JIT compiler is fucked up in its own way.”
JavaScript:
Java:
Python:
Ruby:
PHP:
Lua:
Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer notes:
Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals.
Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time.
SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes:
One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment.
Cliff Click (private communication, 2025) notes:
It’s easier. Got it already, why lose it […] spilling always uses use/def and def/use edges.
This is inspired by Rasmus Andersson’s graph coloring visualization that I saw some years ago. ↩
The example in the thesis is to sequentialize the following parallel copy:
The solution in the thesis is:
but we think this is incorrect. Solving manually, Aaron and I got:
which is what the code gives us, too. ↩