I work on Cinder, a just-in-time (JIT) compiler built on top of CPython. If you aren’t familiar with Cinder and want to learn more, a previous post about the inliner gives a decent overview of the JIT. This post will talk about how we use binary search to isolate miscompiled functions, a technique that is applicable to any compiler if you have the right infrastructure.
Sometimes—frequently—I change an optimization pass or a code generation
step and I break something. In the best case scenario, I end up with a failing
test and the test name or body gives me enough clues to fix my silly little
mistake. But in the worst case I can’t even boot to the Python prompt because
I messed something up so badly. This can manifest as an exception, a failing
assert
, or even a segmentation fault.
Since the runtime could have compiled any number of functions in that boot process or test run, there are a lot of moving parts. I generally don’t want to look at the source, intermediate representations, and assembly of 1000 different Python functions. That’s too much. It would be nice to have one to two functions to look at and make inferences. This is where bisect comes in.
You may have heard of bisecting from geometry or from git
bisect
1. Those are the two places I heard about it.
It means to cut in half. In our case and the Git case, not just once—many
times!
We want to take our list of compiled functions and continuously cut it in half until we reach a very small group of functions that cause us trouble.
Say we run ./python -X jit some_program.py
and it crashes. In the course of
this run, the JIT compiles functions A
, B
, C
, and D
, runs some code,
and aborts. We can’t know which function is miscompiled, so we will try and
bisect:
A
and B
together. Success.C
and D
together. Crash.C
. Success.D
. Crash.Looks like D
is the troublemaker.
Since we can discard half of this set each time, we can find our miscompiled functions in logarithmic time2. That’s awesome. Even on slow-running repros, this rarely takes significant time. At worst, I go make tea.
While it’s excellent in many cases, bisecting has some prerequisites:
We already have the second two due to some server architecture constraints.
Cinder can be run with -X jit-list-file=somefile.txt
and will only compile
functions on the JIT list (but will still only compile on first run).
The bisect script is a wrapper like ./jitlist_bisect.py ./python -X jit ...
which interprets the debug output to figure out which functions were compiled
and passes in the JIT list.
Sometimes a JIT list will cause the crash but each split half won’t. In that case we hold each half fixed and try bisecting the other half to figure out what candidates we need. Update: I have been told that this is called delta debugging.
The script is less than 200 lines of Python and can be found here. A typical run looks like this:
$ ./Tools/scripts/jitlist_bisect.py ./build/python -X jit -m unittest test.test_import
INFO:root:Generating initial jit-list
INFO:root:Verifying jit-list
INFO:root:step fixed[0] and jitlist[504]
INFO:root:504 candidates
INFO:root:252 candidates
INFO:root:126 candidates
INFO:root:63 candidates
INFO:root:32 candidates
INFO:root:16 candidates
INFO:root:8 candidates
INFO:root:4 candidates
INFO:root:2 candidates
Bisect finished with 1 functions in jitlist.txt
$ cat jitlist.txt
warnings:_add_filter
$
Time to go look at the warnings
module.
See the core of the implementation extracted into a copy/pastable non-JIT-specific couple of functions here.
Manually or automatically slimming down your reproducing source code also helps with this approach. It makes the repro runtime shorter and sometimes removes other moving parts like needing to send network traffic or something. We can probably use some form of tracing and bisect to automatically slim the repro.
Compiler and interpreter unit tests can be a pain to write but they have saved me countless times over the past couple of years. Being able to isolate each and every small optimizer change is so, so helpful.
Speaking of automatically slimming the repro, C-Reduce is a tool by John Regehr and his collaborators. It takes a C source file and runner script and automatically bisects it to some failing case. From their homepage:
C-Reduce is a tool that takes a large C, C++, or OpenCL file that has a property of interest (such as triggering a compiler bug) and automatically produces a much smaller C/C++ file that has the same property. It is intended for use by people who discover and report bugs in compilers and other tools that process source code.
Incidentally, at the time of writing, their website looks pretty similar to this one.
I also wrote a blog post about it.
Victor Stinner wrote test.bisect_cmd
3 to debug failing
tests in CPython. This is useful when they only fail in certain arrangements!
This tool takes a slightly different approach from delta debugging and
instead uses random sampling.
Hacker News user eimrine also notes that this is similar to fuzzing, a technique to find bugs by changing input data or input programs.
Other implementations of/similar to C-Reduce:
I saw somewhere (but can no longer find the link to) someone’s implementation
of delta debugging on commit logs (more advanced than git bisect
). If you
find this, please send it my way.
It’s a super helpful tool and I highly recommend getting familiar with it. It has saved me hours in figuring out what commit caused problems. ↩
This is one of the few regressions that has stuck with me from my undergraduate algorithms course. ↩
Note that this has since been renamed from test.bisect
to
test.bisect_cmd
to avoid conflicting with the built-in bisect
module. ↩