Compilers for free with weval

May 19, 2024

Chris Fallin came and gave a talk to the Northeastern Programming Research Laboratory last month. He talked about his work on a new project called weval, a WebAssembly partial evaluator (and then helped me write this post!).

Partial evaluation is neat. In short, it’s all about taking an existing program, modifying it to hold some of its inputs as constants, and then letting the compiler/optimizer go hog wild on it. The result is still a program—not a value—and it’s usually faster than the original program.

The usual small example is the power function. If you have a function that takes two arguments, x and y, and returns x^y:

int power(int x, int y) {
  int result = 1;
  for (int i = 0; i < y; i++) {
    result *= x;
  }
  return result;
}

If you partially evaluate this function with respect to y at y = 5, you get a new function that takes one argument, x, and returns x^5:

int power_5(int x) {
  int result = 1;
  for (int i = 0; i < 5; i++) {
    result *= x;
  }
  return result;
}

Now, to you, this might not look that different from the original function. But to an optimizer, it is a new world of opportunity. The optimizer can unroll the loop and remove the conditional:

int power_5(int x) {
  return x * x * x * x * x;
}

weval does that for entire WebAssembly modules. WebAssembly modules that are normally much bigger than a small power function. You might want to use it if, for example, your WebAssembly module is an interpreter. Imagine a world where you have compiled a runtime such as SpiderMonkey or CPython to WebAssembly. You could then run your Python or JavaScript programs on the WebAssembly runtime, but they would be slower than if you had compiled them directly to WebAssembly. And even if you compiled the JS/Python directly to Wasm, it would probably be slow unless your compiler did some fancy static analysis. This is where weval comes in.

Enter weval

SpiderMonkey and CPython are both huge. Instead, we’re going to do a little demo of a tiny interpreter that I wrote with Chris. Our interpreter doesn’t do much—local variables, an accumulator, arithmetic, and branching. But it’s enough to show off the performance boosts that come with weval.

#define FOR_EACH_INSTRUCTION(V)                                                \
  V(LOAD_IMMEDIATE)                                                            \
  V(STORE_LOCAL)                                                               \
  V(LOAD_LOCAL)                                                                \
  V(PRINT)                                                                     \
  V(PRINTI)                                                                    \
  V(JMPNZ)                                                                     \
  V(INC)                                                                       \
  V(DEC)                                                                       \
  V(ADD)                                                                       \
  V(HALT)

It’s designed as a little loop that reads the next instructions and dispatches with a switch. It’s not the fastest design1, but that’s okay.

uword Execute(uword *program) {
  // ...
  while (true) {
    Instruction op = (Instruction)program[pc++];
    switch (op) {
    case LOAD_IMMEDIATE: {
      uword value = program[pc++];
      accumulator = (Object)value;
      break;
    }
    case STORE_LOCAL: {
      uword idx = program[pc++];
      LOCAL_AT_PUT(idx, accumulator);
      break;
    }
    case LOAD_LOCAL: {
      uword idx = program[pc++];
      accumulator = LOCAL_AT(idx);
      break;
    }
    case PRINT: {
      const char *msg = (const char *)program[pc++];
      printf("%s", msg);
      break;
    }
    case PRINTI: {
      printf("%" PRIu64, accumulator);
      break;
    }
    case HALT: {
      return accumulator;
    }
    case JMPNZ: {
      uword offset = program[pc++];
      if (accumulator != 0) {
        pc = offset;
      }
      break;
    }
    case INC: {
      accumulator++;
      break;
    }
    case DEC: {
      accumulator--;
      break;
    }
    case ADD: {
      uword idx1 = program[pc++];
      uword idx2 = program[pc++];
      accumulator = LOCAL_AT(idx1) + LOCAL_AT(idx2);
      break;
    }
    // ...
    }
  }
}

Using this bytecode, we can write a simple program that adds up all the numbers from 1 to 100 million:

enum {
  result = 0,
  loopc = 1,
};
uword program[] = {
  // result = 0
  LOAD_IMMEDIATE, 0,
  STORE_LOCAL, result,
  // loopc = 100_000_000
  LOAD_IMMEDIATE, 100000000,
  STORE_LOCAL, loopc,

  // loop:
  // result += loopc
  ADD, result, loopc,
  STORE_LOCAL, result,
  // loopc--
  LOAD_LOCAL, loopc,
  DEC,
  STORE_LOCAL, loopc,
  // if loopc != 0, jump to loop
  JMPNZ, 8,

  // print result
  PRINT, (uword)"Result: ",
  LOAD_LOCAL, result,
  PRINTI,
  PRINT, (uword)"\n",
  HALT,
};

We can compile this interpreter program with any C or C++ compiler, feed the interpreter the bytecode, and it will print the result after about 350ms:

$ c++ -O2 peval.cc -o peval.out
$ ./peval.out
Result: 5000000050000000
$

But let’s assume you want to sandbox this program with WebAssembly. Thankfully, there’s this project called wasi-sdk that provides near drop-in replacements for Clang that target WebAssembly. We can compile the interpreter with wasi-sdk and run it with wasmtime or any other WebAssembly runtime that provides a WASI polyfill2. This runs in about 530ms:

$ /opt/wasi-sdk/bin/clang++ -O2 peval.cc -o peval.normal.wasm
$ wasmtime peval.normal.wasm
Result: 5000000050000000
$

But really what we wanted all along was to deploy the program—not the interpreter too, just the program—in the sandbox. We can do that by smushing the bytecode and the interpreter together with weval. This runs in about 40ms:

$ /opt/wasi-sdk/bin/clang++ -O2 -DDO_WEVAL -I include peval.cc -o peval.wasm
$ weval weval -i peval.wasm -o peval.wevaled.wasm -w
$ wasmtime peval.wevaled.wasm
Result: 5000000050000000
$

First of all: let’s step back. We had an interpreter written in C++ that took 350ms to run. We made it a little slower (530ms) by compiling it to WebAssembly. Then we got a 8.5x speedup by using weval. That’s nuts. That’s probably close to what we would get if we hand-wrote a little compiler for our bytecode machine, but I did not have to write a compiler.

Method Time (ms)
C++ 350
WASI 530
WASI+weval 40 (!!)

Check out an asciicast if you want to feel that difference.

You might notice that I added some sneaky flags like -DDO_WEVAL and -I include in there. What’s going on?

Specializing the interpreter

Big picture: give the interpreter function access to constant bytecode.

Well, while weval works correctly on any combination of WebAssembly module and its input, it works best when you give it a little help and tell it what data is constant. In order to do that, we pre-initialize the WebAssembly module using a project called wizer. It gives you, the programmer, hooks to set up some memory before turning the running state back into a WebAssembly module. Let’s look at a diagram of the situation as it is right now:

This is too many levels of nesting

Right now, at run-time, the interpreter loads the bytecode and runs it. The bytecode is not known ahead of time, so the interpreter has to be general.

In order to specialize the interpreter, we do three steps at WebAssembly module initialization time:

  1. Load the bytecode
  2. Create a specialized version of the interpreter function with constant arguments
  3. Run constant propagation and other compiler passes on the function

In this example, we know the bytecode is constant. We can tell weval this by using one of its helper intrinsics. In this case, we create a copy of the Execute function with constant arguments (the program). Now we have two functions: Execute and ExecuteSpecialized. All of this happens in the init function:

template <bool IsSpecialized>
uword Execute(uword *program) {
    // ...
}


Object (*ExecuteSpecialized)(uword *) = 0;

#ifdef DO_WEVAL
void init() {
  uword result = 0;
  uword loopc = 1;
  weval::weval(&ExecuteSpecialized, &Execute<true>, /*func_id=*/123,
               weval::SpecializeMemory<uword *>(program, sizeof program));
}

WIZER_INIT(init);
WEVAL_DEFINE_GLOBALS();
#endif

int main(int argc, char **argv) {
  if (ExecuteSpecialized) {
    ExecuteSpecialized(nullptr);
  } else {
    Execute<false>(program);
  }
}

Now that the code has been loaded and marked constant, the picture looks more like this:

While the code is constant, weval isn’t magic. It won’t modify control flow, by default, except simplifying branches that become constant (it doesn’t even know what an interpreter is!).

In order to start making ExecuteSpecialized faster, we have to drop little hints into the interpreter for weval to pick up. We want to actually specialize the control flow—make the control flow in the bytecode become control flow in the new function itself—so we tell weval about the PC to let it expand out the code.

Modifying the interpreter

Big picture: unroll the loop by specializing on the program counter.

We can start off by telling weval what variable to use as a specialization context. In this case, since we know that the bytecode is constant, we can specialize on the pc—the program counter. This lets weval completely unroll the interpreter loop.

uword Execute(uword *program) {
  while (true) {
    // ...
    switch (op) {
      // ...
    }
    weval::update_context(pc);
  }
}

After running weval on the bundled module, and letting weval unroll the loop, the picture looks like this:

This means that from this point forward, we have used weval to turn our interpreter into a compiler. There are only two optimizations so far—unrolling the loop and constant propagation—but they are very effective. The result is a fully WebAssembly module with no interpreter and no bytecode.

Weval’s compiler passes are not magic. They are the same passes that any compiler would run on your code. They can unroll the interpreter loop and turn bytecode into straight-line WebAssembly code. But that code still has local variable writes push and local variable reads and all the other overhead of the interpreter. So there’s more to be done…

But what if we modified it more?

Big picture: unroll interpreter local variables into WebAssembly local variables by telling weval where they are.

Memory can be hard to reason about in a compiler. Weval isn’t a whole program optimizing compiler and might not be able to prove that a memory location (in this case, the locals array) never escapes or aliases something else. But we, the interpreter authors, know that. So we can add more hints.

Right now, LOCAL_AT and LOCAL_AT_PUT are macros that read and write to the locals array:

#define LOCAL_AT(idx) (locals[idx])
#define LOCAL_AT_PUT(idx, val) (locals[idx] = val)

That’s all well and good for the interpreter, but it’s not great for the compiled code. What we really want is to give weval the ability to reason about each memory location—each local index—separately as an SSA value.

In order to do that, we use weval intrinsics: weval_read_reg and weval_write_reg. For maximum flexibility, we have a couple of macros that switch between the two:

#ifdef DO_WEVAL
#define LOCAL_AT(idx) (IsSpecialized ? weval_read_reg(idx) : locals[idx])
#define LOCAL_AT_PUT(idx, val)                                                 \
  if (IsSpecialized) {                                                         \
    weval_write_reg(idx, val);                                                 \
  } else {                                                                     \
    locals[idx] = val;                                                         \
  }
#else
#define LOCAL_AT(idx) (locals[idx])
#define LOCAL_AT_PUT(idx, val) (locals[idx] = val)
#endif

Now, weval can reason about each local variable separately and they get eventually compiled to normal WebAssembly locals.

Wrapping up

The big idea here is that it’s possible to incrementally unravel an interpreter into a compiler by specializing on constant data and then doing normal compiler passes. The more you can specialize at build-time, the faster the resulting generated code will be.

Check out the code in my already old fork of weval. It includes surprise benchmarks of Wasm JITs in different JS runtimes, too!

Looking forward

Chris gave some more detail about how weval works in this talk (PDF), including a description of how the interpreter function is actually combined with the bytecode. The main idea is to use the PC values as a “context” in a context-sensitive dataflow analysis, so regular constant propagation will see the PC value and opcode for just one interpreter loop iteration, rather than the union of all of them (as a static analysis normally would). There are a bunch of fiddly details to make it work well, and Chris also plans to write a blog post covering weval and its application to SpiderMonkey soon.

Also, our interpreter is tiny and not very interesting on its own. It’s only useful to explain some weval concepts to you. But the same principles apply to much larger interpreters, too! There’s SpiderMonkey, yes, and the same could also probably be done for CPython, the main Python runtime. CPython even has support for being compiled to WebAssembly already!

Imagine compiling Python directly down to WebAssembly… maybe coming soon?

Wilder ideas

CPython already has support for vectorcall function pointers. This is a way to add a JIT compiler in a portable way. We could also maybe use this to turn weval into a Wasm JIT for CPython.

Similar projects

BuildIt is a similar project for C++ that takes a library approach.

  1. A lot of interpreters use a computed goto to dispatch instructions. This is a little faster than a switch statement, but it’s also not portable. It’s not even supported by all compilers (for example, WASI) because it’s not legal C. 

  2. This is pretty straightforward and means the program can even be run in the browser. But it’s not the focus of this post. See the repo for the little JS wrapper that makes this possible.