Matt Keeter put up The Prospero Challenge, which is like catnip for me. It’s a well-scoped project: we have a slow program. Make it faster within these constraints. In this post, I will describe two very small changes that can speed up his sample program with minimal effort.
His sample tiny implementation, which I will reproduce here, uses Python and NumPy to compute pixels in parallel while parsing the input:
import numpy as np
with open('prospero.vm') as f:
text = f.read().strip()
image_size = 1024
space = np.linspace(-1, 1, image_size)
(x, y) = np.meshgrid(space, -space)
v = {}
for line in text.split('\n'):
if line.startswith('#'):
continue
[out, op, *args] = line.split()
match op:
case "var-x": v[out] = x
case "var-y": v[out] = y
case "const": v[out] = float(args[0])
case "add": v[out] = v[args[0]] + v[args[1]]
case "sub": v[out] = v[args[0]] - v[args[1]]
case "mul": v[out] = v[args[0]] * v[args[1]]
case "max": v[out] = np.maximum(v[args[0]], v[args[1]])
case "min": v[out] = np.minimum(v[args[0]], v[args[1]])
case "neg": v[out] = -v[args[0]]
case "square": v[out] = v[args[0]] * v[args[0]]
case "sqrt": v[out] = np.sqrt(v[args[0]])
case _: raise Exception(f"unknown opcode '{op}'")
out = v[out]
with open('out.ppm', 'wb') as f: # write the image out
f.write(f'P5\n{image_size} {image_size}\n255\n'.encode())
f.write(((out < 0) * 255).astype(np.uint8).tobytes())
I saw that, and while I didn’t know what linspace
or meshgrid
did, I got
the general idea. This program runs in 40 seconds on my machine.
Matt made the observation that oops, it’s storing every single frame and that takes up… uh…
>>> (8 * 1024 * 1024 * 7000) / 1000 / 1000 / 1000
58.720256
>>>
Oops, nearly 60 gigabytes of matrices. Fortunately for us, each intermediate
variable is not needed for the entirety of the render: we can find out when
each variable is used for the last time and insert a call to delete that frame
from v
afterward. This is called liveness analysis.
To do that, we’ll need to load the program into memory first so we can analyze it:
prog = []
with open('prospero.vm') as f:
for line in f:
if line.startswith('#'):
continue
prog.append(line.split())
# ...
for (out, op, *args) in prog:
match op:
case "var-x": v[out] = x
# ...
Now, in order to find out when a frame is used for the last time, we can seek backwards on the program to find out when it is used for the “first” time:
with_gc = []
seen = set()
for (out, op, *args) in reversed(prog):
# Don't try to GC constants
# Also don't add GC to the beginning (end) of the program
if op != "const" and with_gc:
for arg in args:
# Delete variable at first (last) use
if arg not in seen:
with_gc.append(("_", "delete", arg))
seen.update(args)
with_gc.append((out, op, *args))
prog = with_gc[::-1]
We just have to be careful not to insert calls to delete constant values like
3.14
– they are not names that will be in v
. We also don’t want to insert
any instructions to delete data at the end of the program (in backwards mode,
before we have added any instructions to with_gc
).
Now we edit the interpreter to handle the new delete
instruction:
for (out, op, *args) in prog:
match op:
case "delete": del v[args[0]]
# ...
After that modification, the program takes only 10 seconds and no more than one gigabyte of RAM. Nice.
Matt also mentions the GPU. I have a 1080ti in my desktop, so let’s see if we can make use of it.
A quick internet search reveals that CuPy should be a drop-in replacement for
NumPy that runs on the GPU. Cool. I installed it with uv pip install
cupy-cuda11x
and replaced the first line with:
import cupy as np
Incredibly, the whole program ran in one and a half seconds. It can also do a 2048x2048 in three seconds. I tried 4096x4096 and ran out of GPU RAM, I think.
Traceback (most recent call last):
File "/home/max/Documents/code/minifidget/test.py", line 38, in <module>
case "add": v[out] = v[args[0]] + v[args[1]]
File "cupy/_core/core.pyx", line 1318, in cupy._core.core._ndarray_base.__add__
File "cupy/_core/_kernel.pyx", line 1349, in cupy._core._kernel.ufunc.__call__
File "cupy/_core/_kernel.pyx", line 645, in cupy._core._kernel._get_out_args_from_optionals
File "cupy/_core/core.pyx", line 2884, in cupy._core.core._ndarray_init
File "cupy/_core/core.pyx", line 257, in cupy._core.core._ndarray_base._init_fast
File "cupy/cuda/memory.pyx", line 738, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1424, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1445, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1116, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1137, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 1382, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
File "cupy/cuda/memory.pyx", line 1385, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 134,217,728 bytes (allocated so far: 5,771,395,072 bytes).
Can it be made faster still? Almost certainly yes. I know some folks working on some meta-JIT hackery that might make it go even faster…