Prospero challenge, now with more garbage collection

March 24, 2025

Matt Keeter put up The Prospero Challenge, which is like catnip for me. It’s a well-scoped project: we have a slow program. Make it faster within these constraints. In this post, I will describe two very small changes that can speed up his sample program with minimal effort.

His sample tiny implementation, which I will reproduce here, uses Python and NumPy to compute pixels in parallel while parsing the input:

import numpy as np

with open('prospero.vm') as f:
    text = f.read().strip()

image_size = 1024
space = np.linspace(-1, 1, image_size)
(x, y) = np.meshgrid(space, -space)
v = {}

for line in text.split('\n'):
    if line.startswith('#'):
        continue
    [out, op, *args] = line.split()
    match op:
        case "var-x": v[out] = x
        case "var-y": v[out] = y
        case "const": v[out] = float(args[0])
        case "add": v[out] = v[args[0]] + v[args[1]]
        case "sub": v[out] = v[args[0]] - v[args[1]]
        case "mul": v[out] = v[args[0]] * v[args[1]]
        case "max": v[out] = np.maximum(v[args[0]], v[args[1]])
        case "min": v[out] = np.minimum(v[args[0]], v[args[1]])
        case "neg": v[out] = -v[args[0]]
        case "square": v[out] = v[args[0]] * v[args[0]]
        case "sqrt": v[out] = np.sqrt(v[args[0]])
        case _: raise Exception(f"unknown opcode '{op}'")
out = v[out]

with open('out.ppm', 'wb') as f: # write the image out
    f.write(f'P5\n{image_size} {image_size}\n255\n'.encode())
    f.write(((out < 0) * 255).astype(np.uint8).tobytes())

I saw that, and while I didn’t know what linspace or meshgrid did, I got the general idea. This program runs in 40 seconds on my machine.

Matt made the observation that oops, it’s storing every single frame and that takes up… uh…

>>> (8 * 1024 * 1024 * 7000) / 1000 / 1000 / 1000
58.720256
>>>

Oops, nearly 60 gigabytes of matrices. Fortunately for us, each intermediate variable is not needed for the entirety of the render: we can find out when each variable is used for the last time and insert a call to delete that frame from v afterward. This is called liveness analysis.

Garbage collection

To do that, we’ll need to load the program into memory first so we can analyze it:

prog = []

with open('prospero.vm') as f:
    for line in f:
        if line.startswith('#'):
            continue
        prog.append(line.split())

# ...

for (out, op, *args) in prog:
    match op:
    case "var-x": v[out] = x
    # ...

Now, in order to find out when a frame is used for the last time, we can seek backwards on the program to find out when it is used for the “first” time:

with_gc = []
seen = set()

for (out, op, *args) in reversed(prog):
    # Don't try to GC constants
    # Also don't add GC to the beginning (end) of the program
    if op != "const" and with_gc:
        for arg in args:
            # Delete variable at first (last) use
            if arg not in seen:
                with_gc.append(("_", "delete", arg))
        seen.update(args)
    with_gc.append((out, op, *args))

prog = with_gc[::-1]

We just have to be careful not to insert calls to delete constant values like 3.14 – they are not names that will be in v. We also don’t want to insert any instructions to delete data at the end of the program (in backwards mode, before we have added any instructions to with_gc).

Now we edit the interpreter to handle the new delete instruction:

for (out, op, *args) in prog:
    match op:
        case "delete": del v[args[0]]
        # ...

After that modification, the program takes only 10 seconds and no more than one gigabyte of RAM. Nice.

Matt also mentions the GPU. I have a 1080ti in my desktop, so let’s see if we can make use of it.

Enter the GPU

A quick internet search reveals that CuPy should be a drop-in replacement for NumPy that runs on the GPU. Cool. I installed it with uv pip install cupy-cuda11x and replaced the first line with:

import cupy as np

Incredibly, the whole program ran in one and a half seconds. It can also do a 2048x2048 in three seconds. I tried 4096x4096 and ran out of GPU RAM, I think.

Traceback (most recent call last):
  File "/home/max/Documents/code/minifidget/test.py", line 38, in <module>
    case "add": v[out] = v[args[0]] + v[args[1]]
  File "cupy/_core/core.pyx", line 1318, in cupy._core.core._ndarray_base.__add__
  File "cupy/_core/_kernel.pyx", line 1349, in cupy._core._kernel.ufunc.__call__
  File "cupy/_core/_kernel.pyx", line 645, in cupy._core._kernel._get_out_args_from_optionals
  File "cupy/_core/core.pyx", line 2884, in cupy._core.core._ndarray_init
  File "cupy/_core/core.pyx", line 257, in cupy._core.core._ndarray_base._init_fast
  File "cupy/cuda/memory.pyx", line 738, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1424, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1445, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1116, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1137, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 1382, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
  File "cupy/cuda/memory.pyx", line 1385, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 134,217,728 bytes (allocated so far: 5,771,395,072 bytes).

Can it be made faster still? Almost certainly yes. I know some folks working on some meta-JIT hackery that might make it go even faster…