Optimizing Django by not being silly

March 23, 2025

I just saw this post and it reminded me of a time when we had a similar situation, but with string operations in our VM. The project is now defunct but the code is open. Let’s go back in time.

It was 2018 and I had just joined a team that was bringing up a new Python runtime (then nicknamed Pyro, now called Skybison). The goal was to use the last 30 years of VM engineering research and folk knowledge to re-design everything from scratch for performance. Also, importantly, we were going to only have one way to encode strings: UTF-8¹ (this is important for later).

We started from a tiny C++ core that could only run bytecode produced by the CPython runtime’s compiler. When I joined, the focus was on adding new features. At the time, the big project was getting importlib working. We needed some string functions such as str.rpartition, so I wrote a quick and dirty implementation. This helped get find_spec and a bunch of other functions working.

Eventually, we got importlib, then we started work on “Django minimal”, then we started benchmarking “Django workload”. This was around a year later: October 2019.

It’s a really cool feeling watching a project grow up. What used to shell out to CPython to compile bytecode from inside the runtime (wild thing to do in practice, but it worked well for bringup) now had its own bytecode compiler and could run a webserver and was being profiled and optimized for performance.

Unfortunately, in order to be a fast Python runtime, you have to run Python code quickly. And we soon hit the weirdest bottleneck in Django URL parsing. It showed up as something ludicrous like strIndex taking 90% of total run-time for a given request.

Now, let me refresh your memory about UTF-8: indexing into an arbitrary string is an O(N) operation because the codepoints (characters, roughly) are variable length. You need to start from the beginning and count up until you hit the expected index. For runtimes that use UTF-8, you want to avoid loops that look like this:

for i in range(len(s)):
    s[i]  # each index operation is O(N)!

and instead do something like this:

for c in s:
    c  # we already have the character from str.__iter__, which is fast

Our C++ code to index into a string looked something like the following. The core bit is that str.__getitem__ calls offsetByCodePoints to turn the index (in codepoint-space) to an offset (in byte-space). This in turn calls offset which has to loop through the encoded string, jumping numChars bytes (depending on how wide each codepoint is):

bool METH(str, __getitem___intrinsic)(Thread* thread) {
  // ...
  if (0 <= idx && idx < len) {
    word offset = self.offsetByCodePoints(0, idx);
    if (offset < len) {
      // .. fetch the code point and put it on the stack ...
      return true;
    }
  }
  // ...
}

word RawDataArray::offsetByCodePoints(word index, word count) const {
  const byte* data = dataArrayData(*this);
  return offset(data, length(), index, count);
}

static inline word offset(const byte* data, word len, word index, word count) {
  if (count >= 0) {
    while (count-- && index < len) {
      index += UTF8::numChars(data[index]);
    }
    return Utils::minimum(index, len);
  }
  while (count < 0) {
    index--;
    if (index < 0) return -1;
    if (UTF8::isLeadByte(data[index])) count++;
  }
  return index;
}

Now, keeping all of that that in mind, let’s look at my implementation of str.rpartition, the hackily implemented function from earlier (full commit here):

class str:
    # ...
    def rpartition(self, sep):
        # TODO(T37438017): Write in C++
        before, itself, after = self[::-1].partition(sep[::-1])[::-1]
        return before[::-1], itself[::-1], after[::-1]

Haha. Oh. Oof. No. Want to count all the ways this is horrible? rpartition should be able to instead iterate backwards through the string once (fast!), splitting precisely twice. Instead, this:

reverses strings (allocating a bunch of short-lived new strings)
makes a bunch of method calls
reverses strings again

I rewrote the function to do the right thing, which was to have the surface implementation bits in Python (argument parsing, type checking) and then shell out to our fast core string utility:

class str:
    # ...
    def rpartition(self, sep):
        _str_guard(self)
        _str_guard(sep)
        return _str_rpartition(self, sep)

Of course, we didn’t have this fast core string utility yet, so I had to implement it. See the rewrite commit for some fun C++.

I don’t have benchmark numbers for you because this was six years ago and the project was still secret back then. All I remember was that strIndex completely disappeared from the profiles. You can trace some of the other fun performance work by paging through the commits around this time.

Conclusion

A potentially subtle conclusion for this article is that performance profiles might only tell you part of the story. If we looked only at the C++ profiles for our Python runtime, we would see that a runtime function, strIndex, was hot. The inference in this case should not be that we need to make strIndex faster.

Instead, the inference should be that strIndex is being called too frequently by some other function. While I knew it because I was the performance culprit, this is often much harder to reason about in bigger and older systems. A multi-language profiler might shed light on the fact that str.rpartition was called N times and in nested call frames str.__getitem__ was called N*1000 times (for example), and strIndex calls were indirectly coming from there. Then we might conclude that str.rpartition’s implementation left something to be desired.

Well, UTF-8 with surrogate pairs. But not UCS-1, UCS-2, or UCS-4. Bringing up a new codecs implementation led to one guy on the team being known as the UTF-8 guy. Sometimes UTF-8 guy would accidentally find heap buffer overflows in CPython while trying to understand how their CJK implementation worked. ↩