☑ What’s New in Python 3.13 - Performance Improvements

16 Nov 2024 at 10:41PM in Software
 | 
Photo by Andy Pearce (via Microsoft Designer)
 | 

In this series looking at features introduced by every version of Python 3, we take a look at some of the new features added in Python 3.13. In this article we look at the improvements to the intereter itself. We’ll look at the new option to disable the GIL, and the addition of a JIT compiler.

This is the 34th of the 34 articles that currently make up the “Python 3 Releases” series.

python 313

After our brief first article in this series looking at what’s new in Python 3.13, this second installment will drill into some of the deeper changes in this release—the optional removal of the the GIL to improve multithreading performance, and the addition of a JIT compiler to improve all the other performance. So we have one removal and one addition, both of which have the potential to improve performance in their own ways.

GIL Removal

Let’s kick off by looking at the optional emoval of the Global Interpreter Lock or GIL. For anyone that’s unaware, this is a global mutex which is held whenever any code is running Python bytecode—broadly speaking, this means that only a single thread can be actually executing code at any time.

Why Does The GIL Exist?

You might wonder what’s the point of that? Well essentially it avoids the overhead and risks of hundreds of fine-grained locks to protect individual data structures within the Python interpreter. This is why Python very, very rarely segfaults on you unlike, say, Perl which, in my experience1, will happily dump core on you the moment you concurrently access a builtin data structure from two threads.

Now this isn’t nearly so bad as it might sound, because code generally spends a lot of its time waiting for I/O, whether that’s from disk or the network, and the GIL is always released to allow other threads to execute during these blocking executions. Also, extension modules (e.g. in C/C++) have long had the option to release the GIL whilst performing

Previous Solutions and Workarounds

A few mechanisms have been put in place for those people who have CPU-intensive multithreaded workloads and really need to spread them over multiple cores. One of the main ones is multiprocessing, which is a module that offers thread-like convenience but actually distributes the work across multiple processes. This actually works really well if you have a workflow which is the classic work queue and pool of workers, where each work item can be fully encapsulated and processed statelessly.

Not everyone has such cases, however, and some use-cases rely heavily on shared data structures and the like. What’s to do in such cases?

There have been previous discussions and even attempts at removing the GIL, and back in 2007 Guido blogged about this. Back in 1996, whilst I was biting my nails and waiting for my exam results to see if I would get into my first choice of university, Greg Stein was busy working on a patch against version 1.4 of the language to remove the GIL. You can still find the README and the patch itself online if you’re interested.

Whilst the move from GIL to more fine-grained locking might have improved multithreaded performance, unfortuantely the additional overhead noticeably worsened single-threaded performance. In the context of the late 90’s and early 00’s, where multicore processors were still the preserve of only wealthier computer users, this was not regarded as an acceptable tradeoff.

Fast-forward to 2015, and another attempt by Larry Hastings which he rather vividly named the Gilectomy. You can still find the code on GitHub and watch his talk at PyCon 2016 about it. I won’t go into the details, but Larry explains it well and also Backblaze has a great little article which provides some more technical context. Suffice to say, the issue was still primarily direct and indirect performance consequences rendering the effort unsatisfactory.

Final Approach

So this brings us to the third main attempt by Sam Gross, which has eventually grown into PEP 703 and the option to remove the GIL that we see in Python 3.13 today.

You can still find his original nogil repo, although that’s no longer being worked on now that the feature is included in the offical tree. You can also find the original design document. But if you want to know how things are working now, as opposed to the historical context, your best bet is to skip all that and just check out the PEP.

There’s some excellent and detailed technical discussion in the PEP, but for those who want a slightly higher-level overview I’ve tried to summarise the three main sets of changes in the following sections:

  • Reference counting
  • Memory management
  • Container thread-safety

If you’re not interested in the theory and just want to see how this behaves in practice, feel free to skip down to the Trying It Out section below.

Reference Counting

The first thing the GIL protects is the reference count on objects—it’s critical that this is accurate, since it determines the lifetime of the object2. Using the GIL, the reference count doesn’t need any special synchronisation, and can be incremented and decremented freely.

Biased Reference Counting

To avoid the overheads of adding locking to all refcount accesses, a technique called biased reference counting is used. This uses two separate refcounts for each object—one for the owning thread and one for everything else. This means that refcount changes by the owning thread3 can avoid the overheads of synchronisation, since only the one thread uses that count. Other threads, however, must synchronise their access and pay the performance cost—but these other accesses are assumed to be typically a lot less common.

Within the PyObject structure, the ownership of an object is indicated by an ob_tid field which tracks the owning thread ID—a value of zero indicates that no thread currently owns this object. The ob_ref_local field tracks the local reference count, and the ob_ref_shared value counts all the other references.

PyObject structure when GIL is disabled

There’s some more detail in that the low-order bits of ob_ref_shared are used to track various changes which would require more care on deallocation, which allows this scheme to be used with more efficiency in the common case.

One such case is if any weak reference have been created, and another is if the shared reference count has become negative due to ownership transfers between the owning thread and other threads leading to imbalances in the increment and decrement of the shared ref count. In these cases, the two reference counts are merged and the biased reference counting scheme is disabled for this object, and doing this requires the slower deallocation path.

If you want the full, gory details about this, go read the PEP.

Immortalisation

In addition to this, immortalisation is used to obviate some refcount objects. This is simply that singletons like True, False and None, as well as interned strings and some other values are kept alive for the lifetime of the interpreter, so refcount changes on them are ignored, and hence incur no performance hit.

Immortality is conferred by setting every bit in ob_ref_local—after millenia of humanity searching for this secret, who knew it was so easy?

Deferred Reference Counting

The final reference counting technique is deferred reference counting, which is used for objects such as top-level functions, code objects, modules and methods, which are frequently accessed by multiple threads, but which should still not be immortal. For objects using deferred reference counting4, the interpreter skips the usual changes to the refcount as objects are pushed and popped from a thread’s stack.

Instead, garbage collection (GC) halts all threads so that even without the GIL, it can safely compute the skipped refcount updates by looking for deferred references from thread stacks—I believe that transitive references by other objects are never deferred, so will be tracked by the actual refcount. This does mean that the destruction of some objects might be deferred to GC when previously it would have been on final refcount decrement, but since these types of objects commonly have cyclic references anyway then it may not make any difference in most cases.

As a point of detail, there are two stop-the-world thread pauses in each run of the GC. The first is for an initial cycle to detect garbage which is involved in reference cycles, then the threads are restarted while __del__() methods are called. Once these complete, threads are paused once more while objects that remain unreachable are actually deleted.

Also, the GC has been made non-generational, as these stop-the-world pauses would be disruptive if carried out frequently for a younger generation of objects. As a result, all objects are held in the same single generation when the GIL is removed.

Thread States

To support the ability of the GC to stop threads, all threads are now in one of three states:

  • ATTACHED when they would have otherwise held the GIL.
  • DETACHED when they would have otherwise released the GIL.
  • GC used for stopping threads for GC.

When the GC needs to stop threads, it can safely transition any thread in DETACHED state into GC using an atomic CAS operation. Any thread which is in ATTACHED state is cooperatively asked to pause themselves and set their own status to GC—this is handled during the eval breaker, which I briefly discussed in one of the 3.12 articles.

Once all threads are in GC state, aside from the one actually performing the GC, then the garbage collection can happen. Once it’s complete, all the threads in GC are set to DETACHED and woken up—if they were previously ATTACHED then they can go back to that state and resume running Python code, and if they were previously DETACHED they ignore the notification and stay that way.

Memory Management

The next thing protected by the GIL is the internal pymalloc memory allocator, which is optimised for small object allocations. You can read more about this in the official documentation if you’re interested.

The implementation of this isn’t thread-safe—it’s never needed to be, because it’s always been protected by the GIL. As a result, when the GIL is removed this allocator is replaced with mimalloc, which is an open source general-purpose memory allocator developed by Microsoft.

Some changes have been made to support lock-free retrieval of objects from dict and list objects—we’ll look at these in the section on Container Thread-Safety.

As a result of the changes, the PEP does impose some additional restrictions, so when you’re compiling in free-threaded mode you must abide by the following:

  • Python objects must be allocated through Python’s own APIs, such as PyObject_Malloc().
  • The PyObject_Malloc() call must only be used for allocating objects, not buffers or other storage
  • The PyMem_SetAllocator() API must only be used to set allocators which themselves eventually delegate to something like PyObject_Malloc().

Container Thread-Safety

The basic means to provide actual thread safety, with the GIL removed, is per-object mutexes, as you might have guessed. To make modifications to a dict, list, set, or similar, code needs to be holding the lock. Specialised operations which require internal access to two or more objects at a time, such as the specialisation of list.extend(iterable) where iterable is a set, require locks on both objects to be held. You might immediately say this has the potential to introduce a risk of deadlocks, and you’d be right—we’ll look at this in a second.

Optimistic Lock Avoidance

To improve performance, however, some read operations have been optimised to not require taking a lock, and this optimisation requires some cooperation from the memory allocator, which I mentioned earlier. The operations which have been optimised to not require locks are:

  • Retrieving a list item (i.e. list[idx])
  • Retrieving a dict item (i.e. dict[key])
  • Iterating to the next item from a list or dict
  • List containment checks (i.e. elem in list)

To see why allocator cooperation is required to support this, let’s consider what needs to happen. When you want to retrieve an item, you need to check whether its reference count is zero and, if it’s not, increment it. The increment is because we now have a new reference to the object, and the check for zero is to make sure that we don’t obtain a reference to an object which has already dropped out of scope and either been deleted, or will be at the next GC cycle.

You can use a conditional increment operation, which increments if the value is non-zero already, and returns an error if it’s zero. Without any locking, however, the object’s memory may have already been reused for some other purpose, so we can’t even be sure the reference count still exists at this location in memory—this makes a naive conditional increment unsafe.

To resolve this, the GIL removal changes arrange that even if the object is freed, the reference count remains at the same location in memory—this makes the lock-free conditional increment operation safe even if you’re performing the check after the object has been deleted.

Allocator Support

To see how the allocator support works, we need to go over some terminology of how mimalloc manages its memory. In mimalloc allocations are known as blocks and are grouped into pages5 of conecutive blocks of the same size, and pages of blocks of different sizes are grouped into heaps. If no blocks in a page are allocated, it can be reinitialised with a different block size if required.

Mimalloc memory structure

The changes that Python has made are to use three separate heaps for different categories of objects which hold their reference counts at different offsets within the block. This means that all of the objects within each page use the same offset into the blocks within that page to hold the reference count, since all pages are in exactly one heap, and that heap only contains obejcts with consistent offsets.

The three heaps are:

  • Garbage-collected objects with a managed dictionary (i.e. which have a __dict__ field), because this precedes the PyObject structure in memory.
  • Garbage-collected objects without a managed dictionary
  • Non-garbage-collected objects, such as those that have been immortalised

This means that the GC can easily find all the objects it should be examining within the first two heaps, and within each the reference count will always be at the same offset in each block.

The other aspect to this is that pages are prevented from being reused in other heaps or for a different block size, as this would destroy the guarantees that have been outlined above. Of course, pages do need to be reused at some point, however, to allow the allocator to adapat to the changing memory needs of a running application—as a result the restrictions are only temporary.

To determine when it’s safe to release a page so it can be used for a different block size, a different heap, or released back to the operating system, a series of per-thread sequence counters is used. A single global, monotonically increasing write sequence number is used, and whenever a page becomes free it’s tagged with the current value of this counter, and the counter is atomically incremented.

Each thread also observes this counter at safe intervals6 when it’s not performing any access to list or dict objects, and takes a copy of it as its local sequence number. A global read sequence number is derived any time a page might be reused, and this comprises the minimum of all of the current threads local seequence numbers.

Given all of this, a page may be safely reused when the read sequence number is greater than the page’s tagged sequence number. This ensures that all threads exited any list or dict accesses after the page became empty, and this guarantees that no thread is currently relying on the page reuse restrictions outlined above.

Page reuse counter system

Critical Sections

Finally, we come back to the topic of deadlocks that I mentioned above. Whenever a thread of execution holds more than one lock, there always exists the possibility of deadlocks. If thread 1 acquires lock A, and thread 2 acquires lock B, and then they both attempt to acquire the other lock then you have a classic deadlock and your application will be stuck forever.

In some simple cases where a deterministic number of locks is required, this can be avoided by every thread acquiring locks in the same order—this could work in the specialised operations I mentioned above, which require only two locks. But in the general case, this strategy is of limited use.

Instead, the general approach is that whenever a nested operation requires additional an addiitonal lock, any outer locks are released—this ensures the code only holds one lock at a time. When the nested operation ends then the inner lock is released and any outer locks are reacquired. Additionally, locks are also released around potentially blocking operations—i.e. those operations that, prior to these changes, would have released the GIL anyway. This prevents deadlocks between locks and blocking operations which may themselves indirectly depend on those locks.

As a refinement to this general approach to improve performance, however, is that such outer lock releases are deferred until the thread would actually block—if no blocking operations are performed, such as conflicting locks or I/O, then the outer locks are not released.

There are four macros defined to help extension modules with this:

Py_BEGIN_CRITICAL_SECTION(PyObject *ob)
Begin a new critical section and acquire mutex for op, releasing any enclosing critical section locks before blocking if it’s already locked.
Py_END_CRITICAL_SECTION
End the most recent Py_BEGIN_CRITICAL_SECTION(), unlocking the mutex.
Py_BEGIN_CRITICAL_SECTION2(PyObject *ob1, PyObject *ob2)
As above but obtain locks on ob1 and ob2 in order of their memory address, to ensure consistent ordering. If either mutex is already locked, enclosing locks are released7.
Py_END_CRITICAL_SECTION2
End the most recent Py_BEGIN_CRITICAL_SECTION2(), unlocking the mutex.

Also, whenever a thread enters DETACHED state, it releases all locks in any active critical sections. When a thread enters ATTACHED state, only the most recent critical section lock should be reacquired.

Trying It Out

OK, so that was a lot of words, maybe it’s time to put things into action and see how this behaves. For transparency and some vague attempt at reproducability, I’m using the Python builds provided by the uv tool under names 3.13.0 and 3.13.0t. The full sys.version strings reported by the two builds are:

3.13.0 (main, Oct 16 2024, 08:05:40) [Clang 18.1.8 ]
3.13.0 experimental free-threading build (main, Oct 16 2024, 08:24:33) [Clang 18.1.8 ]

As an aside, it might be useful to note that you can tell if the GIL is enabled or not at runtime. Firstly, you can check if the compilation option was set using sysconfig.get_config_var("Py_GIL_DISABLED"). Secondly, if you’re using at least Python 3.13, you can call sys._is_gil_enabled() to check if the GIL is actually in use or not.

CPU-Intensive Threading

To start with I thought I’d try and maximimse the CPU-intensive nature of the application by calculating SHA1 checksums of random data.

I’ll say at this point that performance measurements are always really hard to do properly, and there may be flaws with this—for example, perhaps SHA1 checksums utilise some special coprocessor in my Macbook which limits scalability, or perhaps random was doing I/O to get entropy from the OS which acted as a false bottleneck. So this is very unscientific indeed, and all I’m trying to do is a get some sort of vague idea.

That said, here’s the code I used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import hashlib
import random
import sys
import threading
import time


def pointlessly_busy() -> None:
    for i in range(1_000):
        data = random.randbytes(1024 * 1024)
        hashlib.sha1(data).hexdigest()


def run_threads(num_threads: int) -> None:
    threads = {
        threading.Thread(target=pointlessly_busy, daemon=True)
        for i in range(num_threads)
    }
    print(f"Starting {num_threads} threads")
    started = time.monotonic()
    for thread in threads:
        thread.start()
    print("Waiting for threads to finish")
    for thread in threads:
        thread.join()
    print(f"Total time: {time.monotonic() - started:.2f}")


def main() -> int:
    run_threads(30)
    return 0


if __name__ == "__main__":
    sys.exit(main())

As you can see, we spin up 30 threads, each of which loops 1000 times and each time generates 1MB of random data and calculates the SHA1 hash of it, which it throws away. We time how long it takes between starting all the threads and them all completing.

Of course, it’s still going around a loop in Python bytecode, and I could have given it much chunkier work to do than just a SHA1 hash of 1MB of data, but this was somewhat intentional—if the work is all in a C extension then in principle it might release the GIL for that work, erasing any advantage of the free-threaded version.

Using the Python build with the GIL enabled, I got the following results from two consecutive runs:

$ python cpu-heavy-threading.py
Starting 30 threads
Waiting for threads to finish
Total time: 49.20
$ python cpu-heavy-threading.py
Starting 30 threads
Waiting for threads to finish
Total time: 49.02

And then with the GIL removed I got the following results:

$ python cpu-heavy-threading.py
Starting 30 threads
Waiting for threads to finish
Total time: 30.51
$ python cpu-heavy-threading.py
Starting 30 threads
Waiting for threads to finish
Total time: 30.69

You can see here that there’s a clear improvement, as you’d hope. Removing the GIL appears to reduce the running time by 37% on my 8-core8 M2 MacBook Pro. This is perhaps not as much as one might have hoped for what is ostensibly a fairly CPU-intenstive task, but it’s certainly a significant performance boost.

Mixed Load Threading

I thought I’d try a variant of this which pushes in a bunch of I/O to slow things down a bit. In principle, you might expect that the delays in I/O might reduce the benefit of removing the GIL, but let’s find out.

Here’s the code for this version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import hashlib
import os
import queue
import random
import sys
import tempfile
import threading
import time


def feed(work: queue.SimpleQueue) -> None:
    for i in range(1_000):
        with tempfile.NamedTemporaryFile(delete=False) as fd:
            fd.write(random.randbytes(1024 * 1024))
            work.put(fd.name)


def eat(work: queue.SimpleQueue) -> None:
    for i in range(1_000):
        filename = work.get()
        with open(filename, "rb") as fd:
            hashlib.sha1(fd.read()).hexdigest()
        os.remove(filename)


def run_threads(num_threads: int) -> None:
    work = queue.SimpleQueue()
    feeders = {
        threading.Thread(target=feed, args=(work,), daemon=True)
        for i in range(num_threads)
    }
    eaters = {
        threading.Thread(target=eat, args=(work,), daemon=True)
        for i in range(num_threads)
    }
    print(f"Starting {num_threads} feeder threads")
    started = time.monotonic()
    for thread in feeders:
        thread.start()
    print(f"Starting {num_threads} eater threads")
    for thread in eaters:
        thread.start()
    print("Waiting for threads to finish")
    for thread in feeders:
        thread.join()
    for thread in eaters:
        thread.join()
    print(f"Total time: {time.monotonic() - started:.2f}")


def main() -> int:
    run_threads(30)
    return 0


if __name__ == "__main__":
    sys.exit(main())

This one starts up thirty pairs of threads, feeders and eaters9. Feeders generate files that contain 1MB of random data and put the filename in a queue, and eaters take the filenames off the queue, read the file and calculate the SHA1 hash of the content.

This is an interesting one because on the one hand the threads are blocked on conventional file I/O, but on the other hand they’re running Python bytecode frequently which will be blocked up on acquiring the GIL.

First we have with the GIL enabled:

$ python mixed-load-threading.py
Starting 30 feeder threads
Starting 30 eater threads
Waiting for threads to finish
Total time: 60.46
$ python mixed-load-threading.py
Starting 30 feeder threads
Starting 30 eater threads
Waiting for threads to finish
Total time: 59.65

And then with the GIL removed:

$ python mixed-load-threading.py
Starting 30 feeder threads
Starting 30 eater threads
Waiting for threads to finish
Total time: 38.37
$ python mixed-load-threading.py
Starting 30 feeder threads
Starting 30 eater threads
Waiting for threads to finish
Total time: 37.81

Despite changing up the task, we’re still seeing a very consistent reduction in runtime of around 37%. I found this a little surprising, but it’s a good reminder of why it’s so important not to waste time on early optimisation—particularly in concurrent environments it’s often hard to predict the performance impact of code changes.

CPU-Intensive Single Threaded

Of course, even this cursory analysis wouldn’t be complete without considering the single-threaded case—what impact does disabling the GIL have on a CPU-intensive process which doesn’t use threads at all?

For this third and final case, I used the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import hashlib
import random
import sys
import time


def pointlessly_busy() -> None:
    for i in range(30_000):
        data = random.randbytes(1024 * 1024)
        hashlib.sha1(data).hexdigest()


def run_work() -> None:
    print(f"Starting work")
    started = time.monotonic()
    pointlessly_busy()
    print(f"Total time: {time.monotonic() - started:.2f}")


def main() -> int:
    run_work()
    return 0


if __name__ == "__main__":
    sys.exit(main())

You can see the results with the GIL enabled:

$ python cpu-heavy-single-threaded.py
Starting work
Total time: 61.11
$ python cpu-heavy-single-threaded.py
Starting work
Total time: 60.58

And also with the GIL removed:

$ python cpu-heavy-single-threaded.py
Starting work
Total time: 60.29
$ python cpu-heavy-single-threaded.py
Starting work
Total time: 60.34

Not much to say here—removing the GIL clearly didn’t have a great deal of impact on this CPU-intensive code. That’s not to say it might not impair the performance of other more general Python code, but doing that sort of detailed analysis is rather outside the scope of this article. Suffice it to say that at least we’ve demonstrated that at least in some cases, there isn’t a particular performance cost to these changes, even if you’re not using threading.

JIT Compiler

Now we come to an entirely different change to the internals of the interpreter, but this time it’s something you don’t need to specify at build time—support for just-in-time compilation, more commonly referred to as JIT.

Before I run through this, full disclosure: I haven’t had the chance to play with this feature myself. I also don’t think noddy little examples like the ones for GIL removal above will really work—the nature of the feature is such that you’ll see the best performance improvements on long-running applications, not toys which only run for a few seconds. So, if you’re looking for real-world results here, I’m afraid you’ll be disappointed—I’m just going to talk through the theory for now.

Recap: Bytecode

Until now, Python has always been an interpreter bytecode language—source code is compiled to bytecode, which is an instruction set that’s similar to machine code in principle but at a much higher level. Python bytecode is a stack-based machine which executes high-level instructions which somewhat efficently encoded. Some examples of Python bytecode instructions are:

BINARY_OP(op)
Pops the top two values off the stack, combine them with operator op and push the result back on the stack.
STORE_SUBSCR
Pops a key, then a container and then a value from the stack, and stores the value in the container under the key.
CALL(argc)
Pops a callable, then self (or NULL), then argc number of parameters from the stack, and call the specified callable with the specified parameters.
RETURN_VALUE
Returns the top value of the stack as the result to the caller of a function.

In total there are over 120 bytecode instructions, and as you can see they are still quite high-level operations, which makes it comparatively easy for Python source code to be compiled into them.

Aside from expansions to the bytecode instruction set, this has been the way that Python has run for a very long time. There was a bit of a change in 3.11, when the specialising adapative interprter was introduced, which improved performance by dynamically replacing some bytcode instructions with specialised versions based on observations of how the code is used in practice at runtime.

This is a useful performance improvement, but it’s still fundamentally bound by the structure of the existing bytecode instructions—all it can do is swap in specialised implementations of them optimised for specific cases.

Introducing JIT

Now we’ll look at the other performance improvement in this release—the addition of just-in-time compilation (JIT) on some platforms, which is covered in detail in PEP 744.

Starting in the previous release, 3.12, the CPython interpreter has had code changes which means it’s actually constructed from a series of macros which effectively form a C-like domain-specific language. This helped to keep the codebase manageable when complicated changes such the as the specialisating adapative interpreter were introduced, but also opens more possibilities for how code is interpreted.

One possible use is to perform optimisations on seequences of bytecode instructions, where they are broken down into smaller micro-operations, and known sequences of these are executed in a more optimal manner. Apparently this has already been included since early 3.13 builds, but it’s not enabled by default—the problem is that the improvements due to the optimisation were cancelled out by the additional overhead of the interpreter wading through these micro-operations.

There is a solution to the overhead issue, which is to compile a sequence of these operations to machine instructions that can be run directly, removing the overhead of interpreting them. That’s exactly what’s been done in this release.

Since the optimisations often depend on runtime behaviour, it doesn’t make sense to force a complete restructuring of the interpreter to apply them in advance. Instead, they’re applied as needed at runtime, in a more generalised version of the specialising interpreter, and hence the term JIT.

Copy and Patch

There are two issues to be solved here:

  1. Compiling Python code to performant bytecode.
  2. Generating efficient machine code for a specified bytcode sequence.

The core Python development team clearly possess the skills for the first problem, and to solve the second we come back to the switch towards the domain-specific language in which the interpreter is now written. The macros used to build the interpreter form the micro-operations into which bytecode operations are converted in order to optimise. This means that core developers don’t need to concern themselves with separate updates to the JIT backend—they just add or change bytecode definitions, and everything else happens as a natural consequence.

This approach is known as copy-and-patch, where known templates of micro-operations are used to generate templated machine code, where memory addresses and registers are filled in as required. If you want some bedtime reading, you can read the original paper published in ACM. The downside of this approach is that it can be less efficient than more rigorous approaches, but on the plus side it doesn’t require a complete rewrite of the Python interpreter—that seems like a definite advantage.

Runtime Behaviour

I’m no expert in this, but it seems like we start with the kind of tier 1 specialised bytecode that we already had as of Python 3.11. The profiling that’s already done for the adapative specialisation means that we already have information about how hot code is, and when there’s potential for more optimisation to be beneficial (i.e. the code is “hot enough”), then that representation is converted to a tier 2 representation—the micro-operations referred to earlier.

A stack-based virtual machine can be used to interpreter tier 2 code, the same as tier 1, but this is primarily for debugging purposes. The real benefit comes with a further pass to convert tier 2 code to machine code, which can then be executed. My understanding, which may be flaws, is summarised in the diagram below.

JIT process in Python

Enabling JIT

Since this is still experimental, you’ll need to enable it in the interpreter at compilation time. To do this, you can pass --enable-experimental-jit option when compiling, and passing one of these values:

  • no to disable tier 2 instructions and JIT
  • yes to enable JIT by default
  • yes-off to build JIT but turn it off by default
  • interpreter to enable tier 2 instructions, but not JIT

If the interpreter is compiled with JIT support, you can disable it at runtime by setting the PYTHON_JIT=0 environment variable. If it’s compiled but disabled by default, you can enable it with PYTHON_JIT=1.

At runtime you can check if the interpreter was compiled with JIT support by checking sysconfig.get_config_var("PY_CORE_CFLAGS") and looking for -D_Py_TIER2 and/or -D_Py_JIT—the latter is only set if you chose yes or yes-off for the JIT support during compilation.

Conclusions

The GIL removal I must say is very promising, as it does seem like we finally have an approach which has comparatively minimal impact on single-threaded tasks, but which provides really noticeable performance improvements for multithreaded applications. I was also pleasantly surprised to see this performance improvement was beneficial even for fairly IO-heavu workfloads.

It’s interesting to see the lengths to which the developers have gone to minimise the impact of these changes in common cases, and it seems to have paid off. My suspicion for complex optimisations is that you often seem to end up adding so much overhead with complicated checks that you completely outweigh the benefits of the optimisations, but at least based on my very crude checks that doesn’t appear to be the case.

The JIT compiler, on the other hand, seems less certain to me. Of course, I’m only speculating based on what I’ve read, as I didn’t try it out myself, but it appears that the performance gains are so far fairly unremarkable, and I suspect this is one reason that it’s still not enabled by default.

Still, the overhead of bytecode interpretation is clearly one of the bottlenecks which makes Python that bit slower than some other languages, and I think this approach does show promise to speed things up without making interpreter maintenance needlessly difficult. The question is really whether they’ll be able to find sufficient further improvements to be worth enabling by default.

In any case, that’s it for this article. Next time I’ll be looking at one remaining relatively minor language improvement, which is the addition of defined semantics for updating the value returned bylocals(); and also running through at least some of the standard library changes.


  1. Admittedly limited and very long time ago… 

  2. Anyone who’s worked on an extension module and had to manage the reference counts themselves can attest to the nasty problems which can occur if even a single increment or decrement is missed. 

  3. I believe the “owning thread” is always just the one that created it, although they may be some subtleties around passing of ownership which I’ve not yet appreciated. 

  4. This seems to be indicated by setting one of the bits in ob_gc_bits in the PyObject structure, and also setting a bit in ob_ref_shared. It can be set yourself by called PyUnstable_Object_EnableDeferredRefcount() on the object, if you want to. If you look at the implementation of this function, you’ll see that it does enough checks that you shouldn’t consider messing with these bits directly yourself or you’ll risk all kinds of failures. 

  5. Note that these are not memory pages in the operating system sense—these are more like what Doug Lee’s allocator would refer to as bins

  6. The exact point used in the current implementation is during mimalloc’s slow-path allocator, which is called regularly enough to be useful but not so much as to cause overhead. In principle, however, the value could be observed any time the thread is not actively relying on the page reuse being blocked—i.e. not accessing any list or dict objects. 

  7. You may wonder why a separate macro is needed to acquire two locks rather than just nest two critical sections inside each other. The answer is that these operations require both locks to be held simultaneously, and if you nest critical sections then the inner one may release the lock in the outer one if the thread blocks. As a result you need a separate macro to capture the different semantics required. 

  8. Technically it has 12 cores, 8 of which are “performance cores” and 4 of which are “efficiency cores”. Since I wasn’t running this as a lower priority process, my assumption is that it would be using the performance cores, and hence 8-cores is what’s available. It’s possible that given that large number of threads it still spread over the additional efficiency cores—sorry, I’m not enough of an expert in MacOS kernel task scheduling to tell you for sure! 

  9. Most people would call them producers and consumers, but today I fancied a change. Maybe I’m turning into a communist—or maybe I’m just a bit peckish as I write this, who can say. 

This is the most recent article in the “Python 3 Releases” series, which started with What’s New in Python 3.0
Sun 24 Jan, 2021
16 Nov 2024 at 10:41PM in Software
 | 
Photo by Andy Pearce (via Microsoft Designer)
 |