☑ What’s New in Python 3.12 - F-Strings and Interpreter Changes

6 Feb 2024 at 5:05PM in Software
Photo by Dids on Pexels

In this series looking at features introduced by every version of Python 3, we take a look at the new features added in Python 3.12 related to f-strings, inlining of comprehensions, improved error reporting, the new monitoring API for debuggers, and better isolation for sub-interpreters.

This is the 31st of the 32 articles that currently make up the “Python 3 Releases” series.

python 312

Having looked at various improvements to type hints in the previous article, it’s time to look at some more of the improvements to Python in release 3.12. In this article we look at the fact that f-strings have had some of their restrictions lifted, and also some improvements to the CPython interpreter—namely: inlining of comprehensions, improvements to error reporting, a new API for tools such as debuggers, and better isolation for sub-interpreters.

As an aside, the last two features are somewhat obscure, so those not using those features might like to skip that part—I’ve put a note into the text below where I feel some developers might like to drop off. Personally I think it’s useful to have some insight into how the interpreter works, but I realise that some Python developers work at a higher level and just want to get their work done with a minimum of fuss, and perhaps delving into bytecode instructions isn’t their idea of a fun time.

F-Strings Enhancements

Since their introduction in Pyhton 3.6 by PEP 498, f-strings have not had a formal grammar and have been saddled with various restrictions. At the time these were necessary to be able to implement the feature without modifying the existing lexer for the language, but some of them are slightly annoying. Here are three of the major limitations:

  • Impossible to use the quotes delimiting the f-string within the embedded Python expressions—e.g. f"Name: {details["name"]}".
  • Backslashes are explicitly not allowed in f-string expressions.
  • F-string expressions couldn’t spend multiple lines except in a triple-quoted string form, and even then couldn’t include comments.

What’s Changed in Python 3.12

Restrictions like these have been removed by giving f-strings a formal grammar and implementing dedicated parsing code for them, as described in detail in PEP 701. As a result, the three limitations above no longer apply.

Quotes in F-String Expressions

It’s now possible to use any valid quotes in the expression within an f-string, even the same quotes as were used to delimit the f-string:

>>> details = {"name": "Andy", "favourite_colour": "Pantone 2172 C"}
>>> f"{details["name"]} likes the colour {details["favourite_colour"]}"
'Andy likes the colour Pantone 2172 C'

As a consequence, this also allows arbitrary nesting of f-strings, should you find you need such a quirky feature, because you’re no longer forced to select a different type of quote for each level of nesting.

>>> f"{f"{f"{f"{f"{f"{2**8}"}"}"}"}"}"

Multiline F-string Expressions

It’s now possible for expressions to span multiple lines, even if using the single-quoted form of the f-string, wherever a newline would be acceptable in an expression in normal Python code.

>>> f"The items are: {", ".join([
...     "one",
...     "two",
...     "three"
... ])}"
'The items are: one, two, three'

It’s important to remember that these are expressions not arbitrary Python code, so you can’t include multiple semantic lines of code. That said, the inclusion of the walrus operator in Python 3.8 means that expressions can have assignment side-effects. This was the case even before Python 3.12, but I always find it slightly surprising1.

>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
>>> f"{(x := 123),x + 456}"
'(123, 579)'
>>> x

Backslashes in F-string Expressions

Before Python 3.12 you weren’t permitted to use backslashes in f-string expressions, even if this use wasn’t to escape the delimiters used for the f-string itself:

Python 3.11
>>> f"{'Andy\'s blog'}"
  File "<stdin>", line 1
    f"{'Andy\'s blog'}"
SyntaxError: f-string expression part cannot include a backslash

This now works as you’d expect in Python 3.12:

Python 3.12
>>> f"{'Andy\'s blog'}"
"Andy's blog"

Taking this in conjunction with the other changes, you can now see that f-strings are quite powerful tools for constructing fairly complex output formats from raw data:

>>> items = ["one", "two", "three"]
>>> print(f"The items are:\n{"\n".join(
...     # Generate a series of '1. item' lines
...     f"{n}: {item}" for n, item in enumerate(items, start=1)
... )}\nAnd that is all the items.")
The items are:
1: one
2: two
3: three
And that is all the items.

Of course, if you want to generate really complex outputs then you’re almost certainly better off with a more fully-featured templating engine such as Jinja. But for the likes of rich debugging statements during development, f-strings can be handy and fast, and there’ll always be utility in that.

Aside on F-strings in Logging

This isn’t realated to Python 3.12 directly, but since f-strings have become more powerful then people might be tempted to use them more widely. Since I mentioned logging above, I wanted to highlight that there is a downside to using f-strings with methods from logging—the string is always marshalled to prepare the argument to pass to the logging method with f-strings, but using the % formatting built in to logging the string formatting is only done if the logging message is to be actually displayed.

To illustrate this, consider this rather contrived example:

>>> import timeit
>>> setup = """
... import logging
... import time
... class MyObject:
...     def __str__(self):
...         time.sleep(0.5)
...         return "MyObject"
... x = MyObject()
... """
>>> timeit.timeit('logging.debug(f"Value: {x}")', setup=setup, number=100)
>>> timeit.timeit('logging.debug(f"Value: %s", x)', setup=setup, number=100)

Here we’re using time.sleep(0.5) to simulate an object which might be expensive to marshal into a string, so you only want to incur that expense if you’re actually going to use the result. Using f-strings, you’ll incur that cost every time you pass that object into a logging function, as you can see from the first case where logging.debug() has an f-string passed to it. Even though we’re not configured to emit debug-level logs, the cost of marshalling the f-string is still incurred. In the second example, you can see that using the %s format option and passing the x directly to logging.debug() avoids this cost unless the string is actually generated.

If you’re only logging builtin values then this might not be such a big deal, and if performance is an utmost consideration then you probably wouldn’t be using Python anyway. But if you tend to sprinkle your applications with lots of detailed debugging, and rely on the fact that it’s disabled in production environments, then this overhead is definitely something of which you should be aware.

Hopefully in some future Python version they might find some way to improve things so you don’t have to use the increasingly outdated %-style formatting, but without breaking compatibility with existing code. There are some workarounds possible detailed in the logging cookbook using custom classes instead of message strings and/or judicious use LoggerAdapater, but none of it has ever seemed graceful enough to me to make the switch—formatting with % isn’t so bad after all.

A final thought is that even if they do decide to make changes, efficient logging with f-strings would be challenging because you’d have to know not to insantiate the f-string before passing in—you’d need some way for a function to indicate that it wants its arguments lazily evaluated or somesuch. If someone can think of a cunning way to make it work, however, it would certainly make it more convenient to write good quality logging.

Other Benefits

The introduction of support into the grammar has also had some other beneficial side-effects:

  • The previous code had to parse the f-string from the standard string token generated by lexing—use of the proper lexer simplifies this and reduces the chance of bugs.
  • The new code is able to make use of the enhanced error messages applied to general Python code—this is particularly useful as expressions embedded within strings can be quite challenging in complicated cases, so better error reporting is especially helpful.
  • Other Python implementations beyond CPython will now find it easier to ensure they’ve implemented f-string support correctly, now that the feature is part of the official Python grammar.

Inlining Comprehensions

Experienced Python programmers will know how useful comprehensions can be for writing compact yet readable code. If you’re a little hazy on what comprehensions are, here are some examples to remind you, or you can go and read the official documentation on list-comprehensions, set-comprehensions and dict-comprehensions.

>>> [i**2 for i in range(1, 10) if i % 2 == 1]
[1, 9, 25, 49, 81]
>>> {i // 3 for i in range(10)}
{0, 1, 2, 3}
>>> {i: i**2 for i in range(6)}
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
>>> items = ["one", "two", "three"]
>>> print("\n".join(f"{n:>2}. {item}" for n, item in enumerate(items)))
 0. one
 1. two
 2. three

Prior to Python 3.12, comprehensions were compiled as nested anonymous functions within the parent function. This made for a convenient implementation, but wasn’t optimal for performance as function calls actually have noticeable overhead in Python.

In Python 3.12 these have simply been inlined, which offers similar semantics but with better runtime performance.

To see the difference, consider this trivial Python function:

def add_one(gen):
    return [i+i for i in gen]

Under Python 3.11, this results in the following bytecode:

1           0 RESUME                   0

2           2 LOAD_CONST               1 (<code object <listcomp> at 0x1028db500)
            4 MAKE_FUNCTION            0
            6 LOAD_FAST                0 (gen)
            8 GET_ITER
           10 PRECALL                  0
           14 CALL                     0
           24 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x1028db500:
2           0 RESUME                   0
            2 BUILD_LIST               0
            4 LOAD_FAST                0 (.0)
      >>    6 FOR_ITER                 7 (to 22)
            8 STORE_FAST               1 (i)
           10 LOAD_FAST                1 (i)
           12 LOAD_FAST                1 (i)
           14 BINARY_OP                0 (+)
           18 LIST_APPEND              2
           20 JUMP_BACKWARD            8 (to 6)
      >>   22 RETURN_VALUE

You can see here that the comprehension has been turned into a bare code object which is called from within the other function, with all the overheads and inefficiencies that incurs.

Now let’s do exactly the same under Python 3.12 and see what we get:

1           0 RESUME                   0

2           2 LOAD_FAST                0 (gen)
            4 GET_ITER
            6 LOAD_FAST_AND_CLEAR      1 (i)
            8 SWAP                     2
           10 BUILD_LIST               0
           12 SWAP                     2
      >>   14 FOR_ITER                 7 (to 32)
           18 STORE_FAST               1 (i)
           20 LOAD_FAST                1thon  (i)
           22 LOAD_FAST                1 (i)
           24 BINARY_OP                0 (+)
           28 LIST_APPEND              2
           30 JUMP_BACKWARD            9 (to 14)
      >>   32 END_FOR
           34 SWAP                     2
           36 STORE_FAST               1 (i)
           38 RETURN_VALUE
      >>   40 SWAP                     2
           42 POP_TOP
           44 SWAP                     2
           46 STORE_FAST               1 (i)
           48 RERAISE                  0

You can clearly see that what was a separate function has been inlined into the bytecode for the add_one() function itself.

To get a very crude grasp of what difference this makes to the performance, I ran this on both Python 3.11 and 3.12:

>>> import timeit
>>> setup = """
... x = range(100)
... def add_one(gen):
...     return [i+1 for i in gen]
... """
>>> timeit.timeit("add_one(x)", setup=setup)

Now this is such a simple case I wasn’t really sure whether the inlining would make a difference in terms of speed, but it did: a million repetitions took 2082 ms on Python 3.11 and 1773 ms on 3.12, which is a reduction of 13%. Considering that real-world use-cases will likely invoke comprehensions more often than just once then the savings in real-world cases will likely be significantly higher—the Python documentation claims up to twice as fast, in fact.

Note that as per the PEP generator-expressions aren’t yet inlined. I can see how these would be a more complex case due to the way they’re paused and resumed, but perhaps someone will be brave enough to take it on in future.

Improved Error Reporting

In common with some other recent Python releases, 3.12 contains some improvements to error reporting.

Missing Modules

Modules from the standard library are suggested as possible missing imports if a NameError reaches the outermost scope—note that this doesn’t apply to any modules outside the standnard library, however.

>>> os.listdir("/")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'os' is not defined. Did you forget to import 'os'?

Missing self

Within methods, a NameError that might be due to forgetting self will be flagged. Note that this is done by doing a lookup in the object’s attributes at that moment, so even attributes set externally will be considered. This is illustrated below.

>>> class MyClass:
...     def my_method(self):
...         print(foo)
>>> x = MyClass()
>>> x.my_method()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in my_method
NameError: name 'foo' is not defined
>>> x.foo = 123
>>> x.my_method()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in my_method
NameError: name 'foo' is not defined. Did you mean: 'self.foo'?

Errors Importing Submodules

If someone erroneously types import X from Y instead of the correct from Y import Z, the resultant SyntaxError is now more helpful.

>>> import abc from xyz
  File "<stdin>", line 1
    import abc from xyz

Also, where the correct syntax is used but the name of the submodule is incorrect, Python now offers suggestions based on the submodules defined.

>>> from logging import handler
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'handler' from 'logging' ([...]). Did you mean: 'Handler'?

One interesting point: I’d expected that case to offer me a suggestion of handlers, which is genuinely a submodule of logging. This illustrates that, while useful, it’s probably not wise to rely on these suggestions too heavily.

Low Impact Interpreter Monitoring

The rest of this article covers features which may be a little obscure for many users of Python. This section covers a new monitoring API, which is only likely to be used by people writing debugging or profiling tools; and the next section covers some improvements to the use of sub-interpreters, which is a feature general Python developers probably don’t need. Even if they’re niche use-cases, I think they’re useful and interesting insight into the interpreter, but if you don’t agree then you might like to skip the rest of this particular article.

Debuggers and profilers are useful tools, but their implementation often has significant impact on the performance of the code on which they’re running in Python. This is not always troublesome, but can be annoying when running long tests or reproducing rare issues under a debugger, so reducing the performance impact is always useful.

Enter PEP 669 which introduces a new API which allows these tools to be written with lower impact on performance. It leverages the dynamic updates to running Python code added as part of the specialising adaptive interpreter introduced by PEP 659 in Python 3.11.

These changes introduce a new module sys.monitoring to allow callbacks to be registered for events of interest, many of which can be set globally or only on specific modules. Once the callbacks are registered, the specific events which should trigger these callbacks can be activated—by default all events are deactivated, so there is no monitoring overhead.

First let’s take a look at what facilities the sys.monitoring module offers, and then I’ll try to figure out how these works at a high level.

Use of sys.monitoring

There are three steps to receiving monitoring events to your callback functions:

  1. Register your monitoring tool. To allow multiple tools to work together independently, there are six slots and only one tool can use a given slot at a time. Before you can do anything else, you have to claim one of the slots for your analysis tool.
  2. Register callbacks. For each event you want to monitor, you register a callback function which will be invoked when the event occurs.
  3. Enable events. To commence monitoring, you then select the events that you wish to trigger your callbacks. This can be globally for all events of a specific time, or locally to a particular piece of code.

We’ll look at each of these steps, and then see a simple example of their use.

Registering Analysis Tools

Registering the montioring tool is just a case of calling use_tool_id() with the numberical slot number from 0 to 5 and the name of the tool, and assuming you don’t get a ValueError raised then you’re registered. You then need to pass this same slot number into all the remaining calls.

Although you can use any slot you like, there are some pre-defined constants for specific types of tool to help avoid conflicts:

  • sys.monitoring.DEBUGGER_ID = 0
  • sys.monitoring.COVERAGE_ID = 1
  • sys.monitoring.PROFILER_ID = 2
  • sys.monitoring.OPTIMIZER_ID = 5

As an aside, there are actually two other slots, IDs 6 and 7, but they’re not directly available. These are used to maintain the functionality provided by existing sys.settrace(), which uses slot 6, and sys.setprofile(), which uses slot 7. As an aside, experiments have shown this new API is considerably faster than sys.settrace() if only a small set of events are active, but only slightly better if many events occur, such as triggering on every line of code.

Registering Callbacks

Callbacks are registered with the helpfully named register_callback() function, which takes the tool ID, the event which should trigger the callback, and the callback itself. Each tool can only have a single callback registered, so calling this a second time will replace it and return the original one—registering a callback of None effectively just unregisters any previous callback.

The signature of these callbacks depends on the event handled, with groups of related events sharing the same callback signature. In the Events section below, I run through all the events and the callback signature required when registering them.

Enabling Events

The final step before you start receiving callbacks is that you need to enable the specific event types that you want to see. This means that your tool can register all its callbacks as a static initialisation, and then enable/disable specific events as needed to help the developer analyse their code.

Events can be enabled globally on all code using sys.monitoring.set_events(), or you can pass a specific code object (e.g. a module) to sys.monitoring.set_local_events() to only enable them within a specific module. The events actually enabled will be a union of the events globally enabled as well as for that code object, so there’s no way for a particular code object to exclude certain events—they’d need to be disabled globally and then enabled on each specific code object in which you’re interested.

Multiple events can be specified using bitwise OR operator. For example, to enable both PY_START and PY_RESUME events globally for a debugger, you could do:

               sys.monitoring.PY_START | sys.monitoring.PY_RESUME)

Event and Callback Types

All of the callbacks take a CodeType argument first, which is the compiled code which triggered the event, which I’ve ommitted from the parameter documentation below.

Many of the callbacks also pass one or more bytecode offesets to indicate the location in the code where an event is occurring. As you might be aware, these offsets are different to line numbers, but as we’ll see in the code example further on, it’s possible to convert these offsets into line numbers using the co_lines() method of the code object.

Any of these callbacks can return a special value sys.monitoring.DISABLE, which has the effect of disabling further callbacks for that particular tool on that particular event at that particular bytecode instruction. This can be used to improve performance, as disabled events don’t incur performance overhead. However, some exception-related event types aren’t necessarily tied to a particular location in the code and cannot be disabled in this way—this is noted under the individual event types below where applicable.

To re-enable any such disabled callbacks, sys.monitoring.restart_events() will do so for all tools.

Instruction Stepping

Triggers just before each bytecode instruction which is executed.

As well as the CodeType parameter common to all these callbacks, this one also has a single bytecode instruction offset passed indicating the instruction that’s about to execute.

instruction_callback(code: CodeType, offset: int)

Line Stepping

Similar to instruction stepping above, but triggers just prior to executing the first bytecode instruction on a new source code line.

Instead of the offset passed to callback for INSTRUCTION, this one takes the source code line number.

line_callback(code: CodeType, line_number: int)

Function Entry and Exit

Triggers just prior to a function call being taken.
Triggers just after the return from any callable which isn’t a Python function—typically from functions defined in extension modules or the core.
Triggers just after an exception is raised from any callable which isn’t a Python function.

These events have some interdependencies:

  • The CALL event of a given function must be seen to also see the corresponding C_RETURN or C_RAISE event for that call.
  • It’s not possible to enable just one of C_RETURN and C_RAISE—either neither or both events must be enabled. If you try to enable just one you’ll get a ValueError.

The callback for all of these takes a bytecode instruction offset and also a reference to the callable that is being called or returned from. It also takes arg0 which is the first argument to the function for CALL events, the return value for C_RETURN events, or the exception instance for C_RAISE events. If there is no such value, arg0 will be set to the special singleton sys.monitoring.MISSING.

call_callbacks(code: CodeType, offset: int, callable: object, arg0: object | MISSING)

Python Function Execution

Triggers immediately after a Python function is entered for a new call, but before any of the instructions within the function are executed.
Triggers immediately after a Python function resumes, for generator and coroutine functions, except for functions which are resumed by calling the throw() method on the generator or coroutine object (see PY_THROW for that).

Aside from the code object, passes a single parameter offset which is the bytecode offset of the first instruction in the function which is about to be executed.

func_start_callback(code: CodeType, offset: int)

Python Function Exit

Triggers immediately before returning from a Python function, whilst still in the stack frame of the function.
Triggers immediately before yielding a value from a Python generator or coroutine.

Passed two additional parameters, where offset is the bytecode instruction doing the returning and retval is the value being returned or yielded as appropriate.

func_exit_callback(code: CodeType, offset: int, retval: object)


Triggered where any exception is being raised except for the StopIteration raised for generators or coroutines (see STOP_ITERATION).
Triggered when an existing exception is being re-raised, such as at the end of a finally block which has been executed during stack unwinding.
Triggered whenever an exception is being caught and handled.
Triggered on exit from a Python function whilst the stack is being unwound as part of exception handling.
Not to be confused with RAISE, this event is triggered instead of PY_RESUME when a coroutine or generator function is resumed by calling the throw() method of the appropriate object.
Triggered by the special case of raising StopIteration. Note that since this is an inefficient way to return values, this isn’t always triggered where you expect it might be—only where the StopIteration would be visible. For example, if you iterate through the results of a generator using a for loop you’ll see it, but if you pass that same generator into the builtin sum() then you won’t.

Of the events in this group only STOP_ITERATION can be disabled—the others are not “local events” and if you attempt to return DISABLE from the callback for them, then that callback will be removed and a ValueError will be raised. This is a very confusing context in which to raise an exception, and from my brief testing you get some pretty confused stack traces, so my strong suggestion is that you just make sure you don’t do this or you’ll be in quite a painful place.

In all these cases, the additional parameters to the callback are the offset to the bytecode instruction where the event was triggered, and a reference to the exception triggering the action.

exception_callback(code: CodeType, offset: int, exception: BaseException)


A conditional branch is taken (or not).
An unconditional branch is taken.

The branch operations are from src_offset to dst_offset within the code object passed as the first parameter.

branch_callback(code: CodeType, src_offset: int, dst_offset: int)

Illustration of Monitoring

To help demonstrate this, I created this very simple case below, which registers callbacks for a subset of the operations which just print their argument to standard output. Before we look at the code, there are a couple of notes that are worth bearing in mind when you read through the code:

  • The code_offset_to_line() function converts a bytecode instruction offset, as passed into the callbacks, into a source file and line number. If you want to know about the co_lines() method on the code object it uses to do this, read through PEP 626 for more details. The way it works isn’t important for the monitoring functions.
  • I’ve used decorator-style functions to return the callbacks purely for my own convenience, so that I can have a single implementation of the callback of each signature, but also have the specific event name available in it. For example, get_func_exec_cb() provides the callback used for both PY_START and PY_RESUME events, but having two copies of this which differed only in that name would have been dull to implement.

That said, here’s the code:

from os import path
from sys import monitoring

def code_offset_to_line(code, offset):
    filename = path.basename(code.co_filename)
    for start, end, lineno in code.co_lines():
        if start <= offset < end:
            return f"{filename}:{lineno}"
    return f"{filename}:??"

def get_func_call_cb(name):
    def func_call_cb(code, offset, callable, arg0):
        line = code_offset_to_line(code, offset)
        function = callable.__name__
        print(f"[{name}] {function}() {line} {arg0=}")
    return func_call_cb

def get_func_exec_cb(name):
    def func_exec_cb(code, offset):
        line = code_offset_to_line(code, offset)
        print(f"[{name}] {line}")
    return func_exec_cb

def get_func_ret_cb(name):
    def func_ret_cb(code, offset, retval):
        line = code_offset_to_line(code, offset)
        print(f"[{name}] {line} {retval=}")
    return func_ret_cb

def get_branch_cb(name):
    def branch_cb(code, src_offset, dst_offset):
        src_line = code_offset_to_line(code, src_offset)
        dst_line = code_offset_to_line(code, dst_offset)
        print(f"[{name}] {src_line=} {dst_line=}")
    return branch_cb

monitoring.use_tool_id(0, "andymon")
monitoring.register_callback(0, monitoring.events.CALL, get_func_call_cb("CALL"))
monitoring.register_callback(0, monitoring.events.C_RETURN, get_func_call_cb("C_RETURN"))
monitoring.register_callback(0, monitoring.events.C_RAISE, get_func_call_cb("C_RAISE"))
monitoring.register_callback(0, monitoring.events.PY_START, get_func_exec_cb("PY_START"))
monitoring.register_callback(0, monitoring.events.PY_RESUME, get_func_exec_cb("PY_RESUME"))
monitoring.register_callback(0, monitoring.events.PY_RETURN, get_func_ret_cb("PY_RETURN"))
monitoring.register_callback(0, monitoring.events.PY_YIELD, get_func_ret_cb("PY_YIELD"))
monitoring.register_callback(0, monitoring.events.BRANCH, get_branch_cb("BRANCH"))
monitoring.register_callback(0, monitoring.events.JUMP, get_branch_cb("JUMP"))
enable_events = (monitoring.events.CALL
                 | monitoring.events.C_RETURN
                 | monitoring.events.C_RAISE
                 | monitoring.events.PY_START
                 | monitoring.events.PY_RESUME
                 | monitoring.events.PY_RETURN
                 | monitoring.events.PY_YIELD
                 | monitoring.events.BRANCH
                 | monitoring.events.JUMP)
monitoring.set_events(0, enable_events)

def generator_func(start, end):
    for i in range(start, end):
        yield i + 1

def sum_under_ten(start, end):
    total = 0
    for i in generator_func(start, end):
        if i > 9:
        total += i
    return total

print(sum_under_ten(6, 15))

If you execute this code, you should see this output:

[CALL] sum_under_ten() mon-demo.py:70 arg0=6
[PY_START] mon-demo.py:62
[CALL] generator_func() mon-demo.py:64 arg0=6
[PY_START] mon-demo.py:58
[CALL] range() mon-demo.py:59 arg0=6
[C_RETURN] range() mon-demo.py:59 arg0=6
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=7
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=8
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=9
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=10
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:66'
[PY_RETURN] mon-demo.py:68 retval=24
[CALL] print() mon-demo.py:70 arg0=24
[C_RETURN] print() mon-demo.py:70 arg0=24
[PY_RETURN] mon-demo.py:70 retval=None

Based on the earlier documentation, you should be able to match up this output with the code to see what’s happening here. One potentially puzzling point is the BRANCH events on the lines with for loops branch to themselves—this makes more sense when you realise that a single source code line evaluates to multiple bytecode instructions, and branches target an instruction rather than a source line.

This example also illustrates the slight asymmetry of the fact that CALL events are generated for both C extension and builtins (e.g. range()) as well as Python functions (e.g. sum_under_ten()); however, the return from range() triggers C_RETURN whereas the return from sum_under_ten() triggers PY_RETURN. On the topic note that I’ve enabled C_RAISE even though it’s never triggered, because you can’t enable C_RETURN without C_RAISE as well, as previously noted.

Also note that there’s only one PY_START event for generator_func(), and all remaining re-entries are PY_RESUME events.

Monitoring Implementation

So how does this work? Well, if you’ve read my previous article which covered the specialising adaptive interpreter features, or you’re otherwise familiar with them, you’ll recall that since Python 3.11 dyanamic runtime performance improvements are made by specialising certain instructions. Well, this new monitoring interface uses the same technique.

Let’s recap with an example, suppose a LOAD_ATTR bytecode is generated—this instruction pops the object on the top of the stack, retrieves a specified attribute from it and pushes the value of that attribute back on to the stack. With the dynamic specialisation introduced in 3.11, this instruction becomes an adaptive version LOAD_ATTR_ADAPTIVE which records how it’s used. If it’s consistently used to obtain an atttibute from an object instance via __dict__, say, then after a number of repetitions of this consecutively that instruction will dynamically be replaced by a LOAD_ATTR_INSTANCE_VALUE bytecode which can do the same job quicker because it doesn’t need to perform as many checks.

Of course, usage patterns can’t be assumed to stay the same for the lifetime of a script, so the specialised versions are written to confirm their assumptions are correct and revert back to the basic LOAD_ATTR if not—if this happens a few times, the instruction itself reverts back to LOAD_ATTR_ADAPTIVE and the process repeats. If you want a more detailed illustration, you can take a few minutes to read my previous article.

So, this new API modifies instructions to instrumented versions of themselves, and potentially inserts new instructions for things like LINE events. This update happens during sys.monitoring.set_events() for any code objects present on the call stack of any thread, and traps are put into place to ensure other code is updated as soon as it’s called. This operation is performed for:

  • Single stepping: LINE and INSTRUCTION
  • Function entry: CALL
  • Python function execution: PY_START and PY_RESUME
  • Python function exit: PY_RETURN and PY_YIELD
  • Branching: JUMP and BRANCH

For other events such as RAISE, it’s cheaper to instead to a check when the event happens so the bytecodes don’t need to be modified.

Let’s take a look at an example of this in action. Consider this simple implementation of a factorial function in Python:

def fact(x):
    if x < 2:
        return x
        return x * fact(x - 1){.py}

If we use dis.dis() with adaptive=True and show_caches=True then we get the following bytecode disassembly:

1           0 RESUME                   0

2           2 LOAD_FAST__LOAD_CONST     0 (x)
            4 LOAD_CONST               1 (2)
            6 COMPARE_OP_INT           2 (<)
            8 CACHE                    0 (counter: 832)
           10 POP_JUMP_IF_FALSE        2 (to 16)

3          12 LOAD_FAST                0 (x)
           14 RETURN_VALUE

5     >>   16 LOAD_FAST                0 (x)
           18 LOAD_GLOBAL_MODULE       1 (NULL + fact)
           20 CACHE                    0 (counter: 832)
           22 CACHE                    0 (index: 9)
           24 CACHE                    0 (module_keys_version: 75)
           26 CACHE                    0 (builtin_keys_version: 0)
           28 LOAD_FAST__LOAD_CONST     0 (x)
           30 LOAD_CONST               2 (1)
           32 BINARY_OP_SUBTRACT_INT    10 (-)
           34 CACHE                    0 (counter: 832)
           36 CALL_PY_EXACT_ARGS       1
           38 CACHE                    0 (counter: 832)
           40 CACHE                    0 (func_version: 2062)
           42 CACHE                    0
           44 BINARY_OP_MULTIPLY_INT     5 (*)
           46 CACHE                    0 (counter: 832)
           48 RETURN_VALUE

At this point, I registered a tool, registered a dummy callback which did nothing, and enabled only the CALL and LINE events for illustrative purposes. If I called dis.dis() again at this point, I wouldn’t see any changes because the trap to modify the code on next execution won’t have fired. So I called the function one more time, and then dis.dis() showed the changes, which I’ve highlighted below:

1           0 RESUME                   0

2           2 INSTRUMENTED_LINE        0
            4 LOAD_CONST               1 (2)
            6 COMPARE_OP_INT           2 (<)
            8 CACHE                    0 (counter: 832)
           10 POP_JUMP_IF_FALSE        2 (to 16)

3          12 INSTRUMENTED_LINE        0
           14 RETURN_VALUE

5     >>   16 INSTRUMENTED_LINE        0
           18 LOAD_GLOBAL_MODULE       1 (NULL + fact)
           20 CACHE                    0 (counter: 768)
           22 CACHE                    0 (index: 9)
           24 CACHE                    0 (module_keys_version: 75)
           26 CACHE                    0 (builtin_keys_version: 0)
           28 LOAD_FAST                0 (x)
           30 LOAD_CONST               2 (1)
           32 BINARY_OP_SUBTRACT_INT    10 (-)
           34 CACHE                    0 (counter: 832)
           36 INSTRUMENTED_CALL        1
           38 RESERVED
           40 BINARY_OP_MULTIPLY_INT     8 (**)
           42 CACHE                    0 (counter: 0)
           44 BINARY_OP_MULTIPLY_INT     5 (*)
           46 CACHE                    0 (counter: 832)
           48 RETURN_VALUE

There are some interesting things to note here. Firstly, note that CALL_PY_EXACT_ARGS has been replaced with INSTRUMENTED_CALL. This illustrates one of the downsides of this approach to instrumentation, namely that the benefits of adaptive specialisation are lost. This means that instrumenting code will potentially worsen performance by a little more than just the overhead of calling the callbacks, as you lose the performance boost of specialisations.

I also note that the INSTRUMENTED_LINE bytecode appears to have replaced lines rather than being in addition to them. From briefly perusing the CPython code, it looks as if these codes store the original opcode and execute them. This makes sense, because it’s not at all clear that inserting new instructions would be possible using the dynamic specialisation mechanism—instead the original opcode is called after invoking the callback, and when the event is disabled again the bytecode can be switched back to the original.

I also note that the CACHE bytecode in instruction 38, which adaptive instructions use to store their counters, has been changed to RESERVED. I’m a little puzzled why instruction 40 has been replaced with BINARY_OP_MULTIPLY_INT instead of RESERVED—either dis.dis() has mistranslated the opcode, or there’s something odd going on that I’m not quite understanding.

All in all, it seems like a neat way to do things, and being able to instrument code with lower overhead is handy. Perhaps more to the point, I could imagine there are other uses to be found for this dynamic instruction swapping technique, and I’m interested to see what other applications come to light in the future.

Better Sub-Interpreter Isolation

Since the 1.x days of Python, it’s always been possible to run multiple separate interpreters in the same process, should you want to. These sub-interpreters could only be started using the Py_NewInterpreter() function in C code, and each one had its own collection of threads with their own stacks, plus its own copy of imported modules and most other things.

However, there were a few things that were still shared between the multiple sub-interpreters and one of these was the global interpreter lock (GIL). This meant that only one sub-interpter could ever be executing Python code at one time, which means that they were useless for making use of multiple processors—since this seemed like a potentially handy use-case for sub-interpreters, this was a notable drawback.

In Python 3.12 the implementation of PEP 684 has moved much of the previously shared global state into per-interpreter storage, including the GIL. This means that those interpreters can now make use of multiple processors more effectively.

Per-Interpter Data

To do this required quite a few changes. Firstly, the following global state was moved into the PyInterpreterState structure, which exists per sub-interpreter:

  • All mutable global objects.
  • The GIL.
  • Mutable data protected by the GIL or some other per-interpreter lock.
  • Mutable data that may be used differently in different modules, such as laoded extension modules.

Remaining Global Data

However, this does still leave some data global across interpreters, namely:

  • Immutable global objects, which are safe to share.
  • Effectively immutable internal data (e.g. some state that’s initialised once and never changed).
  • Any data guaranteed to only be modified in the main thread.
  • Mutable data that’s protected by any global lock remaining shared across interpreters.
  • Global state in atomic variables.

Memory Allocators

One of the trickier parts of the change was dealing with allocators. CPython providers three allocator domains, where each domain manages its own memory and can use an allocation strategy that’s optimised for a particular purpose. The three domains are:

Raw domain
Memory for general-purpose memory buffers where the allocation must come from the actual system allocator, or where the allocator can safely operate without holding the GIL.
Mem domain
Memory for Python buffers and general-purpose memory buffers where the allocation must be performed whilst holding the GIL. Memory is taken from a private heap managed by Python rather than directly from the system heap.
Object domain
Memory for Python objects. Holding the GIL is required. Memory is taken from the Python private heap.

Each domain has its own functions to allocate and free memory, and custom allocators for each domain can be set during runtime initialisation. Prior to Python 3.12 all of these allocators were global, shared by all sub-interpreters. Also, as a consequence of the mem and object domain allocators requiring the GIL, there was no particular need for them to be thread-safe.

This was fine up until the GIL got moved into the sub-interpreters—at this point, sharing the allocators across multiple interpreters was a recipe for crashes. At this point the choice was essentially to either make the allocators all thread-safe and leave them global; or move them into the scope of each sub-interpreter and not require thread-safety.

For reasons outlined in the PEP, the allocators were left shared and are now required to be thread-safe. This does mean that anyone using custom allocators which aren’t thread-safe, and also using sub-interpreters, is going to risk data races at runtime—but this is currently protected against by requiring own_gil to be false if custom allocators are used (see below).

New Sub-Interpreter API

A new function Py_NewInterpreterFromConfig() has been introduced which, unlike the original Py_NewInterpreter(), takes a configuration parameter2 of type PyInterpreterConfig to control the behaviour of the new interpreter. This means that developers can adopt whichever of the new features will be compatible with their use-cases, but existing code still using Py_NewInterpreter() maintains backwards compatibility.

This configuration structure provides the following fields which can be set—the “legacy” values shown below are the compatibility mode options used for Py_NewInterpreter, and the “new” values are for fully isolated sub-interpreters:

int gil (legacy: SHARED_GIL, new: OWN_GIL)
This can be either PyInterpreterConfig_SHARED_GIL for the sub-interpreter to share the main interpreter’s GIL, or PyInterpreter_OWN_GIL for it to have its own. There’s also a PyInterpreterConfig_DEFAULT_GIL definition, and it currently defaults to a shared GIL for backwards compatibility.
int use_main_obmalloc (legacy: 1, new: 0)
If zero, the sub-interpreter will use its own object domain allocator, otherwise it’ll share the main interpreter’s.
int allow_fork (legacy: 1, new: 0)
If zero, fork() is disallowed in any thread where that sub-interpreter is active, otherwise fork() is unrestricted. The subprocess module still works even if fork() is disallowed, however.
int allow_exec (legacy: 1, new: 0)
If zero, execv() and similar functions will be disallowed in any thread where that subinterpreter is active. Once again, this doesn’t affect subprocess.
int allow_threads (legacy: 1, new: 1)
If zero, the threading module in that sub-interpreter won’t allow new threads to be created.
int allow_daemon_threads (legacy: 1, new: 0)
If allow_threads is non-zero, this can be set to zero to force threading to only create non-daemon threads.
int check_multi_interp_extensions (legacy: 0, new: 1)
If zero, all extension modules may be imported; otherwise, only multi-phase init extension modules may be imported. In particular, this must be true if use_main_obmalloc is also true.


The f-string changes might seem modest, but I run into the issue of clashing quotes just often enough that simply not having to worry about it any more is going to be helpful. The improved error reporting and the ability to nest f-strings are nice bonuses as well.

Inline comprehensions are one of those things that a lot of people won’t even realise has been implemented, but they’re really useful little constructs so it’s great to see their performance rising to match their concision.

The only thing I’m a little disappointed by is the failure to include generator expressions, even though I understand some of the particular difficulties they would introduce. This is because where I need performance, I’ll typically use a generator expression to avoid having to buffer up potentially large amounts in memory. With these new, faster comprehensions, however, there’s more value in weighing up the expected data size more carefully, as performance might now be significantly better using comprehensions for smaller data volumes.

Better error reporting is always useful, although it does feel a little like we’re getting into some slightly obscure corners here—perhaps that means all the low-hanging fruit for error reporting is already plucked.

Finally, I’m a big fan of this new approach to monitoring—self-modifying code is a really handy trick if you can get it right, and a project the size of Python is well-tested and well-used enough to inspire confidence on that score. Even aside from conventional debugging and profiling, this approach might now be cheap enough to use in production for things like detecting when deprecated functions are called, or even rate-limiting calls to a particular piece of code. Of course, the best option would normally be the modify the code itself, but there may be cases where that’s inadvisable or impossible, and this non-invasive approach feels like one of those specialist tools you rarely use, but when you do need it, you’re really glad you have it.

That’s it for this article, and I think I’ve looked at all the major changes in the core language for Python 3.12. Next time I’ll run through some of the changes to the standard library which, on cursory inspection, seem fairly modest.

  1. And, if I’m honest, it’s a little concerning that someone might actually make use of this to change values in an f-string or something awful. But you can’t let potential for abuse stop you adding useful features to a language or you’d never have anything. 

  2. Technically the function takes two parameters, but the first of them is really an output parameter to return the PyThreadState for the new sub-interpreter. 

The next article in the “Python 3 Releases” series is What’s New in Python 3.12 - Library Changes
Sun 17 Mar, 2024
6 Feb 2024 at 5:05PM in Software
Photo by Dids on Pexels