In this series looking at features introduced by every version of Python 3, we take a look at the new features added in Python 3.12 related to f-strings, inlining of comprehensions, improved error reporting, the new monitoring API for debuggers, and better isolation for sub-interpreters.
This is the 31st of the 32 articles that currently make up the “Python 3 Releases” series.
Having looked at various improvements to type hints in the previous article, it’s time to look at some more of the improvements to Python in release 3.12. In this article we look at the fact that f-strings have had some of their restrictions lifted, and also some improvements to the CPython interpreter—namely: inlining of comprehensions, improvements to error reporting, a new API for tools such as debuggers, and better isolation for sub-interpreters.
As an aside, the last two features are somewhat obscure, so those not using those features might like to skip that part—I’ve put a note into the text below where I feel some developers might like to drop off. Personally I think it’s useful to have some insight into how the interpreter works, but I realise that some Python developers work at a higher level and just want to get their work done with a minimum of fuss, and perhaps delving into bytecode instructions isn’t their idea of a fun time.
Since their introduction in Pyhton 3.6 by PEP 498, f-strings have not had a formal grammar and have been saddled with various restrictions. At the time these were necessary to be able to implement the feature without modifying the existing lexer for the language, but some of them are slightly annoying. Here are three of the major limitations:
f"Name: {details["name"]}"
.Restrictions like these have been removed by giving f-strings a formal grammar and implementing dedicated parsing code for them, as described in detail in PEP 701. As a result, the three limitations above no longer apply.
It’s now possible to use any valid quotes in the expression within an f-string, even the same quotes as were used to delimit the f-string:
>>> details = {"name": "Andy", "favourite_colour": "Pantone 2172 C"}
>>> f"{details["name"]} likes the colour {details["favourite_colour"]}"
'Andy likes the colour Pantone 2172 C'
As a consequence, this also allows arbitrary nesting of f-strings, should you find you need such a quirky feature, because you’re no longer forced to select a different type of quote for each level of nesting.
>>> f"{f"{f"{f"{f"{f"{2**8}"}"}"}"}"}"
'256'
It’s now possible for expressions to span multiple lines, even if using the single-quoted form of the f-string, wherever a newline would be acceptable in an expression in normal Python code.
>>> f"The items are: {", ".join([
... "one",
... "two",
... "three"
... ])}"
'The items are: one, two, three'
It’s important to remember that these are expressions not arbitrary Python code, so you can’t include multiple semantic lines of code. That said, the inclusion of the walrus operator in Python 3.8 means that expressions can have assignment side-effects. This was the case even before Python 3.12, but I always find it slightly surprising1.
>>> x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
>>> f"{(x := 123),x + 456}"
'(123, 579)'
>>> x
123
Before Python 3.12 you weren’t permitted to use backslashes in f-string expressions, even if this use wasn’t to escape the delimiters used for the f-string itself:
>>> f"{'Andy\'s blog'}"
File "<stdin>", line 1
f"{'Andy\'s blog'}"
^
SyntaxError: f-string expression part cannot include a backslash
This now works as you’d expect in Python 3.12:
>>> f"{'Andy\'s blog'}"
"Andy's blog"
Taking this in conjunction with the other changes, you can now see that f-strings are quite powerful tools for constructing fairly complex output formats from raw data:
>>> items = ["one", "two", "three"]
>>> print(f"The items are:\n{"\n".join(
... # Generate a series of '1. item' lines
... f"{n}: {item}" for n, item in enumerate(items, start=1)
... )}\nAnd that is all the items.")
The items are:
1: one
2: two
3: three
And that is all the items.
Of course, if you want to generate really complex outputs then you’re almost certainly better off with a more fully-featured templating engine such as Jinja. But for the likes of rich debugging statements during development, f-strings can be handy and fast, and there’ll always be utility in that.
This isn’t realated to Python 3.12 directly, but since f-strings have become more powerful then people might be tempted to use them more widely. Since I mentioned logging above, I wanted to highlight that there is a downside to using f-strings with methods from logging
—the string is always marshalled to prepare the argument to pass to the logging method with f-strings, but using the %
formatting built in to logging
the string formatting is only done if the logging message is to be actually displayed.
To illustrate this, consider this rather contrived example:
>>> import timeit
>>> setup = """
... import logging
... import time
...
... class MyObject:
... def __str__(self):
... time.sleep(0.5)
... return "MyObject"
...
... x = MyObject()
... """
>>> timeit.timeit('logging.debug(f"Value: {x}")', setup=setup, number=100)
50.40898558299523
>>> timeit.timeit('logging.debug(f"Value: %s", x)', setup=setup, number=100)
9.970797691494226e-05
Here we’re using time.sleep(0.5)
to simulate an object which might be expensive to marshal into a string, so you only want to incur that expense if you’re actually going to use the result. Using f-strings, you’ll incur that cost every time you pass that object into a logging function, as you can see from the first case where logging.debug()
has an f-string passed to it. Even though we’re not configured to emit debug-level logs, the cost of marshalling the f-string is still incurred. In the second example, you can see that using the %s
format option and passing the x
directly to logging.debug()
avoids this cost unless the string is actually generated.
If you’re only logging builtin values then this might not be such a big deal, and if performance is an utmost consideration then you probably wouldn’t be using Python anyway. But if you tend to sprinkle your applications with lots of detailed debugging, and rely on the fact that it’s disabled in production environments, then this overhead is definitely something of which you should be aware.
Hopefully in some future Python version they might find some way to improve things so you don’t have to use the increasingly outdated %-style formatting, but without breaking compatibility with existing code. There are some workarounds possible detailed in the logging
cookbook using custom classes instead of message strings and/or judicious use LoggerAdapater
, but none of it has ever seemed graceful enough to me to make the switch—formatting with %
isn’t so bad after all.
A final thought is that even if they do decide to make changes, efficient logging with f-strings would be challenging because you’d have to know not to insantiate the f-string before passing in—you’d need some way for a function to indicate that it wants its arguments lazily evaluated or somesuch. If someone can think of a cunning way to make it work, however, it would certainly make it more convenient to write good quality logging.
The introduction of support into the grammar has also had some other beneficial side-effects:
Experienced Python programmers will know how useful comprehensions can be for writing compact yet readable code. If you’re a little hazy on what comprehensions are, here are some examples to remind you, or you can go and read the official documentation on list-comprehensions, set-comprehensions and dict-comprehensions.
>>> [i**2 for i in range(1, 10) if i % 2 == 1]
[1, 9, 25, 49, 81]
>>> {i // 3 for i in range(10)}
{0, 1, 2, 3}
>>> {i: i**2 for i in range(6)}
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
>>> items = ["one", "two", "three"]
>>> print("\n".join(f"{n:>2}. {item}" for n, item in enumerate(items)))
0. one
1. two
2. three
Prior to Python 3.12, comprehensions were compiled as nested anonymous functions within the parent function. This made for a convenient implementation, but wasn’t optimal for performance as function calls actually have noticeable overhead in Python.
In Python 3.12 these have simply been inlined, which offers similar semantics but with better runtime performance.
To see the difference, consider this trivial Python function:
def add_one(gen):
return [i+i for i in gen]
Under Python 3.11, this results in the following bytecode:
1 0 RESUME 0
2 2 LOAD_CONST 1 (<code object <listcomp> at 0x1028db500)
4 MAKE_FUNCTION 0
6 LOAD_FAST 0 (gen)
8 GET_ITER
10 PRECALL 0
14 CALL 0
24 RETURN_VALUE
Disassembly of <code object <listcomp> at 0x1028db500:
2 0 RESUME 0
2 BUILD_LIST 0
4 LOAD_FAST 0 (.0)
>> 6 FOR_ITER 7 (to 22)
8 STORE_FAST 1 (i)
10 LOAD_FAST 1 (i)
12 LOAD_FAST 1 (i)
14 BINARY_OP 0 (+)
18 LIST_APPEND 2
20 JUMP_BACKWARD 8 (to 6)
>> 22 RETURN_VALUE
You can see here that the comprehension has been turned into a bare code object which is called from within the other function, with all the overheads and inefficiencies that incurs.
Now let’s do exactly the same under Python 3.12 and see what we get:
1 0 RESUME 0
2 2 LOAD_FAST 0 (gen)
4 GET_ITER
6 LOAD_FAST_AND_CLEAR 1 (i)
8 SWAP 2
10 BUILD_LIST 0
12 SWAP 2
>> 14 FOR_ITER 7 (to 32)
18 STORE_FAST 1 (i)
20 LOAD_FAST 1thon (i)
22 LOAD_FAST 1 (i)
24 BINARY_OP 0 (+)
28 LIST_APPEND 2
30 JUMP_BACKWARD 9 (to 14)
>> 32 END_FOR
34 SWAP 2
36 STORE_FAST 1 (i)
38 RETURN_VALUE
>> 40 SWAP 2
42 POP_TOP
44 SWAP 2
46 STORE_FAST 1 (i)
48 RERAISE 0
You can clearly see that what was a separate function has been inlined into the bytecode for the add_one()
function itself.
To get a very crude grasp of what difference this makes to the performance, I ran this on both Python 3.11 and 3.12:
>>> import timeit
>>> setup = """
... x = range(100)
... def add_one(gen):
... return [i+1 for i in gen]
... """
>>> timeit.timeit("add_one(x)", setup=setup)
Now this is such a simple case I wasn’t really sure whether the inlining would make a difference in terms of speed, but it did: a million repetitions took 2082 ms on Python 3.11 and 1773 ms on 3.12, which is a reduction of 13%. Considering that real-world use-cases will likely invoke comprehensions more often than just once then the savings in real-world cases will likely be significantly higher—the Python documentation claims up to twice as fast, in fact.
Note that as per the PEP generator-expressions aren’t yet inlined. I can see how these would be a more complex case due to the way they’re paused and resumed, but perhaps someone will be brave enough to take it on in future.
In common with some other recent Python releases, 3.12 contains some improvements to error reporting.
Modules from the standard library are suggested as possible missing imports if a NameError
reaches the outermost scope—note that this doesn’t apply to any modules outside the standnard library, however.
>>> os.listdir("/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'os' is not defined. Did you forget to import 'os'?
self
¶Within methods, a NameError
that might be due to forgetting self
will be flagged. Note that this is done by doing a lookup in the object’s attributes at that moment, so even attributes set externally will be considered. This is illustrated below.
>>> class MyClass:
... def my_method(self):
... print(foo)
...
>>> x = MyClass()
>>> x.my_method()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in my_method
NameError: name 'foo' is not defined
>>> x.foo = 123
>>> x.my_method()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in my_method
NameError: name 'foo' is not defined. Did you mean: 'self.foo'?
If someone erroneously types import X from Y
instead of the correct from Y import Z
, the resultant SyntaxError
is now more helpful.
>>> import abc from xyz
File "<stdin>", line 1
import abc from xyz
^^^^^^^^^^^^^^^^^^^
Also, where the correct syntax is used but the name of the submodule is incorrect, Python now offers suggestions based on the submodules defined.
>>> from logging import handler
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'handler' from 'logging' ([...]). Did you mean: 'Handler'?
One interesting point: I’d expected that case to offer me a suggestion of handlers
, which is genuinely a submodule of logging
. This illustrates that, while useful, it’s probably not wise to rely on these suggestions too heavily.
The rest of this article covers features which may be a little obscure for many users of Python. This section covers a new monitoring API, which is only likely to be used by people writing debugging or profiling tools; and the next section covers some improvements to the use of sub-interpreters, which is a feature general Python developers probably don’t need. Even if they’re niche use-cases, I think they’re useful and interesting insight into the interpreter, but if you don’t agree then you might like to skip the rest of this particular article.
Debuggers and profilers are useful tools, but their implementation often has significant impact on the performance of the code on which they’re running in Python. This is not always troublesome, but can be annoying when running long tests or reproducing rare issues under a debugger, so reducing the performance impact is always useful.
Enter PEP 669 which introduces a new API which allows these tools to be written with lower impact on performance. It leverages the dynamic updates to running Python code added as part of the specialising adaptive interpreter introduced by PEP 659 in Python 3.11.
These changes introduce a new module sys.monitoring
to allow callbacks to be registered for events of interest, many of which can be set globally or only on specific modules. Once the callbacks are registered, the specific events which should trigger these callbacks can be activated—by default all events are deactivated, so there is no monitoring overhead.
First let’s take a look at what facilities the sys.monitoring
module offers, and then I’ll try to figure out how these works at a high level.
sys.monitoring
¶There are three steps to receiving monitoring events to your callback functions:
We’ll look at each of these steps, and then see a simple example of their use.
Registering the montioring tool is just a case of calling use_tool_id()
with the numberical slot number from 0
to 5
and the name of the tool, and assuming you don’t get a ValueError
raised then you’re registered. You then need to pass this same slot number into all the remaining calls.
Although you can use any slot you like, there are some pre-defined constants for specific types of tool to help avoid conflicts:
sys.monitoring.DEBUGGER_ID = 0
sys.monitoring.COVERAGE_ID = 1
sys.monitoring.PROFILER_ID = 2
sys.monitoring.OPTIMIZER_ID = 5
As an aside, there are actually two other slots, IDs 6
and 7
, but they’re not directly available. These are used to maintain the functionality provided by existing sys.settrace()
, which uses slot 6
, and sys.setprofile()
, which uses slot 7
. As an aside, experiments have shown this new API is considerably faster than sys.settrace()
if only a small set of events are active, but only slightly better if many events occur, such as triggering on every line of code.
Callbacks are registered with the helpfully named register_callback()
function, which takes the tool ID, the event which should trigger the callback, and the callback itself. Each tool can only have a single callback registered, so calling this a second time will replace it and return the original one—registering a callback of None
effectively just unregisters any previous callback.
The signature of these callbacks depends on the event handled, with groups of related events sharing the same callback signature. In the Events section below, I run through all the events and the callback signature required when registering them.
The final step before you start receiving callbacks is that you need to enable the specific event types that you want to see. This means that your tool can register all its callbacks as a static initialisation, and then enable/disable specific events as needed to help the developer analyse their code.
Events can be enabled globally on all code using sys.monitoring.set_events()
, or you can pass a specific code object (e.g. a module) to sys.monitoring.set_local_events()
to only enable them within a specific module. The events actually enabled will be a union of the events globally enabled as well as for that code object, so there’s no way for a particular code object to exclude certain events—they’d need to be disabled globally and then enabled on each specific code object in which you’re interested.
Multiple events can be specified using bitwise OR operator. For example, to enable both PY_START
and PY_RESUME
events globally for a debugger, you could do:
sys.monitoring(sys.monitoring.DEBUGGER_ID,
sys.monitoring.PY_START | sys.monitoring.PY_RESUME)
All of the callbacks take a CodeType
argument first, which is the compiled code which triggered the event, which I’ve ommitted from the parameter documentation below.
Many of the callbacks also pass one or more bytecode offesets to indicate the location in the code where an event is occurring. As you might be aware, these offsets are different to line numbers, but as we’ll see in the code example further on, it’s possible to convert these offsets into line numbers using the co_lines()
method of the code object.
Any of these callbacks can return a special value sys.monitoring.DISABLE
, which has the effect of disabling further callbacks for that particular tool on that particular event at that particular bytecode instruction. This can be used to improve performance, as disabled events don’t incur performance overhead. However, some exception-related event types aren’t necessarily tied to a particular location in the code and cannot be disabled in this way—this is noted under the individual event types below where applicable.
To re-enable any such disabled callbacks, sys.monitoring.restart_events()
will do so for all tools.
sys.monitoring.INSTRUCTION
As well as the CodeType
parameter common to all these callbacks, this one also has a single bytecode instruction offset
passed indicating the instruction that’s about to execute.
instruction_callback(code: CodeType, offset: int)
sys.monitoring.LINE
Instead of the offset
passed to callback for INSTRUCTION
, this one takes the source code line number.
line_callback(code: CodeType, line_number: int)
sys.monitoring.CALL
sys.monitoring.C_RETURN
sys.monitoring.C_RAISE
These events have some interdependencies:
CALL
event of a given function must be seen to also see the corresponding C_RETURN
or C_RAISE
event for that call.C_RETURN
and C_RAISE
—either neither or both events must be enabled. If you try to enable just one you’ll get a ValueError
.The callback for all of these takes a bytecode instruction offset
and also a reference to the callable
that is being called or returned from. It also takes arg0
which is the first argument to the function for CALL
events, the return value for C_RETURN
events, or the exception instance for C_RAISE
events. If there is no such value, arg0
will be set to the special singleton sys.monitoring.MISSING
.
call_callbacks(code: CodeType, offset: int, callable: object, arg0: object | MISSING)
sys.monitoring.PY_START
sys.monitoring.PY_RESUME
throw()
method on the generator or coroutine object (see PY_THROW
for that).Aside from the code object, passes a single parameter offset
which is the bytecode offset of the first instruction in the function which is about to be executed.
func_start_callback(code: CodeType, offset: int)
sys.monitoring.PY_RETURN
sys.monitoring.PY_YIELD
Passed two additional parameters, where offset
is the bytecode instruction doing the returning and retval
is the value being returned or yielded as appropriate.
func_exit_callback(code: CodeType, offset: int, retval: object)
sys.monitoring.RAISE
StopIteration
raised for generators or coroutines (see STOP_ITERATION
).sys.monitoring.RERAISE
finally
block which has been executed during stack unwinding.sys.monitoring.EXCEPTION_HANDLED
sys.monitoring.PY_UNWIND
sys.monitoring.PY_THROW
RAISE
, this event is triggered instead of PY_RESUME
when a coroutine or generator function is resumed by calling the throw()
method of the appropriate object.sys.monitoring.STOP_ITERATION
StopIteration
. Note that since this is an inefficient way to return values, this isn’t always triggered where you expect it might be—only where the StopIteration
would be visible. For example, if you iterate through the results of a generator using a for
loop you’ll see it, but if you pass that same generator into the builtin sum()
then you won’t.Of the events in this group only STOP_ITERATION
can be disabled—the others are not “local events” and if you attempt to return DISABLE
from the callback for them, then that callback will be removed and a ValueError
will be raised. This is a very confusing context in which to raise an exception, and from my brief testing you get some pretty confused stack traces, so my strong suggestion is that you just make sure you don’t do this or you’ll be in quite a painful place.
In all these cases, the additional parameters to the callback are the offset
to the bytecode instruction where the event was triggered, and a reference to the exception
triggering the action.
exception_callback(code: CodeType, offset: int, exception: BaseException)
sys.monitoring.BRANCH
sys.monitoring.JUMP
The branch operations are from src_offset
to dst_offset
within the code
object passed as the first parameter.
branch_callback(code: CodeType, src_offset: int, dst_offset: int)
To help demonstrate this, I created this very simple case below, which registers callbacks for a subset of the operations which just print their argument to standard output. Before we look at the code, there are a couple of notes that are worth bearing in mind when you read through the code:
code_offset_to_line()
function converts a bytecode instruction offset, as passed into the callbacks, into a source file and line number. If you want to know about the co_lines()
method on the code object it uses to do this, read through PEP 626 for more details. The way it works isn’t important for the monitoring functions.get_func_exec_cb()
provides the callback used for both PY_START
and PY_RESUME
events, but having two copies of this which differed only in that name would have been dull to implement.That said, here’s the code:
mon-demo.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
If you execute this code, you should see this output:
[CALL] sum_under_ten() mon-demo.py:70 arg0=6
[PY_START] mon-demo.py:62
[CALL] generator_func() mon-demo.py:64 arg0=6
[PY_START] mon-demo.py:58
[CALL] range() mon-demo.py:59 arg0=6
[C_RETURN] range() mon-demo.py:59 arg0=6
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=7
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=8
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=9
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:67'
[JUMP] src_line='mon-demo.py:67' dst_line='mon-demo.py:64'
[PY_RESUME] mon-demo.py:60
[JUMP] src_line='mon-demo.py:60' dst_line='mon-demo.py:59'
[BRANCH] src_line='mon-demo.py:59' dst_line='mon-demo.py:59'
[PY_YIELD] mon-demo.py:60 retval=10
[BRANCH] src_line='mon-demo.py:64' dst_line='mon-demo.py:64'
[BRANCH] src_line='mon-demo.py:65' dst_line='mon-demo.py:66'
[PY_RETURN] mon-demo.py:68 retval=24
[CALL] print() mon-demo.py:70 arg0=24
24
[C_RETURN] print() mon-demo.py:70 arg0=24
[PY_RETURN] mon-demo.py:70 retval=None
Based on the earlier documentation, you should be able to match up this output with the code to see what’s happening here. One potentially puzzling point is the BRANCH
events on the lines with for
loops branch to themselves—this makes more sense when you realise that a single source code line evaluates to multiple bytecode instructions, and branches target an instruction rather than a source line.
This example also illustrates the slight asymmetry of the fact that CALL
events are generated for both C extension and builtins (e.g. range()
) as well as Python functions (e.g. sum_under_ten()
); however, the return from range()
triggers C_RETURN
whereas the return from sum_under_ten()
triggers PY_RETURN
. On the topic note that I’ve enabled C_RAISE
even though it’s never triggered, because you can’t enable C_RETURN
without C_RAISE
as well, as previously noted.
Also note that there’s only one PY_START
event for generator_func()
, and all remaining re-entries are PY_RESUME
events.
So how does this work? Well, if you’ve read my previous article which covered the specialising adaptive interpreter features, or you’re otherwise familiar with them, you’ll recall that since Python 3.11 dyanamic runtime performance improvements are made by specialising certain instructions. Well, this new monitoring interface uses the same technique.
Let’s recap with an example, suppose a LOAD_ATTR
bytecode is generated—this instruction pops the object on the top of the stack, retrieves a specified attribute from it and pushes the value of that attribute back on to the stack. With the dynamic specialisation introduced in 3.11, this instruction becomes an adaptive version LOAD_ATTR_ADAPTIVE
which records how it’s used. If it’s consistently used to obtain an atttibute from an object instance via __dict__
, say, then after a number of repetitions of this consecutively that instruction will dynamically be replaced by a LOAD_ATTR_INSTANCE_VALUE
bytecode which can do the same job quicker because it doesn’t need to perform as many checks.
Of course, usage patterns can’t be assumed to stay the same for the lifetime of a script, so the specialised versions are written to confirm their assumptions are correct and revert back to the basic LOAD_ATTR
if not—if this happens a few times, the instruction itself reverts back to LOAD_ATTR_ADAPTIVE
and the process repeats. If you want a more detailed illustration, you can take a few minutes to read my previous article.
So, this new API modifies instructions to instrumented versions of themselves, and potentially inserts new instructions for things like LINE
events. This update happens during sys.monitoring.set_events()
for any code objects present on the call stack of any thread, and traps are put into place to ensure other code is updated as soon as it’s called. This operation is performed for:
LINE
and INSTRUCTION
CALL
PY_START
and PY_RESUME
PY_RETURN
and PY_YIELD
JUMP
and BRANCH
For other events such as RAISE
, it’s cheaper to instead to a check when the event happens so the bytecodes don’t need to be modified.
Let’s take a look at an example of this in action. Consider this simple implementation of a factorial function in Python:
1 2 3 4 5 |
|
If we use dis.dis()
with adaptive=True
and show_caches=True
then we get the following bytecode disassembly:
1 0 RESUME 0
2 2 LOAD_FAST__LOAD_CONST 0 (x)
4 LOAD_CONST 1 (2)
6 COMPARE_OP_INT 2 (<)
8 CACHE 0 (counter: 832)
10 POP_JUMP_IF_FALSE 2 (to 16)
3 12 LOAD_FAST 0 (x)
14 RETURN_VALUE
5 >> 16 LOAD_FAST 0 (x)
18 LOAD_GLOBAL_MODULE 1 (NULL + fact)
20 CACHE 0 (counter: 832)
22 CACHE 0 (index: 9)
24 CACHE 0 (module_keys_version: 75)
26 CACHE 0 (builtin_keys_version: 0)
28 LOAD_FAST__LOAD_CONST 0 (x)
30 LOAD_CONST 2 (1)
32 BINARY_OP_SUBTRACT_INT 10 (-)
34 CACHE 0 (counter: 832)
36 CALL_PY_EXACT_ARGS 1
38 CACHE 0 (counter: 832)
40 CACHE 0 (func_version: 2062)
42 CACHE 0
44 BINARY_OP_MULTIPLY_INT 5 (*)
46 CACHE 0 (counter: 832)
48 RETURN_VALUE
At this point, I registered a tool, registered a dummy callback which did nothing, and enabled only the CALL
and LINE
events for illustrative purposes. If I called dis.dis()
again at this point, I wouldn’t see any changes because the trap to modify the code on next execution won’t have fired. So I called the function one more time, and then dis.dis()
showed the changes, which I’ve highlighted below:
1 0 RESUME 0
2 2 INSTRUMENTED_LINE 0
4 LOAD_CONST 1 (2)
6 COMPARE_OP_INT 2 (<)
8 CACHE 0 (counter: 832)
10 POP_JUMP_IF_FALSE 2 (to 16)
3 12 INSTRUMENTED_LINE 0
14 RETURN_VALUE
5 >> 16 INSTRUMENTED_LINE 0
18 LOAD_GLOBAL_MODULE 1 (NULL + fact)
20 CACHE 0 (counter: 768)
22 CACHE 0 (index: 9)
24 CACHE 0 (module_keys_version: 75)
26 CACHE 0 (builtin_keys_version: 0)
28 LOAD_FAST 0 (x)
30 LOAD_CONST 2 (1)
32 BINARY_OP_SUBTRACT_INT 10 (-)
34 CACHE 0 (counter: 832)
36 INSTRUMENTED_CALL 1
38 RESERVED
40 BINARY_OP_MULTIPLY_INT 8 (**)
42 CACHE 0 (counter: 0)
44 BINARY_OP_MULTIPLY_INT 5 (*)
46 CACHE 0 (counter: 832)
48 RETURN_VALUE
There are some interesting things to note here. Firstly, note that CALL_PY_EXACT_ARGS
has been replaced with INSTRUMENTED_CALL
. This illustrates one of the downsides of this approach to instrumentation, namely that the benefits of adaptive specialisation are lost. This means that instrumenting code will potentially worsen performance by a little more than just the overhead of calling the callbacks, as you lose the performance boost of specialisations.
I also note that the INSTRUMENTED_LINE
bytecode appears to have replaced lines rather than being in addition to them. From briefly perusing the CPython code, it looks as if these codes store the original opcode and execute them. This makes sense, because it’s not at all clear that inserting new instructions would be possible using the dynamic specialisation mechanism—instead the original opcode is called after invoking the callback, and when the event is disabled again the bytecode can be switched back to the original.
I also note that the CACHE
bytecode in instruction 38
, which adaptive instructions use to store their counters, has been changed to RESERVED
. I’m a little puzzled why instruction 40
has been replaced with BINARY_OP_MULTIPLY_INT
instead of RESERVED
—either dis.dis()
has mistranslated the opcode, or there’s something odd going on that I’m not quite understanding.
All in all, it seems like a neat way to do things, and being able to instrument code with lower overhead is handy. Perhaps more to the point, I could imagine there are other uses to be found for this dynamic instruction swapping technique, and I’m interested to see what other applications come to light in the future.
Since the 1.x days of Python, it’s always been possible to run multiple separate interpreters in the same process, should you want to. These sub-interpreters could only be started using the Py_NewInterpreter()
function in C code, and each one had its own collection of threads with their own stacks, plus its own copy of imported modules and most other things.
However, there were a few things that were still shared between the multiple sub-interpreters and one of these was the global interpreter lock (GIL). This meant that only one sub-interpter could ever be executing Python code at one time, which means that they were useless for making use of multiple processors—since this seemed like a potentially handy use-case for sub-interpreters, this was a notable drawback.
In Python 3.12 the implementation of PEP 684 has moved much of the previously shared global state into per-interpreter storage, including the GIL. This means that those interpreters can now make use of multiple processors more effectively.
To do this required quite a few changes. Firstly, the following global state was moved into the PyInterpreterState
structure, which exists per sub-interpreter:
However, this does still leave some data global across interpreters, namely:
One of the trickier parts of the change was dealing with allocators. CPython providers three allocator domains, where each domain manages its own memory and can use an allocation strategy that’s optimised for a particular purpose. The three domains are:
Each domain has its own functions to allocate and free memory, and custom allocators for each domain can be set during runtime initialisation. Prior to Python 3.12 all of these allocators were global, shared by all sub-interpreters. Also, as a consequence of the mem and object domain allocators requiring the GIL, there was no particular need for them to be thread-safe.
This was fine up until the GIL got moved into the sub-interpreters—at this point, sharing the allocators across multiple interpreters was a recipe for crashes. At this point the choice was essentially to either make the allocators all thread-safe and leave them global; or move them into the scope of each sub-interpreter and not require thread-safety.
For reasons outlined in the PEP, the allocators were left shared and are now required to be thread-safe. This does mean that anyone using custom allocators which aren’t thread-safe, and also using sub-interpreters, is going to risk data races at runtime—but this is currently protected against by requiring own_gil
to be false if custom allocators are used (see below).
A new function Py_NewInterpreterFromConfig()
has been introduced which, unlike the original Py_NewInterpreter()
, takes a configuration parameter2 of type PyInterpreterConfig
to control the behaviour of the new interpreter. This means that developers can adopt whichever of the new features will be compatible with their use-cases, but existing code still using Py_NewInterpreter()
maintains backwards compatibility.
This configuration structure provides the following fields which can be set—the “legacy” values shown below are the compatibility mode options used for Py_NewInterpreter
, and the “new” values are for fully isolated sub-interpreters:
int gil
(legacy: SHARED_GIL
, new: OWN_GIL
)PyInterpreterConfig_SHARED_GIL
for the sub-interpreter to share the main interpreter’s GIL, or PyInterpreter_OWN_GIL
for it to have its own. There’s also a PyInterpreterConfig_DEFAULT_GIL
definition, and it currently defaults to a shared GIL for backwards compatibility.int use_main_obmalloc
(legacy: 1
, new: 0
)int allow_fork
(legacy: 1
, new: 0
)fork()
is disallowed in any thread where that sub-interpreter is active, otherwise fork()
is unrestricted. The subprocess
module still works even if fork()
is disallowed, however.int allow_exec
(legacy: 1
, new: 0
)execv()
and similar functions will be disallowed in any thread where that subinterpreter is active. Once again, this doesn’t affect subprocess
.int allow_threads
(legacy: 1
, new: 1
)threading
module in that sub-interpreter won’t allow new threads to be created.int allow_daemon_threads
(legacy: 1
, new: 0
)allow_threads
is non-zero, this can be set to zero to force threading
to only create non-daemon threads.int check_multi_interp_extensions
(legacy: 0
, new: 1
)use_main_obmalloc
is also true.The f-string changes might seem modest, but I run into the issue of clashing quotes just often enough that simply not having to worry about it any more is going to be helpful. The improved error reporting and the ability to nest f-strings are nice bonuses as well.
Inline comprehensions are one of those things that a lot of people won’t even realise has been implemented, but they’re really useful little constructs so it’s great to see their performance rising to match their concision.
The only thing I’m a little disappointed by is the failure to include generator expressions, even though I understand some of the particular difficulties they would introduce. This is because where I need performance, I’ll typically use a generator expression to avoid having to buffer up potentially large amounts in memory. With these new, faster comprehensions, however, there’s more value in weighing up the expected data size more carefully, as performance might now be significantly better using comprehensions for smaller data volumes.
Better error reporting is always useful, although it does feel a little like we’re getting into some slightly obscure corners here—perhaps that means all the low-hanging fruit for error reporting is already plucked.
Finally, I’m a big fan of this new approach to monitoring—self-modifying code is a really handy trick if you can get it right, and a project the size of Python is well-tested and well-used enough to inspire confidence on that score. Even aside from conventional debugging and profiling, this approach might now be cheap enough to use in production for things like detecting when deprecated functions are called, or even rate-limiting calls to a particular piece of code. Of course, the best option would normally be the modify the code itself, but there may be cases where that’s inadvisable or impossible, and this non-invasive approach feels like one of those specialist tools you rarely use, but when you do need it, you’re really glad you have it.
That’s it for this article, and I think I’ve looked at all the major changes in the core language for Python 3.12. Next time I’ll run through some of the changes to the standard library which, on cursory inspection, seem fairly modest.
And, if I’m honest, it’s a little concerning that someone might actually make use of this to change values in an f-string or something awful. But you can’t let potential for abuse stop you adding useful features to a language or you’d never have anything. ↩
Technically the function takes two parameters, but the first of them is really an output parameter to return the PyThreadState
for the new sub-interpreter. ↩