In this series looking at features introduced by every version of Python 3, we complete our look at 3.7 by checking the changes in the standard library. These include three new modules, as well as changes across many other modules.
This is the 16th of the 34 articles that currently make up the “Python 3 Releases” series.
Forgive me Internet for I have sinned — it has been seven months since my last blog post. But for one thing, that’s a lot shorter than some of my previous breaks; and for another, this isn’t a post about my blogging consistency — this is all about Python.
In the previous post we ran through the language changes in 3.7, and this time I’m going to touch on the library changes. There’s quite a lot of very minor updates in this release so I may well skip a few of the smaller ones. Then again, I always intend to do that and somehow my OCD tendencies always kick in and I end up covering nearly everything. Let’s see how well I do this time.
We’ll start by looking at three entirely new libraries that were added.
This module, and a new set of C APIs, provide the feature of context variables. These are similar in principle to thread-local storage, except that they also correctly distinguish between asynchronous tasks (i.e. with asyncio
) as well as threads. This idea was initially proposed in PEP 550, but this was rather too grand in its scope so it was withdrawn and a simplified PEP 567 proposed instead — it’s this latter PEP which has been implemented in this release.
There are two concepts to grasp when using this module. The first are the context variables, as already mentioned. These are typically declared at the module level and act like keys into a context dictionary. The second concept is a context. In the case of thread-local storage, the context is always the current thread. With this module, however, the context is a concept that’s exposed to the developer and can be selected in code — this is what enables the same principle to be extended to asynchronous tasks which execute in the same OS thread.
The default behaviour is for each thread to have its own context, however, so let’s start with a simple threading example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Notice how context_var
is really just a key to use to access the variable — you intentionally access this globally, using its set()
and get()
methods. When you do so, the value from the current local context will automatically be used.
Since each of the threads should have its own local context, when we run that code we should see context variable be incremented from its default of 0
to 1
, but the context of the main thread should be untouched and finally still display as 10
as set in line 6. At least that’s what you might expect — let’s see what output we get when I run it:
$ python context-vars-test.py
Result: 1
Result: 1
Result: 1
Result: 2
Result: 2
Result: 2
Result: 3
Result: 3
Result: 3
Final value: 10
Some of you may be ahead of me here, but I suspect a good number of you are scratching your heads. However, things might become clearer if you think about how we’re executing these threads. Think about the name ThreadPoolExecutor
— it’s a pool of threads, and we’re limiting the size of it to 3. This means the first three threads all display the behaviour we expect, but then the pool is exhausted and the executor waits until a thread becomes free to run the next instance. Because this thread is being reused, it uses the same context as the previous instance running in this thread, and that’s why we see the value value being incremented again.
It’s worth noting that it’s the time.sleep()
which makes the behaviour more-or-less deterministic in this example. If you take that away, the threads execute so quickly that the same 1-2 threads may be available even before the next submit()
call, so you’ll see some unpredictability in the results.
If we want each instance to use its own context regardless of which thread its in, we can call contextvars.Context()
to construct an empty new context, and then use the run()
method to execute the function within that context. The following simple modification calls run()
in each worker, passing thread_func()
as an argument. This will yield the output you might expect.
14 15 |
|
As well as constructing new contexts, you can create shallow copies of the current context by calling contextvars.current_context()
. This acts like a fork, where all the values are inherited from the current context, but changes in the copy won’t impact the values in the original. You can see this with another small change to the example above:
14 15 |
|
Now you’ll see all the threads returning value 11
, but the main thread still displaying its value 10
at the end as before.
Finally, it’s worth noting that asyncio
has also been updated with support for contexts in this release. Each task has its own context, and there’s support for manually specifying a context when callbacks are invoked. The example below is more or less equivalent to the thread example above1. If you run it, you’ll note that the tasks all return 11
again, indicating that asyncio
is using copy_context()
to create the context for newly created tasks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
The next module added in this release is the aptly named dataclasses
, proposed by PEP 557. This module adds a single decorator @dataclass
which can be added to a class as an easy way to generate simple structure-like classes. These support declaration of attributes by type annotations as with typing.NamedTuple
, except that in this case a full class is created not a subtype of tuple
, hence the attributes are mutable.
It’s perhaps best explained with a simple example:
>>> from dataclasses import dataclass
>>> from datetime import date
>>>
>>> @dataclass
... class Student:
... name: str
... dob: date
... class_name: str
... days_attended: int = 0
... days_absence: int = 0
...
>>> joe = Student()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() missing 3 required positional arguments: 'name', 'dob', and 'class_name'
>>> joe = Student("Joe Bloggs", date(2013, 3, 14), class_name="4A")
>>> print(joe)
Student(name='Joe Bloggs', dob=datetime.date(2013, 3, 14), class_name='4A', days_attended=0, days_absence=0)
>>> joe.days_attended += 5
>>> joe.days_attended
5
This illustrates the declaration of attributes using type annotations, as well as the use of default values and the fact that the classes are given sensible __str__()
and __repr__()
methods which display the attribute values.
Classes are also given a __eq__()
method by default, although this can be disabled by passing eq=False
to the @dataclass
decorator. Additionally, ordering methods (__lt__()
, __le__()
, __gt__()
and __ge__()
) can be generated, although this isn’t done by default — if enabled by passing order=True
to the decorator, classes are ordered as if they were a tuple of the attributes in the order in which they’re declared.
By passing frozen=True
to the decorator, classes can be declared read-only, where any attempt to set an attribute will raise an exception. If this is done and eq=True
also, then an appropriate __hash__()
method will be automatically generated, to allow instances to be keys in hashed collections like dict
and set
. You can override this behaviour to generate a __hash__()
even for mutable types by passing unsafe_hash=True
to the decorator, but you’d best stay away from this sort of thing unless you’re extremely confident you know what you’re doing.
All in all, I see this being a useful generalisation of namedtuple
and it’s bound to come in handy for reducing boilerplate in simple cases.
The third new module is importlib.resources
, which is used to embed file-like resources inside Python packages. This is helpful for library authors who wish to distribute static data in files, but don’t want to worry whether those will be stored as actual files on the filesystem, or in some other form such as in importable zip archives.
I suspect this is a little niche, so I’m not going to drill into details. But to broadly illustrate how it can work, I set up a very silly example package called sillypkg
. It contains two modules, silly
and sillier
, and the use of importlib.resources
is within sillier.sillier_func()
— this wants to read a message from a text file included in the package and print it. Here are the contents of each file in the package:
__init__.py | |
---|---|
1 2 3 |
|
silly.py | |
---|---|
1 2 |
|
sillier.py | |
---|---|
1 2 3 4 |
|
message.txt | |
---|---|
1 |
|
After creating the package directory I zipped it up, just to illustrate that the resources system could retrieve files from zips as well as standard directories:
$ unzip -l sillypkg.zip
Archive: sillypkg.zip
Length Date Time Name
--------- ---------- ----- ----
0 06-01-2022 12:04 sillypkg/
55 06-01-2022 11:55 sillypkg/silly.py
55 06-01-2022 11:55 sillypkg/__init__.py
47 06-01-2022 12:00 sillypkg/message.txt
116 06-01-2022 12:00 sillypkg/sillier.py
--------- -------
273 5 files
Finally, here you can see the modules being imported and the functions being called — the key part is the call to sillier_func()
, which correctly retrieves the contents of message.txt
:
>>> import sys
>>> sys.path.append("./sillypkg.zip")
>>> import sillypkg
>>> sillypkg.silly.silly_func()
This function is silly.
>>> sillypkg.sillier.sillier_func()
This message is not actually very silly. Woof.
I’m guessing this module will only be of significant use to a smallish subset of package maintainers, but I can certainly see how it would make life easier for anyone that falls into this use-case.
In terms of text processing, this release just see some subtle, but potentially useful, changes to the re
module.
Prior to Python 3.7, re.split()
would fail if given a pattern which could match the empty string:
Python 3.6.13 (default, Mar 28 2021, 04:17:23)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split("(?=l)", "hello")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.6.13/lib/python3.6/re.py", line 212, in split
return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.
Whereas in this release these are now supported:
Python 3.7.10 (default, Mar 28 2021, 04:19:36)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split("(?=l)", "hello")
['he', 'l', 'lo']
Also, a FutureWarning
check has been raised if patterns are using characters which might one day be used for set operations for character ranges, as suggested in the Unicode specification.
>>> import re
>>> re.compile("[a-f&&aeiou]")
__main__:1: FutureWarning: Possible set intersection at position 4
re.compile('[a-f&&aeiou]')
There are some handy changes in three of the data types modules — some enhancements to collections.namedtuple
, a new method for parsing ISO time specifications in datetime
and some slightly obscure machinery that may be useful when creating enum.Enum
members programmatically.
There is a small but useful improvement to collections.namedtuple
which is the addition of a defaults
parameter to provide default values on construction of namedtuple
instances.
You can provide any iterable to specify the list of defaults. If the number of defaults provided is fewer than the number of attributes of the namedtuple
, they’re assigned to the rightmost set as is consistent with the fact that mandatory parameters must occur before those with default values.
You can see this in action in the small snippet below:
>>> import collections
>>>
>>> MyClass = collections.namedtuple(
... "MyClass",
... ("one", "two", "three", "four"),
... defaults=(333, 444))
>>> MyClass()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __new__() missing 2 required positional arguments: 'one' and 'two'
>>> MyClass(11, 22)
MyClass(one=11, two=22, three=333, four=444)
>>> MyClass(11, 22, 33)
MyClass(one=11, two=22, three=33, four=444)
The date
and datetime
objects now support a fromisoformat()
method, which parses date and datetime strings of the sort generated by the isoformat()
methods. These include those conforming to ISO 8601, although I have a feeling the methods accept a range of strings that is slightly broader than the ISO standard would strictly permit.
You may just think this is a convenience to avoid calling the strptime()
method, but there are actually some differences in the handling of timezones — the fromisoformat()
is able to be more flexible since it doesn’t need to conform to the fixed input format that must be specified when using strptime()
. If you’re interested in the gritty details, follow the full discussion on bpo-15873.
Here’s a quick illustration of this method in action, and some of the variations it can accept:
>>> import datetime
>>>
>>> datetime.datetime.fromisoformat("2022-06-02 11:15:00")
datetime.datetime(2022, 6, 2, 11, 15)
>>> datetime.datetime.fromisoformat("2022-06-02T11:15:00")
datetime.datetime(2022, 6, 2, 11, 15)
>>> datetime.datetime.fromisoformat("2022-06-02 11:15:12.123")
datetime.datetime(2022, 6, 2, 11, 15, 12, 123000)
>>> datetime.datetime.fromisoformat("2022-06-02 11:15:12+01:00")
datetime.datetime(2022, 6, 2, 11, 15, 12, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))
>>> datetime.date.fromisoformat("2022-06-02")
datetime.date(2022, 6, 2)
In addition to these new methods, there’s also a small change to the tzinfo
class to support sub-minute timezone offsets. I’d hazard a guess that the number of developers who need to worry about this case is pretty small, but if you want some insight into the motvation checkout bpo-5288.
When creating enum.Enum
classes, it’s sometimes useful to do so programmatically rather than list every single constant. However, because of the introspection inherent in the declaration process, you can’t leave any class-scope variables hanging around, as demonstrated by this small example2:
>>> import calendar
>>> import enum
>>>
>>> class WeekDay(enum.Enum):
... namespace = vars()
... for i in range(len(calendar.day_name)):
... namespace[calendar.day_name[i]] = i
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in WeekDay
File "/Users/andy/.pyenv/versions/3.7.10/lib/python3.7/enum.py", line 105, in __setitem__
raise TypeError('Attempted to reuse key: %r' % key)
TypeError: Attempted to reuse key: 'i'
The problem is that the variables i
and namespace
hang around, and Enum
tries to create enumeration entries for them. As of release 3.7, however, there’s a new _ignore_
attribute which allows them to be skipped.
>>> class WeekDay(enum.Enum):
... _ignore_ = ("namespace", "i")
... namespace = vars()
... for i in range(len(calendar.day_name)):
... namespace[calendar.day_name[i]] = i
...
>>> WeekDay.Monday
<WeekDay.Monday: 0>
>>> WeekDay.Saturday
<WeekDay.Saturday: 5>
>>> WeekDay(3)
<WeekDay.Thursday: 3>
>>> WeekDay(3).value
3
>>> WeekDay(3).name
'Thursday'
On the operating system front there are some useful tweaks in logging
, a useful os.register_at_fork()
function, and some nanosecond resolution functions added to time
.
There are a few enhancements to logging
. The first is simply that Logger
instances can be pickled. The benefit here isn’t so much picking the loggers themselves, but making it easier to pickle other objects which just happen to have a Logger
instance inside them somewhere. Instances are just pickled into the name of the logger, so when they’re restored they’ll use or create a logger of the same name.
Next, the StreamHandler
class has a new setStream()
method to allow the output stream to be changed after construction. For example, this could be useful if you’re using sys.stderr
and you end up replacing sys.stderr
with a different stream and want to update all your handlers to use it. The function will flush all logs first, then replace the stream and return the old stream object, or None
if no change was made.
Finally, there’s a small but useful change to allow configuration passed to logging.config.fileConfig()
to use kwargs
to specify keyword arguments, alongside the existing args
for positional arguments. Here are two specifications of handlers in configparser
format:
[handler_args]
class=FileHandler
level=INFO
formatter=myformat
args=("foo.log", "w")
[handler_kwargs]
class=FileHandler
level=INFO
formatter=myformat
kwargs={"filename": "foo.log", "mode": "w"}
There are handful of smaller improvements in the os
module. First up, the os.scandir()
function can now accept a directory file descriptor as well as a path name. This is useful if you’re calling it in a context such as an os.fwalk()
, which gives you directory descriptors on each iteration.
Next up there’s a new os.register_at_fork()
method which allows callbacks to be registered to be called just before or after a fork()
operation. This hook is provided by the Python wrapper around the underlying system call so these hooks won’t be called if, say, a C extension module calls fork()
itself, unless it goes to the trouble of calling the Python C APIs to trigger these hooks. Also, they’re only invoked if control returns to the Python interpreter, so you won’t see them called in cases like subprocess
.
You can install hooks to be run just prior to fork()
, or just after in either the parent or child. You can also install multiple hooks in any of these places. The example below illustrates all this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
And here is some output from running it:
About to fork pid=58507
HOOK2: Before pid=58507
HOOK1: Before fork
HOOK1: In parent
HOOK2: Parent pid=58507
Forked (parent) child=58521 pid=58507
HOOK1: In child
HOOK2: Child pid=58521
Forked (child) pid=58521
You can see that the before
hooks are executed in reverse order of registration, but the after_in_*
hooks are called in registration order. Note that if you run this, it’s platform independent whether the parent or child output comes out first, and it may even be interleaved — these are separate processes at this point and it’s up to the operating system scheduler how to run them.
There are a couple of new functions os.preadv()
and os.pwritev()
. The os.preadv()
function simply combines the functionality of os.pread()
, which reads a specified number of bytes from a specified offset to the current read offset, with the os.readv()
, which splits read bytes across a set of limited size buffers. The os.pwritev()
similarly merges os.pwrite()
and os.writev()
.
Finally, there’s a small but noteworthy change to os.makedirs()
to create intermediate directories with full access permissions, modified by the users umask
, as opposed to the previous behaviour of applying the specified permissions to each directory. The issue with that is that since you need to create further child directories, it assumes that the specified permissions include write access for the current user, which you may not wish to allow. The new behaviour is more consistent with the mkdir
utility.
The time
module now offers six new functions which are equivalent to existing ones but providing nanosecond resolution values, as specified in PEP 564:
Original | Nanosecond resolution |
---|---|
clock_gettime() |
clock_gettime_ns() |
clock_settime() |
clock_settime_ns() |
monotonic() |
monotonic_ns() |
perf_counter() |
perf_counter_ns() |
process_time() |
process_time_ns() |
time() |
time_ns() |
The actual precision of the values will, as always, be platform-dependent. The reasoning behind this change is that as we approach hardware clocks offering nanosecond-precision values, the use of float
to store these values starts to lose precision. A 64-bit IEEE 754 format floating point value starts to lose accuracy at nanosecond resolution if you store any period of time longer than around 104 days. By returning these values as int
they’re easy to deal with and store, and don’t lose precision.
There are also several new clock types that are supported:
CLOCK_BOOTTIME
(Linux only)CLOCK_MONOTONIC
except that it’s adjusted to include time for which the system is suspended. This is useful if you need to be aware of suspend delays, but you don’t want to deal with all the complexities of CLOCK_REALTIME
.CLOCK_PROF
(FreeBSD, NetBSD and OpenBSD only)CLOCK_UPTIME
(FreeBSD and OpenBSD only)Finally, there’s also a thread-specific version of the process_time()
function called thread_time()
, which returns the total of user and system CPU time consumed by the current thread. In keeping with the other changes in this release, there’s also a thread_time_ns()
version which returns the time in nanoseconds instead of fractional seconds.
A handful of improvements for concurrent.futures
and multiprocessing
, as well as a new queue.SimpleQueue
class which offers additional re-entrancy guarantees over the existing queue.Queue
class.
When using either ThreadPoolExecutor
or ProcessPoolExecutor
, it’s now possible to pass a callable object to perform global initialisation in each thread or process when it’s created for the first time. This is done by passing a callable as the initializer
parameter and, if required, initargs
as a tuple of arguments to pass to it.
This can be illustrated by the sample code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
When executed, it’s clear that each thread only has thread_initialiser()
called once, when first created.
INIT thread=123145558577152
WORK A START thread=123145558577152
INIT thread=123145563832320
WORK B START thread=123145563832320
INIT thread=123145569087488
WORK C START thread=123145569087488
WORK A END thread=123145558577152
WORK D START thread=123145558577152
WORK C END thread=123145569087488
WORK E START thread=123145569087488
WORK B END thread=123145563832320
WORK F START thread=123145563832320
WORK D END thread=123145558577152
WORK E END thread=123145569087488
WORK F END thread=123145563832320
All done
Some changes to multiprocessing
for easier and cleaner termination of child processes.
There’s a new Process.close()
method for immediately closing a process object and freeing all the resources associated with it. This is in contrast to the situation prior to this release, where this only occurred when the garbage collect finaliser was called. This is only intended to be used once the process has terminated, and ValueError
is raised if it’s still running. Note that once the process is closed, most of the methods and attributes will raise ValueError
if invoked.
There’s another new method Process.kill()
, which is the same as Process.terminate()
except that on Unix systems it uses the signal SIGKILL
instead of SIGTERM
. For those less familiar with Unix, the difference is that processes can handle SIGTERM
and continue running, whereas SIGKILL
cannot be caught and will always3 reliably terminate the application.
Finally, there’s a fix for a slightly unusual edge case where a multiprocessing
child process itself spawns threads. Prior to this release, all threads would be terminated as soon as the main thread exited — this is expected behaviour for daemon threads, but it would occur even for non-daemon threads. As of Python 3.7, however, these threads are joined before the process terminates, to allow them to exit gracefully.
The queue
module has a new class SimpleQueue
, which is not susceptible to some re-entrancy bugs which can occur with the existing Queue
class. These issues can occur with signals, but they can also occurs with other sources of re-entrancy like garbage collection. You can find a detailed walkthrough of some of these issues in this article on Code Without Rules.
The Queue
class has some additional features beyond a simple FIFO queue such as task tracking — consumers can indicate when each item has been processed and watchers can be notified when all items are done. This makes adding guarantees around re-entrancy particularly difficult, so instead the new SimpleQueue
has been added which offers more guarantees in exchange for only basic FIFO functionality — you can’t even specify a maximum queue size.
You should use SimpleQueue
if you will be calling put()
in any context in which an existing put()
may be executing in the same thread. Examples of this include signal handlers, __del__()
methods or weakref
callbacks. Since it’s often hard to predict when you might one day want to call things as your code evolves, my suggestion is to just always use SimpleQueue
unless you have a specific need of the functionality provided by Queue
.
The ever-useful subprocess
module has had a couple of handy improvements. First up the subprocess.run()
function now has a convenience parameter capture_output
— if this is True
, it’s equivalent to specifying stdout=subprocess.PIPE
and stderr=subprocess.PIPE
. This is a fairly common case, and it’s nice to see it made more convenient.
Secondly, there’s now more graceful handling of KeyboardInterrupt
exceptions during execution. In particular, it now pauses briefly to allow the child to exit before continuing handling of the exception and sending SIGKILL
.
A lot of love to networking this release, with a variety of asyncio
improvements, some new socket options supported in socket
, and support for TLS 1.3 in ssl
.
In this release the asyncio
module has had numerous enhancements and optimisations. In this article I’ll touch on what I regard as the highlights.
The first of these is the addition of the asyncio.run()
function, which is intended to be used as the main entrypoint to kick off the top-level coroutine. It’s a convenience which manages the event loop and other details to avoid developers running into common problems.
Code that used to read something like this before 3.7:
async def some_coroutine():
...
loop = asyncio.get_event_loop()
try:
loop.run_until_complete_(some_coroutine())
finally:
loop.close()
… is now rather more concise:
async def some_coroutine():
...
asyncio.run(some_coroutine())
Next up, the loop.start_tls()
method has been added to support protocols which offer the STARTTLS feature, such as SMTP, IMAP, POP3 and LDAP. Once the protocol-level handshake has been done and the TLS handshake should start, this method is called and it returns a new transport instance which the protocol must start using immediately. The original transport should is no longer valid and should not be used.
There are new methods asyncio.current_task()
and asyncio.all_tasks()
for introspection purposes — these could be quite useful for diagnostics and logging purposes, and are worth remembering. These replace the previous Task.current_task()
and Task.all_tasks()
methods, which have been deprecated as they couldn’t be overridden by the event loop.
There’s a new protocol base class, asyncio.BufferedProtocol
. This is useful for implementing streaming protocols where you want to deal with the underlying data buffer yourself. Instead of calling data_received()
on the derived class, it instead calls get_buffer()
for the protocol to provide its buffer object, and then buffer_updated()
to indicate to the protocol code that there’s now more data in the buffer and it should potentially perform more parsing. This is really useful for, say, line-oriented protocols where you often don’t want to do any parsing until you receive a line-ending character.
There’s a new loop.sock_sendfile()
method which uses os.sendfile()
to send files to a socket where possible, for performance reasons. Where that system call is unavailable, it’s either simulated in code or an exception is raised depending on the arguments to sock_sendfile()
.
There have been some changes to the way asyncio.Server
instances are started. Before 3.7 these would immediately start serving as soon as created, but now you can choose that behaviour by passing start_serving=True
, or you can start later using a new start_serving()
method. There’s an is_serving()
method to check if the server is serving currently, and servers can also be used as context managers. Once the context manager exits, the server closes and will no longer accept new connections.
server_instance = await loop.create_server(...)
async with server_instance:
...
Finally, a small but noteworthy change is that TCP sockets created by asyncio
are now created with TCP_NODELAY
set by default to disable Nagle’s algorithm. The issue ticket for the change (bpo-27456) asserts this as if it’s a common-sense change, and nobody commenting on the ticket seems to be too worried about understanding it, but it’s worth noting that this change is not an unqualified benefit. It will reduce apparently latency on the socket, but ensuring all data is transmitted as soon as available, but conversely it could reduce performance of bulk transfer protocols like HTTP if the protocol layer provides data in small chunks (i.e. smaller than the path MTU).
If you want to disable this option from a transport, you can do so by recovering the socket
object using BaseTransport.get_extra_info()
method:
sock = transport.get_extra_info("socket")
if sock is not None:
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 0)
There are some a handful of useful improvements to the socket
module. First up is a convenience method sock.getblocking()
which returns True
if the socket is blocking, False
otherwise. This is convenient and readable, but it’s only equivalent to sock.gettimeout() == 0
which you could do in previous releases.
Next up there’s a new socket.close()
method for closing socket objects. Using this offers better compatibility across platforms than passing it to os.close()
, as I’m shocked to learn that apparently not everything is Unix.
There are some new Linux-specific socket options that have been exposed in the socket
module:
TCP_CONGESTION
/proc/sys/net/ipv4/tcp_{available,allowed}_congestion_control
. There are plenty of articles on congestion control to read out there if you want more details, but unless you have unusual needs then I suspect you probably don’t need to mess with this.TCP_USER_TIMEOUT
ETIMEDOUT
to the application. If left at default settings, failures can take around 20 minutes to be detected.TCP_NOTSENT_LOWAT
Also on Linux support for the AF_VSOCK
address family was added. This allows guests and hypervisors to communicate with virtual machines regardess of the network configuration within those machines. The address family supports both SOCK_STREAM
and SOCK_DGRAM
sockets, although which of these, if any, are available depends on the underlying hypervisor which provides the actual communication. At least VMWare, KVM and Hyper-V are supported, provided you have a sufficiently recent kernel version.
Until this release, the Python ssl
module used its own match_hostname()
function for verifying that a given certificate matched a given hostname. More recent OpenSSL versions now perform this validation, however, and so the ssl
module has been updated to handle this during the SSL handshake. Any validation errors now raise SSLCertVerificationError
and abort the handshake. The match_hostname()
function is still available, but deprecated.
There’s been a change to how TLS Server Name Indication (SNI) works. Previously, whatever was passed as the remote hostname was passed in the SNI data as part of the “client hello”. However, it’s not valid for this to be an IP address, only a hostname, so now the extension will only be included if a hostname is specified. The whole purpose of this extension is to share multiple domains on a single IP address, so it’s not particularly useful to pass an IP address anyway.
Validation of server certificates containing internationalised domain names (IDNs) is now supported. As a side-effect the contents of SSLSocket.server_hostname
will be different for IDNs — previously this used to be in Unicode (U-label) but will now be ASCII-compatible (A-label). If you want more details of this difference, check out §2.3.2.1 of RFC 5890 for a lot more detail. If you just want to convert back again, you can just call decode("idna")
on it.
The ssl
module has support for TLS 1.3, as well as supporting OpenSSL 1.1.1 — this behaves slightly differently to previous versions, so do check out the documentation if you’re going to use it. There are also new attributes for setting the minumum and maximum version of TLS to to use when using PROTOCOL_TLS
— these are the minimum_version
and maximum_version
attributes of the SSLContext
object. These can be modified when using PROTOCOL_TLS
to affect the available versions to negotiate, but are read-only for any other protocol. They should be set to values of the new ssl.TLSVersions
enum, which provides values for specific versions but also the magic constants MINIMUM_SUPPORTED
and MAXIMUM_SUPPORTED
. There are also a series of constants such as ssl.HAS_TLSv1_3
to allow code to query which versions the current environment supports.
Some improvements to the HTTP servers offered by the http.server
module, some additional utility functions for checking IP networks for being subnets of each other, and a minor improvement to socketserver
.
There have been a couple of improvements to the SimpleHTTPRequestHandler
, which is a simple HTTP server for serving files from the current directory. Firstly, it now supports the If-Modified-Since
header — if a client specifies this, and the file hasn’t been modified since the date and time specified, then a 304 Not Modified response is sent back. To determine the last modification time, the st_mtime
field from os.fstat()
is used, so this may not work correctly if your underlying filesystem doesn’t give a realistic value for this.
The second improvement is simply that there’s a new directory
parameter for specifying the root directory from which to serve files, which is used instead of the current working directory.
In addition to these, there’s a new http.server.ThreadingHTTPServer
which uses socketserver.ThreadingMixin
to handle client requests in threads instead of inline in the current thread of execution. A good use-case for this is when modern browsers pre-open multiple sockets to a server, but don’t necessarily use them all — a single-threaded blocking server would block waiting for the request to arrive on the first socket, whereas the browser may have made the request on a subsequently opened socket and be waiting for a response.
The ipaddress.IPv4Network
and ipaddress.IPv6Network
classes now offer subnet_of()
and supernet_of()
methods for testing whether another the network is a strict subset or superset of another network.
>>> from ipaddress import ip_network
>>>
>>> large = ip_network("10.1.2.0/23")
>>> small = ip_network("10.1.3.248/30")
>>>
>>> small.subnet_of(large)
True
>>> large.supernet_of(small)
True
>>> large.subnet_of(small)
False
>>> small.supernet_of(small)
True
>>> small.subnet_of(small)
True
Note that a network is both a subnet and supernet of itself, as you can see in the last couple of examples above.
The server_close()
method of both ThreadingMixIn
and ForkingMixIn
now waits for all non-daemon threads and child processes, as appropriate, to terminate before it exits.
If you don’t want this new behaviour, there’s also a new class member block_on_close
in each class which defaults to True
. If you change this to False
you get the pre-3.7 behaviour.
A small but interesting change to the garbage collector which promises to make things more efficient in situations where you plan to fork()
a large number of child worker processes.
In this release the garbage collector module gc
has two new functions gc.freeze()
and gc.unfreeze()
. When you freeze the collector it moves all objects which it’s tracking to a separate permanent generation, which are never garbage collected. This does not disable future garbage collection entirely — it simply grants all currently extant objects at that time immunity from collection. When you call gc.unfreeze()
it moves all those objects back into contention to be collected again.
Why might you want to do this? Well one good reason is if you plan to spawn a lot of worker processes under a POSIX-like system using fork()
— let me explain why. As you may know, the semantics of fork()
are that a new process is created whose address space is a complete duplicate of the current process. However, duplicating all that memory is expensive for such a common operation, so you may also know that modern systems like Linux use copy-on-write semantics. For anyone that’s not familiar, this means that the new process doesn’t actually copy any of the user-space memory pages of the original, they simply get shallow references to them which is very cheap. If they attempt to modify any of those pages then at that point a copy must be made, to prevent modifications impacting the original process, but many processes only modify a small subset of their pages after a fork()
so it’s a worthwhile saving.
Garbage collectors can really mess around with this optimisation, however. Imagine your process has built up a number structures ready to be garbage-collected, but before the collector runs you execute a lot of fork()
calls to create a large number of child processes to use as workers for some CPU-intensive tasks. This is cheap, due to the copy-on-write. However, in each of those new processes the garbage collector is then dutifully invoked and goes ahead and cleans up the garbage. This means the memory pages containing the garbage are all duplicated separately in each child process, which burns a lot of cycles for very little gain.
The new freeze()
function is part of a mitigation of this problem — I say part, because there is another problem. If the garbage collector manages to free anything, this creates empty memory pages, which are also candidates for duplication by fork()
— these will be used by child processes, causing duplication of these empty pages into each child. Therefore, if you’re going to be creating a lot of child processes then you need to minimise the number of free pages inherited by the child as well as prevent it freeing older garbage. You can do this by following this procedure:
gc.disable()
to prevent the garbage collector running automatically.fork()
to create worker processes, call gc.freeze()
.gc.enable()
without risk of causing unnecessary copy-on-write overhead.As long as you call your fork()
before doing a lot of work in the parent, this shouldn’t be too inefficient on memory because most of the objects which exist at that point are module-level and will probably hang around a long time anyway.
So let’s see a quick illustration of these methods in action. I’m not going to call fork()
here, just illustrate the gc
semantics. First let’s see a perfectly normal garbage collection, with two objects created with mutual references that can’t be freed by normal reference counting:
>>> import gc
>>> gc.set_debug(gc.DEBUG_COLLECTABLE)
>>> x = []
>>> y = [x]
>>> x.append(y)
>>> del x, y
>>> gc.collect()
gc: collectable <list 0x10c355cd0>
gc: collectable <list 0x10c360140>
2
The use of gc.set_debug(gc.DEBUG_COLLECTTABLE)
produces the output you can see above when collectable objects are discovered. In these examples, it makes things a little easier to follow.
So now let’s see a sequence where we call gc.freeze()
. As am aside, I’ve split the output into chunks for discussion, but all of these sequences are part of the same single Python console session started with the example above.
>>> x = []
>>> y = [x]
>>> x.append(y)
>>> del x, y
>>> gc.freeze()
>>> gc.get_freeze_count()
4334
>>> gc.collect()
0
Here we can see that x
and y
have been moved to the permanent generation, so are immune from the gc.collect()
call. The output of gc.get_freeze_count()
is high because it includes all the other objects across all modules which happened to exist as garbage in the interpreter at the point I called the function.
>>> a = []
>>> b = [a]
>>> a.append(b)
>>> del a, b
>>> gc.freeze()
>>> gc.get_freeze_count()
4336
>>> gc.collect()
0
Here we see that further calls to gc.freeze()
are possible and will move more objects into the permanent generation. In this case the output of gc.get_freeze_count()
has increased by 2 due to the addition of a
and b
.
>>> i = []
>>> j = [i]
>>> i.append(j)
>>> del i, j
>>> gc.collect()
gc: collectable <list 0x10c2ffa50>
gc: collectable <list 0x10c35f7d0>
2
Here we demonstrate that objects which become garbage after the gc.freeze()
call are still collected as normal.
>>> gc.unfreeze()
>>> gc.get_freeze_count()
0
>>> gc.collect()
gc: collectable <list 0x10c2799b0>
gc: collectable <list 0x10c35fdc0>
gc: collectable <list 0x10c2812d0>
gc: collectable <list 0x10c2e20f0>
4
Finally, we use gc.unfreeze()
to move the permanent generation back into collectable garbage, and then we force them to be collected.
As usual, some smaller changes which are noteworthy but don’t require a lot of discussion.
argparse
module, normally it’s assumed that all command-line options will occur before positional arguments. However, a new ArgumentParser.parse_intermixed_args()
method allows for more flexible parsing where they can be freely intermixed, as with some Unix commands. However, this is at the expense of some of the features of argparse
, such as subparsers.contextlib.nullcontext()
object which can be used.crypt
module now supports the Blowfish cipher, and crypt.mksalt()
now offers a rounds
parameter for specifying the number of hashing rounds, although note that the range of values accepted depends on the hashing algorithm used.dis.dis()
function can now also disassemble nested code objects, such as within comprehensions, generator expressions and nested functions. There is also a new parameter to control the maximum depth of recursive disassembly.hmac.digest()
hmac
module is quite flexible, allowing data to be fed to the hmac
object in chunks. This comes at a performance cost, however, which is annoying for simple cases of calculating a simple HMAC on a single string which fits in memory. To cater for these cases, a new optimised hmac.digest()
method has been added which uses an optimised C implementation where possible, which calls direct into OpenSSL.math
for IEEE 754 Remaindersmath.remainder()
function that implements the IEEE version. If you’d like to know more about the differences, check out bpo-29962.pathlib.Path
class on POSIX systems now has an is_mount()
method which returns True
if the specified path is a mount point.backup()
method on the sqlite3.Connection
object which allows access to the SQLite online backup API.When I started looking at Python 3.7 I thought it was a smallish release, but having gone through it in more detail, I can still see plenty of changes to the standard library to like here. The new contextvars
and dataclasses
modules both serve their relevant niches well, and are well worth bearing in mind. My only slight caution with dataclasses
is that it might make code less funcitonal in style — using immutable values can lead to some really clean chains of generators and the like. But forcing developers to through together fully custom classes just to hold a few mutable data members is overkill.
The addition of queue.SimpleQueue
is useful, although I do wonder if it may be underused simply because developers don’t fully comprehend the cases where Queue
is unsafe. Still, that’s arguably not the fault of the Python development team!
It’s great to see asyncio
continuing to develop, and it’s now getting to the point there it seems to have really settled down into a usable framework. I look forward to seeing how it continues to develop in future releases.
That’s it for Python 3.7 — as always I hope you find something useful to take away from this article. Next up, as you probably might have guessed by now, is Python 3.8.
If you’re wondering about the asyncio.run()
function, we’ll be covering that later in this article. ↩
As an aside, although this is a reasonable illustration for the purposes of enum
, what this code is doing is not a great idea in general. Day names vary by locale, so you’ll really regret doing this if anyone in a different locale tries to use your code. ↩
This isn’t quite true, since if a process is stuck within an uninterruptible system call then it can be even immune to SIGKILL
. Broken NFS mounts tend to be particularly prone to this. ↩