In this series looking at features introduced by every version of Python 3, we continue our look at Python 3.8, examining changes to the standard library. These include some useful new functionality in functools
, some new mathematical functions in math
and statistics
, some improvements for running servers on dual-stack hosts in asyncio
and socket
, and also a number of new features in typing
.
This is the 18th of the 32 articles that currently make up the “Python 3 Releases” series.
As usual, this release 3.8 contains improvements across a whole host of modules, although many of these are fairly limited in scope. The asyncio
module continues to develop with a series of changes, the functools
module has a number of new decorators which will likely be quite useful, and the math
and statistics
modules contain a bounty of useful new functions.
So let’s get started!
A fairly straightforward change, the long-standing behaviour of various functions in pprint
, which output the keys in dict
objects in lexicographically sorted order, can now be disabled by passing sort_dicts=false
. This makes sense now that the dict
implementation returns keys in the order of insertion, which is potentially useful in pprint()
output.
In addition to this new parameter, there’s a new convenience function pprint.pp()
which is essentially equivalent to functools.partial(pprint.pprint, sort_dicts=false)
.
There are a few useful changes in the functools
module this release, around caching and implement single dispatch2.
Let’s kick off with a simple convenience — it’s now possible to use functools.lru_cache
as a normal decorator without requiring function call syntax.
# Prior to Python 3.8
@functools.lru_cache()
def function():
...
# In Python 3.8 we don't need the brackets!
@functools.lru_cache
def function():
...
# But argument are still supported, of course
@functools.lru_cache(maxsize=128)
def function():
...
Staying with the topic of caching, a mechanism for caching immutable properties has been added, appropriately enough called functools.cached_property
. When used as the decorator of a function, it acts like @property
except that only the first read calls the decorated function — subsequent reads use the same cached value. Writes are permitted, and update the cached value without calling the underlying function again, and deleting the attribute with del
removes the cached value such that the next read will invoke the function once more.
This is all illustrated in the example below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
The result of executing this, at time of writing, is shown below:
Constructing name...
Calculating age...
[1] Charles Darwin, 213
[2] Charles Darwin, 213
[3] Chuck Darwin, 213
[4] Chuck Darwin, 213
Constructing name...
Calculating age...
[5] Emma Darwin, 214
You can see that the first call to each of the attributes on line 34 invokes the function, triggering the print()
statements on lines 14 and 19. However, the second call on line 35 simply uses the same cached values.
On line 36 we write to the cached value — this updates the result, but doesn’t update the underlying first_name
or surname
attributes, and neither does it invoke any functions. Then on lines 38-39 we update the underlying attributes, but nothing here triggers updates to the cached values so line 40 still prints the cached versions from before.
Finally, on lines 41-42 we use del
to clear the cached values, and this causes the reads triggered from line 43 to re-run the property functions and thus the output now reflects the changes we’d previously made to first_name
and date_of_birth
.
The third and final change in functools
is nothing to do with caching, but is a helper for those wanting to write functions with single dispatch. This is where you want to call different implementations of a method based on the type of a single argument.
This is an extension to the existing functools.singledispatch
which we looked at way back in one of the articles on Python 3.4. The difference here is that the decorator will ignore the initial self
or cls
argument and switch on the next one.
The way it works is the same as singledispatch
— you decorate the first occurrence of the method, and that one becomes the default case if none of the other types match. That method becomes an object which presents a register
decorator you can then use to register your type overloads, using type annotations in the signature of the overload function to specify the type.
Here’s a trivial example, which might make things clearer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
If you execute that, you’ll see the following output:
Calling with int...
int version called arg1=123 arg2=1
Calling with str...
str version called arg1='abc' arg2=2
Calling with float...
default version called arg1=456.789 arg2=3
Exception: Arg of type float not supported
In general I’d say it’s better practice to use approaches such as polymorphism to handle switching implementations by type, but there are always those irritating cases where it doesn’t pan out — for example, if you’re not in control of the types used. For these cases this approach seems rather cleaner than rolling your own solution is likely to be.
There are some handy new functions in the math
module for particular cases, which are discussed in the sections below.
There’s a new math.dist()
function which calculates the Euclidean distance — that is, the straight-line distance between two points in Euclidean space. In two dimensions, for example, this will be the length of the hypotenuse of the right-angle triangle formed between the two points.
Although this will generally be either 2-dimensional or 3-dimensional space for most people, the function itself supports any number of dimensions — it just requires that the two points have the same number of dimensions.
>>> math.dist((0, 0), (3, 4))
5.0
>>> math.dist((0, 0), (21.65, 12.20))
24.85080481594107
>>> math.dist((1, 2, 3), (5, 8, 15))
14.0
>>> math.dist((0,) * 10, range(10))
16.88194301613413
The existing function math.hypot()
calculates the Euclidean norm of a point in 2D space. This has been extended to support N-dimensions.
It’s beyond the scope of this article to discuss norms in general, and the Euclidean norm specifically — the Wikipedia article linked above can help you out. For the purposes of this discussion, consider it essentially the same as the dist()
method above where the first argument is implicitly the origin. This function is the more basic building block which is potentially useful for things other than distances between points.
This is a simple one — there’s a new math.prod()
method which is similar to sum()
except that it calculates the product instead of the sum.
You can already do this using some other library functions, but it’s less convenient and slower:
>>> import math
>>>
>>> math.prod(range(1, 100))
93326215443944152681699238856266700490715968264381621468592
96389521759999322991560894146397615651828625369792082722375
82511852109168640000000000000000000000
>>>
>>> import functools
>>> import operator
>>>
>>> functools.reduce(operator.mul, range(1, 100), 1)
93326215443944152681699238856266700490715968264381621468592
96389521759999322991560894146397615651828625369792082722375
82511852109168640000000000000000000000
>>>
>>> import timeit
>>>
>>> timeit.timeit("math.prod(x)",
... setup="import math; x = range(1, 100)")
4.280957616996602
>>> timeit.timeit("functools.reduce(operator.mul, x, 1)",
... setup="import functools; import operator;"
... " x = range(1, 100)")
6.377806781005347
There are two new functions to calculate the combinations (math.comb()
) and permutations (math.perm()
) of selecting r
items from n
. If I think back to GCSE Mathematics, I recall that permutations are the number of ways of selecting r
items from a population of n
without replacement, where the same items selected in a different order are counted distinctly. Combinations represents the same thing but where the same items selected in a different order are not counted distinctly.
If I think back really hard, I recall that the formulae for these two are as follows:
\[ ^nP_r = \frac{n!}{(n-r)!} \] \[ ^nC_r = \frac{n!}{r!(n-r)!} \]
Still, it’s handy not to have to remember those, and these versions are faster.
>>> math.comb(59, 6)
45057474
>>> math.perm(59, 6)
32441381280
There’s also a new function math.isqrt()
to calculate the integer square root. The integer square root of \(n\) is the largest integer \(m\) such that \(m^2 \le n\).
This has applications in areas such as primality testing, and it’s a tricky little function to get both correct and efficient for large inputs. For small values you can round off math.sqrt()
, but for larger values inaccuracies creep in and you get incorrect results.
>>> root = 67108865
>>> square = root ** 2
>>> math.isqrt(square - 1)
67108864
>>> math.floor(math.sqrt(square - 1))
67108865
In the example above you can see that using math.floor(math.sqrt(...))
overestimates the result by 1. As you move to much larger values the floating point errors increase.
The statistics
module has a generous helping of delicious new functions, so fill your plate with all this numerical analysis goodness.
There’s a new statistics.fmean()
function, which performs the same operation as mean()
except that it uses entirely floating point. This means it sacrifices a small amount of the accuracy, but so small that almost all users probably would never care, and in return gives significantly faster performance.
>>> setup = "import random, statistics; " \
... "data = [random.randint(1, 100) for i in range(1000)]"
>>> timeit.timeit("statistics.mean(data)", setup=setup, number=10000)
4.593662832000007
>>> timeit.timeit("statistics.fmean(data)", setup=setup, number=10000)
0.1214523390000295
In the comparison above you can see that mean()
takes around 38 times longer than fmean()
to complete.
The new statistics.geometric_mean()
calculates, you guessed it, the geometric mean. As opposed to the more common arithmetic mean, calculated as the sum of the set divided by its cardinality, the geometric mean is calculated as the nth root of the product of the set.
\[ \sqrt[n]{\prod\limits_{i=1}^{n} x_i} \]
This is a useful measure for certain situations such as proportional growth rates. It’s also particularly suitable for averaging results which have been normalised to different reference values, because of the particular property of the geometric mean that:
\[ G\left(\frac{X_i}{Y_i}\right) = \frac{G(X_i)}{G(Y_i)} \]
The addition of this function also means that, along with mean()
and harmonic_mean()
, the statistics
module now contains all three of the Pythagorean means. It’s good to see that Python has finally caught up with the ancient Greeks!
This is a fairly straightforward variation of the existing statistics.mode()
function. When locating the modal value, that which occurs the greatest number of times in the input set, there’s always the possibility for multiple such values. The existing mode()
function returns the first such value encountered, whereas multimode()
returns a list
of all of them.
>>> data = "A" * 2 + "B" * 5 + "C" * 4 + "D" * 5 + "E" + "F" * 5
>>> statistics.mode(data)
'B'
>>> statistics.multimode(data)
['B', 'D', 'F']
A common statistical measure is to divide a large data set into four evenly sized groups and look at the three boundary values of these groups — these are the lower quartile, the median and the upper quartile respectively. Generalising this concept to n different groups instead of 4 yields the concept of quantiles. The statistics.quantiles()
function calculates these boundaries.
By default the quartiles are given, but other values of the n
parameter divide the data into that many groups — pass n=10
for deciles and n=100
for percentiles.
>>> statistics.quantiles(range(1, 100))
[25.0, 50.0, 75.0]
>>> statistics.quantiles(range(1, 100), n=5)
[20.0, 40.0, 60.0, 80.0]
>>> statistics.quantiles(range(1, 100), n=10)
[10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]
Finally, we have the new statistics.NormalDist
class. This is a fairly flexible workhorse for doing operations on normally distributed data.
First we have to construct the distribution. This can be done directly, by passing in the two parameters: mu
is the arithmetic mean of the data and sigma
is the standard deviation. Alternatively, the from_samples()
class method uses fmean()
and stdev()
to estimate these values from a sample of data.
Once you’ve constructed the object, you can recover the mean
and stdev
attributes from it, but there are a variety of more interesting functions as well. The samples()
method can be used to generate a specified number of random samples which conform to the distribution, and the quantiles()
function can return where you’d expect the quantiles of the data to be based on the distribution.
>>> height_dist = statistics.NormalDist.from_samples(height_data)
>>> height_dist.mean
175.19763652682227
>>> height_dist.stdev
7.709514592235422
>>> height_dist.samples(3)
[179.58106311923885, 179.66015160925144, 173.56059959418297]
>>> statistics.fmean(dist.samples(1000000))
175.31327020208047
>>> height_dist.quantiles(n=4)
[169.99764795537234, 175.19763652682227, 180.3976250982722]
Distribution objects can also be multiplied by a constant to transform the distribution accordingly — this can be useful for things like unit conversion, which would apply to all data points equally. Addition, subtraction and division are also supported for other forms of translation and scaling.
>>> feet_height_dist = height_dist * 0.0328084
>>> feet_height_dist.mean
5.7479541382265955
>>> feet_height_dist.stdev
0.2529368385478966
>>>
There are also useful functions for dealing with probabilities. The pdf()
method uses a probability density function to return the probability that a random variable will be close to the specified value. There’s also cdf()
, which uses a cumulative density function to return the probability that a value will be less than or equal to the specified value, and inv_cdf()
, which takes a probability and returns the point in the distribution where the cdf()
of that point would return the specified probability.
>>> feet_height_dist.pdf(5)
0.01991105275231589
>>> feet_height_dist.cdf(6)
0.8404908965170221
>>> feet_height_dist.inv_cdf(0.75)
5.918557443274153
There are some other features that I haven’t covered here, so it’s well worth reading through the documentation if you’re doing analysis of normally distributed data.
There are a couple of improvements to os.path
, several of which only apply on Windows.
The various os.path
functions which return a bool
, such as os.exists()
and os.isdir()
, always used to raise ValueError
if passed a filename which contains invalid characters for the OS. For example, here’s an attempt to check for a filename which contains an embedded nul character on Python 3.7:
Python 3.7.10 (default, Mar 28 2021, 04:19:36)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.isdir("/foo\0bar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.7.10/lib/python3.7/genericpath.py", line 42, in isdir
st = os.stat(s)
ValueError: embedded null byte
And here’s the new behaviour on Python 3.8:
Python 3.8.8 (default, Mar 28 2021, 04:22:11)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.isdir("/foo\0bar")
False
There are a set of smaller changes on Windows in os.path
as well.
os.path.expanduser()
USERPROFILE
environment variable on Windows in preference to HOME
, since the former is more reliably set for normal user accounts.os.path.isdir()
True
when querying a link to a directory that no longer exists.os.path.realpath()
os
.Just a couple of small changes in pathlib
. Firstly, the bool
-returning functions such as exists()
and is_symlink()
no longer raise ValueError
for invalid filenames, as discussed in the section on os.path
above.
Secondly, there’s a new Path.link_to()
function which creates a hard link to the current path. The naming is slightly unfortunate, because if you call my_path.link_to(target)
then it reads as if the path referred to by my_path
will become a link to target
— in fact the opposite is true, target
is created as a hard link to my_path
.
There are some smaller changes to a handful of functions in the shutil
module.
shutil.copytree()
Acquires dirs_exist_ok
copytree()
raises an exception if any of the destination directories already exists — passing dirs_exist_ok=True
now disables this behaviour and allows the copy to proceed.shutil.make_archive()
format="tar"
— now the specific format used will be pax instead of legacy GNU format. The same change has been made to the tarfile
module.shutil.rmtree()
A handful of changes in os
, mostly on the Windows platform.
First up there’s a new add_dll_directory()
function to provide additional search paths for loading DLLs, for example when using the ctypes
module. This is similar to LD_LIBRARY_PATH
on POSIX systems. The function returns a handle which has a close()
method which reverts the change again, or it can be used in a with
statement to achieve the same effect.
The second change on Windows is that the logic for reparse points, such as symlinks and directory junctions, has been moved from being Python-specific to being delegated to the operating system to handle. Now I’m very far from an expert on Windows, so I hope I don’t get any of these details wrong, but my understanding is that this expands the set of file-like objects that are supported to anything that the OS itself supports. This means that os.stat()
can query anything, whereas os.lstat()
will query anything which has the name surrogate bit set in the reparse point tag.
It’s worth noting that stat_result.st_mode
will only set the S_IFLNK
bit for actual symlinks — it will be clear for other reparse points. If you want to check for reparse points in general, you can look for stat.FILE_ATTRIBUTE_REPARSE_POINT
in stat_result.st_file_attributes
, and you can look at stat_result.st_reparse_tag
to get the reparse point tag in this case. The stat
module has some IO_REPARSE_TAG_*
constants to help check bits in the tag, but the list is not exhaustive.
In a related change, os.readlink()
is also now able to read directory junctions. Note, however, that os.path.islink()
still returns False
for these. As a consequence, if your code is LBYL-style and checks islink()
first then it’ll continue to treat junctions as if they were standard directories, but if your code is EAFP-style and just catches errors from readlink()
then your code may now behave differently when it encounters junctions.
Finally we have a change that’s distinctly more Linuxy — the Linux-specific memfd_create()
call has been made available in the os
module. This call creates an anonymous file in memory, and returns a file descriptor to it. This can be used like any other file, except that when the last reference to it is dropped then the memory is automatically released. In short, it has the same semantics as a file created with mmap()
using the MAP_ANONYMOUS
flag.
The call takes a mandatory name
parameter, which is used as the filename but doesn’t affect much except how entries in /proc/<pid>/fd
will appear. There’s also an optional flags
parameter which accepts the bitwise OR of various new flags in os
. These are:
os.MFD_CLOEXEC
O_CLOEXEC
to open()
would do, which in turns is the same as setting FD_CLOEXEC
with fcntl()
except that it avoids potential multithreaded race conditions. This flag closes the file descriptor automatically on an exec()
call to load another binary.os.MFD_ALLOW_SEALING
F_SEAL_SEAL
seal will be set, which prevents further seals from being added. A discussion of Linux file sealing is rather esoteric and outside the scope of this article, but the memfd_create()
man page has a section which illustrates how it can work in general, and you can find the constants for types of seal defined in the fcntl
module.os.MFD_HUGETLB
hugetlbfs
filesystem using huge pages. This allows page sizes above the usual 4KB, which reduces the size of the page table, increasing performance for large chunks of memory, if supported by the kernel.MFD_HUGE_<size>
MFD_HUGETLB
there are also constants to select the huge page size to use. Take a look at the documentation for os.memfd_create()
for the full list.MFD_HUGE_SHIFT
and MFD_HUGE_MASK
mmap()
man page then you’ll see some discussion on using MAP_HUGE_SHIFT
, and I think the same approach is meant to work here.A couple of potentially useful changes in the threading
module.
The default behaviour when an exception propogates outside the main function of a thread is to print a traceback. However, there’s now a threading.excepthook
which can be overridden to handle such exceptions in a different way, such as writing them to a log file.
Here’s a simple illustration:
>>> import threading
>>> def my_thread_func():
... print(">>> Thread started <<<")
... raise Exception("Naughty!")
...
>>> thread = threading.Thread(target=my_thread_func)
>>> thread.start()
>>> Thread started <<<
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/andy/.pyenv/versions/3.8.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
>>> self.run()
File "/Users/andy/.pyenv/versions/3.8.8/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "<stdin>", line 3, in my_thread_func
Exception: Naughty!
>>> thread.join()
>>>
>>> threading.excepthook = lambda exc: print(f"CAUGHT: {exc!r}")
>>> thread = threading.Thread(target=my_thread_func)
>>> thread.start()
>>> Thread started <<<
CAUGHT: _thread.ExceptHookArgs(exc_type=<class 'Exception'>, exc_value=Exception('Naughty!'), exc_traceback=<traceback object at 0x10663f280>, thread=<Thread(Thread-2, started 123145342038016)>)
>>> thread.join()
The threading.get_ident()
function was added in Python 3.3 which returns a unique identifier for the current thread. The problem is that this doesn’t, in general, have any relation with the operating system identifier for the thread, which can sometimes be useful to know for, say, logging purposes.
In Python 3.8, therefore, the threading.get_native_id()
function has been added, which returns the native thread ID of the current thread assigned by the kernel. The downside is that this function isn’t guaranteed to be available on all platforms, but it seems to be supported on Windows, MacOS, Linux and several of the other Unixes, so it should be useful for a lot of people.
The asyncio.run()
function, added in the previous release, has been upgraded from a provisional to a stable API, although this release doesn’t contain any changes to its functionality. Aside from this, there are some more substantive changes.
Prior to this release the CancelledError
exception, which is raised when asyncio
tasks are cancelled, now inherits from BaseException
rather than Exception
. This mirrors several similar changes in the past, such as in Python 2.6 where GeneratorExit
also had its base class changed from Exception
to BaseException
. The problem is the same in all these cases: unintended capture of exceptions. In the case of CancelledError
consider code like this:
try:
await some_async_function()
except Exception:
log.error("Task failed")
This seems fairly innocuous, but of course the CancelledError
will also be captured in that exception specification prior to Python 3.8. This in turn means that the CancelledError
will not propagate to the caller, and hence anyone waiting on that task will not be notified that the task was cancelled.
Consider the following short script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Under python 3.7, the result of running this is that main()
never finds out the task was cancelled:
$ python3.7 task_example.py
Something went wrong
Task completed normally
But under Python 3.8 the CancelledError
propagates as the programmer probably intended:
$ python3.8 task_example.py
Task was cancelled
This move was a little controversial, as you can see from the discussion on BPO-32528 — whichever decision was taken here, some programmers would have likely been bitten by it, either in past or future code. The best option would be for everyone to carefully consider whether BaseException
is the right base for their exceptions in future1.
Another change is support for Happy Eyeballs. Despite the slightly clickbait-sounding name, Happy Eyeballs is actually a fairly useful IETF algorithm for improving responsiveness on dual-stack (i.e. supporting IPv4 and IPv6 concurrently) systems. These systems typically would prefer IPv6, as the newer standard, but this would lead to frustrating delays for parts of the Internet where the IPv6 path failed, as these requests would typically need to hit some timeout before falling back to IPv4. This algorithm tries connections nearly in parallel to give a fast response, but still prefers IPv6 given a choice. The full details can be found in RFC 8305.
Python now includes support for this in asyncio.loop.create_connection()
. There are two new parameters. Specifying happy_eyeballs_delay
activates the behaviour and specifies the delay between making adjacent connections — the RFC suggests 0.25
(250ms) for this value. The second parameter is interleave
and corresponds to what the RFC calls First Address Family Count — this only applies if getaddrinfo()
returns multiple possible addresses and controls how many IPv6 addresses are tried before switching to trying IPv4. I’d suggest not specifying this at all unless you know what you’re doing, the default should be fine.
There are also a couple of smaller changes in asyncio
to note:
Task.get_coro()
AddedTask
object is a wrapper which is used to schedule an underlying coroutine for execution, and it now provides a get_coro()
method to return the underlying coroutine object itself.name
keyword parameter to create_task()
or by calling set_name()
on the Task
object. This could include calling asyncio.current_task().set_name(...)
from within the task itself, which could be useful for diagnostic progress reporting or identification purposes for tasks which acquire work items after the point of creation.There are a couple of related functions added to socket
which make it easier to create listening sockets.
The first is socket.has_dualstack_ipv6()
which simply returns True
if the current platform supports creating a socket bound to both an IPv4 and IPv6 address, or False
otherwise.
The second function is create_server()
which is a convenience for creating a binding a TCP socket, which is a tedious bit of boilerplate. This accepts a family
argument, which should be AF_INET
for IPv4 or AF_INET6
for IPv6. However, if you want to support both, you should pass AF_INET6
and also dualstack_ipv6=True
, which attempts to bind the socket to both families. This is commonly used with an empty string as the IP address, to bind to all interfaces, but if you pass an address it should be an IPv6 address — the IPv4 address used will be an IPv4-mapped IPv6 address.
Note that if you use dualstack_ipv6
and your platform doesn’t support dual-stack sockets, you’ll get a ValueError
. You can use has_dualstack_ipv6()
described above to avoid this, although I think EAFP would have been more Pythonic so I’m a little disappointed they didn’t make this a more unique exception that could be caught and handled.
import socket
if socket.has_dualstack_ipv6():
sock = create_server(("", 1234),
family=socket.AF_INET6,
dualstack_ipv6=True)
else:
sock = socket.create_server(("", 1234))
The function also accepts parameters backlog
, which is passed to the listen()
call, and reuse_port
, which is used to control whether to set SO_REUSEPORT
3. Overall, therefore, create_server()
performs something like this:
SOCK_STREAM
in the specified address family.SO_REUSEADDR
4 (not on Windows).reuse_port
is True
then set SO_REUSEPORT
.family
is AF_INET6
and dualstack_ipv6
is False
, set IPV6_V6ONLY
option.bind()
on the socket.listen()
on the socket.At this point the returned socket is ready to call accept()
to receive inbound connections.
There are a few useful improvements for XML parsing, including some security improvements, support for wildcard searches within a namespace and support for XML canonicalisation (aka C14N).
There are various known attacks on XML parsers which can cause issues such as massive memory consumption or crashes on the client side, or even steal file content off the disk. One class of thses are called XML External Entity (XXE) injection attacks. These rely on a feature which the XML standards require of parsers, but which is very rarely used — the ability to reference entities from external files. The article I linked has some great explanation of how these work.
In Python 3.8, the xml.sax
and xml.dom.minidom
modules no longer process external entities by default, to attempt to mitigate these security risks. If you do want to re-enable this feature in xml.sax
for some reason, apparently you can instantiate an xml.sax.xmlreader.XMLReader()
and call setFeature()
on it using xml.sax.handler.feature_external_ges
. But I suspect it’s probably a much better idea to simply never use this feature of XML.
The various findX()
functions within xml.etree.ElementTree
have acquired some handy support for searching within XML namespaces. Take a look at the example below, which illustrates that you can search within a namespace for any tag using "{namespace}*"
and you can search for a tag within any namespace with "{*}tag"
.
>>> import pprint
>>> import xml.etree.ElementTree as ET
>>>
>>> doc = '<foo xmlns:a="http://aaa" xmlns:b="http://bbb">' \
... '<one/><a:two/><b:three/></foo>'
>>> root = ET.fromstring(doc)
>>>
>>> pprint.pp(root.findall("*"))
[<Element 'one' at 0x101e90180>,
<Element '{http://aaa}two' at 0x101e90220>,
<Element '{http://bbb}three' at 0x101ece040>]
>>> pprint.pp(root.findall("{http://bbb}*"))
[<Element '{http://bbb}three' at 0x101ece040>]
>>> pprint.pp(root.findall("two"))
[]
>>> pprint.pp(root.findall("{*}two"))
[<Element '{http://aaa}two' at 0x101e90220>]
There’s a new xml.etree.ElementTree.canonicalize()
which performs XML canonicalisation, also known as C14N5 to save typing. This is a process for a standard byte representation of an XML document, so things like cryptographic signatures can be calculated, where a single byte inconsistency would lead to an error.
This function accepts either XML as a string, or a file path or file-like object using the from_file
keyword parameter. The XML is converted to the canonical form and written to an output file-like object, if provided via the out
keyword parameter, or returned as a text string if out
is not set.
Note that the output file receives the canonicalised version as a str
, so it should be opened with encoding="utf-8"
.
There are some options to control some of the operations, such as whether to strip whitespace and whether to replace namespaces with numbered aliases, but I won’t bother duplicating the documentation for those here.
Overall this is very useful as the process is quite convoluted and if you’re trying to calculate a crytographic hash you generally have very little to go on when you’re diagnosing discrepancies — you tend to just have to guess what might be going wrong and fiddle around until the two sides match. Having this already implemented in the library, therefore, saves everyone going through this hassle.
Finally, the xml.etree.ElementTree.XMLParser
class has some new features. Firstly, there are a couple of new callbacks that can be added to the handler. The start_ns()
method will be called for each new namespace declaration, prior to the start()
callback for the element which defines it — this method is passed the namespace prefix and the URI. There’s also a corresponding end_ns()
method which is called with the prefix just after the end()
method for the tag.
>>> from xml.etree.ElementTree import XMLParser
>>>
>>> class Handler:
... def start(self, tag, attr):
... print(f"START {tag=} {attr=}")
... def end(self, tag):
... print(f"END {tag=}")
... def start_ns(self, prefix, uri):
... print(f"START NS {prefix=} {uri=}")
... def end_ns(self, prefix):
... print(f"END NS {prefix=}")
...
>>> doc = '<foo xmlns:a="http://aaa" xmlns:b="http://bbb">' \
... '<one/><a:two/><b:three/></foo>'
>>> handler = Handler()
>>> parser = XMLParser(target=handler)
>>> parser.feed(doc)
START NS prefix='a' uri='http://aaa'
START NS prefix='b' uri='http://bbb'
START tag='foo' attr={}
START tag='one' attr={}
END tag='one'
START tag='{http://aaa}two' attr={}
END tag='{http://aaa}two'
START tag='{http://bbb}three' attr={}
END tag='{http://bbb}three'
END tag='foo'
END NS prefix='b'
END NS prefix='a'
The second change is that comments and processing instructions, which were previously ignored, can now be passed through by the builtin TreeBuilder
object. To enable this, there are new insert_comments
and insert_pis
keyword parameters, and there are also comment_factory
and pi_factory
parameters to specify the factory functions to use to construct these objects, instead of using the builtin Comment
and ProcessingInstruction
objects.
To specify these parameters, you need to construct your own TreeBuilder
and pass it to XMLParser
using the target
parameter.
The support for type hinting continues at a healthy pace with some more improvments in the typing
module.
There’s a new typing.TypedDict
type which supports a heterogenous dict
where the type of each value may differ. All keys must be str
and must be specified in advance using the usual class member type hint syntax.
import datetime
import typing
class Person(typing.TypedDict):
first_name: str
surname: str
date_of_birth: datetime.date
At runtime this will be entirely equivalent to a dict
, but it allows type-checkers to validate the usage of values within it. If a key is used with an incorrect type, that’s expected to fail type checking. Also, any use of a key not specifically listed should fail, unless total=False
is added to the constructor — this means that the keys listed must still have their specified types, but any other keys can be used and they may take any type of value.
One subtle point that may not be immediately apparent is that initialisation with a dict
literal must include a specific type hint on the destination variable, otherwise the type-checker will assume that it is of type dict
instead of the TypedDict
subclass you’ve defined.
churchill: Person = {
"first_name": "Winston",
"surname": "Churchill",
"date_of_birth": datetime.date(1874, 11, 30)
}
Personally I tend to try to use custom classes for these sorts of cases, and this is made especially easy by the addition of dataclasses
in Python 3.7, as I talked about in a previous article. However, as PEP 589 discusses there are some cases where a dict
subclass has advantages. It feels to me as if this is straying a little away from the Zen of Python’s There should be one — and preferably only one — obvious way to do it, but Python is a broad church and there’s room for many opinions.
The next new type is typing.Literal
which allows the programmer to specify a value must be one of a pre-determined list of values. For example:
def get_status(self) -> Literal["running", "stopping", "stopped"]:
...
One interesting point that’s highlighted in the PEP is that even if you assume you can break any backwards-compatibility of an API and were to use an enum
for these values, all that does is constrain the type of the parameter to be that enumeration type, but it’s possible only a subset of the values from it should be accepted or returned — in these cases, Literal
is still useful.
Another addition in this release is typing.Final
for variables, and a corresponding decorator @final
for methods and classes, added by PEP 591. These can be used to specify that:
As usual none of this changes the runtime behaviour, but allows type checkers such as mypy
to perform additional validation. Consider the following code:
finaltest.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
If you run this, you’ll see the output is exactly as you’d expect if the final
and Final
specifiers weren’t there:
SOME_GLOBAL=4567
This method can be overridden
Overriding this will fail type checks
base.normal_attr=100 base.final_attr=333
FirstDerived.can_override()
FirstDerived.cannot_override()
derived.normal_attr=101 derived.final_attr=202
SecondDerived.can_override()
SecondDerived.cannot_override()
second.normal_attr=101 second.final_attr=202
However, if you run mypy
then you’ll see we’re breaking some constraints:
finaltest.py:23: error: Cannot assign to final name "final_attr"
finaltest.py:32: error: Cannot inherit from final class "FirstDerived"
finaltest.py:40: error: Cannot assign to final attribute "final_attr"
finaltest.py:41: error: Cannot assign to final name "SOME_GLOBAL"
Found 4 errors in 1 file (checked 1 source file)
Also included in this release are the changes outlined in PEP 544, which introduce a form of structural typing to Python. This is where compatability between types is determined by analysing a type of object’s actual structure rather than relying on type annotations, which are a form of nominative typing.
The PEP refers to it as static duck typing, which I think is a good name. As I’m sure many of you are aware, in general duck typing refers to a system where objects are just checked for meeting a specified abstract interface at runtime, rather than their entire type. The key aspect, however, is that the object doesn’t need to declare that it meets this interface by, for example, inheriting from some abstract base class. The interface is checked against the object’s actual definition.
The static in that phrase is important, because Python already offers runtime facilities for checking whether objects meet particular interfaces without them having to be specifically declared. In the except below, for example, the IntArray
class never declares itself as inheriting from collections.abc.Collection
, yet it still returns itself as compatible from isinstance()
.
>>> import collections.abc
>>> from typing import List
>>>
>>> class IntArray:
... values: List[int] = []
... def __init__(self, initial=()):
... self.values = list(int(i) for i in initial)
... def __contains__(self, value):
... return value in self.values
... def __iter__(self):
... return iter(self.values)
... def __len__(self):
... return len(self.values)
... def __reversed__(self):
... return reversed(self.values)
... def __getitem__(self, idx):
... return self.values[idx]
... def index(self, *args, **kwargs):
... return self.values.index(*args, **kwargs)
... def count(self, *args, **kwargs):
... return self.values.count(*args, **kwargs)
...
>>> instance = IntArray((1,2,3,4,5))
>>> isinstance(instance, collections.abc.Sized)
True
>>> isinstance(instance, collections.abc.Collection)
True
>>> isinstance(instance, collections.abc.Reversible)
True
>>> isinstance(instance, collections.abc.MutableSequence)
False
>>> isinstance(instance, collections.abc.Mapping)
False
A protocol is a declaration of the interface that a class must meet in order to be taken as supporting that protocol. Creating one simply involves declaring a class which inherits from typing.Protocol
and defining the required interface. This protocol can then be used by static type checkers to validate that the objects passed to a call conform to the specified interface.
>>> from abc import abstractmethod
>>> from typing import Protocol
>>>
>>> class SupportsNameAndID(Protocol):
... name: str
... @abstractmethod
... def get_id(self) -> int:
... ...
...
As well as implicitly supporting the interface by implementing the specified attributes and methods directly, it’s also fine to inherit from Protocol
instances — this can be useful to allow protocol classes to become mixins, adding default concrete implementations of some or all of the methods. The general idea is that they’re very similar to regular abstract base classes (ABCs). One detail that’s worth noting, however, is that a class is only a protocol if it directly derives from typing.Protocol
— classes further down the inheritance hierarchy are treated as just regular ABCs.
It’s also possible to decorate a protocol class with @typing.runtime_checkable
, which also means that isinstance()
and issubclass()
can be used to detect whether types conform at runtime. This can be used to log warnings, raise exceptions or anything else. The example below follows on from the same session above, except assuming that SupportsNameAndID
had been defined with the @runtime_checkable
class decorator.
>>> class One:
... pass
...
>>> isinstance(One(), SupportsNameAndID)
False
>>>
>>> class Two:
... def get_id(self) -> int:
... return 123
...
>>> isinstance(Two(), SupportsNameAndID)
False
>>>
>>> class Three:
... name: str = "default"
... def get_id(self) -> int:
... return 123
...
>>> isinstance(Three(), SupportsNameAndID)
True
All in all this is a useful means for developers to specify their own protocols to complement those already defined in collections.abc
.
Last up in typing
are the new methods get_origin()
and get_args()
. These are used for breaking apart the specification of generic types into the core type, returned by get_origin()
, and the type(s) passed as argument(s), returned by get_args()
. This is perhaps best explained with some examples:
>>> import typing
>>>
>>> typing.get_origin(typing.List[typing.Tuple[int, ...]])
<class 'list'>
>>> typing.get_args(typing.List[typing.Tuple[int, ...]])
(typing.Tuple[int, ...],)
>>>
>>> typing.get_origin(typing.Tuple[int, ...])
<class 'tuple'>
>>> typing.get_args(typing.Tuple[int, ...])
(<class 'int'>, Ellipsis)
>>>
>>> typing.get_origin(typing.Dict[str, typing.Sequence[int]])
<class 'dict'>
>>> typing.get_args(typing.Dict[str, typing.Sequence[int]])
(<class 'str'>, typing.Sequence[int])
>>>
>>> typing.get_origin(typing.Hashable)
<class 'collections.abc.Hashable'>
>>> typing.get_args(typing.Hashable)
()
The unittest.mock.Mock
class is a real workhorse that can mock almost anything — if you’re not familiar with it, I gave a brief overview in an earlier article on Python 3.3. There’s one case where it can’t be easily used, however, which is when mocking asynchronous objects — for example, when mocking a asynchronous context manager which provides __aenter__()
and __aexit__()
methods.
The issue is that the mock needs to be recognised as an async function and return an awaitable object instead of a direct result. In Python 3.8, the AsyncMock
class has been added to implement these semantics. You can see the difference in behaviour here:
>>> import asyncio
>>> from unittest import mock
>>>
>>> m1 = mock.Mock()
>>> m2 = mock.AsyncMock()
>>> asyncio.iscoroutinefunction(m1)
False
>>> asyncio.iscoroutinefunction(m2)
True
>>> m1()
<Mock name='mock()' id='4456666016'>
>>> m2()
<coroutine object AsyncMockMixin._execute_mock_call at 0x1099f28c0>
Here’s a very simple illustration of it in practice:
>>> async def get_value(obj):
... value = await obj()
... print(value)
...
>>> mock_obj = mock.AsyncMock(return_value=123)
>>> asyncio.run(get_value(mock_obj))
123
A unittest.TestCase
has setUp()
and tearDown()
methods to allow instantiation and removal of test fixtures. However, if setUp()
does not complete successfully then tearDown()
is never called — this runs the risks of leaving things in a broken state. As a result these objects also have an addCleanup()
method to register functions which will be called after tearDown()
, but which are always called regardless of whether setUp()
succeeded.
The class also has corresponding setUpClass()
and tearDownClass()
class methods, which are called before and after tests within the class as a whole are run. In addition the module containing the tests can defined setUpModule()
and tearDownModule()
functions which perform the same thing at module scope. However, until Python 3.8 neither of these cases had an equivalent of addCleanup()
that would be called in all cases.
As of Python 3.8 these now exist. There’s an addClassCleanup()
class method on TestCase
to add cleanup functions to be called after tearDownClass()
, and there’s a unittest.addModuleCleanup()
function to register functions to be called after tearDownModule()
. These will be invoked even if the relevant setup methods raise an exception.
You can now write your test cases as async methods thanks to the new unittest.IsolatedAsyncioTestCase
base class. This is useful for testing your own async functions, which need to be executed in the context of an event loop, without having to write a load of boilerplate yourself each time.
If you derive from this new class, as you would TestCase
, then it accepts coroutines as test functions. It adds asyncSetUp()
and asyncTearDown()
async methods, which are additionally called just inside the existing setUp()
and tearDown()
, which are still normal (i.e. non-async) methods. There’s also an addAsyncCleanup()
method, similar to the other cleanup methods described above — this registers an async function to be called at cleanup time.
An event loop is constructed and the test methods are executed on it one at a time, asynchronously. Once execution is completed, any remaining tasks on the event loop are cancelled. Other than that, things operate more or less as with the standard TestCase
.
To summarise, assuming all of these are defined then the order of events will be:
IsolatedAsyncioTestCase
class.setUp()
method, if defined.asyncSetUp()
method to the job queue, if defined.asyncTearDown()
method to the job queue, if defined.tearDown()
method, if defined.The usual handful of changes that I noted, but didn’t think required much elaboration.
cProfile.Profile
Context Managerwith cProfile.Profile() as profiler:
around the block. This is a useful convenience for calling the enable()
and disable()
methods of the Profiler
.dict
for OrderedDict
collections.OrderedDict
now just return dict
again, as that now preserves insertion order as we discussed in a previous article on Python 3.6. These are the _asdict()
method of collections.namedtuple()
and csv.DictReader
.date.fromisocalendar()
Addedfromisoformat()
method on date
and datetime
objects in the datetime
module. In this release there’s a somewhat similar new method just on date
objects called fromisocalendar()
which is based on a different part of the ISO 8601 standard, that of the week numbering convention. The function takes a year, a week number and a weekday within that week, and returns the correspondingly initialised date
object, and is essentially the converse of the existing date.isocalendar()
method.itertools.accumulate()
Initial Valueaccumulate()
function in itertools
repeatedly applies a binary operation to a list of values, by default addition, returning the list of cumulative results. In Python 3.8 there’s a new initial
parameter which allows an initial value to be specified as if it were at the start of the iterable.logging.basicConfig()
Acquires force
Parameterlogging.basicConfig()
is that if the root logger already has handlers configured, the call is silently ignored. This means that only the first call is generally effective. As of Python 3.8, however, it’s acquired a force
parameter which, if True
, will cause any existing handlers to be removed and closed before adding the new ones. This is helpful in any case where you want to re-initialise logging that you think may have been initialised.madvise()
For mmap
mmap.mmap
class now has an madvise()
method which calls the madvise()
system call. This allows applications to hint the kernel at what type of access it can expect from an memory-mapped block, which allows the kernel to optimise its choice of read-ahead and caching techniques. For example, if accesses will be random then read-ahead is probably of little use.shlex.join()
shlex.split()
function splits a command-line respecting quoting, as a shell would. In Python 3.8 there’s now a shlex.join()
that does the opposite, inserting quoting and escapes as appropriate.time.CLOCK_UPTIME_RAW
(MacOS only)CLOCK_UPTIME_RAW
been made available. This is a monotonically increasing clock which isn’t affected by any time of day changes, and does not advance while the System is sleeping. This works exactly the same as CLOCK_UPTIME
which is available on FreeBSD and OpenBSD.sys.unraisablehook()
sys.unraisablehook()
can now be set to handle cases where Python cannot handle a raised exception — for example, those raised by __del__()
during garbage collection. The default action is to print some output to stderr
, but this hook can be defined to take other action instead, such as writing output to a log file.unicodedata.is_normalized()
unicodedata.is_normalized()
which checks whether the specified normalization method would alter the specified unicode string. This can be performed much faster than calling normalize()
and so if an already-normalized string is a common case then this can save some time. All the standard types of normalization (NFD, NFC, NFKD and NFKC) are supported. For the gory details how Unicode normalization works, you can read the official standard, or the rather more accessible FAQ.Another crop of handy features here. The enhancements to asyncio
, and the ability to write unit tests for coroutines conveniently, are making coroutine-style code really convenient to write now. I look forward to seeing what other enhancements are to come in this area. The typing
improvements are also great to see, and having gone through all these changes I’m rather belatedly trying to make more consistent use of the features they offer in my code.
A lot of the rest of these features are perhaps only useful to specific areas, such as the mathematical or networking enhancements. But that’s fine — with a steady stream of these domain-specific enhancements in every release, everyone is bound to find something useful to them every so often. What I do like is the way that the standard library still feels like a coherent set of core functionality — despite the range of domains supported now, there’s still clearly a lot of effort going into keeping out anything that’s too niche.
So, on to Python 3.9 next, and I’m really looking forward to getting all the way up to date in a few articles. I never had a clue what I was letting myself in for when I started this series, but it’s been a great learning experience for me, and I’m hoping parts of it are proving useful to others too. The limiting factor has been the availability of my time, not any lack of interest on my part — it’s a continual source of amazement to me how far things have come, having followed the path.
To be fair, I realise that the cases for BaseException
are generally very rare, and probably only impact the standard library or anyone writing fundamental code execution frameworks. But it’s useful to bear in mind the high cost that a casual decision early on can have on many developers. ↩
Single dispatch is a form of a generic function where different implementations are chosen based on the type of a single argument. For example, you may want a function that performs different operations when passed a str
vs. an int
. ↩
SO_REUSEPORT
allows multiple sockets to bind to the same port, which can be useful. Unfortunately the behaviour differs across operating systems — Linux performs load-balancing, so multiple threads can have their own listen sockets bound to the same port and connections will be distributed across them; whereas MacOS and some of the BSDs always send connections to the most recently-bound process, to allow a new process to seamlessly take over from another. This article has some great details. ↩
SO_REUSEADDR
allows the socket to bind to the same address as a previous socket, as long as that socket is in not actively listening any more (typically it’s in TIME_WAIT
state). ↩
In case it’s not obvious, the “14” of “C14N” refers to the 14 characters between the “C” and the “N”. ↩