# ☑ What’s New in Python 3.8 - Library Changes

5 Jul 2022 at 11:38PM in Software
｜   ｜

In this series looking at features introduced by every version of Python 3, we continue our look at Python 3.8, examining changes to the standard library. These include some useful new functionality in functools, some new mathematical functions in math and statistics, some improvements for running servers on dual-stack hosts in asyncio and socket, and also a number of new features in typing.

This is the 18th of the 22 articles that currently make up the “Python 3 Releases” series.

As usual, this release 3.8 contains improvements across a whole host of modules, although many of these are fairly limited in scope. The asyncio module continues to develop with a series of changes, the functools module has a number of new decorators which will likely be quite useful, and the math and statistics modules contain a bounty of useful new functions.

So let’s get started!

## Data Types¶

### pprint¶

A fairly straightforward change, the long-standing behaviour of various functions in pprint, which output the keys in dict objects in lexicographically sorted order, can now be disabled by passing sort_dicts=false. This makes sense now that the dict implementation returns keys in the order of insertion, which is potentially useful in pprint() output.

In addition to this new parameter, there’s a new convenience function pprint.pp() which is essentially equivalent to functools.partial(pprint.pprint, sort_dicts=false).

## Functional Programming¶

### functools¶

There are a few useful changes in the functools module this release, around caching and implement single dispatch2.

#### lru_cache¶

Let’s kick off with a simple convenience — it’s now possible to use functools.lru_cache as a normal decorator without requiring function call syntax.

# Prior to Python 3.8
@functools.lru_cache()
def function():
...

# In Python 3.8 we don't need the brackets!
@functools.lru_cache
def function():
...

# But argument are still supported, of course
@functools.lru_cache(maxsize=128)
def function():
...


#### cached_property¶

Staying with the topic of caching, a mechanism for caching immutable properties has been added, appropriately enough called functools.cached_property. When used as the decorator of a function, it acts like @property except that only the first read calls the decorated function — subsequent reads use the same cached value. Writes are permitted, and update the cached value without calling the underlying function again, and deleting the attribute with del removes the cached value such that the next read will invoke the function once more.

This is all illustrated in the example below:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 import dataclasses import datetime import functools @dataclasses.dataclass class Person: first_name: str surname: str date_of_birth: datetime.date @functools.cached_property def name(self): print("Constructing name...") return " ".join((self.first_name, self.surname)).title() @functools.cached_property def age(self): print("Calculating age...") today = datetime.date.today() naive_age = today.year - self.date_of_birth.year try: birthday = self.date_of_birth.replace(year=today.year) except ValueError: birthday = date_of_birth.replace( year=today.year, month=self.date_of_birth.month + 1, day=1, ) return naive_age - 1 if birthday > today else naive_age darwin = Person("charles", "darwin", datetime.date(1809, 2, 12)) print(f"[1] {darwin.name}, {darwin.age}") print(f"[2] {darwin.name}, {darwin.age}") darwin.name = "Chuck Darwin" print(f"[3] {darwin.name}, {darwin.age}") darwin.first_name = "Emma" darwin.date_of_birth = datetime.date(1808, 5, 2) print(f"[4] {darwin.name}, {darwin.age}") del darwin.name del darwin.age print(f"[5] {darwin.name}, {darwin.age}") 

The result of executing this, at time of writing, is shown below:

Constructing name...
Calculating age...
[1] Charles Darwin, 213
[2] Charles Darwin, 213
[3] Chuck Darwin, 213
[4] Chuck Darwin, 213
Constructing name...
Calculating age...
[5] Emma Darwin, 214


You can see that the first call to each of the attributes on line 34 invokes the function, triggering the print() statements on lines 14 and 19. However, the second call on line 35 simply uses the same cached values.

On line 36 we write to the cached value — this updates the result, but doesn’t update the underlying first_name or surname attributes, and neither does it invoke any functions. Then on lines 38-39 we update the underlying attributes, but nothing here triggers updates to the cached values so line 40 still prints the cached versions from before.

Finally, on lines 41-42 we use del to clear the cached values, and this causes the reads triggered from line 43 to re-run the property functions and thus the output now reflects the changes we’d previously made to first_name and date_of_birth.

#### singledispatchmethod¶

The third and final change in functools is nothing to do with caching, but is a helper for those wanting to write functions with single dispatch. This is where you want to call different implementations of a method based on the type of a single argument.

This is an extension to the existing functools.singledispatch which we looked at way back in one of the articles on Python 3.4. The difference here is that the decorator will ignore the initial self or cls argument and switch on the next one.

The way it works is the same as singledispatch — you decorate the first occurrence of the method, and that one becomes the default case if none of the other types match. That method becomes an object which presents a register decorator you can then use to register your type overloads, using type annotations in the signature of the overload function to specify the type.

Here’s a trivial example, which might make things clearer:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import functools class MyClass: @functools.singledispatchmethod def method(self, arg1, arg2: int): print(f"default version called {arg1=} {arg2=}") raise NotImplementedError( f"Arg of type {type(arg1).__name__} not supported" ) @method.register def _(self, arg1: int, arg2: int): print(f"int version called {arg1=} {arg2=}") @method.register def _(self, arg1: str, arg2: int): print(f"str version called {arg1=} {arg2=}") instance = MyClass() print("Calling with int...") instance.method(123, 1) print("Calling with str...") instance.method("abc", 2) print("Calling with float...") try: instance.method(456.789, 3) except NotImplementedError as exc: print(f"Exception: {exc}") 

If you execute that, you’ll see the following output:

Calling with int...
int version called arg1=123 arg2=1
Calling with str...
str version called arg1='abc' arg2=2
Calling with float...
default version called arg1=456.789 arg2=3
Exception: Arg of type float not supported


In general I’d say it’s better practice to use approaches such as polymorphism to handle switching implementations by type, but there are always those irritating cases where it doesn’t pan out — for example, if you’re not in control of the types used. For these cases this approach seems rather cleaner than rolling your own solution is likely to be.

## Numeric and Mathematical¶

### math¶

There are some handy new functions in the math module for particular cases, which are discussed in the sections below.

#### Euclidean Distance¶

There’s a new math.dist() function which calculates the Euclidean distance — that is, the straight-line distance between two points in Euclidean space. In two dimensions, for example, this will be the length of the hypotenuse of the right-angle triangle formed between the two points.

Although this will generally be either 2-dimensional or 3-dimensional space for most people, the function itself supports any number of dimensions — it just requires that the two points have the same number of dimensions.

>>> math.dist((0, 0), (3, 4))
5.0
>>> math.dist((0, 0), (21.65, 12.20))
24.85080481594107
>>> math.dist((1, 2, 3), (5, 8, 15))
14.0
>>> math.dist((0,) * 10, range(10))
16.88194301613413


#### Euclidean Norm¶

The existing function math.hypot() calculates the Euclidean norm of a point in 2D space. This has been extended to support N-dimensions.

It’s beyond the scope of this article to discuss norms in general, and the Euclidean norm specifically — the Wikipedia article linked above can help you out. For the purposes of this discussion, consider it essentially the same as the dist() method above where the first argument is implicitly the origin. This function is the more basic building block which is potentially useful for things other than distances between points.

#### Products¶

This is a simple one — there’s a new math.prod() method which is similar to sum() except that it calculates the product instead of the sum.

You can already do this using some other library functions, but it’s less convenient and slower:

>>> import math
>>>
>>> math.prod(range(1, 100))
93326215443944152681699238856266700490715968264381621468592
96389521759999322991560894146397615651828625369792082722375
82511852109168640000000000000000000000
>>>
>>> import functools
>>> import operator
>>>
>>> functools.reduce(operator.mul, range(1, 100), 1)
93326215443944152681699238856266700490715968264381621468592
96389521759999322991560894146397615651828625369792082722375
82511852109168640000000000000000000000
>>>
>>> import timeit
>>>
>>> timeit.timeit("math.prod(x)",
...               setup="import math; x = range(1, 100)")
4.280957616996602
>>> timeit.timeit("functools.reduce(operator.mul, x, 1)",
...               setup="import functools; import operator;"
...                     " x = range(1, 100)")
6.377806781005347


#### Combinations and Permutations¶

There are two new functions to calculate the combinations (math.comb()) and permutations (math.perm()) of selecting r items from n. If I think back to GCSE Mathematics, I recall that permutations are the number of ways of selecting r items from a population of n without replacement, where the same items selected in a different order are counted distinctly. Combinations represents the same thing but where the same items selected in a different order are not counted distinctly.

If I think back really hard, I recall that the formulae for these two are as follows:

$^nP_r = \frac{n!}{(n-r)!}$ $^nC_r = \frac{n!}{r!(n-r)!}$

Still, it’s handy not to have to remember those, and these versions are faster.

>>> math.comb(59, 6)
45057474
>>> math.perm(59, 6)
32441381280


#### Integer Square Root¶

There’s also a new function math.isqrt() to calculate the integer square root. The integer square root of $$n$$ is the largest integer $$m$$ such that $$m^2 \le n$$.

This has applications in areas such as primality testing, and it’s a tricky little function to get both correct and efficient for large inputs. For small values you can round off math.sqrt(), but for larger values inaccuracies creep in and you get incorrect results.

>>> root = 67108865
>>> square = root ** 2
>>> math.isqrt(square - 1)
67108864
>>> math.floor(math.sqrt(square - 1))
67108865


In the example above you can see that using math.floor(math.sqrt(...)) overestimates the result by 1. As you move to much larger values the floating point errors increase.

### statistics¶

The statistics module has a generous helping of delicious new functions, so fill your plate with all this numerical analysis goodness.

#### fmean¶

There’s a new statistics.fmean() function, which performs the same operation as mean() except that it uses entirely floating point. This means it sacrifices a small amount of the accuracy, but so small that almost all users probably would never care, and in return gives significantly faster performance.

>>> setup = "import random, statistics; " \
...         "data = [random.randint(1, 100) for i in range(1000)]"
>>> timeit.timeit("statistics.mean(data)", setup=setup, number=10000)
4.593662832000007
>>> timeit.timeit("statistics.fmean(data)", setup=setup, number=10000)
0.1214523390000295


In the comparison above you can see that mean() takes around 38 times longer than fmean() to complete.

#### geometric_mean¶

The new statistics.geometric_mean() calculates, you guessed it, the geometric mean. As opposed to the more common arithmetic mean, calculated as the sum of the set divided by its cardinality, the geometric mean is calculated as the nth root of the product of the set.

$\sqrt[n]{\prod\limits_{i=1}^{n} x_i}$

This is a useful measure for certain situations such as proportional growth rates. It’s also particularly suitable for averaging results which have been normalised to different reference values, because of the particular property of the geometric mean that:

$G\left(\frac{X_i}{Y_i}\right) = \frac{G(X_i)}{G(Y_i)}$

The addition of this function also means that, along with mean() and harmonic_mean(), the statistics module now contains all three of the Pythagorean means. It’s good to see that Python has finally caught up with the ancient Greeks!

#### multimode¶

This is a fairly straightforward variation of the existing statistics.mode() function. When locating the modal value, that which occurs the greatest number of times in the input set, there’s always the possibility for multiple such values. The existing mode() function returns the first such value encountered, whereas multimode() returns a list of all of them.

>>> data = "A" * 2 + "B" * 5 + "C" * 4 + "D" * 5 + "E" + "F" * 5
>>> statistics.mode(data)
'B'
>>> statistics.multimode(data)
['B', 'D', 'F']


#### quantiles¶

A common statistical measure is to divide a large data set into four evenly sized groups and look at the three boundary values of these groups — these are the lower quartile, the median and the upper quartile respectively. Generalising this concept to n different groups instead of 4 yields the concept of quantiles. The statistics.quantiles() function calculates these boundaries.

By default the quartiles are given, but other values of the n parameter divide the data into that many groups — pass n=10 for deciles and n=100 for percentiles.

>>> statistics.quantiles(range(1, 100))
[25.0, 50.0, 75.0]
>>> statistics.quantiles(range(1, 100), n=5)
[20.0, 40.0, 60.0, 80.0]
>>> statistics.quantiles(range(1, 100), n=10)
[10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]


#### NormalDist¶

Finally, we have the new statistics.NormalDist class. This is a fairly flexible workhorse for doing operations on normally distributed data.

First we have to construct the distribution. This can be done directly, by passing in the two parameters: mu is the arithmetic mean of the data and sigma is the standard deviation. Alternatively, the from_samples() class method uses fmean() and stdev() to estimate these values from a sample of data.

Once you’ve constructed the object, you can recover the mean and stdev attributes from it, but there are a variety of more interesting functions as well. The samples() method can be used to generate a specified number of random samples which conform to the distribution, and the quantiles() function can return where you’d expect the quantiles of the data to be based on the distribution.

>>> height_dist = statistics.NormalDist.from_samples(height_data)
>>> height_dist.mean
175.19763652682227
>>> height_dist.stdev
7.709514592235422
>>> height_dist.samples(3)
[179.58106311923885, 179.66015160925144, 173.56059959418297]
>>> statistics.fmean(dist.samples(1000000))
175.31327020208047
>>> height_dist.quantiles(n=4)
[169.99764795537234, 175.19763652682227, 180.3976250982722]


Distribution objects can also be multiplied by a constant to transform the distribution accordingly — this can be useful for things like unit conversion, which would apply to all data points equally. Addition, subtraction and division are also supported for other forms of translation and scaling.

>>> feet_height_dist = height_dist * 0.0328084
>>> feet_height_dist.mean
5.7479541382265955
>>> feet_height_dist.stdev
0.2529368385478966
>>>


There are also useful functions for dealing with probabilities. The pdf() method uses a probability density function to return the probability that a random variable will be close to the specified value. There’s also cdf(), which uses a cumulative density function to return the probability that a value will be less than or equal to the specified value, and inv_cdf(), which takes a probability and returns the point in the distribution where the cdf() of that point would return the specified probability.

>>> feet_height_dist.pdf(5)
0.01991105275231589
>>> feet_height_dist.cdf(6)
0.8404908965170221
>>> feet_height_dist.inv_cdf(0.75)
5.918557443274153


There are some other features that I haven’t covered here, so it’s well worth reading through the documentation if you’re doing analysis of normally distributed data.

## File and Directory Access¶

### os.path¶

There are a couple of improvements to os.path, several of which only apply on Windows.

#### Fewer ValueErrors¶

The various os.path functions which return a bool, such as os.exists() and os.isdir(), always used to raise ValueError if passed a filename which contains invalid characters for the OS. For example, here’s an attempt to check for a filename which contains an embedded nul character on Python 3.7:

Python 3.7.10 (default, Mar 28 2021, 04:19:36)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
>>> import os
>>> os.path.isdir("/foo\0bar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.7.10/lib/python3.7/genericpath.py", line 42, in isdir
st = os.stat(s)
ValueError: embedded null byte


And here’s the new behaviour on Python 3.8:

Python 3.8.8 (default, Mar 28 2021, 04:22:11)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
>>> import os
>>> os.path.isdir("/foo\0bar")
False


#### Windows Changes¶

There are a set of smaller changes on Windows in os.path as well.

os.path.expanduser()
Now prefers to use the USERPROFILE environment variable on Windows in preference to HOME, since the former is more reliably set for normal user accounts.
os.path.isdir()
No longer returns True when querying a link to a directory that no longer exists.
os.path.realpath()
Now has support for resolving reparse points, which were discussed above in the section on os.

### pathlib¶

Just a couple of small changes in pathlib. Firstly, the bool-returning functions such as exists() and is_symlink() no longer raise ValueError for invalid filenames, as discussed in the section on os.path above.

Secondly, there’s a new Path.link_to() function which creates a hard link to the current path. The naming is slightly unfortunate, because if you call my_path.link_to(target) then it reads as if the path referred to by my_path will become a link to target — in fact the opposite is true, target is created as a hard link to my_path.

### shutil¶

There are some smaller changes to a handful of functions in the shutil module.

shutil.copytree() Acquires dirs_exist_ok
Typically copytree() raises an exception if any of the destination directories already exists — passing dirs_exist_ok=True now disables this behaviour and allows the copy to proceed.
shutil.make_archive()
This high-level interface to creating an archive file had a slight tweak when using format="tar" — now the specific format used will be pax instead of legacy GNU format. The same change has been made to the tarfile module.
shutil.rmtree()
On Windows (only) this function now removes directory junctions without recursing into them to remove their contents first.

## Generic Operating System Services¶

### os¶

A handful of changes in os, mostly on the Windows platform.

First up there’s a new add_dll_directory() function to provide additional search paths for loading DLLs, for example when using the ctypes module. This is similar to LD_LIBRARY_PATH on POSIX systems. The function returns a handle which has a close() method which reverts the change again, or it can be used in a with statement to achieve the same effect.

#### Reparse Points (Windows)¶

The second change on Windows is that the logic for reparse points, such as symlinks and directory junctions, has been moved from being Python-specific to being delegated to the operating system to handle. Now I’m very far from an expert on Windows, so I hope I don’t get any of these details wrong, but my understanding is that this expands the set of file-like objects that are supported to anything that the OS itself supports. This means that os.stat() can query anything, whereas os.lstat() will query anything which has the name surrogate bit set in the reparse point tag.

It’s worth noting that stat_result.st_mode will only set the S_IFLNK bit for actual symlinks — it will be clear for other reparse points. If you want to check for reparse points in general, you can look for stat.FILE_ATTRIBUTE_REPARSE_POINT in stat_result.st_file_attributes, and you can look at stat_result.st_reparse_tag to get the reparse point tag in this case. The stat module has some IO_REPARSE_TAG_* constants to help check bits in the tag, but the list is not exhaustive.

In a related change, os.readlink() is also now able to read directory junctions. Note, however, that os.path.islink() still returns False for these. As a consequence, if your code is LBYL-style and checks islink() first then it’ll continue to treat junctions as if they were standard directories, but if your code is EAFP-style and just catches errors from readlink() then your code may now behave differently when it encounters junctions.

#### memfd_create¶

Finally we have a change that’s distinctly more Linuxy — the Linux-specific memfd_create() call has been made available in the os module. This call creates an anonymous file in memory, and returns a file descriptor to it. This can be used like any other file, except that when the last reference to it is dropped then the memory is automatically released. In short, it has the same semantics as a file created with mmap() using the MAP_ANONYMOUS flag.

The call takes a mandatory name parameter, which is used as the filename but doesn’t affect much except how entries in /proc/<pid>/fd will appear. There’s also an optional flags parameter which accepts the bitwise OR of various new flags in os. These are:

os.MFD_CLOEXEC
Set the close-on-exec flag, the same as passing O_CLOEXEC to open() would do, which in turns is the same as setting FD_CLOEXEC with fcntl() except that it avoids potential multithreaded race conditions. This flag closes the file descriptor automatically on an exec() call to load another binary.
os.MFD_ALLOW_SEALING
This flag allows seals to be set on the file, which are a means of restricting further operations on the file, which allows multiple processes to deal with shared memory with fewer risks that a misbehaving or malicious process might trigger bugs. Without this the F_SEAL_SEAL seal will be set, which prevents further seals from being added. A discussion of Linux file sealing is rather esoteric and outside the scope of this article, but the memfd_create() man page has a section which illustrates how it can work in general, and you can find the constants for types of seal defined in the fcntl module.
os.MFD_HUGETLB
This flag causes the anonymous file to be created in the hugetlbfs filesystem using huge pages. This allows page sizes above the usual 4KB, which reduces the size of the page table, increasing performance for large chunks of memory, if supported by the kernel.
MFD_HUGE_<size>
For use with MFD_HUGETLB there are also constants to select the huge page size to use. Take a look at the documentation for os.memfd_create() for the full list.
MFD_HUGE_SHIFT and MFD_HUGE_MASK
I believe these two values are used to construct page size requests which aren’t represented by one of the fixed sizes listed, but documentation on how to use them is a little sketchy. If you look at the mmap() man page then you’ll see some discussion on using MAP_HUGE_SHIFT, and I think the same approach is meant to work here.

## Concurrent Execution¶

A couple of potentially useful changes in the threading module.

#### excepthook¶

The default behaviour when an exception propogates outside the main function of a thread is to print a traceback. However, there’s now a threading.excepthook which can be overridden to handle such exceptions in a different way, such as writing them to a log file.

Here’s a simple illustration:

>>> import threading
...     raise Exception("Naughty!")
...
>>>     self.run()
File "/Users/andy/.pyenv/versions/3.8.8/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "<stdin>", line 3, in my_thread_func
Exception: Naughty!

>>>
>>> threading.excepthook = lambda exc: print(f"CAUGHT: {exc!r}")
Traceback (most recent call last):
File "/Users/andy/.pyenv/versions/3.8.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner


#### get_native_id¶

The threading.get_ident() function was added in Python 3.3 which returns a unique identifier for the current thread. The problem is that this doesn’t, in general, have any relation with the operating system identifier for the thread, which can sometimes be useful to know for, say, logging purposes.

In Python 3.8, therefore, the threading.get_native_id() function has been added, which returns the native thread ID of the current thread assigned by the kernel. The downside is that this function isn’t guaranteed to be available on all platforms, but it seems to be supported on Windows, MacOS, Linux and several of the other Unixes, so it should be useful for a lot of people.

## Networking and IPC¶

### asyncio¶

The asyncio.run() function, added in the previous release, has been upgraded from a provisional to a stable API, although this release doesn’t contain any changes to its functionality. Aside from this, there are some more substantive changes.

#### CancelledError¶

Prior to this release the CancelledError exception, which is raised when asyncio tasks are cancelled, now inherits from BaseException rather than Exception. This mirrors several similar changes in the past, such as in Python 2.6 where GeneratorExit also had its base class changed from Exception to BaseException. The problem is the same in all these cases: unintended capture of exceptions. In the case of CancelledError consider code like this:

try:
await some_async_function()
except Exception:


This seems fairly innocuous, but of course the CancelledError will also be captured in that exception specification prior to Python 3.8. This in turn means that the CancelledError will not propagate to the caller, and hence anyone waiting on that task will not be notified that the task was cancelled.

Consider the following short script:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import asyncio async def to_be_cancelled(): try: await asyncio.sleep(60) except Exception: print(f"Something went wrong") async def main(): task = asyncio.create_task(to_be_cancelled()) await asyncio.sleep(1) task.cancel() try: await task print("Task completed normally") except asyncio.CancelledError: print("Task was cancelled") except Exception: print(f"Task raised exception") asyncio.run(main()) 

Under python 3.7, the result of running this is that main() never finds out the task was cancelled:

$python3.7 task_example.py Something went wrong Task completed normally  But under Python 3.8 the CancelledError propagates as the programmer probably intended: $ python3.8 task_example.py


This move was a little controversial, as you can see from the discussion on BPO-32528 — whichever decision was taken here, some programmers would have likely been bitten by it, either in past or future code. The best option would be for everyone to carefully consider whether BaseException is the right base for their exceptions in future1.

#### Happy Eyeballs¶

Another change is support for Happy Eyeballs. Despite the slightly clickbait-sounding name, Happy Eyeballs is actually a fairly useful IETF algorithm for improving responsiveness on dual-stack (i.e. supporting IPv4 and IPv6 concurrently) systems. These systems typically would prefer IPv6, as the newer standard, but this would lead to frustrating delays for parts of the Internet where the IPv6 path failed, as these requests would typically need to hit some timeout before falling back to IPv4. This algorithm tries connections nearly in parallel to give a fast response, but still prefers IPv6 given a choice. The full details can be found in RFC 8305.

Python now includes support for this in asyncio.loop.create_connection(). There are two new parameters. Specifying happy_eyeballs_delay activates the behaviour and specifies the delay between making adjacent connections — the RFC suggests 0.25 (250ms) for this value. The second parameter is interleave and corresponds to what the RFC calls First Address Family Count — this only applies if getaddrinfo() returns multiple possible addresses and controls how many IPv6 addresses are tried before switching to trying IPv4. I’d suggest not specifying this at all unless you know what you’re doing, the default should be fine.

#### Other asyncio Changes¶

There are also a couple of smaller changes in asyncio to note:

Task.get_coro() Added
A Task object is a wrapper which is used to schedule an underlying coroutine for execution, and it now provides a get_coro() method to return the underlying coroutine object itself.
The name of a task can now be set, either by passing the name keyword parameter to create_task() or by calling set_name() on the Task object. This could include calling asyncio.current_task().set_name(...) from within the task itself, which could be useful for diagnostic progress reporting or identification purposes for tasks which acquire work items after the point of creation.

### socket¶

There are a couple of related functions added to socket which make it easier to create listening sockets.

The first is socket.has_dualstack_ipv6() which simply returns True if the current platform supports creating a socket bound to both an IPv4 and IPv6 address, or False otherwise.

The second function is create_server() which is a convenience for creating a binding a TCP socket, which is a tedious bit of boilerplate. This accepts a family argument, which should be AF_INET for IPv4 or AF_INET6 for IPv6. However, if you want to support both, you should pass AF_INET6 and also dualstack_ipv6=True, which attempts to bind the socket to both families. This is commonly used with an empty string as the IP address, to bind to all interfaces, but if you pass an address it should be an IPv6 address — the IPv4 address used will be an IPv4-mapped IPv6 address.

Note that if you use dualstack_ipv6 and your platform doesn’t support dual-stack sockets, you’ll get a ValueError. You can use has_dualstack_ipv6() described above to avoid this, although I think EAFP would have been more Pythonic so I’m a little disappointed they didn’t make this a more unique exception that could be caught and handled.

import socket

if socket.has_dualstack_ipv6():
sock = create_server(("", 1234),
family=socket.AF_INET6,
dualstack_ipv6=True)
else:
sock = socket.create_server(("", 1234))


The function also accepts parameters backlog, which is passed to the listen() call, and reuse_port, which is used to control whether to set SO_REUSEPORT3. Overall, therefore, create_server() performs something like this:

1. Create a socket of type SOCK_STREAM in the specified address family.
2. Set SO_REUSEADDR4 (not on Windows).
3. If reuse_port is True then set SO_REUSEPORT.
4. If family is AF_INET6 and dualstack_ipv6 is False, set IPV6_V6ONLY option.
5. Perform the bind() on the socket.
6. Perform the listen() on the socket.

At this point the returned socket is ready to call accept() to receive inbound connections.

## Structured Markup Processing Tools¶

### xml¶

There are a few useful improvements for XML parsing, including some security improvements, support for wildcard searches within a namespace and support for XML canonicalisation (aka C14N).

#### Security Fixes¶

There are various known attacks on XML parsers which can cause issues such as massive memory consumption or crashes on the client side, or even steal file content off the disk. One class of thses are called XML External Entity (XXE) injection attacks. These rely on a feature which the XML standards require of parsers, but which is very rarely used — the ability to reference entities from external files. The article I linked has some great explanation of how these work.

In Python 3.8, the xml.sax and xml.dom.minidom modules no longer process external entities by default, to attempt to mitigate these security risks. If you do want to re-enable this feature in xml.sax for some reason, apparently you can instantiate an xml.sax.xmlreader.XMLReader() and call setFeature() on it using xml.sax.handler.feature_external_ges. But I suspect it’s probably a much better idea to simply never use this feature of XML.

#### Finding Tags in Namespaces¶

The various findX() functions within xml.etree.ElementTree have acquired some handy support for searching within XML namespaces. Take a look at the example below, which illustrates that you can search within a namespace for any tag using "{namespace}*" and you can search for a tag within any namespace with "{*}tag".

>>> import pprint
>>> import xml.etree.ElementTree as ET
>>>
>>> doc = '<foo xmlns:a="http://aaa" xmlns:b="http://bbb">' \
...       '<one/><a:two/><b:three/></foo>'
>>> root = ET.fromstring(doc)
>>>
>>> pprint.pp(root.findall("*"))
[<Element 'one' at 0x101e90180>,
<Element '{http://aaa}two' at 0x101e90220>,
<Element '{http://bbb}three' at 0x101ece040>]
>>> pprint.pp(root.findall("{http://bbb}*"))
[<Element '{http://bbb}three' at 0x101ece040>]
>>> pprint.pp(root.findall("two"))
[]
>>> pprint.pp(root.findall("{*}two"))
[<Element '{http://aaa}two' at 0x101e90220>]


#### XML Canonicalisation¶

There’s a new xml.etree.ElementTree.canonicalize() which performs XML canonicalisation, also known as C14N5 to save typing. This is a process for a standard byte representation of an XML document, so things like cryptographic signatures can be calculated, where a single byte inconsistency would lead to an error.

This function accepts either XML as a string, or a file path or file-like object using the from_file keyword parameter. The XML is converted to the canonical form and written to an output file-like object, if provided via the out keyword parameter, or returned as a text string if out is not set.

Note that the output file receives the canonicalised version as a str, so it should be opened with encoding="utf-8".

There are some options to control some of the operations, such as whether to strip whitespace and whether to replace namespaces with numbered aliases, but I won’t bother duplicating the documentation for those here.

Overall this is very useful as the process is quite convoluted and if you’re trying to calculate a crytographic hash you generally have very little to go on when you’re diagnosing discrepancies — you tend to just have to guess what might be going wrong and fiddle around until the two sides match. Having this already implemented in the library, therefore, saves everyone going through this hassle.

#### New Features in XMLParser¶

Finally, the xml.etree.ElementTree.XMLParser class has some new features. Firstly, there are a couple of new callbacks that can be added to the handler. The start_ns() method will be called for each new namespace declaration, prior to the start() callback for the element which defines it — this method is passed the namespace prefix and the URI. There’s also a corresponding end_ns() method which is called with the prefix just after the end() method for the tag.

>>> from xml.etree.ElementTree import XMLParser
>>>
>>> class Handler:
...     def start(self, tag, attr):
...         print(f"START {tag=} {attr=}")
...     def end(self, tag):
...         print(f"END {tag=}")
...     def start_ns(self, prefix, uri):
...         print(f"START NS {prefix=} {uri=}")
...     def end_ns(self, prefix):
...         print(f"END NS {prefix=}")
...
>>> doc = '<foo xmlns:a="http://aaa" xmlns:b="http://bbb">' \
...       '<one/><a:two/><b:three/></foo>'
>>> handler = Handler()
>>> parser = XMLParser(target=handler)
>>> parser.feed(doc)
START NS prefix='a' uri='http://aaa'
START NS prefix='b' uri='http://bbb'
START tag='foo' attr={}
START tag='one' attr={}
END tag='one'
START tag='{http://aaa}two' attr={}
END tag='{http://aaa}two'
START tag='{http://bbb}three' attr={}
END tag='{http://bbb}three'
END tag='foo'
END NS prefix='b'
END NS prefix='a'


The second change is that comments and processing instructions, which were previously ignored, can now be passed through by the builtin TreeBuilder object. To enable this, there are new insert_comments and insert_pis keyword parameters, and there are also comment_factory and pi_factory parameters to specify the factory functions to use to construct these objects, instead of using the builtin Comment and ProcessingInstruction objects.

To specify these parameters, you need to construct your own TreeBuilder and pass it to XMLParser using the target parameter.

## Development Tools¶

### typing¶

The support for type hinting continues at a healthy pace with some more improvments in the typing module.

#### TypedDict¶

There’s a new typing.TypedDict type which supports a heterogenous dict where the type of each value may differ. All keys must be str and must be specified in advance using the usual class member type hint syntax.

import datetime
import typing

class Person(typing.TypedDict):
first_name: str
surname: str
date_of_birth: datetime.date


At runtime this will be entirely equivalent to a dict, but it allows type-checkers to validate the usage of values within it. If a key is used with an incorrect type, that’s expected to fail type checking. Also, any use of a key not specifically listed should fail, unless total=False is added to the constructor — this means that the keys listed must still have their specified types, but any other keys can be used and they may take any type of value.

One subtle point that may not be immediately apparent is that initialisation with a dict literal must include a specific type hint on the destination variable, otherwise the type-checker will assume that it is of type dict instead of the TypedDict subclass you’ve defined.

churchill: Person = {
"first_name": "Winston",
"surname": "Churchill",
"date_of_birth": datetime.date(1874, 11, 30)
}


Personally I tend to try to use custom classes for these sorts of cases, and this is made especially easy by the addition of dataclasses in Python 3.7, as I talked about in a previous article. However, as PEP 589 discusses there are some cases where a dict subclass has advantages. It feels to me as if this is straying a little away from the Zen of Python’s There should be one — and preferably only one — obvious way to do it, but Python is a broad church and there’s room for many opinions.

#### Literal¶

The next new type is typing.Literal which allows the programmer to specify a value must be one of a pre-determined list of values. For example:

def get_status(self) -> Literal["running", "stopping", "stopped"]:
...


One interesting point that’s highlighted in the PEP is that even if you assume you can break any backwards-compatibility of an API and were to use an enum for these values, all that does is constrain the type of the parameter to be that enumeration type, but it’s possible only a subset of the values from it should be accepted or returned — in these cases, Literal is still useful.

#### Final¶

Another addition in this release is typing.Final for variables, and a corresponding decorator @final for methods and classes, added by PEP 591. These can be used to specify that:

• A method should not be overridden.
• A class should not be subclassed.
• A variable or attribute should not be reassigned.

As usual none of this changes the runtime behaviour, but allows type checkers such as mypy to perform additional validation. Consider the following code:

finaltest.py
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from typing import final, Final

SOME_GLOBAL: Final[int] = 1234

class Base:

normal_attr: int = 100
final_attr: Final[int] = 200

def can_override(self):
print("This method can be overridden")

@final
def cannot_override(self):
print("Overriding this will fail type checks")

@final
class FirstDerived(Base):

normal_attr: int = 101
final_attr: int = 202

def can_override(self):
print("FirstDerived.can_override()")

def cannot_override(self):
print("FirstDerived.cannot_override()")

class SecondDerived(FirstDerived):
def can_override(self):
print("SecondDerived.can_override()")

def cannot_override(self):
print("SecondDerived.cannot_override()")

Base.final_attr = 333
SOME_GLOBAL = 4567
print(f"{SOME_GLOBAL=}")

base = Base()
base.can_override()
base.cannot_override()
print(f"{base.normal_attr=} {base.final_attr=}")

derived = FirstDerived()
derived.can_override()
derived.cannot_override()
print(f"{derived.normal_attr=} {derived.final_attr=}")

second = SecondDerived()
second.can_override()
second.cannot_override()
print(f"{second.normal_attr=} {second.final_attr=}")


If you run this, you’ll see the output is exactly as you’d expect if the final and Final specifiers weren’t there:

SOME_GLOBAL=4567
This method can be overridden
Overriding this will fail type checks
base.normal_attr=100 base.final_attr=333
FirstDerived.can_override()
FirstDerived.cannot_override()
derived.normal_attr=101 derived.final_attr=202
SecondDerived.can_override()
SecondDerived.cannot_override()
second.normal_attr=101 second.final_attr=202


However, if you run mypy then you’ll see we’re breaking some constraints:

finaltest.py:23: error: Cannot assign to final name "final_attr"
finaltest.py:32: error: Cannot inherit from final class "FirstDerived"
finaltest.py:40: error: Cannot assign to final attribute "final_attr"
finaltest.py:41: error: Cannot assign to final name "SOME_GLOBAL"
Found 4 errors in 1 file (checked 1 source file)


#### Protocol¶

Also included in this release are the changes outlined in PEP 544, which introduce a form of structural typing to Python. This is where compatability between types is determined by analysing a type of object’s actual structure rather than relying on type annotations, which are a form of nominative typing.

The PEP refers to it as static duck typing, which I think is a good name. As I’m sure many of you are aware, in general duck typing refers to a system where objects are just checked for meeting a specified abstract interface at runtime, rather than their entire type. The key aspect, however, is that the object doesn’t need to declare that it meets this interface by, for example, inheriting from some abstract base class. The interface is checked against the object’s actual definition.

The static in that phrase is important, because Python already offers runtime facilities for checking whether objects meet particular interfaces without them having to be specifically declared. In the except below, for example, the IntArray class never declares itself as inheriting from collections.abc.Collection, yet it still returns itself as compatible from isinstance().

>>> import collections.abc
>>> from typing import List
>>>
>>> class IntArray:
...     values: List[int] = []
...     def __init__(self, initial=()):
...        self.values = list(int(i) for i in initial)
...     def __contains__(self, value):
...        return value in self.values
...     def __iter__(self):
...         return iter(self.values)
...     def __len__(self):
...         return len(self.values)
...     def __reversed__(self):
...         return reversed(self.values)
...     def __getitem__(self, idx):
...         return self.values[idx]
...     def index(self, *args, **kwargs):
...         return self.values.index(*args, **kwargs)
...     def count(self, *args, **kwargs):
...         return self.values.count(*args, **kwargs)
...
>>> instance = IntArray((1,2,3,4,5))
>>> isinstance(instance, collections.abc.Sized)
True
>>> isinstance(instance, collections.abc.Collection)
True
>>> isinstance(instance, collections.abc.Reversible)
True
>>> isinstance(instance, collections.abc.MutableSequence)
False
>>> isinstance(instance, collections.abc.Mapping)
False


A protocol is a declaration of the interface that a class must meet in order to be taken as supporting that protocol. Creating one simply involves declaring a class which inherits from typing.Protocol and defining the required interface. This protocol can then be used by static type checkers to validate that the objects passed to a call conform to the specified interface.

>>> from abc import abstractmethod
>>> from typing import Protocol
>>>
>>> class SupportsNameAndID(Protocol):
...     name: str
...     @abstractmethod
...     def get_id(self) -> int:
...         ...
...


As well as implicitly supporting the interface by implementing the specified attributes and methods directly, it’s also fine to inherit from Protocol instances — this can be useful to allow protocol classes to become mixins, adding default concrete implementations of some or all of the methods. The general idea is that they’re very similar to regular abstract base classes (ABCs). One detail that’s worth noting, however, is that a class is only a protocol if it directly derives from typing.Protocol — classes further down the inheritance hierarchy are treated as just regular ABCs.

It’s also possible to decorate a protocol class with @typing.runtime_checkable, which also means that isinstance() and issubclass() can be used to detect whether types conform at runtime. This can be used to log warnings, raise exceptions or anything else. The example below follows on from the same session above, except assuming that SupportsNameAndID had been defined with the @runtime_checkable class decorator.

>>> class One:
...     pass
...
>>> isinstance(One(), SupportsNameAndID)
False
>>>
>>> class Two:
...     def get_id(self) -> int:
...         return 123
...
>>> isinstance(Two(), SupportsNameAndID)
False
>>>
>>> class Three:
...     name: str = "default"
...     def get_id(self) -> int:
...         return 123
...
>>> isinstance(Three(), SupportsNameAndID)
True


All in all this is a useful means for developers to specify their own protocols to complement those already defined in collections.abc.

#### Generic Type Introspection¶

Last up in typing are the new methods get_origin() and get_args(). These are used for breaking apart the specification of generic types into the core type, returned by get_origin(), and the type(s) passed as argument(s), returned by get_args(). This is perhaps best explained with some examples:

>>> import typing
>>>
>>> typing.get_origin(typing.List[typing.Tuple[int, ...]])
<class 'list'>
>>> typing.get_args(typing.List[typing.Tuple[int, ...]])
(typing.Tuple[int, ...],)
>>>
>>> typing.get_origin(typing.Tuple[int, ...])
<class 'tuple'>
>>> typing.get_args(typing.Tuple[int, ...])
(<class 'int'>, Ellipsis)
>>>
>>> typing.get_origin(typing.Dict[str, typing.Sequence[int]])
<class 'dict'>
>>> typing.get_args(typing.Dict[str, typing.Sequence[int]])
(<class 'str'>, typing.Sequence[int])
>>>
>>> typing.get_origin(typing.Hashable)
<class 'collections.abc.Hashable'>
>>> typing.get_args(typing.Hashable)
()


### unittest¶

#### AsyncMock¶

The unittest.mock.Mock class is a real workhorse that can mock almost anything — if you’re not familiar with it, I gave a brief overview in an earlier article on Python 3.3. There’s one case where it can’t be easily used, however, which is when mocking asynchronous objects — for example, when mocking a asynchronous context manager which provides __aenter__() and __aexit__() methods.

The issue is that the mock needs to be recognised as an async function and return an awaitable object instead of a direct result. In Python 3.8, the AsyncMock class has been added to implement these semantics. You can see the difference in behaviour here:

>>> import asyncio
>>> from unittest import mock
>>>
>>> m1 = mock.Mock()
>>> m2 = mock.AsyncMock()
>>> asyncio.iscoroutinefunction(m1)
False
>>> asyncio.iscoroutinefunction(m2)
True
>>> m1()
<Mock name='mock()' id='4456666016'>
>>> m2()
<coroutine object AsyncMockMixin._execute_mock_call at 0x1099f28c0>


Here’s a very simple illustration of it in practice:

>>> async def get_value(obj):
...     value = await obj()
...     print(value)
...
>>> mock_obj = mock.AsyncMock(return_value=123)
>>> asyncio.run(get_value(mock_obj))
123


#### Module and Class Cleanup¶

A unittest.TestCase has setUp() and tearDown() methods to allow instantiation and removal of test fixtures. However, if setUp() does not complete successfully then tearDown() is never called — this runs the risks of leaving things in a broken state. As a result these objects also have an addCleanup() method to register functions which will be called after tearDown(), but which are always called regardless of whether setUp() succeeded.

The class also has corresponding setUpClass() and tearDownClass() class methods, which are called before and after tests within the class as a whole are run. In addition the module containing the tests can defined setUpModule() and tearDownModule() functions which perform the same thing at module scope. However, until Python 3.8 neither of these cases had an equivalent of addCleanup() that would be called in all cases.

As of Python 3.8 these now exist. There’s an addClassCleanup() class method on TestCase to add cleanup functions to be called after tearDownClass(), and there’s a unittest.addModuleCleanup() function to register functions to be called after tearDownModule(). These will be invoked even if the relevant setup methods raise an exception.

#### Async Test Cases¶

You can now write your test cases as async methods thanks to the new unittest.IsolatedAsyncioTestCase base class. This is useful for testing your own async functions, which need to be executed in the context of an event loop, without having to write a load of boilerplate yourself each time.

If you derive from this new class, as you would TestCase, then it accepts coroutines as test functions. It adds asyncSetUp() and asyncTearDown() async methods, which are additionally called just inside the existing setUp() and tearDown(), which are still normal (i.e. non-async) methods. There’s also an addAsyncCleanup() method, similar to the other cleanup methods described above — this registers an async function to be called at cleanup time.

An event loop is constructed and the test methods are executed on it one at a time, asynchronously. Once execution is completed, any remaining tasks on the event loop are cancelled. Other than that, things operate more or less as with the standard TestCase.

To summarise, assuming all of these are defined then the order of events will be:

1. Construct an instance of the IsolatedAsyncioTestCase class.
2. Create an event loop and add a “runner” task with a job queue.
3. For each test method:
1. Call the setUp() method, if defined.
2. Send the asyncSetUp() method to the job queue, if defined.
3. If the test method returns an awaitable object, send it to the job queue.
4. Send the asyncTearDown() method to the job queue, if defined.
5. Call the tearDown() method, if defined.
6. Call any cleanups registered, either by adding to the job queue if awaitable, or directly if not.
4. Cancel all remaining tasks on the event loop.
5. Close the event loop.

## Smaller Changes¶

The usual handful of changes that I noted, but didn’t think required much elaboration.

cProfile.Profile Context Manager
You can now profile blocks of code using with cProfile.Profile() as profiler: around the block. This is a useful convenience for calling the enable() and disable() methods of the Profiler.
New Uses of dict for OrderedDict
A couple of places that used to return collections.OrderedDict now just return dict again, as that now preserves insertion order as we discussed in a previous article on Python 3.6. These are the _asdict() method of collections.namedtuple() and csv.DictReader.
date.fromisocalendar() Added
In the second article on Python 3.7 we covered the new fromisoformat() method on date and datetime objects in the datetime module. In this release there’s a somewhat similar new method just on date objects called fromisocalendar() which is based on a different part of the ISO 8601 standard, that of the week numbering convention. The function takes a year, a week number and a weekday within that week, and returns the correspondingly initialised date object, and is essentially the converse of the existing date.isocalendar() method.
itertools.accumulate() Initial Value
The accumulate() function in itertools repeatedly applies a binary operation to a list of values, by default addition, returning the list of cumulative results. In Python 3.8 there’s a new initial parameter which allows an initial value to be specified as if it were at the start of the iterable.
logging.basicConfig() Acquires force Parameter
The behaviour of logging.basicConfig() is that if the root logger already has handlers configured, the call is silently ignored. This means that only the first call is generally effective. As of Python 3.8, however, it’s acquired a force parameter which, if True, will cause any existing handlers to be removed and closed before adding the new ones. This is helpful in any case where you want to re-initialise logging that you think may have been initialised.
madvise() For mmap
The mmap.mmap class now has an madvise() method which calls the madvise() system call. This allows applications to hint the kernel at what type of access it can expect from an memory-mapped block, which allows the kernel to optimise its choice of read-ahead and caching techniques. For example, if accesses will be random then read-ahead is probably of little use.
shlex.join()
The existing shlex.split() function splits a command-line respecting quoting, as a shell would. In Python 3.8 there’s now a shlex.join() that does the opposite, inserting quoting and escapes as appropriate.
time.CLOCK_UPTIME_RAW (MacOS only)
On MacOS 10.12 and up, there’s a new clock CLOCK_UPTIME_RAW been made available. This is a monotonically increasing clock which isn’t affected by any time of day changes, and does not advance while the System is sleeping. This works exactly the same as CLOCK_UPTIME which is available on FreeBSD and OpenBSD.
sys.unraisablehook()
The new sys.unraisablehook() can now be set to handle cases where Python cannot handle a raised exception — for example, those raised by __del__() during garbage collection. The default action is to print some output to stderr, but this hook can be defined to take other action instead, such as writing output to a log file.
unicodedata.is_normalized()
There’s a new function unicodedata.is_normalized() which checks whether the specified normalization method would alter the specified unicode string. This can be performed much faster than calling normalize() and so if an already-normalized string is a common case then this can save some time. All the standard types of normalization (NFD, NFC, NFKD and NFKC) are supported. For the gory details how Unicode normalization works, you can read the official standard, or the rather more accessible FAQ.

## Conclusion¶

Another crop of handy features here. The enhancements to asyncio, and the ability to write unit tests for coroutines conveniently, are making coroutine-style code really convenient to write now. I look forward to seeing what other enhancements are to come in this area. The typing improvements are also great to see, and having gone through all these changes I’m rather belatedly trying to make more consistent use of the features they offer in my code.

A lot of the rest of these features are perhaps only useful to specific areas, such as the mathematical or networking enhancements. But that’s fine — with a steady stream of these domain-specific enhancements in every release, everyone is bound to find something useful to them every so often. What I do like is the way that the standard library still feels like a coherent set of core functionality — despite the range of domains supported now, there’s still clearly a lot of effort going into keeping out anything that’s too niche.

So, on to Python 3.9 next, and I’m really looking forward to getting all the way up to date in a few articles. I never had a clue what I was letting myself in for when I started this series, but it’s been a great learning experience for me, and I’m hoping parts of it are proving useful to others too. The limiting factor has been the availability of my time, not any lack of interest on my part — it’s a continual source of amazement to me how far things have come, having followed the path.

1. To be fair, I realise that the cases for BaseException are generally very rare, and probably only impact the standard library or anyone writing fundamental code execution frameworks. But it’s useful to bear in mind the high cost that a casual decision early on can have on many developers.

2. Single dispatch is a form of a generic function where different implementations are chosen based on the type of a single argument. For example, you may want a function that performs different operations when passed a str vs. an int

3. SO_REUSEPORT allows multiple sockets to bind to the same port, which can be useful. Unfortunately the behaviour differs across operating systems — Linux performs load-balancing, so multiple threads can have their own listen sockets bound to the same port and connections will be distributed across them; whereas MacOS and some of the BSDs always send connections to the most recently-bound process, to allow a new process to seamlessly take over from another. This article has some great details.

4. SO_REUSEADDR allows the socket to bind to the same address as a previous socket, as long as that socket is in not actively listening any more (typically it’s in TIME_WAIT state).

5. In case it’s not obvious, the “14” of “C14N” refers to the 14 characters between the “C” and the “N”.

This is the 18th of the 22 articles that currently make up the “Python 3 Releases” series.

5 Jul 2022 at 11:38PM in Software
｜   ｜
Photo by David Clode on Unsplash