☑ Python 2to3: What’s New in 3.4 - Part 2

8 Apr 2021 at 11:47PM in Software
 |   | 

In this series looking at features introduced by every version of Python 3, this one is the second of two covering release 3.4. We look at improvements to the way multiprocessing spawns child processes, various powerful new facilities for code instrospection, improvements to garbage collection, and a lot more besides.

green python two 34

In this article we conclude our look at Python 3.4 which started with the previous one in this series. Last time we took a look at the ensurepip module, file descriptor inheritance changes, the codecs module, and a series of other new modules which were added to the library. In this article we’ll be looking at a host of changes that have been made to existing modules, some long-awaited improvements to garbage collection and a few other small details.

Library Enhancements

The bulk of this article is going to be looking at changes to modules in the standard library. As usual, I’ve tried to group them by category to make things somewhat more approachable, and we’re kicking off with a category that I never even really knew existed in the standard library.

Audio

This release contained some improvements in a handful of modules for dealing with audio formats, and it wasn’t until I looked into the changes in these modules that I even knew they were there. This is one of the reasons I like to write these articles, so I’m including the changes here at least partly just to mention them in case anyone else was similarly unaware of their existence.

First up, the aifc module allows read/write access to AIFF and AIFF-C format files. This module has had some small tweaks:

  • getparams() now returns namedtuple instead of tuple.
  • aifc.open() can now be used as a context manager.
  • writeframesraw() and writeframes() now accept any bytes-like object.

Next we have the audioop module, which provides useful operations on raw audio fragments, such as converting between mono and stereo, converting between different raw audio formats, and searching for a snippet of audio within a larger fragment. As of Python 3.4, this module now offers a byteswap() method for endian conversion of all samples in a fragment, and also all functions now accept any bytes-like object.

The sunau module, which allows read/write access to Au format audio files. The first three tweaks are essentially the same as for aifc I mentioned above, so I won’t repeat them. The final change is that AU_write.setsamplewidth() now supports 24-bit samples8.

Likewise the wave module has those same three changes as well. Additionally it now is able to write output to file descriptors which don’t support seeking, although in these cases the number of frames written in the header better be correct when first written.

Concurrency

The multiprocessing module has had a few changes. First up is the concept of start methods, which gives the programmer control of how subprocesses are created. It’s especially useful to exercise this control when mixing threads and processes. There are three methods now supported on Unix, although spawn is the only option on Windows:

spawn
A fresh new Python interpreter process is started in a child process. This is potentially quite slow compared to the other methods, but does mean the child process doesn’t inherit any unnecessary file descriptors, and there’s no potential issues with other threads because it’s a clean process. Under Unix this is achieved with a standard fork() and exec() pair. This is the default (and only) option on Windows.
fork
Uses fork() to create a child process, but doesn’t exec() into a new instance of the interpreter. As mentioned at the start of the article, by file handles will still not be inherited unless the programmer has explicitly set them to be inheritable. In multithreaded code, however, there can still be problems using a bare fork() like this. The replicates the entire address space of the process as-is, but only the currently executing thread of execution. If another thread happens to have a mutex held when the current thread calls fork(), for example, that mutex will still be held in the child process but with the thread holding it no longer extant, so this mutex will never be released6.
forkserver
The usual solution to mixing fork() and multithreaded code is to make sure you call fork() before any other threads are spawned. Since the current thread is the only one that’s ever existed up to that point, and it survives into the child process, then there’s no chance for the process global state to be in an indeterminate state. This solutions is the purpose of the forkserver model. In this case, a separate process is created at startup, and this is used to fork all the new child processes. A Unix domain socket is created to communicate between the main process and the forkserver. When a new child is created, two pipes are created to send work to the child process and receive the exit status back, respectively. In the forkserver module, the client end file descriptors for these pipes are sent over the UDS to the fork server process. As a result, this method is only available on OSs that support sending FDs over UDSs (e.g. Linux). Note that the child process that the fork server process creates does not require a UDS, it inherits what it needs using standard fork() semantics.

This last model is a bit of a delicate dance, so out of interest I sniffed aorund the code and drew up this sequence diagram to illustrate how it happens.

forkserver sequence diagram

To set and query which of these methods is in use globally, the multiprocessing module provides get_start_method() and set_start_method(), and you can choose from any of the methods returned by get_all_start_methods().

As well as this you can now create a context with get_context(). This allows the start method to be set for a specific context, and the context object shares the same API as the multiprocessing module so you can just use methods on the object instead of the module functions to utilise the settings of that particular context. Any worker pools you create are specific to that context. This allows different libraries interoperating in the same application to avoid interfering with each other by each creating their own context instead of having to mess with global state.

The threading module also has a minor improvement in the form of the main_thread() function, which returns a Thread object representing the main thread of execution.

Cryptography

hashlib now provides pbkdf2_hmac() function implementing the commonly used PKCS#5 key derivation function2. This is based on an existing hash digest algorithm (e.g. SHA-256) which is combined with a salt and repeated a specified number of times. As usual, the salt must be preserved so that the process can be repeated again to generate the same secret key from the same credential consistently in the future.

>>> import hashlib
>>> import os
>>>
>>> salt = os.urandom(16)
>>> hashlib.pbkdf2_hmac("sha256", b"password", salt, 100000)
b'Vwq\xfe\x87\x10.\x1c\xd8S\x17N\x04\xda\xb8\xc3\x8a\x14C\xf1\x10F\x9eaQ\x1f\xe4\xd04%L\xc9'

The hmac.new() function now accepts bytearray as well as bytes for the key, and the type of the data fed in may be any of the types accepted by hashlib. Also, the digest algorithm passed to new() may be any of the names recognised by hashlib, and the choice of MD5 as a default is deprecated — in future there will be no default.

Diagnostics & Testing

The dis module for disassembling bytecode has had some facilities added to allow user code better programmatic access. There’s a new Instruction class representing a bytecode instruction, with appropriate parameters for inspecting it, and a get_instructions() method which takes a callable and yields the bytecode instructions that comprise it as Instruction instances. For those who prefer a more object-oriented interface, the new Bytecode class offers similar facilities.

>>> import dis
>>>
>>> def func(arg):
...     print("Arg value: " + str(arg))
...     return arg * 2
>>>
>>> for instr in dis.get_instructions(func):
...     print(instr.offset, instr.opname, instr.argrepr)
...
0 LOAD_GLOBAL print
3 LOAD_CONST 'Arg value: '
6 LOAD_GLOBAL str
9 LOAD_FAST arg
12 CALL_FUNCTION 1 positional, 0 keyword pair
15 BINARY_ADD
16 CALL_FUNCTION 1 positional, 0 keyword pair
19 POP_TOP
20 LOAD_FAST arg
23 LOAD_CONST 2
26 BINARY_MULTIPLY
27 RETURN_VALUE

inspect, which provides functions for introspecting runtime objects, has also had some features added in 3.4. First up is a command-line interface, so by executing the module and passing a module name, or a specific function or class within that module, the source code will be displayed. Or if --details is passed then information about the specified object will be displayed instead.

$ python -m inspect shutil:copy
def copy(src, dst, *, follow_symlinks=True):
    """Copy data and mode bits ("cp src dst"). Return the file's destination.

    The destination may be a directory.

    If follow_symlinks is false, symlinks won't be followed. This
    resembles GNU's "cp -P src dst".

    If source and destination are the same file, a SameFileError will be
    raised.

    """
    if os.path.isdir(dst):
        dst = os.path.join(dst, os.path.basename(src))
    copyfile(src, dst, follow_symlinks=follow_symlinks)
    copymode(src, dst, follow_symlinks=follow_symlinks)
    return dst

$ python -m inspect --details shutil:copy
Target: shutil:copy
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Line: 214

$ python -m inspect --details shutil
Target: shutil
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Loader: <_frozen_importlib.SourceFileLoader object at 0x1051fa518>

Next there’s a new unwrap() method which is used to introspect on the original function that’s been wrapped by decorators. It works by following the chain of __wrapped__ attributes, which are set by the functools.wraps() decorator, or anything else that calls functools.update_wrapper().

>>> import functools
>>> import inspect
>>>
>>> def some_decorator(func):
...     @functools.wraps(func)
...     def wrapper_func(*args, **kwargs):
...         print("Calling " + func.__name__)
...         print("  - Args: " + repr(args))
...         print("  - KW args: " + repr(kwargs))
...         ret_val = func(*args, **kwargs)
...         print("Return from " + func.__name__ + ": " + repr(ret_val))
...         return ret_val
...     return wrapper_func
...
>>> @some_decorator
... def some_other_func(arg):
...     """Just doubles something."""
...     print("Prepare to be amazed as I double " + repr(arg))
...     return arg * 2
...
>>> some_other_func(123)
Calling some_other_func
  - Args: (123,)
  - KW args: {}
Prepare to be amazed as I double 123
Return from some_other_func: 246
246
>>> some_other_func("hello")
Calling some_other_func
  - Args: ('hello',)
  - KW args: {}
Prepare to be amazed as I double 'hello'
Return from some_other_func: 'hellohello'
'hellohello'
>>>
>>> some_other_func.__name__
'some_other_func'
>>> some_other_func.__doc__
'Just doubles something.'
>>>
>>> inspect.unwrap(some_other_func)(123)
Prepare to be amazed as I double 123
246

In an earlier article on Python 3.3, I spoke about the introduction of the inspect.signature() function. In Python 3.4 the existing inspect.getfullargspec() function, which returns information about a specified function’s parameters, is now based on signature() which means it supports a broader set of callables. One difference is that getfullargspec() still ignores __wrapped__ attributes, unlike signature(), so if you’re querying decorated functions then you may still need the latter.

On the subject of signature(), that has also changed in this release so that it no longer checks the type of the object passed in, but instead will work with anything that quacks like a function5. This now allows it to work with Cython functions, for example.

The logging module has a few tweaks. TimedRotatingFileHandler can now specify the time of day at which file rotation should happen, and SocketHandler and DatagramHandler now support Unix domain sockets by setting port=None. The configuration interface is also a little more flexible, as a configparser.RawConfigParser instance (or a subclass of it) can now be passed to fileConfig(), which allows an application to embed logging configuration in part of a larger file. On the same topic of configuration, the logging.config.listen() function, which spawns a thread listening on a socket for updated logging configurations for live modification of logging in a running process, can now be passed a validation function which is used to sanity check updated configurations before applying them.

The pprint module has had a couple of updates to deal more gracefully with long output. Firstly, there’s a new compact parameter which defaults to False. If you pass True then sequences are printed with as many items per line will fit within the specified width, which defaults to 80. Secondly, long strings are now split over multiple lines using Python’s standard line continuation syntax.

>>> pprint.pprint(x)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
          'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
 'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
          'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
          'key2c': ['alright',
                    'but',
                    'apart',
                    'from',
    # ... items elided from output for brevity ...
                    'ever',
                    'done',
                    'for',
                    'us']}}
>>> pprint.pprint(x, compact=True, width=75)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
          'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
 'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
          'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
          'key2c': ['alright', 'but', 'apart', 'from', 'the',
                    'sanitation', 'the', 'medicine', 'education', 'wine',
                    'public', 'order', 'irrigation', 'roads', 'the',
                    'fresh-water', 'system', 'and', 'public', 'health',
                    'what', 'have', 'the', 'romans', 'ever', 'done',
                    'for', 'us']}}
>>> pprint.pprint(" ".join(x["key2"]["key2c"]), width=50)
('alright but apart from the sanitation the '
 'medicine education wine public order '
 'irrigation roads the fresh-water system and '
 'public health what have the romans ever done '
 'for us')

In the sys module there’s also a new function getallocatedblocks() which is a lighter-weight alternative to the new tracemalloc module described in the previous article. This function simply returns the number of blocks currently allocated by the interpreter, which is useful for tracing memory leaks. Since it’s so lightweight, you could easily have all your Python applications publish or log this metric at intervals to check for concerning behaviour like monotonically increasing usage.

One quirk I found is that the first time you call it, it seems to perform some allocations, so you want to call it at least twice before doing any comparisons to make sure it’s in a steady state. This behaviour may change on different platforms and Python releases, so just something to keep an eye on.

>>> import sys
>>> sys.getallocatedblocks()
17553
>>> sys.getallocatedblocks()
17558
>>> sys.getallocatedblocks()
17558
>>> x = "hello, world"
>>> sys.getallocatedblocks()
17559
>>> del x
>>> sys.getallocatedblocks()
17558

Yet more good news for debugging and testing are some changes to the unittest module. First up is subTest() which can be used as a context manager to allow one test method to generate multiple test cases dynamically. See the simple code below for an example.

>>> import unittest
>>>
>>> class SampleTest(unittest.TestCase):
    def runTest(self):
        for word in ("one", "two", "three", "four"):
            with self.subTest(testword=word):
                self.assertEqual(len(word), 3)
...
>>> unittest.TextTestRunner(verbosity=2).run(SampleTest())
runTest (__main__.SampleTest) ...
======================================================================
FAIL: runTest (__main__.SampleTest) (testword='three')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in runTest
AssertionError: 5 != 3

======================================================================
FAIL: runTest (__main__.SampleTest) (testword='four')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in runTest
AssertionError: 4 != 3

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=2)

In addition to this, test discovery via TestLoader.discover() or python -m unittest discover, now sorts test cases consistently between runs which makes it much easier to compare them.

There’s also a new assertLogs() context manager, which can be used to ensure that code under test emits a log entry. By default this checks for any message of at least INFO level being emitted by any logger, but these parameters can be overridden. In general I don’t think it’s a good idea to tightly couple unit test cases with logging, since it can make things brittle — but there are cases where it’s important to log, such as specific text in a log file triggering an alert somewhere else. In these cases it’s important to catch cases where someone might change or remove the log entry without realising its importance, and being able to do so without explicitly mocking the logging library yourself will prove quite handy.

Internet

Following on from the policy framework added to the email package in Python 3.3, this release adds support for passing a policy argument to the as_string() method when generating string representations of messages. There is also a new as_bytes() method which is equivalent but returns bytes instead of str.

Another change in email is the addition of two subclasses for Message, which are EmailMessage and MIMEPart. The former should be used to represent email messages going forward and has a new default policy, with the base class Message being reserved for backwards compatibility using the compat32 policy. The latter represents a subpart of a MIME message and is identical to EmailMessage except for the ommission of some headers which aren’t required for subparts.

Finally in email there’s a new module contentmanager which offers better facilities for managing message payloads. Currently this offers a ContentManager base class and a single concrete derivation, raw_data_manager, which is the one used by the default EmailPolicy. This offers some basic facilities for doing encoding/decoding to/from bytes and handling of headers for each message part. The contentmanager module also offers facilities for you to register your own managers if you would like to do so.

Looking at the http module briefly, the BaseHTTPRequestHandler.send_error() method, which is used to send an error response to the client, now offers an explain parameter. Along with the existing optional message parameter, these can be set to override the default text for each HTTP error code that’s normally sent.

The response is formatted using the contents of the error_message_format attribute, which you can override by the default is as shown below. You can see how the new %(explain)s expansion will be presented in the error.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: %(code)d</p>
        <p>Message: %(message)s.</p>
        <p>Error code explanation: %(code)s - %(explain)s.</p>
    </body>
</html>

The ipaddress module was provisional in Python 3.3, but is now considered a stable interface as of 3.4. In addition the IPV4Address and IPV6Address classes now offer an is_global attribute, which is True if the address is intended to be globally routable (i.e. is not reserved as a private address). At least this is what the documentation indicates — in practice, I found that only IPV6Address offers this feature, it’s missing from IPV4Address. Looks like this was noted and fixed in issue 21386 on the Python tracker, but that fix didn’t make it out until Python 3.5.

In any case, here’s an example of it being used for some IPv6 addresses:

>>> import ipaddress
>>> # Localhost address
... ipaddress.ip_address("::1").is_global
False
>>> # Private address range
... ipaddress.ip_address("fd12:3456:789a:1::1").is_global
False
>>> # python.org
... ipaddress.ip_address("2a04:4e42:4::223").is_global
True

The poplib module has a couple of extra functions. Firstly there’s capa() which returns the list of capabilities advertised by the POP server. Secondly there’s stls(), which issues the STLS command to upgrade a clear-text connection to SSL as specified by RFC 2595. For those familiar with it, this is very similar in operation to IMAP’s STARTTLS command.

In smtplib the exception type SMTPException is now a subclass of OSError, which allows both socket and protocol errors to be caught together and handled in a consistent way, in case that simplifies application logic. This sort of change highlights how important it is to pick the right base class for your exceptions in a library, becuase you may be able to make life considerably simpler for some of your users if you get it right.

The socket library has a few minor updates, the first being the new get_inheritable() and set_inheritable() methods on socket objects to change their inheritability as we discussed in the the previous article. Also continuing from an earlier article on release 3.3, the new PF_CAN socket family has a new member: CAN_BCM is the broadcast manager protocol. But unless you’re writing Python code to run on a vehicle messaging bus then you can safely disregard this one.

One nice touch is that socket.AF_* and socket.SOCK_* constants are now defined in terms of the new enum module which we covered in the previous article. This means we can get some useful values out in log trace instead of magic numbers that we need to look up. The other change in this release is for Windows users, who can now enjoy inet_pton() and inet_ntop() for added IPv6 goodness.

There are some more extensive changes to the ssl module. Firstly, TSL v1.1 and v1.2 support has been added, using PROTOCOL_TLSv1_1 and PROTOCOL_TLSv1_2 respectively. This is where I have to remind myself that Python 3.4 was released in March 2014, as by 2021’s standards these versions are looking long in the tooth, being defined in 2006 and 2008 respectively. Indeed, all the major browser vendors deprecated 1.0 and 1.1 in March 2020.

Secondly, there’s a handy convenience function create_default_context() for creating an SSLContext with some sensible settings to provide reasonable security. These are stricter than the defaults in the SSLContext constructor, and are also subject to change if security best practices evolve. This gives code a better chance to stay up to date with security practices via a simple Python version upgrade, although I assume the downside is a slightly increased chance of introducing issues if (say) a client’s Python version is updated but the server is still using outdated settings so they fail to negotiate a mutually agreeable protocol version.

One detail about the create_default_context() function that I like is it’s purpose parameter, which selects different sets of parameter values for different purposes. This release includes two purposes, SERVER_AUTH is the default which is for client-side connections to authenticate servers, and CLIENT_AUTH is for server-side connections to authenticate clients.

The SSLContext class method load_verify_locations() has a new cadata parameter, which allows certificates to be passed directly in PEM- or DER-encoded forms. This is in contrast to the existing cafile and capath parameters which both require certificates to be stored in files.

There’s a new function get_default_verify_paths() which returns the current list of paths OpenSSL will check for a default certificate authority (CA). These values are the same ones that are set with the existing set_default_verify_paths(). This will be useful for debugging, with encryption you want as much transparency as you can possibly get because it can be very challenging to figure out the source of issues when your only feedback is generally a “yes” or “no”.

On the theme of tranparency, SSLContext now has a cert_store_stats() method which returns statistics on the number certificates loaded, and also a get_ca_certs() method to return a list of the currently loaded CA certificates.

A welcome addition is the ability to customise the certificate verification process by setting the verify_flags attribute on an SSLContext. This can be set by ORing together one or more flags. This release defines the following flags which related to checks against certificate revocation lists (CRLs):

VERIFY_DEFAULT
Does not check any certificates against CRLs.
VERIFY_CRL_CHECK_LEAF
Check only the peer certificate is checked against CRLs, but not any of the intermediate CA certificates in the chain of trust. Requires a CRL signed by the peer certificate’s issuer (i.e. its direct ancestor CA) to be loaded with load_verify_locations(), or validation will fail.
VERIFY_CRL_CHECK_CHAIN
In this mode, all certificates in the chain of trust are checked against their CRLs.
VERIFY_X509_STRICT
Also checks the full chain, but additionally disables workarounds for broken X.509 certificates.

Another useful addition for common users, the load_default_certs() method on SSLContext loads a set of standard CA certificates from locations which are platform-dependent. Note that if you use create_default_context() and you don’t pass your own CA certificate store, this method will be called for you.

>>> import pprint
>>> import ssl
>>>
>>> context = ssl.SSLContext(protocol=ssl.PROTOCOL_TLSv1_2)
>>> len(context.get_ca_certs())
0
>>> context.load_default_certs()
>>> len(context.get_ca_certs())
163
>>> pprint.pprint(context.get_ca_certs()[151])
{'issuer': ((('countryName', 'US'),),
            (('organizationName', 'VeriSign, Inc.'),),
            (('organizationalUnitName', 'VeriSign Trust Network'),),
            (('organizationalUnitName',
              '(c) 2008 VeriSign, Inc. - For authorized use only'),),
            (('commonName',
              'VeriSign Universal Root Certification Authority'),)),
 'notAfter': 'Dec  1 23:59:59 2037 GMT',
 'notBefore': 'Apr  2 00:00:00 2008 GMT',
 'serialNumber': '401AC46421B31321030EBBE4121AC51D',
 'subject': ((('countryName', 'US'),),
             (('organizationName', 'VeriSign, Inc.'),),
             (('organizationalUnitName', 'VeriSign Trust Network'),),
             (('organizationalUnitName',
               '(c) 2008 VeriSign, Inc. - For authorized use only'),),
             (('commonName',
               'VeriSign Universal Root Certification Authority'),)),
 'version': 3}

You may recall from the earlier article on Python 3.2 that client-side support for SNI (Server Name Indication) was added then. Well, Python 3.4 adds server-side support for SNI. This is achieved using the set_servername_callback() method7 of SSLContext, which registers a callback function which is invoked when the client uses SNI. The callback is invoked with three arguments: the SSLSocket instance, a string indicating the name the client has requested, and the SSLContext instance. A common role for this callback is to swap out the SSLContext attached to the socket for one which matches the server name that’s being requested — otherwise the certificate will fail to validate.

Finally in ssl, Windows users get two additional functions, enum_certificates() and enum_crls() which can retrieve certificates and CRLs from the Windows certificate store.

There a number of improvements in the urllib.request module. It now supports URIs using the data: scheme with the DataHandler class. The HTTP method used by the Request class can be specified by overriding the method class attribute in a subclass. It’s also possible to now safely reuse Request objects — updating full_url or data causes all relevant internal state to be updated as appropriate. This means you can set up a template Request and then use that for multiple individual requests which differ only in the URL or the request body data.

Also in urllib, HTTPError exceptions now have a headers attribute which contains the HTTP headers from the response which triggered the error.

Language Support

A few changes have been made to some of the modules that support core language features.

First up is the abc module for defining abstract base classes. Previously, abstract base classes were defined using the metaclass keyword parameter to the class definition, which could sometimes confuse people:

import abc

class MyClass(metaclass=abc.ABCMeta):
    ...

Now there’s an abc.ABC base class so you can instead use this rather more readable version:

class MyClass(abc.ABC):
    ...

Next a useful change in contextlib, which now offers a suppress context manager to ignore exceptions in its block. If any of the listed exceptions occur, they are ignored and execution jumps to just outside the with block. This is really just a more concise and/or better self-documenting way of catching then ignoring the exceptions yourself.

>>> import contextlib
>>>
>>> with contextlib.suppress(OSError, IOError):
...     with open("/tmp/canopen", "w") as fd:
...         print("Success in /tmp")
...     with open("/cannotopen", "w") as fd:
...         print("Success in /")
...     print("Done both files")
...
Success in /tmp
>>>

There’s also a new redirect_stdout() context manager which temporarily redirects sys.stdout to any other stream, including io.StringIO to capture the output in a string. This is useful for dealing with poorly-designed code which writes its errors directly to standard output instead of raising exceptions. Oddly no equivalent redirect_stderr() option to match this, however1.

Moving on there are some improvements to the functools module. First up is partialmethod() which works like partial() except that it’s used for defining partial specialisations of methods instead of direct callables. It supports descriptors like classmethod(), staticmethod(), and so on, and also any method that accepts self as the first positional argument.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import functools

class Host:

    def __init__(self):
        self.set_host_down()

    @property
    def state(self):
        return self._state

    def set_state(self, new_state):
        self._state = new_state

    set_host_up = functools.partialmethod(set_state, "up")
    set_host_down = functools.partialmethod(set_state, "down")

In the code above, set_host_up() and set_host_down() can be called as normal methods with no parameters, and just indirect into set_state() with the appropriate argument passed.

The other addition to functools is the singledispatch decorator. This allows the creation of a generic function which calls into one of several separate underlying implementation functions based on the type of the first parameter. The code below illustrates a generic function which calculates the square of an integer from several possible input types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import functools

@functools.singledispatch
def my_func(value):
    return value ** 2

@my_func.register(str)
@my_func.register(bytes)
def _(value):
    return int(value) ** 2

@my_func.register(float)
def _(value):
    return round(value) ** 2

The importlib module has also had some more attention in this release. First up is a change to InspectLoader, the abstract base class for loaders3. This now has a method source_to_code() which converts Python source code to executable byte code. The default implementation calls the builtin compile() with appropriate arguments, but it would be possible to override this method to add other features — for example, to use ast.parse() to obtains the AST4 of the code, then manipulate it somehow (e.g. to implement some optimisation), and then finally use compile() to convert this to executable Python code.

Also in InspectLoader the get_code(), which used to be abstract, now has a concrete default implementation. This is responsible for returning the code object for a module. The documentation states that if possible it should be overridden for performance reasons, however, as the default one uses the get_source() method which can be a somewhat expensive operation as it has to decode the source and do universal newline conversion.

Speaking of get_source(), there’s a new importlib.util.decode_source() function that decodes source from bytes with universal newline processing — this is quite useful for implementing get_source() methods easily.

Potentially of interest to more people, imp.reload() is now importlib.reload(), as part of the ongoing deprecation of the imp module. In a similar vein, imp.get_magic() is replaced by importlib.util.MAGIC_NUMBER, and both imp.cache_from_source() and imp.source_from_cache() have moved to importlib.util as well.

Following on from the discussion of namespace packages in the last article, the NamespaceLoader used now conforms to the InspectLoader interface, which has the concrete benefit that the runpy module, and hence the python -m <module> command-line option, now work with namespace packages too.

Finally in importlib, the ExtensionFileLoader in importlib.machinery has now received a get_filename() method, whose omission was simply an oversight in the original implementation.

The new descriptor DynamicClassAttribute has been added to the types module. You use this in cases where you want an attribute that acts differently based on whether it’s been accessed through an instance or directly through the class. It seems that the main use-case for this is when you want to define class attributes on a base class, but still allow subclasses to reuse the same names for their properties without conflicting. For this to work you need a define a __getattr__() method in your base class, but since this is quite an obscure little corner then I’ll leave the official types documentation to go into more detail. I’ll just leave you with a code sample that illustrates its use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import types

class MyMetaclass(type):
    def __getattr__(self, name):
        if name == "some_property":
            return MyMetaclass.some_property

    some_property = "meta"

class MyClass(metaclass=MyMetaclass):
    def __init__(self):
        self._some_property = "concrete"

    # Replace this decorator with @property to see the difference.
    @types.DynamicClassAttribute
    def some_property(self):
        return self._some_property

instance = MyClass()
print(instance.some_property)   # Should print "concrete"
print(MyClass.some_property)    # Should print "meta'

And to close off our language support features there are a handful of changes to the weakref module. First of all, the WeakMethod class has been added for taking a weak reference to a bound method. You can’t use a standard weak reference because bound methods are ephemeral, they only exist while they’re being called unless there’s another variable keeping a reference to them. Therefore, if the weak reference was the only reference then it wouldn’t be enough to keep them alive. Thus the WeakMethod class was added to simulate a weak reference to a bound method by re-creating the bound method as required until either the instance or the method no longer exist.

This class follows standard weakref.ref semantics where calling the weak reference returns either None or the object itself. Since the object in this example is a callable, then we need another pair of brackets to call that. This explains the m2()() you’ll see in the snippet below.

>>> import weakref
>>>
>>> class MyClass:
...     def __init__(self, value):
...         self._value = value
...     def my_method(self):
...         print("Method called")
...         return self._value
...
>>> instance = MyClass(123)
>>> m1 = weakref.ref(instance.my_method)
>>> # Standard weakrefs don't work here.
... repr(m1())
'None'
>>> m2 = weakref.WeakMethod(instance.my_method)
>>> repr(m2())
'<bound method MyClass.my_method of <__main__.MyClass object at 0x10abc8f60>>'
>>> repr(m2()())
Method called
'123'
>>> # Here you can see the bound method is re-created each time.
... m2() is m2()
False
>>> del instance
>>> # Now we've deleted the object, the method is gone.
... repr(m2())
'None'

There’s also a new class weakref.finalize which allows you to install a callback to be invoked when an object is garbage-collected. In this regard it works a bit like an externally installed __del__() method. You pass in an object instance and a callback function as well as, optionally, parameters to be passed to the callback. The finalize object is returned, but even if you delete this reference it remains installed and the callback will still be called when the object is destroyed. This includes when the interpreter exits, although you can set the atexit attribute to False to prevent this.

>>> import sys
>>> import weakref
>>>
>>> class MyClass:
...     pass
...
>>> def callback(arg):
...     print("Callback called with {}".format(arg))
...
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "one")
>>> # Deleting the finalize instance makes no difference
... del finalizer
>>> # The callback is still called when the instance is GC.
... del instance
Callback called with one
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "two")
>>> # You can trigger the callback earlier if you like.
... finalizer()
Callback called with two
>>> finalizer.alive
False
>>> # It's only called once, so it now won't fire on deletion.
... del instance
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "three")
>>> finalizer.atexit
True
>>> # Callback is invoked at system exit, if atexit=True
... sys.exit(0)
Callback called with three

Markup Languages

The html module has sprouted a handy little unescape() function which converts HTML character entities back to their unicode equivalents.

>>> import html
>>> html.unescape("I spent &pound;50 on this &amp; that")
'I spent £50 on this & that'
>>> html.unescape("&pi; is the &numero;1 mathematical constant")
'π is the №1 mathematical constant'

The HTMLParser class has been updated to take advantage of this, so now there’s a convert_charrefs parameter that, if True performs this conversion. For backwards-compatibility it defaults to False, but the documentation warns this will flip to True in a future release.

The xml.extree module has also seem some changes, with a new XMLPullParser parser being added. This is intended for applications which can’t perform blocking reads of the data for any reason. Data is fed into the parser incrementally with the feed() method, and instead of the callback method approach used by XMLParser the XMLPullParser relies on the application to call a read_events() method to collect any parsed items found so far. I’ve found this sort of incremental parsing model really useful in the past where you may be parsing particularly large documents, since often you can process the information incrementally into some other useful data structure and save a lot of memory, so it’s worthwhile getting familiar with this class.

Each call to the read_events() method will yield a generator which allows you to iterate through the events. Once an item is read from the generator it’s removed from the list, but the call to read_events() itself doesn’t clear anything, so you don’t need to worry about avoiding partial reads of the generator before dropping it — the remaining events will still be there on your next call to read_events(). That said, creating multiple such generators and using them in parallel could have unpredictable results, and spanning them across threads is probably a particularly bad idea.

One important point to note is that if there is an error parsing the document, then this method is where the ParseError exception will be raised. This implies that the feed() method just adds text to an input buffer and all the actual parsing happens on-demand in read_events().

Each item yielded will be a 2-tuple of the event type and a payload which is event-type-specific. On the subject of event type, the constructor of XMLPullParser takes a list of event types that you’re interested in, which defaults to use end events. The event types you can specify in this release are:

Event Meaning Payload
start Opening tag Element object
end Closing tag Element object
start-ns Start namespace Tuple (prefix, uri)
end-ns End namespace None

It’s worth noting that the start event is raised as soon as the end of the opening tag is seen, so the Element object won’t have any text or tail attributes. If you care about these, probably best to just filter on end events, where the entire element is returned. The start events are mostly useful so you can see the context in which intervening tags are used, including any attributes defined within the containing opening tag.

The start-ns event is generated prior to the opening tag which specifies the namespace prefix, and the end-ns event is generated just after its matching closing tag. In the tags that follow which use the namespace prefix the URI will be substituted in, since really the prefix is just an alias for the URI.

Here’s an example of its use showing that only events for completed items are returned, and showing what happens if the document is malformed:

>>> import xml.etree.ElementTree as ET
>>> import pprint
>>>
>>> parser = ET.XMLPullParser(("start", "end"))
>>> parser.feed("<document><one>Sometext</one><two><th")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'document' at 0x1057d2728>),
 ('start', <Element 'one' at 0x1057d2b38>),
 ('end', <Element 'one' at 0x1057d2b38>),
 ('start', <Element 'two' at 0x1057d2b88>)]
>>> parser.feed("ree>Moretext</three><four>Yet")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'three' at 0x1057d2c28>),
 ('end', <Element 'three' at 0x1057d2c28>),
 ('start', <Element 'four' at 0x1057d2c78>)]
>>> parser.feed("moretext</closewrongtag></two></document>")
>>> pprint.pprint(list(parser.read_events()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1281, in read_events
    raise event
  File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1239, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: mismatched tag: line 1, column 76

Another small enhancement is that the tostring() and tostringlist() functions, as well as the ElementTree.write() method, now have a short_empty_elements keyword parameter. If set to True, which is the default, this causes empty tags to use the <tag /> shorthand. If set to False the expanded <tag></tag> form will be used instead.

Operating System

As well as the file descriptor inheritance features mentioned above, the os module also has a few more changes, listed below.

os.cpu_count() Added
Returns the number of CPUs available on the current platform, or None if it can’t be determined. This is now used as the implementation for multiprocessing.cpu_count().
os.path Improvements on Windows
On Windows, the os.path.samestat() is now available, to tell if two stat() results refer to the same file, and os.path.ismount() now correctly recognises volumes which are mounted below the drive letter level.
os.open() New Flags

On platforms where the underlying call supports them, os.open() now supports two new flags.

  • O_PATH is used for obtaining a file descriptor to a path without actually opening it — reading or writing it will yield EBADF. This is useful for operations that don’t require us to access the file or directory such as fchdir().
  • O_TMPFILE creates an open file but never creates a directory entry for it, so it can be used as a temporary file. This is one step better than the usual approach of creating and then immediately deleting a temporary file, relying on the open filehandle to prevent the filesystem from reclaiming the blocks, because it doesn’t allow any window of opportunity to see the directory entry.

MacOS users get to benefit from some improvements to the plistlib module, which offers functions to read and write Apple .plist (property list) files. This module now sports an API that’s more consistent with other similar ones, with functions load(), loads(), dump() and dumps(). The module also now supports the binary file format, as well as the existing support for the XML version.

On Linux, the resource module has some additional features. The Linux-specific prlimit() system call has been exposed, which allows you to both set and retrieve the current limit for any process based on its PID. You provide a resource (e.g. resource.RLIMIT_NOFILE controls the number of open file descriptors permitted) and then you can either provide a new value for the resource to set it and return the prior value, or omit the limit argument to just query the current setting. Note that you may get PermissionError raised if the current process doesn’t have the CAP_SYS_RESOURCE capability.

On a related note, since some Unix variants have additional RLIMIT_* constants available, these have also been exposed in the resource module:

  • RLIMIT_MSGQUEUE (on Linux)
  • RLIMIT_NICE (on Linux)
  • RLIMIT_RTPRIO (on Linux)
  • RLIMIT_RTTIME (on Linux)
  • RLIMIT_SIGPENDING (on Linux)
  • RLIMIT_SBSIZE (on FreeBSD)
  • RLIMIT_SWAP (on FreeBSD)
  • RLIMIT_NPTS (on FreeBSD)

The stat module is now backed by a C implementation _stat, which makes it much easier to expose the myriad of platform-dependent values that exist. Three new ST_MODE flags were also added:

S_IFDOOR
Doors are an IPC mechanism on Solaris.
S_IFPORT
Event ports are another Solaris mechanism, which is a unified interface to collecting events completions, rather like a generic version of poll().
S_IFWHT
A whiteout file is a special file which indicates there is, in fact, no such file. This is typically used in union mount filesystems, such as OverlayFS on Linux, to indicate that a file has been deleted in the overlay. Since the lower layers are often mounted read-only, the upper layer needs some indicator to layer over the top to stop the underlying files being visible.

Other Library Changes

Some other assorted updates that didn’t fit any of the themes above.

argparse.FileType Improvements
The class now accepts encoding and errors arguments that are passed straight on to the resultant open() call.
base64 Improvements
Encoding and decoding functions now accept any bytes-like object. Also, there are now functions to encode/decode Ascii85, both the variant used by Adobe for the PostScript and PDF formats, and also the one used by Git to encode binary patches.
dbm.open() Improvements
The dbm.open() call now supports use as a context manager.
glob.escape() Added
This escapes any special characters in a string to force it to be matched literally if passed to a globbing function.
importlib.machinery.ModuleSpec Added
PEP 451 describes a number of changes to importlib to continue to address some of the outstanding quirks and inconsistencies in this process. Primarily the change is to move some attributes from the module object itself to a new ModuleSpec object, which will be available via the __spec__ attribute. As far as I can tell this doesn’t offer a great deal of concrete benefits initially, but I believe it’s laying the foundations for further improvements to the import system in future releases. Check out the PEP for plenty of details.
re.fullmatch() Added
For matching regexes there was historically re.match() which only checked for a match starting at the beginning of the search string, and re.search() which would find a match starting anywhere. Now there’s also re.fullmatch(), and a corresponding method on compiled patterns, which finds matches covering the entire string (i.e. anchored at both ends).
selectors Added
The new module selectors was added as a higher-level abstraction over the implementations provided by the select module. This will probably make it easier for programmers who are less experienced with select(), poll() and friends to implement reliable applications, as these calls definitely have a few tricky quirks. That said, I would have thought the intention would be for most people to shift to using asyncio for these purposes, if they’re able.
shutil.copyfile() Raises SameFileError
For cases where the source and destination are already the same file, the SameFileError exception allows applications to take special action in this case.
struct.iter_unpack() Added
For strings which consist of repeated packed structures concatenated, this method provides an efficient way to iterate across them. There’s also a corresponding method of the same name on struct.Struct objects.
tarfile CLI Added
There’s now a command-line interface to the tarfile module which can be invoked with python -m tarfile.
textwrap Enhancements
The TextWrapper class now offers two new attributes: max_lines limits the number of lines in the output, and placeholder which is appended to output to indicate it was truncated due to the setting of max_lines. There’s also a new handy textwrap.shorten() convenience function that uses these facilities to shorten a single line to a specified length, and appand placeholder if truncation occurred.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> import textwrap
>>>
>>> wrapper = textwrap.TextWrapper(max_lines=5, width=40,
        placeholder="[Read more...]")
>>> print("\n".join(wrapper.wrap(
        "House? You were lucky to have a house! We used to live in one"
        " room, all hundred and twenty-six of us, no furniture. Half the"
        " floor was missing; we were all huddled together in one corner"
        " for fear of falling.")))
House? You were lucky to have a house!
We used to live in one room, all hundred
and twenty-six of us, no furniture. Half
the floor was missing; we were all
huddled together in one[Read more...]
>>> textwrap.shorten("No no! 'E's pining!", width=30)
"No no! 'E's pining!"
>>> textwrap.shorten("'E's not pinin'! 'E's passed on!", width=30)
"'E's not pinin'! 'E's [...]"
PyZipFile.writepy() Enhancement
The zipfile.PyZipFile class is a specialised compressor for the purposes of creating ZIP archives of python libraries. It now supports a filterfunc parameter which must be a function accepting a single argument. It will be called for each file added to the archive, being passed the full path, and if it returns False for any path then it’ll be excluded from the archive. This could be used to exclude unit test code, for example.

Builtin Changes

There were a collection of changes to builtins which are worth a quick mention.

min() and max() Defaults
You can now specify a default keyword-only parameter, to be returned if the iterable you pass is empty.
Absolute Module __file__ Path
The __file__ attribute of modules should now always use absolute paths, except for __main__.__file__ if the script was invoked with a relative name. Could be handy, especially when using this to generate log entries and the like.
bytes.join() Accepts Any Buffer
The join() method of bytes and bytesarray previously used to be restricted to accepting objects of these types. Now in both cases it will accept any object supporting the buffer protocol.
memoryview Supports reversed()
Due to an oversight, memoryview was not registered as a subclass of collections.Sequence. It is in Python 3.4. Also, it can now be used with reversed().

Garbage Collecting Reference Cycles

In this release there’s also an important change to the garbage collection process, as defined in PEP 442. This finally resolves some long-standing problems around garbage collection of reference cycles where the objects have custom finalisers (e.g. __del__() methods).

Just to make sure we’re on the same page, a reference cycle is when you have a series of objects which all hold a reference to each other where there is no clear “root” object which can be deleted first. This means that their reference counts never normally drop to zero, because there’s always another object holding a reference to them. If, like me, you’re a more visual thinker, here’s a simple illustration:

refence cycle diagram

It’s for these cases that the garbage collector was created. It will detect reference cycles where there are no longer any external references pointing to them9, and if so it’ll break all the references within the cycle. This allows the references counts to drop to zero and the normal object cleanup occurs.

This is fine, except when more than one of the objects have custom finalisers. In these cases, it’s not clear in what order the finalisers should be called, and also there’s the risk that the finalisers could make changes which themselves impact the garbage collection process. So historically the interpreter has balked at these cases and left the objects on the gc.garbage list for programmers to clean up using their specific knowledge of the objects in question. Of course, it’s always better never to create such reference cycles in the first place, but sometimes it’s surprisingly easy to do so by accident.

The good news is that in Python 3.4 this situation has been improved so that in almost every case the garbage collector will be able to collect reference cycles. The garbage collector now has two additional stages. In the first of these, the finalisers of all objects in isolated reference cycles are invoked. The only choice here is really to call them in an undefined order, so you should avoid making too many assumptions in the finalisers that you write.

The second new step, after all finalisers have been run, is to re-traverse the cycles and confirm they’re still isolated. This is required because the finalisers may have ended up creating references from outside the cycle which should keep it alive. If the cycle is no longer isolated, the collection is aborted this time around and the objects persist. Note that their finalisers will only ever be called once, however, and this won’t change if they’re resurrected in this fashion.

Assuming the collection wasn’t aborted, it now continues as normal.

This should cover most of the cases people are likely to hit. However, there’s an important exception which can still bite you: this change doesn’t affect objects defined in C extension modules which have a custom tp_dealloc function. These objects may still end up on gc.garbage, unfortunately.

The take-aways from this change appear to be:

  • Don’t rely on the order in which your finalisers will be called.
  • You shouldn’t need to worry about checking gc.garbage any more.
  • … Unless you’re using objects from C extensions which define custom finalisers.

Other Changes

Here are the other changes I felt were noteworthy enough to mention, but not enough to jump into a lot details.

More secure hash algorithm
Python has updated its hashing algorithm to SipHash for security reasons. For a little more background you can see the CERT advisory on this issue from 2011, and PEP 456 has a lot more details.
UCD Updated to 6.3
The Unicode Character Database (UCD) has been updated to version 6.3. If you’re burning to know what it added, check out the UCD blog post.
Isolated mode option
The Python interpreter now supports a -I option to run in isolated mode. This removes the current directory from sys.path, as well as the user’s own site-packages directory, and also ignores all PYTHON* environment variables. The intention is to be able to run a script in a clean system-defined environment, without any user customisations being able to impact it. This can be specified on the shebang line of system scripts, for example.
Optimisations

As usual there are a number of optimisations, of which I’ve only included some of the more interesting ones here:

  • The UTF-32 decoder is now 3-4x faster.
  • Hash collisions in sets are cheaper due to an optimisation of trying some limited linear probing in the case of a collision, which can take advntage of cache locality, before falling back on open addressing if there are still repeated collisions (by default the limit for linear probing is 9).
  • Interpreter startup time has been reduced by around 30% by loading fewer modules by default.
  • html.escape() is around 10x faster.
  • os.urandom() now uses a lazily-opened persistent file descriptor to avoid the overhead of opening large numbers of file descriptors when run in parallel from multiple threads.

Conclusions

Another packed release yet again, and plenty of useful additions large and small. Blocking inheritance of file descriptors by default is one of those features that’s going to be helpful to a lot of people without them even knowing it, which is the sort of thing Python does well in general. The new modules in this release aren’t anything earth-shattering, but they’re all useful additions. The lack of something like the enum module in particular is something that has always felt like a bit of a rough edge. The diagnostic improvements like tracemalloc and the inspect improvements all feel like the type of thing that you necessarily be using every day, but when you have a need of them then they’re priceless. The addition of subTest to unittest is definitely a handy one, as it makes failures in certain cases much more transparent than just realising the overall test has failed and having to insert logging statements to figure out why.

The incremental XMLPullParser is a great addition in my opinion, I’ve always had a bit of a dislike of callback-based approaches since they always seem to force you to jump through more hoops than you’d like. Whichever one is a natural fit does depend on your use-case, however, so it’s good to have both approaches available to choose from. I’m also really glad to see the long-standing issue of garbage collecting reference cycles with custom finalisers has finally been tidied up — it’s one more change to give is more confidence using Python for very long-running daemon processes.

It does feel rather like a “tidying up loose ends and known issues” release, this one, but there’s plenty there to justify an upgrade. From what I know of later releases, I wonder if that was somewhat intentional — stabilising the platform for syntactic updates and other more sweeping changes in the future.


  1. Spoiler alert: it was added in the next release. 

  2. A key derivation function is used to “stretch” a password of somewhat limited length into a longer byte sequence that can be used as a cryptographic key. Doing this naively can significantly reduce the security of a system, so using established algorithms is strongly recommended. 

  3. These are the objects returned by the finder, and which are responsible for actually loading the module from disk into memory. If you want to know more, you can find more details in the Importer Protocol section of PEP 302

  4. AST stands for Abstract Syntax Trees, and represents a normalised intermediate form for Python code which has been parsed but not yet compiled. 

  5. That’s a duck typing joke. Also, why do you never see ducks typing? They’ve never found a keyboard that quite fits the bill. That was another duck typing joke, although I use the word “joke” in its loosest possible sensible in both cases. 

  6. For a more in-depth discussion of some of the issues using fork() in multithreaded code, this article has some good discussion. 

  7. Spoiler alert: this method is only applicable until Python 3.7 where it was deprecated in favour of a newer sni_callback attribute. The semantics are similar, however. 

  8. It’s useful to have support for it in software, of course, because you don’t necessarily have control of the formats you ened to open. But as Benjamin Zwickel convincingly explains in The 24-Bit Delusion, 24-bit audio is generally a waste of time since audio equipment cannot reproduce it accurately enough. 

  9. For reference, this is what PEP 442 refers to as a cyclic isolate

8 Apr 2021 at 11:47PM in Software
 |   | 
Photo by David Clode on Unsplash