Articles on current page (1-5 of 55)

April 2021

☑ Python 2to3: What’s New in 3.4 - Part 2

8 Apr 2021 at 11:47PM in Software
 |   | 

This is part 7 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.4 - Part 1.

In this series looking at features introduced by every version of Python 3, this one is the second of two covering release 3.4. We look at improvements to the way multiprocessing spawns child processes, various powerful new facilities for code instrospection, improvements to garbage collection, and a lot more besides.

green python two

In this article we conclude our look at Python 3.4 which started with the previous one in this series. Last time we took a look at the ensurepip module, file descriptor inheritance changes, the codecs module, and a series of other new modules which were added to the library. In this article we’ll be looking at a host of changes that have been made to existing modules, some long-awaited improvements to garbage collection and a few other small details.

Library Enhancements

The bulk of this article is going to be looking at changes to modules in the standard library. As usual, I’ve tried to group them by category to make things somewhat more approachable, and we’re kicking off with a category that I never even really knew existed in the standard library.

Audio

This release contained some improvements in a handful of modules for dealing with audio formats, and it wasn’t until I looked into the changes in these modules that I even knew they were there. This is one of the reasons I like to write these articles, so I’m including the changes here at least partly just to mention them in case anyone else was similarly unaware of their existence.

First up, the aifc module allows read/write access to AIFF and AIFF-C format files. This module has had some small tweaks:

  • getparams() now returns namedtuple instead of tuple.
  • aifc.open() can now be used as a context manager.
  • writeframesraw() and writeframes() now accept any bytes-like object.

Next we have the audioop module, which provides useful operations on raw audio fragments, such as converting between mono and stereo, converting between different raw audio formats, and searching for a snippet of audio within a larger fragment. As of Python 3.4, this module now offers a byteswap() method for endian conversion of all samples in a fragment, and also all functions now accept any bytes-like object.

The sunau module, which allows read/write access to Au format audio files. The first three tweaks are essentially the same as for aifc I mentioned above, so I won’t repeat them. The final change is that AU_write.setsamplewidth() now supports 24-bit samples8.

Likewise the wave module has those same three changes as well. Additionally it now is able to write output to file descriptors which don’t support seeking, although in these cases the number of frames written in the header better be correct when first written.

Concurrency

The multiprocessing module has had a few changes. First up is the concept of start methods, which gives the programmer control of how subprocesses are created. It’s especially useful to exercise this control when mixing threads and processes. There are three methods now supported on Unix, although spawn is the only option on Windows:

spawn
A fresh new Python interpreter process is started in a child process. This is potentially quite slow compared to the other methods, but does mean the child process doesn’t inherit any unnecessary file descriptors, and there’s no potential issues with other threads because it’s a clean process. Under Unix this is achieved with a standard fork() and exec() pair. This is the default (and only) option on Windows.
fork
Uses fork() to create a child process, but doesn’t exec() into a new instance of the interpreter. As mentioned at the start of the article, by file handles will still not be inherited unless the programmer has explicitly set them to be inheritable. In multithreaded code, however, there can still be problems using a bare fork() like this. The replicates the entire address space of the process as-is, but only the currently executing thread of execution. If another thread happens to have a mutex held when the current thread calls fork(), for example, that mutex will still be held in the child process but with the thread holding it no longer extant, so this mutex will never be released6.
forkserver
The usual solution to mixing fork() and multithreaded code is to make sure you call fork() before any other threads are spawned. Since the current thread is the only one that’s ever existed up to that point, and it survives into the child process, then there’s no chance for the process global state to be in an indeterminate state. This solutions is the purpose of the forkserver model. In this case, a separate process is created at startup, and this is used to fork all the new child processes. A Unix domain socket is created to communicate between the main process and the forkserver. When a new child is created, two pipes are created to send work to the child process and receive the exit status back, respectively. In the forkserver module, the client end file descriptors for these pipes are sent over the UDS to the fork server process. As a result, this method is only available on OSs that support sending FDs over UDSs (e.g. Linux). Note that the child process that the fork server process creates does not require a UDS, it inherits what it needs using standard fork() semantics.

This last model is a bit of a delicate dance, so out of interest I sniffed aorund the code and drew up this sequence diagram to illustrate how it happens.

forkserver sequence diagram

To set and query which of these methods is in use globally, the multiprocessing module provides get_start_method() and set_start_method(), and you can choose from any of the methods returned by get_all_start_methods().

As well as this you can now create a context with get_context(). This allows the start method to be set for a specific context, and the context object shares the same API as the multiprocessing module so you can just use methods on the object instead of the module functions to utilise the settings of that particular context. Any worker pools you create are specific to that context. This allows different libraries interoperating in the same application to avoid interfering with each other by each creating their own context instead of having to mess with global state.

The threading module also has a minor improvement in the form of the main_thread() function, which returns a Thread object representing the main thread of execution.

Cryptography

hashlib now provides pbkdf2_hmac() function implementing the commonly used PKCS#5 key derivation function2. This is based on an existing hash digest algorithm (e.g. SHA-256) which is combined with a salt and repeated a specified number of times. As usual, the salt must be preserved so that the process can be repeated again to generate the same secret key from the same credential consistently in the future.

>>> import hashlib
>>> import os
>>>
>>> salt = os.urandom(16)
>>> hashlib.pbkdf2_hmac("sha256", b"password", salt, 100000)
b'Vwq\xfe\x87\x10.\x1c\xd8S\x17N\x04\xda\xb8\xc3\x8a\x14C\xf1\x10F\x9eaQ\x1f\xe4\xd04%L\xc9'

The hmac.new() function now accepts bytearray as well as bytes for the key, and the type of the data fed in may be any of the types accepted by hashlib. Also, the digest algorithm passed to new() may be any of the names recognised by hashlib, and the choice of MD5 as a default is deprecated — in future there will be no default.

Diagnostics & Testing

The dis module for disassembling bytecode has had some facilities added to allow user code better programmatic access. There’s a new Instruction class representing a bytecode instruction, with appropriate parameters for inspecting it, and a get_instructions() method which takes a callable and yields the bytecode instructions that comprise it as Instruction instances. For those who prefer a more object-oriented interface, the new Bytecode class offers similar facilities.

>>> import dis
>>>
>>> def func(arg):
...     print("Arg value: " + str(arg))
...     return arg * 2
>>>
>>> for instr in dis.get_instructions(func):
...     print(instr.offset, instr.opname, instr.argrepr)
...
0 LOAD_GLOBAL print
3 LOAD_CONST 'Arg value: '
6 LOAD_GLOBAL str
9 LOAD_FAST arg
12 CALL_FUNCTION 1 positional, 0 keyword pair
15 BINARY_ADD
16 CALL_FUNCTION 1 positional, 0 keyword pair
19 POP_TOP
20 LOAD_FAST arg
23 LOAD_CONST 2
26 BINARY_MULTIPLY
27 RETURN_VALUE

inspect, which provides functions for introspecting runtime objects, has also had some features added in 3.4. First up is a command-line interface, so by executing the module and passing a module name, or a specific function or class within that module, the source code will be displayed. Or if --details is passed then information about the specified object will be displayed instead.

$ python -m inspect shutil:copy
def copy(src, dst, *, follow_symlinks=True):
    """Copy data and mode bits ("cp src dst"). Return the file's destination.

    The destination may be a directory.

    If follow_symlinks is false, symlinks won't be followed. This
    resembles GNU's "cp -P src dst".

    If source and destination are the same file, a SameFileError will be
    raised.

    """
    if os.path.isdir(dst):
        dst = os.path.join(dst, os.path.basename(src))
    copyfile(src, dst, follow_symlinks=follow_symlinks)
    copymode(src, dst, follow_symlinks=follow_symlinks)
    return dst

$ python -m inspect --details shutil:copy
Target: shutil:copy
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Line: 214

$ python -m inspect --details shutil
Target: shutil
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Loader: <_frozen_importlib.SourceFileLoader object at 0x1051fa518>

Next there’s a new unwrap() method which is used to introspect on the original function that’s been wrapped by decorators. It works by following the chain of __wrapped__ attributes, which are set by the functools.wraps() decorator, or anything else that calls functools.update_wrapper().

>>> import functools
>>> import inspect
>>>
>>> def some_decorator(func):
...     @functools.wraps(func)
...     def wrapper_func(*args, **kwargs):
...         print("Calling " + func.__name__)
...         print("  - Args: " + repr(args))
...         print("  - KW args: " + repr(kwargs))
...         ret_val = func(*args, **kwargs)
...         print("Return from " + func.__name__ + ": " + repr(ret_val))
...         return ret_val
...     return wrapper_func
...
>>> @some_decorator
... def some_other_func(arg):
...     """Just doubles something."""
...     print("Prepare to be amazed as I double " + repr(arg))
...     return arg * 2
...
>>> some_other_func(123)
Calling some_other_func
  - Args: (123,)
  - KW args: {}
Prepare to be amazed as I double 123
Return from some_other_func: 246
246
>>> some_other_func("hello")
Calling some_other_func
  - Args: ('hello',)
  - KW args: {}
Prepare to be amazed as I double 'hello'
Return from some_other_func: 'hellohello'
'hellohello'
>>>
>>> some_other_func.__name__
'some_other_func'
>>> some_other_func.__doc__
'Just doubles something.'
>>>
>>> inspect.unwrap(some_other_func)(123)
Prepare to be amazed as I double 123
246

In an earlier article on Python 3.3, I spoke about the introduction of the inspect.signature() function. In Python 3.4 the existing inspect.getfullargspec() function, which returns information about a specified function’s parameters, is now based on signature() which means it supports a broader set of callables. One difference is that getfullargspec() still ignores __wrapped__ attributes, unlike signature(), so if you’re querying decorated functions then you may still need the latter.

On the subject of signature(), that has also changed in this release so that it no longer checks the type of the object passed in, but instead will work with anything that quacks like a function5. This now allows it to work with Cython functions, for example.

The logging module has a few tweaks. TimedRotatingFileHandler can now specify the time of day at which file rotation should happen, and SocketHandler and DatagramHandler now support Unix domain sockets by setting port=None. The configuration interface is also a little more flexible, as a configparser.RawConfigParser instance (or a subclass of it) can now be passed to fileConfig(), which allows an application to embed logging configuration in part of a larger file. On the same topic of configuration, the logging.config.listen() function, which spawns a thread listening on a socket for updated logging configurations for live modification of logging in a running process, can now be passed a validation function which is used to sanity check updated configurations before applying them.

The pprint module has had a couple of updates to deal more gracefully with long output. Firstly, there’s a new compact parameter which defaults to False. If you pass True then sequences are printed with as many items per line will fit within the specified width, which defaults to 80. Secondly, long strings are now split over multiple lines using Python’s standard line continuation syntax.

>>> pprint.pprint(x)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
          'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
 'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
          'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
          'key2c': ['alright',
                    'but',
                    'apart',
                    'from',
    # ... items elided from output for brevity ...
                    'ever',
                    'done',
                    'for',
                    'us']}}
>>> pprint.pprint(x, compact=True, width=75)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
          'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
 'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
          'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
          'key2c': ['alright', 'but', 'apart', 'from', 'the',
                    'sanitation', 'the', 'medicine', 'education', 'wine',
                    'public', 'order', 'irrigation', 'roads', 'the',
                    'fresh-water', 'system', 'and', 'public', 'health',
                    'what', 'have', 'the', 'romans', 'ever', 'done',
                    'for', 'us']}}
>>> pprint.pprint(" ".join(x["key2"]["key2c"]), width=50)
('alright but apart from the sanitation the '
 'medicine education wine public order '
 'irrigation roads the fresh-water system and '
 'public health what have the romans ever done '
 'for us')

In the sys module there’s also a new function getallocatedblocks() which is a lighter-weight alternative to the new tracemalloc module described in the previous article. This function simply returns the number of blocks currently allocated by the interpreter, which is useful for tracing memory leaks. Since it’s so lightweight, you could easily have all your Python applications publish or log this metric at intervals to check for concerning behaviour like monotonically increasing usage.

One quirk I found is that the first time you call it, it seems to perform some allocations, so you want to call it at least twice before doing any comparisons to make sure it’s in a steady state. This behaviour may change on different platforms and Python releases, so just something to keep an eye on.

>>> import sys
>>> sys.getallocatedblocks()
17553
>>> sys.getallocatedblocks()
17558
>>> sys.getallocatedblocks()
17558
>>> x = "hello, world"
>>> sys.getallocatedblocks()
17559
>>> del x
>>> sys.getallocatedblocks()
17558

Yet more good news for debugging and testing are some changes to the unittest module. First up is subTest() which can be used as a context manager to allow one test method to generate multiple test cases dynamically. See the simple code below for an example.

>>> import unittest
>>>
>>> class SampleTest(unittest.TestCase):
    def runTest(self):
        for word in ("one", "two", "three", "four"):
            with self.subTest(testword=word):
                self.assertEqual(len(word), 3)
...
>>> unittest.TextTestRunner(verbosity=2).run(SampleTest())
runTest (__main__.SampleTest) ...
======================================================================
FAIL: runTest (__main__.SampleTest) (testword='three')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in runTest
AssertionError: 5 != 3

======================================================================
FAIL: runTest (__main__.SampleTest) (testword='four')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in runTest
AssertionError: 4 != 3

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=2)

In addition to this, test discovery via TestLoader.discover() or python -m unittest discover, now sorts test cases consistently between runs which makes it much easier to compare them.

There’s also a new assertLogs() context manager, which can be used to ensure that code under test emits a log entry. By default this checks for any message of at least INFO level being emitted by any logger, but these parameters can be overridden. In general I don’t think it’s a good idea to tightly couple unit test cases with logging, since it can make things brittle — but there are cases where it’s important to log, such as specific text in a log file triggering an alert somewhere else. In these cases it’s important to catch cases where someone might change or remove the log entry without realising its importance, and being able to do so without explicitly mocking the logging library yourself will prove quite handy.

Internet

Following on from the policy framework added to the email package in Python 3.3, this release adds support for passing a policy argument to the as_string() method when generating string representations of messages. There is also a new as_bytes() method which is equivalent but returns bytes instead of str.

Another change in email is the addition of two subclasses for Message, which are EmailMessage and MIMEPart. The former should be used to represent email messages going forward and has a new default policy, with the base class Message being reserved for backwards compatibility using the compat32 policy. The latter represents a subpart of a MIME message and is identical to EmailMessage except for the ommission of some headers which aren’t required for subparts.

Finally in email there’s a new module contentmanager which offers better facilities for managing message payloads. Currently this offers a ContentManager base class and a single concrete derivation, raw_data_manager, which is the one used by the default EmailPolicy. This offers some basic facilities for doing encoding/decoding to/from bytes and handling of headers for each message part. The contentmanager module also offers facilities for you to register your own managers if you would like to do so.

Looking at the http module briefly, the BaseHTTPRequestHandler.send_error() method, which is used to send an error response to the client, now offers an explain parameter. Along with the existing optional message parameter, these can be set to override the default text for each HTTP error code that’s normally sent.

The response is formatted using the contents of the error_message_format attribute, which you can override by the default is as shown below. You can see how the new %(explain)s expansion will be presented in the error.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: %(code)d</p>
        <p>Message: %(message)s.</p>
        <p>Error code explanation: %(code)s - %(explain)s.</p>
    </body>
</html>

The ipaddress module was provisional in Python 3.3, but is now considered a stable interface as of 3.4. In addition the IPV4Address and IPV6Address classes now offer an is_global attribute, which is True if the address is intended to be globally routable (i.e. is not reserved as a private address). At least this is what the documentation indicates — in practice, I found that only IPV6Address offers this feature, it’s missing from IPV4Address. Looks like this was noted and fixed in issue 21386 on the Python tracker, but that fix didn’t make it out until Python 3.5.

In any case, here’s an example of it being used for some IPv6 addresses:

>>> import ipaddress
>>> # Localhost address
... ipaddress.ip_address("::1").is_global
False
>>> # Private address range
... ipaddress.ip_address("fd12:3456:789a:1::1").is_global
False
>>> # python.org
... ipaddress.ip_address("2a04:4e42:4::223").is_global
True

The poplib module has a couple of extra functions. Firstly there’s capa() which returns the list of capabilities advertised by the POP server. Secondly there’s stls(), which issues the STLS command to upgrade a clear-text connection to SSL as specified by RFC 2595. For those familiar with it, this is very similar in operation to IMAP’s STARTTLS command.

In smtplib the exception type SMTPException is now a subclass of OSError, which allows both socket and protocol errors to be caught together and handled in a consistent way, in case that simplifies application logic. This sort of change highlights how important it is to pick the right base class for your exceptions in a library, becuase you may be able to make life considerably simpler for some of your users if you get it right.

The socket library has a few minor updates, the first being the new get_inheritable() and set_inheritable() methods on socket objects to change their inheritability as we discussed in the the previous article. Also continuing from an earlier article on release 3.3, the new PF_CAN socket family has a new member: CAN_BCM is the broadcast manager protocol. But unless you’re writing Python code to run on a vehicle messaging bus then you can safely disregard this one.

One nice touch is that socket.AF_* and socket.SOCK_* constants are now defined in terms of the new enum module which we covered in the previous article. This means we can get some useful values out in log trace instead of magic numbers that we need to look up. The other change in this release is for Windows users, who can now enjoy inet_pton() and inet_ntop() for added IPv6 goodness.

There are some more extensive changes to the ssl module. Firstly, TSL v1.1 and v1.2 support has been added, using PROTOCOL_TLSv1_1 and PROTOCOL_TLSv1_2 respectively. This is where I have to remind myself that Python 3.4 was released in March 2014, as by 2021’s standards these versions are looking long in the tooth, being defined in 2006 and 2008 respectively. Indeed, all the major browser vendors deprecated 1.0 and 1.1 in March 2020.

Secondly, there’s a handy convenience function create_default_context() for creating an SSLContext with some sensible settings to provide reasonable security. These are stricter than the defaults in the SSLContext constructor, and are also subject to change if security best practices evolve. This gives code a better chance to stay up to date with security practices via a simple Python version upgrade, although I assume the downside is a slightly increased chance of introducing issues if (say) a client’s Python version is updated but the server is still using outdated settings so they fail to negotiate a mutually agreeable protocol version.

One detail about the create_default_context() function that I like is it’s purpose parameter, which selects different sets of parameter values for different purposes. This release includes two purposes, SERVER_AUTH is the default which is for client-side connections to authenticate servers, and CLIENT_AUTH is for server-side connections to authenticate clients.

The SSLContext class method load_verify_locations() has a new cadata parameter, which allows certificates to be passed directly in PEM- or DER-encoded forms. This is in contrast to the existing cafile and capath parameters which both require certificates to be stored in files.

There’s a new function get_default_verify_paths() which returns the current list of paths OpenSSL will check for a default certificate authority (CA). These values are the same ones that are set with the existing set_default_verify_paths(). This will be useful for debugging, with encryption you want as much transparency as you can possibly get because it can be very challenging to figure out the source of issues when your only feedback is generally a “yes” or “no”.

On the theme of tranparency, SSLContext now has a cert_store_stats() method which returns statistics on the number certificates loaded, and also a get_ca_certs() method to return a list of the currently loaded CA certificates.

A welcome addition is the ability to customise the certificate verification process by setting the verify_flags attribute on an SSLContext. This can be set by ORing together one or more flags. This release defines the following flags which related to checks against certificate revocation lists (CRLs):

VERIFY_DEFAULT
Does not check any certificates against CRLs.
VERIFY_CRL_CHECK_LEAF
Check only the peer certificate is checked against CRLs, but not any of the intermediate CA certificates in the chain of trust. Requires a CRL signed by the peer certificate’s issuer (i.e. its direct ancestor CA) to be loaded with load_verify_locations(), or validation will fail.
VERIFY_CRL_CHECK_CHAIN
In this mode, all certificates in the chain of trust are checked against their CRLs.
VERIFY_X509_STRICT
Also checks the full chain, but additionally disables workarounds for broken X.509 certificates.

Another useful addition for common users, the load_default_certs() method on SSLContext loads a set of standard CA certificates from locations which are platform-dependent. Note that if you use create_default_context() and you don’t pass your own CA certificate store, this method will be called for you.

>>> import pprint
>>> import ssl
>>>
>>> context = ssl.SSLContext(protocol=ssl.PROTOCOL_TLSv1_2)
>>> len(context.get_ca_certs())
0
>>> context.load_default_certs()
>>> len(context.get_ca_certs())
163
>>> pprint.pprint(context.get_ca_certs()[151])
{'issuer': ((('countryName', 'US'),),
            (('organizationName', 'VeriSign, Inc.'),),
            (('organizationalUnitName', 'VeriSign Trust Network'),),
            (('organizationalUnitName',
              '(c) 2008 VeriSign, Inc. - For authorized use only'),),
            (('commonName',
              'VeriSign Universal Root Certification Authority'),)),
 'notAfter': 'Dec  1 23:59:59 2037 GMT',
 'notBefore': 'Apr  2 00:00:00 2008 GMT',
 'serialNumber': '401AC46421B31321030EBBE4121AC51D',
 'subject': ((('countryName', 'US'),),
             (('organizationName', 'VeriSign, Inc.'),),
             (('organizationalUnitName', 'VeriSign Trust Network'),),
             (('organizationalUnitName',
               '(c) 2008 VeriSign, Inc. - For authorized use only'),),
             (('commonName',
               'VeriSign Universal Root Certification Authority'),)),
 'version': 3}

You may recall from the earlier article on Python 3.2 that client-side support for SNI (Server Name Indication) was added then. Well, Python 3.4 adds server-side support for SNI. This is achieved using the set_servername_callback() method7 of SSLContext, which registers a callback function which is invoked when the client uses SNI. The callback is invoked with three arguments: the SSLSocket instance, a string indicating the name the client has requested, and the SSLContext instance. A common role for this callback is to swap out the SSLContext attached to the socket for one which matches the server name that’s being requested — otherwise the certificate will fail to validate.

Finally in ssl, Windows users get two additional functions, enum_certificates() and enum_crls() which can retrieve certificates and CRLs from the Windows certificate store.

There a number of improvements in the urllib.request module. It now supports URIs using the data: scheme with the DataHandler class. The HTTP method used by the Request class can be specified by overriding the method class attribute in a subclass. It’s also possible to now safely reuse Request objects — updating full_url or data causes all relevant internal state to be updated as appropriate. This means you can set up a template Request and then use that for multiple individual requests which differ only in the URL or the request body data.

Also in urllib, HTTPError exceptions now have a headers attribute which contains the HTTP headers from the response which triggered the error.

Language Support

A few changes have been made to some of the modules that support core language features.

First up is the abc module for defining abstract base classes. Previously, abstract base classes were defined using the metaclass keyword parameter to the class definition, which could sometimes confuse people:

import abc

class MyClass(metaclass=abc.ABCMeta):
    ...

Now there’s an abc.ABC base class so you can instead use this rather more readable version:

class MyClass(abc.ABC):
    ...

Next a useful change in contextlib, which now offers a suppress context manager to ignore exceptions in its block. If any of the listed exceptions occur, they are ignored and execution jumps to just outside the with block. This is really just a more concise and/or better self-documenting way of catching then ignoring the exceptions yourself.

>>> import contextlib
>>>
>>> with contextlib.suppress(OSError, IOError):
...     with open("/tmp/canopen", "w") as fd:
...         print("Success in /tmp")
...     with open("/cannotopen", "w") as fd:
...         print("Success in /")
...     print("Done both files")
...
Success in /tmp
>>>

There’s also a new redirect_stdout() context manager which temporarily redirects sys.stdout to any other stream, including io.StringIO to capture the output in a string. This is useful for dealing with poorly-designed code which writes its errors directly to standard output instead of raising exceptions. Oddly no equivalent redirect_stderr() option to match this, however1.

Moving on there are some improvements to the functools module. First up is partialmethod() which works like partial() except that it’s used for defining partial specialisations of methods instead of direct callables. It supports descriptors like classmethod(), staticmethod(), and so on, and also any method that accepts self as the first positional argument.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import functools

class Host:

    def __init__(self):
        self.set_host_down()

    @property
    def state(self):
        return self._state

    def set_state(self, new_state):
        self._state = new_state

    set_host_up = functools.partialmethod(set_state, "up")
    set_host_down = functools.partialmethod(set_state, "down")

In the code above, set_host_up() and set_host_down() can be called as normal methods with no parameters, and just indirect into set_state() with the appropriate argument passed.

The other addition to functools is the singledispatch decorator. This allows the creation of a generic function which calls into one of several separate underlying implementation functions based on the type of the first parameter. The code below illustrates a generic function which calculates the square of an integer from several possible input types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import functools

@functools.singledispatch
def my_func(value):
    return value ** 2

@my_func.register(str)
@my_func.register(bytes)
def _(value):
    return int(value) ** 2

@my_func.register(float)
def _(value):
    return round(value) ** 2

The importlib module has also had some more attention in this release. First up is a change to InspectLoader, the abstract base class for loaders3. This now has a method source_to_code() which converts Python source code to executable byte code. The default implementation calls the builtin compile() with appropriate arguments, but it would be possible to override this method to add other features — for example, to use ast.parse() to obtains the AST4 of the code, then manipulate it somehow (e.g. to implement some optimisation), and then finally use compile() to convert this to executable Python code.

Also in InspectLoader the get_code(), which used to be abstract, now has a concrete default implementation. This is responsible for returning the code object for a module. The documentation states that if possible it should be overridden for performance reasons, however, as the default one uses the get_source() method which can be a somewhat expensive operation as it has to decode the source and do universal newline conversion.

Speaking of get_source(), there’s a new importlib.util.decode_source() function that decodes source from bytes with universal newline processing — this is quite useful for implementing get_source() methods easily.

Potentially of interest to more people, imp.reload() is now importlib.reload(), as part of the ongoing deprecation of the imp module. In a similar vein, imp.get_magic() is replaced by importlib.util.MAGIC_NUMBER, and both imp.cache_from_source() and imp.source_from_cache() have moved to importlib.util as well.

Following on from the discussion of namespace packages in the last article, the NamespaceLoader used now conforms to the InspectLoader interface, which has the concrete benefit that the runpy module, and hence the python -m <module> command-line option, now work with namespace packages too.

Finally in importlib, the ExtensionFileLoader in importlib.machinery has now received a get_filename() method, whose omission was simply an oversight in the original implementation.

The new descriptor DynamicClassAttribute has been added to the types module. You use this in cases where you want an attribute that acts differently based on whether it’s been accessed through an instance or directly through the class. It seems that the main use-case for this is when you want to define class attributes on a base class, but still allow subclasses to reuse the same names for their properties without conflicting. For this to work you need a define a __getattr__() method in your base class, but since this is quite an obscure little corner then I’ll leave the official types documentation to go into more detail. I’ll just leave you with a code sample that illustrates its use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import types

class MyMetaclass(type):
    def __getattr__(self, name):
        if name == "some_property":
            return MyMetaclass.some_property

    some_property = "meta"

class MyClass(metaclass=MyMetaclass):
    def __init__(self):
        self._some_property = "concrete"

    # Replace this decorator with @property to see the difference.
    @types.DynamicClassAttribute
    def some_property(self):
        return self._some_property

instance = MyClass()
print(instance.some_property)   # Should print "concrete"
print(MyClass.some_property)    # Should print "meta'

And to close off our language support features there are a handful of changes to the weakref module. First of all, the WeakMethod class has been added for taking a weak reference to a bound method. You can’t use a standard weak reference because bound methods are ephemeral, they only exist while they’re being called unless there’s another variable keeping a reference to them. Therefore, if the weak reference was the only reference then it wouldn’t be enough to keep them alive. Thus the WeakMethod class was added to simulate a weak reference to a bound method by re-creating the bound method as required until either the instance or the method no longer exist.

This class follows standard weakref.ref semantics where calling the weak reference returns either None or the object itself. Since the object in this example is a callable, then we need another pair of brackets to call that. This explains the m2()() you’ll see in the snippet below.

>>> import weakref
>>>
>>> class MyClass:
...     def __init__(self, value):
...         self._value = value
...     def my_method(self):
...         print("Method called")
...         return self._value
...
>>> instance = MyClass(123)
>>> m1 = weakref.ref(instance.my_method)
>>> # Standard weakrefs don't work here.
... repr(m1())
'None'
>>> m2 = weakref.WeakMethod(instance.my_method)
>>> repr(m2())
'<bound method MyClass.my_method of <__main__.MyClass object at 0x10abc8f60>>'
>>> repr(m2()())
Method called
'123'
>>> # Here you can see the bound method is re-created each time.
... m2() is m2()
False
>>> del instance
>>> # Now we've deleted the object, the method is gone.
... repr(m2())
'None'

There’s also a new class weakref.finalize which allows you to install a callback to be invoked when an object is garbage-collected. In this regard it works a bit like an externally installed __del__() method. You pass in an object instance and a callback function as well as, optionally, parameters to be passed to the callback. The finalize object is returned, but even if you delete this reference it remains installed and the callback will still be called when the object is destroyed. This includes when the interpreter exits, although you can set the atexit attribute to False to prevent this.

>>> import sys
>>> import weakref
>>>
>>> class MyClass:
...     pass
...
>>> def callback(arg):
...     print("Callback called with {}".format(arg))
...
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "one")
>>> # Deleting the finalize instance makes no difference
... del finalizer
>>> # The callback is still called when the instance is GC.
... del instance
Callback called with one
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "two")
>>> # You can trigger the callback earlier if you like.
... finalizer()
Callback called with two
>>> finalizer.alive
False
>>> # It's only called once, so it now won't fire on deletion.
... del instance
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "three")
>>> finalizer.atexit
True
>>> # Callback is invoked at system exit, if atexit=True
... sys.exit(0)
Callback called with three

Markup Languages

The html module has sprouted a handy little unescape() function which converts HTML character entities back to their unicode equivalents.

>>> import html
>>> html.unescape("I spent &pound;50 on this &amp; that")
'I spent £50 on this & that'
>>> html.unescape("&pi; is the &numero;1 mathematical constant")
'π is the №1 mathematical constant'

The HTMLParser class has been updated to take advantage of this, so now there’s a convert_charrefs parameter that, if True performs this conversion. For backwards-compatibility it defaults to False, but the documentation warns this will flip to True in a future release.

The xml.extree module has also seem some changes, with a new XMLPullParser parser being added. This is intended for applications which can’t perform blocking reads of the data for any reason. Data is fed into the parser incrementally with the feed() method, and instead of the callback method approach used by XMLParser the XMLPullParser relies on the application to call a read_events() method to collect any parsed items found so far. I’ve found this sort of incremental parsing model really useful in the past where you may be parsing particularly large documents, since often you can process the information incrementally into some other useful data structure and save a lot of memory, so it’s worthwhile getting familiar with this class.

Each call to the read_events() method will yield a generator which allows you to iterate through the events. Once an item is read from the generator it’s removed from the list, but the call to read_events() itself doesn’t clear anything, so you don’t need to worry about avoiding partial reads of the generator before dropping it — the remaining events will still be there on your next call to read_events(). That said, creating multiple such generators and using them in parallel could have unpredictable results, and spanning them across threads is probably a particularly bad idea.

One important point to note is that if there is an error parsing the document, then this method is where the ParseError exception will be raised. This implies that the feed() method just adds text to an input buffer and all the actual parsing happens on-demand in read_events().

Each item yielded will be a 2-tuple of the event type and a payload which is event-type-specific. On the subject of event type, the constructor of XMLPullParser takes a list of event types that you’re interested in, which defaults to use end events. The event types you can specify in this release are:

Event Meaning Payload
start Opening tag Element object
end Closing tag Element object
start-ns Start namespace Tuple (prefix, uri)
end-ns End namespace None

It’s worth noting that the start event is raised as soon as the end of the opening tag is seen, so the Element object won’t have any text or tail attributes. If you care about these, probably best to just filter on end events, where the entire element is returned. The start events are mostly useful so you can see the context in which intervening tags are used, including any attributes defined within the containing opening tag.

The start-ns event is generated prior to the opening tag which specifies the namespace prefix, and the end-ns event is generated just after its matching closing tag. In the tags that follow which use the namespace prefix the URI will be substituted in, since really the prefix is just an alias for the URI.

Here’s an example of its use showing that only events for completed items are returned, and showing what happens if the document is malformed:

>>> import xml.etree.ElementTree as ET
>>> import pprint
>>>
>>> parser = ET.XMLPullParser(("start", "end"))
>>> parser.feed("<document><one>Sometext</one><two><th")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'document' at 0x1057d2728>),
 ('start', <Element 'one' at 0x1057d2b38>),
 ('end', <Element 'one' at 0x1057d2b38>),
 ('start', <Element 'two' at 0x1057d2b88>)]
>>> parser.feed("ree>Moretext</three><four>Yet")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'three' at 0x1057d2c28>),
 ('end', <Element 'three' at 0x1057d2c28>),
 ('start', <Element 'four' at 0x1057d2c78>)]
>>> parser.feed("moretext</closewrongtag></two></document>")
>>> pprint.pprint(list(parser.read_events()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1281, in read_events
    raise event
  File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1239, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: mismatched tag: line 1, column 76

Another small enhancement is that the tostring() and tostringlist() functions, as well as the ElementTree.write() method, now have a short_empty_elements keyword parameter. If set to True, which is the default, this causes empty tags to use the <tag /> shorthand. If set to False the expanded <tag></tag> form will be used instead.

Operating System

As well as the file descriptor inheritance features mentioned above, the os module also has a few more changes, listed below.

os.cpu_count() Added
Returns the number of CPUs available on the current platform, or None if it can’t be determined. This is now used as the implementation for multiprocessing.cpu_count().
os.path Improvements on Windows
On Windows, the os.path.samestat() is now available, to tell if two stat() results refer to the same file, and os.path.ismount() now correctly recognises volumes which are mounted below the drive letter level.
os.open() New Flags

On platforms where the underlying call supports them, os.open() now supports two new flags.

  • O_PATH is used for obtaining a file descriptor to a path without actually opening it — reading or writing it will yield EBADF. This is useful for operations that don’t require us to access the file or directory such as fchdir().
  • O_TMPFILE creates an open file but never creates a directory entry for it, so it can be used as a temporary file. This is one step better than the usual approach of creating and then immediately deleting a temporary file, relying on the open filehandle to prevent the filesystem from reclaiming the blocks, because it doesn’t allow any window of opportunity to see the directory entry.

MacOS users get to benefit from some improvements to the plistlib module, which offers functions to read and write Apple .plist (property list) files. This module now sports an API that’s more consistent with other similar ones, with functions load(), loads(), dump() and dumps(). The module also now supports the binary file format, as well as the existing support for the XML version.

On Linux, the resource module has some additional features. The Linux-specific prlimit() system call has been exposed, which allows you to both set and retrieve the current limit for any process based on its PID. You provide a resource (e.g. resource.RLIMIT_NOFILE controls the number of open file descriptors permitted) and then you can either provide a new value for the resource to set it and return the prior value, or omit the limit argument to just query the current setting. Note that you may get PermissionError raised if the current process doesn’t have the CAP_SYS_RESOURCE capability.

On a related note, since some Unix variants have additional RLIMIT_* constants available, these have also been exposed in the resource module:

  • RLIMIT_MSGQUEUE (on Linux)
  • RLIMIT_NICE (on Linux)
  • RLIMIT_RTPRIO (on Linux)
  • RLIMIT_RTTIME (on Linux)
  • RLIMIT_SIGPENDING (on Linux)
  • RLIMIT_SBSIZE (on FreeBSD)
  • RLIMIT_SWAP (on FreeBSD)
  • RLIMIT_NPTS (on FreeBSD)

The stat module is now backed by a C implementation _stat, which makes it much easier to expose the myriad of platform-dependent values that exist. Three new ST_MODE flags were also added:

S_IFDOOR
Doors are an IPC mechanism on Solaris.
S_IFPORT
Event ports are another Solaris mechanism, which is a unified interface to collecting events completions, rather like a generic version of poll().
S_IFWHT
A whiteout file is a special file which indicates there is, in fact, no such file. This is typically used in union mount filesystems, such as OverlayFS on Linux, to indicate that a file has been deleted in the overlay. Since the lower layers are often mounted read-only, the upper layer needs some indicator to layer over the top to stop the underlying files being visible.

Other Library Changes

Some other assorted updates that didn’t fit any of the themes above.

argparse.FileType Improvements
The class now accepts encoding and errors arguments that are passed straight on to the resultant open() call.
base64 Improvements
Encoding and decoding functions now accept any bytes-like object. Also, there are now functions to encode/decode Ascii85, both the variant used by Adobe for the PostScript and PDF formats, and also the one used by Git to encode binary patches.
dbm.open() Improvements
The dbm.open() call now supports use as a context manager.
glob.escape() Added
This escapes any special characters in a string to force it to be matched literally if passed to a globbing function.
importlib.machinery.ModuleSpec Added
PEP 451 describes a number of changes to importlib to continue to address some of the outstanding quirks and inconsistencies in this process. Primarily the change is to move some attributes from the module object itself to a new ModuleSpec object, which will be available via the __spec__ attribute. As far as I can tell this doesn’t offer a great deal of concrete benefits initially, but I believe it’s laying the foundations for further improvements to the import system in future releases. Check out the PEP for plenty of details.
re.fullmatch() Added
For matching regexes there was historically re.match() which only checked for a match starting at the beginning of the search string, and re.search() which would find a match starting anywhere. Now there’s also re.fullmatch(), and a corresponding method on compiled patterns, which finds matches covering the entire string (i.e. anchored at both ends).
selectors Added
The new module selectors was added as a higher-level abstraction over the implementations provided by the select module. This will probably make it easier for programmers who are less experienced with select(), poll() and friends to implement reliable applications, as these calls definitely have a few tricky quirks. That said, I would have thought the intention would be for most people to shift to using asyncio for these purposes, if they’re able.
shutil.copyfile() Raises SameFileError
For cases where the source and destination are already the same file, the SameFileError exception allows applications to take special action in this case.
struct.iter_unpack() Added
For strings which consist of repeated packed structures concatenated, this method provides an efficient way to iterate across them. There’s also a corresponding method of the same name on struct.Struct objects.
tarfile CLI Added
There’s now a command-line interface to the tarfile module which can be invoked with python -m tarfile.
textwrap Enhancements
The TextWrapper class now offers two new attributes: max_lines limits the number of lines in the output, and placeholder which is appended to output to indicate it was truncated due to the setting of max_lines. There’s also a new handy textwrap.shorten() convenience function that uses these facilities to shorten a single line to a specified length, and appand placeholder if truncation occurred.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> import textwrap
>>>
>>> wrapper = textwrap.TextWrapper(max_lines=5, width=40,
        placeholder="[Read more...]")
>>> print("\n".join(wrapper.wrap(
        "House? You were lucky to have a house! We used to live in one"
        " room, all hundred and twenty-six of us, no furniture. Half the"
        " floor was missing; we were all huddled together in one corner"
        " for fear of falling.")))
House? You were lucky to have a house!
We used to live in one room, all hundred
and twenty-six of us, no furniture. Half
the floor was missing; we were all
huddled together in one[Read more...]
>>> textwrap.shorten("No no! 'E's pining!", width=30)
"No no! 'E's pining!"
>>> textwrap.shorten("'E's not pinin'! 'E's passed on!", width=30)
"'E's not pinin'! 'E's [...]"
PyZipFile.writepy() Enhancement
The zipfile.PyZipFile class is a specialised compressor for the purposes of creating ZIP archives of python libraries. It now supports a filterfunc parameter which must be a function accepting a single argument. It will be called for each file added to the archive, being passed the full path, and if it returns False for any path then it’ll be excluded from the archive. This could be used to exclude unit test code, for example.

Builtin Changes

There were a collection of changes to builtins which are worth a quick mention.

min() and max() Defaults
You can now specify a default keyword-only parameter, to be returned if the iterable you pass is empty.
Absolute Module __file__ Path
The __file__ attribute of modules should now always use absolute paths, except for __main__.__file__ if the script was invoked with a relative name. Could be handy, especially when using this to generate log entries and the like.
bytes.join() Accepts Any Buffer
The join() method of bytes and bytesarray previously used to be restricted to accepting objects of these types. Now in both cases it will accept any object supporting the buffer protocol.
memoryview Supports reversed()
Due to an oversight, memoryview was not registered as a subclass of collections.Sequence. It is in Python 3.4. Also, it can now be used with reversed().

Garbage Collecting Reference Cycles

In this release there’s also an important change to the garbage collection process, as defined in PEP 442. This finally resolves some long-standing problems around garbage collection of reference cycles where the objects have custom finalisers (e.g. __del__() methods).

Just to make sure we’re on the same page, a reference cycle is when you have a series of objects which all hold a reference to each other where there is no clear “root” object which can be deleted first. This means that their reference counts never normally drop to zero, because there’s always another object holding a reference to them. If, like me, you’re a more visual thinker, here’s a simple illustration:

refence cycle diagram

It’s for these cases that the garbage collector was created. It will detect reference cycles where there are no longer any external references pointing to them9, and if so it’ll break all the references within the cycle. This allows the references counts to drop to zero and the normal object cleanup occurs.

This is fine, except when more than one of the objects have custom finalisers. In these cases, it’s not clear in what order the finalisers should be called, and also there’s the risk that the finalisers could make changes which themselves impact the garbage collection process. So historically the interpreter has balked at these cases and left the objects on the gc.garbage list for programmers to clean up using their specific knowledge of the objects in question. Of course, it’s always better never to create such reference cycles in the first place, but sometimes it’s surprisingly easy to do so by accident.

The good news is that in Python 3.4 this situation has been improved so that in almost every case the garbage collector will be able to collect reference cycles. The garbage collector now has two additional stages. In the first of these, the finalisers of all objects in isolated reference cycles are invoked. The only choice here is really to call them in an undefined order, so you should avoid making too many assumptions in the finalisers that you write.

The second new step, after all finalisers have been run, is to re-traverse the cycles and confirm they’re still isolated. This is required because the finalisers may have ended up creating references from outside the cycle which should keep it alive. If the cycle is no longer isolated, the collection is aborted this time around and the objects persist. Note that their finalisers will only ever be called once, however, and this won’t change if they’re resurrected in this fashion.

Assuming the collection wasn’t aborted, it now continues as normal.

This should cover most of the cases people are likely to hit. However, there’s an important exception which can still bite you: this change doesn’t affect objects defined in C extension modules which have a custom tp_dealloc function. These objects may still end up on gc.garbage, unfortunately.

The take-aways from this change appear to be:

  • Don’t rely on the order in which your finalisers will be called.
  • You shouldn’t need to worry about checking gc.garbage any more.
  • … Unless you’re using objects from C extensions which define custom finalisers.

Other Changes

Here are the other changes I felt were noteworthy enough to mention, but not enough to jump into a lot details.

More secure hash algorithm
Python has updated its hashing algorithm to SipHash for security reasons. For a little more background you can see the CERT advisory on this issue from 2011, and PEP 456 has a lot more details.
UCD Updated to 6.3
The Unicode Character Database (UCD) has been updated to version 6.3. If you’re burning to know what it added, check out the UCD blog post.
Isolated mode option
The Python interpreter now supports a -I option to run in isolated mode. This removes the current directory from sys.path, as well as the user’s own site-packages directory, and also ignores all PYTHON* environment variables. The intention is to be able to run a script in a clean system-defined environment, without any user customisations being able to impact it. This can be specified on the shebang line of system scripts, for example.
Optimisations

As usual there are a number of optimisations, of which I’ve only included some of the more interesting ones here:

  • The UTF-32 decoder is now 3-4x faster.
  • Hash collisions in sets are cheaper due to an optimisation of trying some limited linear probing in the case of a collision, which can take advntage of cache locality, before falling back on open addressing if there are still repeated collisions (by default the limit for linear probing is 9).
  • Interpreter startup time has been reduced by around 30% by loading fewer modules by default.
  • html.escape() is around 10x faster.
  • os.urandom() now uses a lazily-opened persistent file descriptor to avoid the overhead of opening large numbers of file descriptors when run in parallel from multiple threads.

Conclusions

Another packed release yet again, and plenty of useful additions large and small. Blocking inheritance of file descriptors by default is one of those features that’s going to be helpful to a lot of people without them even knowing it, which is the sort of thing Python does well in general. The new modules in this release aren’t anything earth-shattering, but they’re all useful additions. The lack of something like the enum module in particular is something that has always felt like a bit of a rough edge. The diagnostic improvements like tracemalloc and the inspect improvements all feel like the type of thing that you necessarily be using every day, but when you have a need of them then they’re priceless. The addition of subTest to unittest is definitely a handy one, as it makes failures in certain cases much more transparent than just realising the overall test has failed and having to insert logging statements to figure out why.

The incremental XMLPullParser is a great addition in my opinion, I’ve always had a bit of a dislike of callback-based approaches since they always seem to force you to jump through more hoops than you’d like. Whichever one is a natural fit does depend on your use-case, however, so it’s good to have both approaches available to choose from. I’m also really glad to see the long-standing issue of garbage collecting reference cycles with custom finalisers has finally been tidied up — it’s one more change to give is more confidence using Python for very long-running daemon processes.

It does feel rather like a “tidying up loose ends and known issues” release, this one, but there’s plenty there to justify an upgrade. From what I know of later releases, I wonder if that was somewhat intentional — stabilising the platform for syntactic updates and other more sweeping changes in the future.


  1. Spoiler alert: it was added in the next release. 

  2. A key derivation function is used to “stretch” a password of somewhat limited length into a longer byte sequence that can be used as a cryptographic key. Doing this naively can significantly reduce the security of a system, so using established algorithms is strongly recommended. 

  3. These are the objects returned by the finder, and which are responsible for actually loading the module from disk into memory. If you want to know more, you can find more details in the Importer Protocol section of PEP 302

  4. AST stands for Abstract Syntax Trees, and represents a normalised intermediate form for Python code which has been parsed but not yet compiled. 

  5. That’s a duck typing joke. Also, why do you never see ducks typing? They’ve never found a keyboard that quite fits the bill. That was another duck typing joke, although I use the word “joke” in its loosest possible sensible in both cases. 

  6. For a more in-depth discussion of some of the issues using fork() in multithreaded code, this article has some good discussion. 

  7. Spoiler alert: this method is only applicable until Python 3.7 where it was deprecated in favour of a newer sni_callback attribute. The semantics are similar, however. 

  8. It’s useful to have support for it in software, of course, because you don’t necessarily have control of the formats you ened to open. But as Benjamin Zwickel convincingly explains in The 24-Bit Delusion, 24-bit audio is generally a waste of time since audio equipment cannot reproduce it accurately enough. 

  9. For reference, this is what PEP 442 refers to as a cyclic isolate

8 Apr 2021 at 11:47PM in Software
 |   | 
Photo by David Clode on Unsplash
 | 

☑ Python 2to3: What’s New in 3.4 - Part 1

4 Apr 2021 at 3:25PM in Software
 |   | 

This is part 6 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.3 - Part 2.

In this series looking at features introduced by every version of Python 3, this one is the first of two covering release 3.4. We look at a universal install of the pip utility, improvements to handling codecs, and the addition of the asyncio and enum modules, among other things.

green python two

Python 3.4 was released on March 16 2014, around 18 months after Python 3.3. That means I’m only writing this around seven years late, as opposed to my Python 3.0 overview which was twelve years behind — at this rate I should be caught up in time for the summer.

This release was mostly focused on standard library improvements and there weren’t any syntax changes. There’s a lot here to like, however, including a bevy of new modules and a whole herd of enhancements to existing ones, so let’s fire up our Python 3.4 interpreters and import some info.

What a Pip

For anyone who’s unaware of pip, is the most widely used package management tool for Python, its name being a recursive acronym for pip installs packages. Originally written by Ian Bicking, creator of virtualenv, it was originally called pyinstall and was written to be a more fully-featured alternative to easy_install, which was the official package installation tool at the time.

Since pip is the tool you naturally turn to for installation Python modules and tools, this always begs the question: how do you install pip for the first time? Typically the answer has been to install some OS package with it in, and once you have it installed you can use it to install everything else. In the new release, however, there’s a new ensurepip module to perform this bootstrapping operation. It uses a private copy of pip that’s distributed with CPython, so it doesn’t require network access and can readily be used by anyone on any platform.

This approach is part of a wider standardisation effort around distributing Python packages, and pip was selected as a tool that’s already popular and also works well within virtual environments. Speaking of which, this release also updates the venv module to install pip in virtual environments by default, using ensurepip. This was something that virtualenv always did, and the lack of it in venv was a serious barrier to adoption of venv for a number of people. Additionally the CPython installers on Windows and MacOS also default to installing pip on these platforms. You can find full details in PEP 453.

When you try newer langauges like Go and Rust, coming from a heritage of C++ and the like, one of the biggest factors that leaps out at you isn’t so much the language itself but the convenience of the well integrated standard tooling. With this release I think Python has taken another step in this direction, with a standard and consistent package management on all the major platforms.

File Descriptor Inheritance (or Lack Thereof)

Under POSIX, file descriptors are by default inherited by child processes during a fork() operation. This offers some concrete advantages, such as the child process automatically inheriting the stdin, stdout and stderr from the parent, and also allowing the parent to create a pipe with pipe() to communicate with the child process1.

However, this behaviour can cause confusion and bugs. For example, if the child process is a long-running daemon then this file descriptor may be held open indefinitely and the disk space associated with the file will not be freed. Or if the parent had a large number of open file descriptors, the child may exhaust the remaining space if it too tries to open a large number. This is one reason why it’s common to iterate over all file descriptors and call close() on them after forking.

In Python 3.4, however, this behaviour has been modified so that file descriptors are not inherited. This is implemented by setting FD_CLOEXEC on the descriptor via fcntl()2 on POSIX systems, which closes all current file descriptors when any of the execX() family are called. On Windows, SetHandleInformation() is used passing HANDLE_FLAG_INHERIT with much the same purpose.

Since inheritance of file descriptors is still desirable in some circumstances, the functions os.get_inheritable() and os.set_inheritable() can be used to query and set this behaviour on a per-filehandle basis. There are also os.get_handle_inheritable() and os.set_handle_inheritable() calls on Windows, if you’re using native Windows handles rather than the POSIX layer.

One important aspect to note here is that when using the FD_CLOEXEC flag, the close() happens on the execX() call, so if you call a plain vanilla os.fork() and continue execution in the same script then all the descriptors will still be open. To demonstrate the action of these methods, you’ll need to do something like this (which is Unix-specific since it assumes the existence of /tmp):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import os
import sys
import tempfile
import time

# Create script that we'll exec to test whether FD is still open.
fd, script_path = tempfile.mkstemp()
os.write(fd, b"import os\n")
os.write(fd, b"import sys\n")
os.write(fd, b"fd = int(sys.argv[1])\n")
os.write(fd, b"msg = sys.argv[2]\n")
os.write(fd, b"data = (msg + '\\n').encode('utf-8')\n")
os.write(fd, b"try:\n")
os.write(fd, b"    os.write(fd, data)\n")
os.write(fd, b"except Exception as exc:\n")
os.write(fd, b"    print('ERROR: ' + str(exc))\n")
os.close(fd)

# Create output file to which child processes will attempt to write.
fd, output_path = tempfile.mkstemp()
os.write(fd, b"Before fork\n")

# First attempt, should fail with "Bad file descriptor" as the default
# is for filehandles to not inherit over exec.
if os.fork() == 0:
    os.execl(
        sys.executable, script_path, script_path, str(fd), "FIRST"
    )
os.wait()

# Second attempt should succeed once fd is inheritable.
os.set_inheritable(fd, True)
if os.fork() == 0:
    os.execl(
        sys.executable, script_path, script_path, str(fd), "SECOND"
    )
os.wait()

# Now we re-read the file to see which attempts worked.
os.lseek(fd, os.SEEK_SET, 0)
print("Contents of file:")
print(os.read(fd, 4096).decode("utf-8"))
os.close(fd)

# Clean up temporary files.
os.remove(script_path)
os.remove(output_path)

When run, you should see something like the following:

ERROR: [Errno 9] Bad file descriptor
Contents of file:
Before fork
SECOND

That first line is the output from the first attempt to write the file, which fails. The contente of the output file clearly indicates the second write was successful.

In general I think this change is a very sensible one as the previous default behaviour of inheriting file descriptors by default on POSIX systems probably took a lot of less experienced developers (and a few more experienced ones!) by surprise. It’s the sort of nasty surprise that you don’t realise is there until those odd cases where, say, you’re dealing with hundreds of open files at once and when you spawn a child process it suddenly starts complaining it’s hit the system limit on open file descriptors and you wonder what on earth is going on. It always seems that such odd cases are always those when you have the tightest deadlines, too, so the last thing you need is to spend hours tracking down some weird file descriptor inheritance bug.

If you need to know more, PEP 446 has the lowdown, including references to real issues in various OSS projects caused by this behaviour.

Clarity on Codecs

The codecs module has long been a fixture in Python, since it was introduced in (I think!) Python 2.0, released over two decades ago. It was intended as general framework for registering and using any sort of codec, and this can be seen from the diverse range of codecs it supports. For example, as well as obvious candidates like utf-8 and ascii, you’ve got options like base64, hex, zlib and bz2. You can even register your own with codecs.register().

However, most people don’t use codecs on a frequent basis, but they do use the convenience methods str.encode() and bytes.decode() all the time. This can cause confusion because while the encode() and decode() methods provided by codecs are generic, the convenience methods on str and bytes are not — these only support the limited set of text encodings that make sense for those classes.

In Python 3.4 this situation has been somewhat improved by more helpful error messages and improved documentation.

Firstly, the methods codecs.encode() and codecs.decode() are now documented, which they weren’t previously. This is probably because they’re really they are just convenient wrappers for calling lookup() and invoking the encoder object thus created, but unless you’re doing a lot of encoding/decoding with the same codec, the simplicity of their interface is probably preferable. Since these are C extension modules under the hood, there shouldn’t be a lot of performance overhead for using these wrappers either.

>>> import codecs
>>> encoder = codecs.lookup("rot13")
>>> encoder.encode("123hello123")
('123uryyb123', 11)

Secondly, using one of the non-text encodings without going through the codecs module now yields a helpful error which points you in that direction.

>>> "123hello123".encode("rot13")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: 'rot13' is not a text encoding; use codecs.encode() to handle arbitrary codecs

Finally, errors during encoding now use chained exceptions to ensure that the codec responsible for them is indicated as well as the underlying error raised by that codec.

>>> codecs.decode("abcdefgh", "hex")
Traceback (most recent call last):
  File "/Users/andy/.pyenv/versions/3.4.10/encodings/hex_codec.py", line 19, in hex_decode
    return (binascii.a2b_hex(input), len(input))
binascii.Error: Non-hexadecimal digit found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: decoding with 'hex' codec failed (Error: Non-hexadecimal digit found)

Hopefully all this will go some way to making things easier to grasp for anyone grappling with the nuances of codecs in Python.

New Modules

This release has a number of new modules, which are discussed in the sections below. I’ve skipped ensurepip since it’s already been discussed at the top of this article.

Asyncio

This release contains the new asyncio module which provides an event loop framework for Python. I’m not going to discuss it much in this article because I already covered it a few years ago in an article that was part of my coroutines series. The other reason not to go into things in too much detail here are that the situation evolved fairly rapidly from Python 3.4 to 3.7, so it probably makes more sense to have a more complete look in retrospect.

Briefly, it’s nominally the successor to the asyncore module, for doing asynchronous I/O, which was always promising in priciple but a bit of a disappointment in practice due to a lack of flexibility. This is far from the whole story, however, as it also forms the basis for the modern use of coroutines in Python.

Since I’m writing these articles with the benefit of hindsight, my strong suggestion is to either go find some other good tutorials on asyncio that were written in the last couple of years, and which use Python 3.7 as a basis; or wait until I get around to covering Python 3.7 myself, where I’ll run through in more detail (especially since my previous articles stopped at Python 3.5).

Enum

Enumerations are something that Python’s been lacking for some time. This is partly due to the fact that it’s not too hard to find ways to work around this omission, but they’re often a little unsatisfactory. It’s also partly due to the fact that nobody could fully agree on the best way to implement them.

Well in Python 3.4 PEP 435 has come along to change all that, and it’s a handy little addition.

Enumerations are defined using the same syntax as a class:

class WeekDay(Enum):
    MONDAY = 1
    TUESDAY = 2
    WEDNESDAY = 3
    THURSDAY = 4
    FRIDAY = 5
    SATURDAY = 6
    SUNDAY = 7

However, it’s important to note that this isn’t actually a class, as it’s linked to the enum.EnumMeta metaclass. Don’t worry too much about the details, just be aware that this is not a class but essentially a new construct that uses the same syntax as classes, and you won’t be taken by surprise later.

You’ll notice that all the enumeration members need to be assigned a value, you can’t just list the member names on their own (although read on for a nuance to this). When you have an enumeration value you can query both its name and value, and also str and repr have sensible values. See the excerpt below for an illustration of all these aspects.

>>> WeekDay.WEDNESDAY.name
'WEDNESDAY'
>>> WeekDay.WEDNESDAY.value
3
>>> str(WeekDay.FRIDAY)
'WeekDay.FRIDAY'
>>> repr(WeekDay.FRIDAY)
'<WeekDay.FRIDAY: 5>'
>>> type(WeekDay.FRIDAY)
<enum 'WeekDay'>
>>> type(WeekDay)
<class 'enum.EnumMeta'>
>>> WeekDay.THURSDAY - WeekDay.MONDAY
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'WeekDay' and 'WeekDay'
>>> WeekDay.THURSDAY.value - WeekDay.MONDAY.value
3

I did mention that every enumeration members need a name, but there is an enum.auto() helper for you to automatically assign values if all you need is something unique. The excerpt below illustrates this as well as iterating through an enumeration.

>>> from enum import Enum, auto
>>> class Colour(Enum):
...     RED = auto()
...     GREEN = auto()
...     BLUE = auto()
...
>>> print("\n".join(i.name + "=" + str(i.value) for i in Colour))
RED=1
GREEN=2
BLUE=3

Every enumeration name must be unique within a given enumeration definition, but the values can be duplicated if needed, which you can use to define aliases for values. If this isn’t desirable, the @enum.unique decorator can enforce uniqueness, raising a ValueError if not.

One thing that’s not immediately obvious from these examples is that enumeration member values may be any type and different types may even be mixed within the same enumeration. I’m not sure how valuable this would be to do in practive, however.

Values can be compared by identity or equality, but comparing enumeration members to their underlying types always returns not equal. Even when comparing by identity, aliases for the same underlying value compare equal. Also note that when iterating through enumerations, aliases are skipped and the first definition for each value is used.

>>> class Numbers(Enum):
...     ONE = 1
...     UN = 1
...     EIN = 1
...     TWO = 2
...     DEUX = 2
...     ZWEI = 2
...
>>> Numbers.ONE is Numbers.UN
True
>>> Numbers.TWO == Numbers.ZWEI
True
>>> Numbers.ONE == Numbers.TWO
False
>>> Numbers.ONE is Numbers.TWO
False
>>> Numbers.ONE == 1
False
>>> list(Numbers)
[<Numbers.ONE: 1>, <Numbers.TWO: 2>]

If you really do need to include aliases in your iteration, the special __members__ dictionary can be used for that.

>>> import pprint
>>> pprint.pprint(Numbers.__members__)
mappingproxy({'DEUX': <Numbers.TWO: 2>,
              'EIN': <Numbers.ONE: 1>,
              'ONE': <Numbers.ONE: 1>,
              'TWO': <Numbers.TWO: 2>,
              'UN': <Numbers.ONE: 1>,p
              'ZWEI': <Numbers.TWO: 2>})

Finally, the module also provides some subclasses of Enum which may be useful. For example, IntEnum is one which adds the ability to compare enumeration values with int as well as other enumeration values.

This is a bit of a whirlwind tour of what’s been written to be quite a flexible module, but hopefully if gives you an idea of its capabilities. Check out the full documentation for more details.

Pathlib

This release sees the addition of a new library pathlib to manipulate filesystem paths, with semantics appropriate for different operating systems. This is intended to be a higher-level abstraction than that provided by the existing os.path library, which itself has some functions to abstract away from the filesystem details (e.g. os.path.join() which uses appropriate slashes to build a path).

There are common base classes across platforms, and then different subclasses for POSIX and Windows. The classes are also split into pure and concrete, where pure classes represent theoretical paths but lack any methods to interact with the concrete filesystem. The concrete equivalents have such methods, but can only be instantiated on the appropriate platform.

For reference, here is the class hierarchy:

pathlib class structure

When run on a POSIX system, the following excerpt illustrates which of the platform-specific classes can be instantiated, and also that the pure classes lack the filesystem methods that the concrete ones provide:

>>> import pathlib
>>> a = pathlib.PurePosixPath("/tmp")
>>> b = pathlib.PureWindowsPath("/tmp")
>>> c = pathlib.PosixPath("/tmp")
>>> d = pathlib.WindowsPath("/tmp")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andy/.pyenv/versions/3.4.10/pathlib.py", line 927, in __new__
    % (cls.__name__,))
NotImplementedError: cannot instantiate 'WindowsPath' on your system
>>> c.exists()
True
>>> len(list(c.iterdir()))
24
>>> a.exists()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PurePosixPath' object has no attribute 'exists'
>>> len(list(a.iterdir()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PurePosixPath' object has no attribute 'iterdir'

Of course, a lot of the time you’ll just want whatever path represents the platform on which you’re running, so if you instantiate plain old Path you’ll get the appropriate concrete representation.

>>> x = pathlib.Path("/tmp")
>>> type(x)
<class 'pathlib.PosixPath'>

One handy feature is that the division operator (slash) has been overridden so that you can append path elements with it. Note that this operator is the same on all platforms, and also you always use forward-slashes even on Windows. However, when you stringify the path, Windows paths will be given backslashes. The excerpt below illustrates these features, and also some of the manipulations that pure paths support.

>>> x = pathlib.PureWindowsPath("C:/") / "Users" / "andy"
>>> x
PureWindowsPath('C:/Users/andy')
>>> str(x)
'C:\\Users\\andy'
>>> x.parent
PureWindowsPath('C:/Users')
>>> [str(i) for i in x.parents]
['C:\\Users', 'C:\\']
>>> x.drive
'C:'

So far it’s pretty handy but perhaps nothing to write home about. However, there are some handy features. One is glob matching, where you can test a given path for matches against a glob-style pattern with the match() method.

>>> x = pathlib.PurePath("a/b/c/d/e.py")
>>> x.match("*.py")
True
>>> x.match("d/*.py")
True
>>> x.match("a/*.py")
False
>>> x.match("a/*/*.py")
False
>>> x.match("a/*/*/*/*.py")
True
>>> x.match("d/?.py")
True
>>> x.match("d/??.py")
False

Then there’s relative_to() which is handy for getting the relative path of a file to some specified parent directory. It also raises an exception if the path isn’t under the parent directory, which makes checking for errors in paths specified by the user more convenient.

>>> x = pathlib.PurePath("/one/two/three/four/five.py")
>>> x.relative_to("/one/two/three")
PurePosixPath('four/five.py')
>>> x.relative_to("/xxx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../pathlib.py", line 819, in relative_to
    .format(str(self), str(formatted)))
ValueError: '/one/two/three/four/five.py' does not start with '/xxx'

And finally there’s with_name(), with_stem() and with_suffix() which are useful for making manipulations of parts of the filename.

>>> x = pathlib.PurePath("/home/andy/file.md")
>>> x.with_name("newfilename.html")
PurePosixPath('/home/andy/newfilename.html')
>>> x.with_stem("newfile")
PurePosixPath('/home/andy/newfile.md')
>>> x.with_suffix(".html")
PurePosixPath('/home/andy/file.html')
>>> x.with_suffix("")
PurePosixPath('/home/andy/file')

The concrete classes add a lot more useful functionality for querying the content of directories and reading file ownership and metadata, but if you want more details I suggest you go read the excellent documentation. If you want the motivations behind some of the design decisions, go and read PEP 428.

Statistics

Both simple and useful, this new module contains some handy functions to calculate basic statistical measures from sets of data. All of these operations support the standard numeric types int, float, Decimal and Fraction and raise StatisticsError on errors, such as an empty data set being passed.

The following functions for determining different forms of average value are provided in this release:

mean()
Broadly equivalent to sum(data) / len(data) except supporting generalised iterators that can only be evaluated once and don’t support len().
median()
Broadly equivalent to data[len(data) // 2] except supporting generalised iterators. Also, if the number of items in data is even then the mean of the two middle items is returned instead of selecting one of them, so the value is not necessarily one of the actual members of the data set in this case.
median_low() and median_high()
These are identical to median() and each other for data sets with an odd number of elements. If the number of elements is even, these return one of the two middle elements instead of their mean as median() does, with median_low() returning the lower of the two and median_high() the higher.
median_grouped()
This function implements the median of continuous data based on the frequncy of values in fixed-width groups. Each value is interpreted as the midpoint of an interval, and the width of that interval is passed as the second argument. If omitted, the interval defaults to 1, which would represent continuous values that have been rounded to the nearest integer. The method involes identifying the median interval, and then using the proportion of values above and within that interval to interpolate an estimate of the median value within it3.
mode()
Returns the most commonly occurring value within the data set, or raises StatisticsError if there’s more than one value with equal-highest cardinality.

There are also functions to calculate the variance and standard deviation of the data:

pstdev() and stdev()
These calculate the population and sample standard deviation respectively.
pvariance() and variance()
These calculate the population and sample variance respectively.

These operations are generally fairly simple to implement yourself, but making them operate correctly on any iterator is slightly fiddly and it’s definitely handy to have them available in the standard library. I also have a funny feeling that we’ll be seeing more additions to this library in the future beyond the fairly basic set that’s been included initially.

Tracemalloc

As you can probably tell from the name, this module is intended to help you track down where memory is being allocated in your scripts. It does this by storing the line of code that allocated every block, and offering APIs which allow your code to query which files or lines of code have allocated the most blocks, and also compare snapshots between two points in time so you can track down the source of memory leaks.

Due to the memory and CPU overhead of performing this tracing it’s not enabled by default. You can start tracking at runtime with tracemalloc.start(), or to start it early you can pass the PYTHONTRACEMALLOC environment variable or -X tracemalloc command-line option. You can also store multiple frames of traceback against each block, at the cost of increased CPU and memory overhead, which can be helpful for tracing the source of memory allocations made by common shared code.

Once tracing is enabled you can grab a snapshot at any point with take_snapshot(), which returns a Snapshot instance which can be interrogated for information at any later point. Once you have a Snapshot instance you can call statistics() on it to get the memory allocations aggregated by source file, or broken down by line number of specific backtrace. There’s also a compare_to() method for examining the delta in memory allocations between two points, and there are dump() and load() methods for saving snapshots to disk for later analysis, which could be useful for tracing code in production environments.

As a quick example of these two methods, consider the following completely artificial code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
memory.py
import tracemalloc

import lib1
import lib2
import lib3

tracemalloc.start()

mystring1 = "abc" * 4096
mystring2 = "\U0002000B\U00020016\U00020017" * 4096
foo = lib1.Foo()
bar = lib2.Bar()
baz = lib3.Baz()

snapshot1 = tracemalloc.take_snapshot()
print("---- Initial snapshot:")
for entry in snapshot1.statistics("lineno"):
    print(entry)

del foo, bar, baz
snapshot2 = tracemalloc.take_snapshot()
print("\n---- Incremental snapshot:")
for entry in snapshot2.compare_to(snapshot1, "lineno"):
    print(entry)
1
2
3
4
5
lib1.py
import os

class Foo:
    def __init__(self):
        self.entropy_pool = [os.urandom(64) for i in range(100)]
1
2
3
4
lib2.py
class Bar:
    def __init__(self):
        self.name = "instance"
        self.id = 12345
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
lib3.py
import random
import string

class Baz:
    def __init__(self):
        self.values = []
        for i in range(100):
            buffer = ""
            for j in range(1024):
                buffer += random.choice(string.ascii_lowercase)
            self.values.append(buffer)

Let’s take a quick look at the two parts of the output that executing memory.py gives us. The first half that I get on my MacOS system is shown below — wherever you see “...” it’s where I’ve stripped out leading paths to avoid the need for word wrapping:

---- Initial snapshot:
.../lib3.py:10: size=105 KiB, count=101, average=1068 B
memory.py:10: size=48.1 KiB, count=1, average=48.1 KiB
memory.py:9: size=12.0 KiB, count=1, average=12.0 KiB
.../lib1.py:5: size=10.3 KiB, count=102, average=104 B
.../lib3.py:11: size=848 B, count=1, average=848 B
memory.py:13: size=536 B, count=2, average=268 B
.../python3.4/random.py:253: size=536 B, count=1, average=536 B
memory.py:12: size=56 B, count=1, average=56 B
memory.py:11: size=56 B, count=1, average=56 B
.../lib3.py:6: size=32 B, count=1, average=32 B
.../lib2.py:3: size=32 B, count=1, average=32 B

I’m not going to go through all of these, but let’s pick a few examples to check what we’re seeing makes sense. Note that the results from statistics() are always sorted in decreasing order of total memory consumption.

The first line indicates lib3.py:10 allocated memory 101 times, which is reassuring because it’s not allocating every time around the nested loop. Interesting to note that it’s one more time than the number of times around the outer loop, however, which perhaps implies there’s some allocation that was done the first time and then reused. The average allocation of 1068 bytes makes sense, since these are str objects of 1024 characters and based on sys.getsizeof("") on my platform each instance has an overhead of around 50 bytes.

Next up are memory.py:10 and memory.py:9 which are straightforward enough: single allocations for single strings. The sizes are such that the str overhead is lost in rounding errors, but do note that the string using extended Unicode characters4 requires 4 bytes per character and is therefore four times larger than the byte-per-character ASCII one. If you’ve read the earlier articles in this series, you may recall that this behaviour was introduced in Python 3.3.

Skipping forward slightly, the allocation on lib3.py:11 is interesting: when we append the str we’ve built to the list we get a single allocation of 848 bytes. I assume there’s some optimisation going on here, because if I increase the loop count the allocation count remains at one but the size increases.

The last thing I’ll call out is the two allocations on memory.py:13. I’m not quite sure exactly what’s triggering this, but it’s some sort of optimisation — even if the loop has zero iterations then these allocations still occur, but if I comment out the loop entirely then these allocations disappear. Fascinating stuff!

Now we’ll look at the second half the output, comparing the initial snapshot to that after the class instances are deleted:

---- Incremental snapshot:
.../lib3.py:10: size=520 B (-105 KiB), count=1 (-100), average=520 B
.../lib1.py:5: size=0 B (-10.3 KiB), count=0 (-102)
.../python3.4/tracemalloc.py:462: size=1320 B (+1320 B), count=3 (+3), average=440 B
.../python3.4/tracemalloc.py:207: size=952 B (+952 B), count=3 (+3), average=317 B
.../python3.4/tracemalloc.py:165: size=920 B (+920 B), count=3 (+3), average=307 B
.../lib3.py:11: size=0 B (-848 B), count=0 (-1)
.../python3.4/tracemalloc.py:460: size=672 B (+672 B), count=1 (+1), average=672 B
.../python3.4/tracemalloc.py:432: size=520 B (+520 B), count=2 (+2), average=260 B
memory.py:18: size=472 B (+472 B), count=1 (+1), average=472 B
.../python3.4/tracemalloc.py:53: size=472 B (+472 B), count=1 (+1), average=472 B
.../python3.4/tracemalloc.py:192: size=440 B (+440 B), count=1 (+1), average=440 B
.../python3.4/tracemalloc.py:54: size=440 B (+440 B), count=1 (+1), average=440 B
.../python3.4/tracemalloc.py:65: size=432 B (+432 B), count=6 (+6), average=72 B
.../python3.4/tracemalloc.py:428: size=432 B (+432 B), count=1 (+1), average=432 B
.../python3.4/tracemalloc.py:349: size=208 B (+208 B), count=4 (+4), average=52 B
.../python3.4/tracemalloc.py:487: size=120 B (+120 B), count=2 (+2), average=60 B
memory.py:16: size=90 B (+90 B), count=2 (+2), average=45 B
.../python3.4/tracemalloc.py:461: size=64 B (+64 B), count=1 (+1), average=64 B
memory.py:13: size=480 B (-56 B), count=1 (-1), average=480 B
.../python3.4/tracemalloc.py:275: size=56 B (+56 B), count=1 (+1), average=56 B
.../python3.4/tracemalloc.py:189: size=56 B (+56 B), count=1 (+1), average=56 B
memory.py:12: size=0 B (-56 B), count=0 (-1)
memory.py:11: size=0 B (-56 B), count=0 (-1)
.../python3.4/tracemalloc.py:425: size=48 B (+48 B), count=1 (+1), average=48 B
.../python3.4/tracemalloc.py:277: size=32 B (+32 B), count=1 (+1), average=32 B
.../lib3.py:6: size=0 B (-32 B), count=0 (-1)
.../lib2.py:3: size=0 B (-32 B), count=0 (-1)
memory.py:10: size=48.1 KiB (+0 B), count=1 (+0), average=48.1 KiB
memory.py:9: size=12.0 KiB (+0 B), count=1 (+0), average=12.0 KiB
.../python3.4/random.py:253: size=536 B (+0 B), count=1 (+0), average=536 B

Firstly, there are of course a number of allocations within tracemalloc.py, which are the result of creating and analysing the previous snapshot. We’ll disregard these, because they depend on the details of the library implementation which we don’t have transparency into here.

Beyond this, most of the changes are as you’d expect. Interesting points to note are that one of the allocations lib3.py:10 was not freed, and only one of the two allocations from memory.py:13 was freed. Since these were the two cases where I was a little puzzled by the apparently spurious additional allocations, I’m not particularly surprised to see these two being the ones that weren’t freed afterwards.

In a simple example like this, it’s easy to see how you could track down memory leaks and similar issues. However, I suspect in a complex codebase it could be quite a challenge to focus in on the impactful allocations with the amount of detail provided. I guess the main reason people would turn to this module is only to track down major memory leaks rather than a few KB here and there, so at that point perhaps the important allocations would stand out clearly from the background noise.

Either way, it’s certainly a welcome addition to the library!

Conclusions

Great stuff so far, but we’ve got plenty of library enhancements still to get through. I’ll discuss those and few other remaining details in the next post, and I’ll also sum up my overall thoughts on this release as a whole.


  1. So the parent process closes one end of the pipe and the child process closes the other end. If you want bidirectional communication you can do the same with another pipe, just the opposite way around. There are other ways for processes to communicate, of course, but this is one of the oldest. 

  2. If you want to get technical there’s a faster path used on platforms which support it which is to call ioctl() with either FIOCLEX or FIONCLEX to perform the same task. This is only because it’s generally a few percent faster than the equivalent fcntl() call, but less standard. 

  3. Or more concisely where is the lowest possible value from the median interval, is the size of the data set, is the number of items below the median interval, is the number of items within the median interval, and is the interval width. 

  4. Specifically from the Supplementary Ideographic Plane

4 Apr 2021 at 3:25PM in Software
 |   | 
Photo by David Clode on Unsplash
 | 

March 2021

☑ Python 2to3: What’s New in 3.3 - Part 2

7 Mar 2021 at 11:27AM in Software
 |   | 

This is part 5 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.3 - Part 1.

The second of my two articles covering features added in Python 3.3, this one talks about a large number of changes to the standard library, especially in network and OS modules. I also discuss implicit namespace packages, which are a bit niche but can be useful for maintaining large families of packages.

green python two

This is the second and final article in this series looking at new features in Python 3.3. and we’ll be primarily drilling into a large number of changes to the Python libraries. There’s a lot of interesting stuff to cover the Internet side such as the new ipaddress module and changes to email, and also in terms of OS features such as a slew of new POSIX functions that have been exposed.

Internet

There are a few module changes relating to networking and Internet protocols in this release.

ipaddress

There’s a new ipaddress module for storing IP addresses, as well as other related concepts like subnets and interfaces. All of the types have IPv4 and IPv6 variants, and offer some useful functionality for code to deal with IP addresses generically without needing to worry about the distinctions. The basic types are listed below.

IPv4Address & IPv6Address
Represents a single host address. The ip_address() utility function constructs the appropriate one of these from a string specification such as 192.168.0.1 or 2001:db8::1:0.
IPv4Network & IPv6Network
Represents a single subnet of addresses. The ip_network() utility function constructs one of these from a string specification such as 192.168.0.0/28 or 2001:db8::1:0/56. One thing to note is that because this represents an IP subnet rather than any particular host, it’s an error for any of the bits to be non-zero in the host part of the network specification.
IPv4Interface & IPv6Interface
Represents a host network interface, which has both a host IP address and network giving the details of the local subnets. The ip_interface() utility function constructs this from a string specification such as 192.168.1.20/28. Note that unlike the specification passed to ip_network(), this has non-zero bits in the host part of the specification.

The snippet below demonstrates some of the attributes of address objects:

>>> import ipaddress
>>> x = ipaddress.ip_address("2001:db8::1:0")
>>> x.packed
b' \x01\r\xb8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00'
>>> x.compressed
'2001:db8::1:0'
>>> x.exploded
'2001:0db8:0000:0000:0000:0000:0001:0000'
>>>
>>> x = ipaddress.ip_address("192.168.0.1")
>>> x.packed
b'\xc0\xa8\x00\x01'
>>> x.compressed
'192.168.0.1'
>>> x.exploded
'192.168.0.1'

This snippet illustrates a network and how it can be used to iterate over the addresses within it, as well as check for address membership in the subnet and overlaps with other subnets:

>>> x = ipaddress.ip_network("192.168.0.0/28")
>>> for addr in x:
...     print(repr(addr))
...
IPv4Address('192.168.0.0')
IPv4Address('192.168.0.1')
# ... (12 rows skipped)
IPv4Address('192.168.0.14')
IPv4Address('192.168.0.15')
>>> ipaddress.ip_address("192.168.0.2") in x
True
>>> ipaddress.ip_address("192.168.1.2") in x
False
>>> x.overlaps(ipaddress.ip_network("192.168.0.0/30"))
True
>>> x.overlaps(ipaddress.ip_network("192.168.1.0/30"))
False

And finally the interface can be queried for its address and netmask, as well retrieve its specification either as a netmask or in CIDR notation:

>>> x = ipaddress.ip_interface("192.168.0.25/28")
>>> x.network
IPv4Network('192.168.0.16/28')
>>> x.ip
IPv4Address('192.168.0.25')
>>> x.with_prefixlen
'192.168.0.25/28'
>>> x.with_netmask
'192.168.0.25/255.255.255.240'
>>> x.netmask
IPv4Address('255.255.255.240')
>>> x.is_private
True
>>> x.is_link_local
False

Having implemented a lot of this stuff manually in the past, having them here in the standard library is definitely a big convenience factor.

Email

The email module has always attempted to be compliant with the various MIME RFCs3. The email ecosystem is a broad church, however, and sometimes it’s useful to be able to customise certain behaviours, either to work on email held in non-compliance offine mailboxes or to connect to non-compliant email servers. For these purposes the email module now has a policy framework.

The Policy object controls the behaviour of various aspects of the email module. This can be specified when constructing an instance from email.parser to parse messages, or when constructing an email.message.Message directly, or when serialising out an email using the classes in email.generator.

In fact Policy is an abstract base class which is designed to be extensible, but instances must provide at least the following properties:

Property Default Meaning
max_line_length 78 Maximum line length, not including separators, when serialising.
linesep "\n" Character used to separate lines when serialising.
cte_type "7bit" If 8bit used with a BytesGenerator then non-ASCII may be used.
raise_on_defect False Raise errors during parsing instead of adding them to defects list.

So, if you’ve ever found yourself sick of having to remember to override linesep="\r\n" in a lot of different places or similar, this new approach should be pretty handy.

However, one of the main motivations to introducing this system is it now allows backwards-incompatible API changes to be made in a way which enables authors to opt-in to them when ready, but without breaking existing code. If you default to the compat32 policy, you get an interface and functionality which is compatible with the old pre-3.3 behaviour.

There is also an EmailPolicy, however, which introduces a mechanism for handling email headers using custom classes. This policy implements the following controls:

Property Default Meaning
refold_source long Controls whether email headers are refolded by the generator.
header_factory See note4 Callable that takes name and value and returns a custom header object for that particular header.

The classes used to represent headers can implement custom behaviour and allow access to parsed details. Here’s an example using the default policy which implements the EmailPolicy with all default behaviours unchanged:

>>> from email.message import Message
>>> from email.policy import default
>>> msg = Message(policy=default)
>>> msg["To"] = "Andy Pearce <andy@andy-pearce>"
>>> type(msg["To"])
<class 'email.headerregistry._UniqueAddressHeader'>
>>> msg["To"].addresses
(Address(display_name='Andy Pearce', username='andy', domain='andy-pearce'),)
>>>
>>> import email.utils
>>> msg["Date"] = email.utils.localtime()
>>> type(msg["Date"])
<class 'email.headerregistry._UniqueDateHeader'>
>>> msg["Date"].datetime
datetime.datetime(2021, 3, 1, 17, 18, 21, 467804, tzinfo=datetime.timezone(datetime.timedelta(0), 'GMT'))
>>> print(msg)
To: Andy Pearce <andy@andy-pearce>
Date: Mon, 01 Mar 2021 17:18:21 +0000

These classes will handle aspects such as presenting Unicode representations to code, but serialising out using UTF-8 or similar encoding, so the programmer no longer has to deal with such complications, provided they selected the correct policy.

On a separate email-related note, the smtpd module now also supports RFC 5321, which adds an extension framework to allow optional additions to SMTP; and RFC 1870, which offers clients an ability to pre-delcare the size of messages before sending them to detect errors earlier before sending a lot of data needlessly.

The smtplib module also has some improvements. The classes now support a source_address keyword argument to specify the source address to use for binding the outgoing socket, for servers where there are multiple potential interfaces and it’s important that a particular one is used. The SMTP class can now act as a context manager, issuing a QUIT command disconnecting when the context expires.

FTP

Also on the Internet-related front there were a handful of small enhancements to the ftplib module.

ftplib.FTP Now Accepts source_address
This is to specify the source address to use for binding the outgoing socket, for servers where there are multiple potential interfaces and it’s important that a particular one is used.
FTP_TLS.ccc()
The FTP_TLS class, which is a subclass of FTP which adds TLS support as per RFC 4217, has now acquired a ccc() method which reverts the connection back to plaintext. Apparently, this can be useful to take advantage of firewalls that know how to handle NAT with non-secure FTP without opening fixed ports. So now you know.
FTP.mlsd()
The mlsd() method has been added to FTP objects which uses the MLSD command specified by RFC 3659. This offers a better API than FTP.nlst(), returning a generator rather than a list and includes file metadata rather than just filenames. Not all FTP servers support the MLSD command, however.

Web Modules

The http, html and urllib packages also or some love in this release.

BaseHTTPRequestHandler Header Buffering
The http.server.BaseHTTPRequestHandler server now
html.parser.HTMLParser Now Parses Invalid Markup
After a large collection of bug fixes, errors are no longer raised when parsing broken markup. As a result the old strict parameter of the constructor as well as the now-unused HTMLParseError have been deprecated.
html.entities.html5 Added
This is a useful dict that maps entity names to the equivalent characters, for example html5["amp;"] == "&". This includes all the Unicode characters too. If you want the full list, take a peek at §13.5 of the HTML standard.
urllib.Request Method Specification
The urllib.Request class now has a method parameter which can specify the HTTP method to use. Previously this was decided automatically between GET and POST based on whether body data was provided, and that behaviour is still the default if the method isn’t specified.

Sockets

Support For sendmsg() and recvmsg()
These two functions provide two main additional features over tranditional sends: scatter/gather interfaces to send/receive to/from multiple buffers, and the ability to send and receive ancilliary data. For more details on ancilliary data, see the cmsg man page.
PF_CAN Support
The socket class now supports the PF_CAN protocol family, which I don’t pretend to know much about but is an open source stack contributed by Volkswagen which bridges the Controller Area Network (CAN) standard for implementing a vehicle communications bus into the standard sockets layer. This one’s pretty niche, but it was just too cool not to mention5.
PF_RDS Support
Another additional protocol family supported in this release is PF_RDS which is the Reliable Datagram Sockets protocol. This is a protocol developed by Oracle which offers similar interfaces to UDP but offers guaranteed in-order delivery. Unlike TCP, however, it’s still datagram-based and connectionless. You now know at least as much about RDS as I do. If anyone knows why they didn’t just use SCTP, which already seems to offer them everything they need, let me know in the comments.
PF_SYSTEM Support
We all know that new protocol families always come in threes, and the third is PF_SYSTEM. This is a MacOS-specific set of protocols for communicating with kernel extensions6.
sethostname() Added
If the current process has sufficient privilege, sethostname() updates the system hostname. On Unix system this will generally require running as root or, in the case of Linux at least, having the CAP_SYS_ADMIN capability.
socketserver.BaseServer Actions Hook
The class now calls a service_actions() method every time around the main poll loop. In the base class this method does nothing, but derived classes can implement it to perform periodic actions. Specifically, the ForkingMixIn now uses this hook to clean up any defunct child processes.
ssl Module Random Number Generation
A couple of new OpenSSL functions are exposed for random number generation, RAND_bytes() and RAND_pseudo_bytes(). However, os.urandom() is still preferable for most applications.
ssl Module Exceptions
These are now more fine-grained, and the following new exceptions have been added for particular cases: SSLZeroReturnError, SSLWantReadError, SSLWantWriteError, SSLSyscallError and SSLEOFError.
SSLContext.load_cert_chain() Passwords
The load_cert_chain() method now accepts a password parameter for cases where the private key is encrypted. It can be a str or bytes value containing the actual password, of a callable which will return the password. If specified, this overrides OpenSSL’s default password-prompting mechanism.
ssl Supports Additional Algorithms
Some changes have been made to properly support Diffie-Hellman key exchange on all platforms. In addition, the “PLUS” variants of SCRAM are now supported, which use a technique called channel binding to prevent some person-in-the-middle attacks.
SSL Compression
SSL sockets now have a compression() method to query the current compression algorithm in use. The SSL context also now supports an OP_NO_COMPRESSION option to disable compression.
ssl Next Protocol Negotiation
A new method ssl.SSLContext.set_npn_protocols() has been added to support the Next Protocol Negotiation (NPN) extension to TLS. This allows different application-level protocols to be specified in preference order. It was originally added to support Google’s SPDY, and although SPDY is now deprecated (and superceded by HTTP/2) this extension is general in nature and still useful.
ssl Error Introspection

Instances of ssl.SSLError now have two additional attributes:

  • library is a string indicating the OpenSSL subsystem responsible for the error (e.g. SSL, X509).
  • reason is a string code indicating the reason for the error (e.g. CERTIFICATE_VERIFY_FAILED).

New Collections

A few new data structures have been added as part of this release.

SimpleNamespace

There’s a new types.SimpleNamespace type which can be used in cases where you just want to hold some attributes. It’s essentially just a thin wrapper around a dict which allows the keys to be accessed as attributes instead of being subscripted. It’s also somewhat similar to an empty class definition, except for three main advantages:

  • You can initialise attributes in the constructor, as in types.SimpleNamespace(a=1, xyz=2).
  • It provides a readable repr() which follows the usual guideline that eval(repr(x)) == x.
  • It defines an equality operator which compares by equality of attributes, like a dict, unlike the default equality of classes, which compares by the result of id().

ChainMap

There’s a new collections.ChainMap class which can group together multiple mappings to form a single unified updateable view. The class overall acts as a mapping, and read lookups are performed across each mapping in turn with the first match being returned. Updates and additions are always performed in the first mapping in the list, and note that this may mask the same key in later mappings (but it will leave the originally mapping intact).

>>> import collections
>>> a = {"one": 1, "two": 2}
>>> b = {"three": 3, "four": 4}
>>> c = {"five": 5}
>>> chain = collections.ChainMap(a, b, c)
>>> chain["one"]
1
>>> chain["five"]
5
>>> chain.get("ten", "MISSING")
'MISSING'
>>> list(chain.keys())
['five', 'three', 'four', 'one', 'two']
>>> chain["one"] = 100
>>> chain["five"] = 500
>>> chain["six"] = 600
>>> list(chain.items())
[('five', 500), ('one', 100), ('three', 3), ('four', 4), ('six', 600), ('two', 2)]
>>> a
{'five': 500, 'six': 600, 'one': 100, 'two': 2}
>>> b
{'three': 3, 'four': 4}
>>> c
{'five': 5}

Operating System Features

There are a whole host of enhancements to the os, shutil and signal modules in this release which are covered below. I’ve tried to be brief, but include enough useful details for anyone who’s interested but not immediately familiar.

os Module

os.pipe2() Added
On platforms that support it, the pipe2() call is now available. This allows flags to be set on the file descriptors thus created atomically at creation. The O_NONBLOCK flag might seem the most useful, although it’s for O_CLOEXEC (close-on-exec) where the atomicity is really essential. If you open a pipe and then try to set O_CLOEXEC separately, it’s possible for a different thread to call fork() and execve() between these two, thus leaving the file descriptor open in the resultant new process (which is exactly what O_CLOEXEC is meant to avoid).
os.sendfile() Added
In a similar vain, the sendfile() system call is now also available. This allows a specified number of bytes to be copied directly between two file descriptors entirely within the kernel, which avoids the overheads of a copy to and from userspace that read() and write() would incur. This useful for, say, static file HTTP daemons.
os.get_terminal_size() Added
Queries the specified file descriptor, or sys.stdout by default, to obtain the window size of the attached terminal. On Unix systems (at least) it probably uses the TIOCGWINSZ command with ioctl(), so if the file descriptor isn’t attached to a terminal I’d expect you’d get an OSError due to inappropriate ioctl() for the device. There’s a higher-level shutil.get_terminal_size() discussed below which handles these errors, so it’s probably best to use that in most cases.
Avoiding Symlink Races

Bugs and security vulnerabilities can result from the use of symlinks in the filesystem if you implement the pattern of first obtaining a target filename, and then opening it in a different step. This is because the target of the symlink may be changed, either accidentally or maliciously, in the meantime. To avoid this, various os functions have been enhanced to deal with file descriptors instead of filenames, which avoids this issue. This also offers improved performance.

Firstly, there’s a new os.fwalk() function which is the same as os.walk() except that it takes a directory file descriptor as a parameter, with the dir_fd parameter, and instead of the 3-tuple return it returns a 4-tuple of (dirpath, dirnames, filenames, dir_fd). Secondly, many functions now support accepting a dir_fd parameter, and any path names specified should be relative to that directory (e.g. access(), chmod(), stat()). This is not available on all platforms, and attempting to use it when not available will raise NotImplementedError. To check support, os.supports_dir_fd is a set of the functions that support it on the current platform.

Thirdly, many of these functions also now support a follow_symlinks parameter which, if False, means they’ll operate on the symlink itself as opposed to the target of the symlink. Once again, this isn’t always available you risk getting NotImplementedError if you don’t check the function is in os.supports_follows_symlinks.

Finally, some functions now also support passing a file descriptor instead of a path (e.g. chdir(), chown(), stat()). Support is optional for this as well and you should check your functions are in os.supports_fd.

os.access() With Effective IDs
There’s now an effective_ids parameter which, if True, checks access using the effective UID/GID as opposed to the real identifiers. This is platform-dependent, check os.supports_effective_ids, which once again is a set() of methods.
os.getpriority() & os.setpriority()
These underlyling system calls are now also exposed, so processes can set “nice” values in the same way as with os.nice() but for other processes too.
os.replace() Added
The behaviour of os.rename() is to overwrite the destination on POSIX platforms, but raises an error on Windows. Now there’s os.replace() which does the same thing but always overwrites the destination on all platforms.
Nanosecond Precision File Timestamps
The functions os.stat(), os.fstat() and os.lstat() now support reading timestamps with nanosecond precision, where available on the platform. The os.utime() function supposed updating nanosecond timestamps.
Linux Extended Attributes Support
There are now a family of functions to support Linux extended attributes, namely os.getxattr(), os.listxattr(), os.removexattr() and os.setxattr(). These are key/value pairs that can be associated with files to attach metadata for multiple purposes, such as supporting Access Control Lists (ACLs). Support for these is platform-dependent, not just on the OS but potentially on the underlying filesystem in use as well (although most of the Linux ones seem to support them).
Linux Scheduling
On Linux (and any other supported platforms) the os module now allows access to the sched_*() family of functions whic control CPU scheduling by the OS. You can find more details on the sched man page.
New POSIX Operations

Support for some additional POSIX filesystem and other operations was added in this release:

  • lockf() applies, tests or removes POSIX filesystem locks from a file.
  • pread() and pwrite() read/write from a specified offset within a current file descriptor but without changing the current file descriptor offset.
  • readv() and writev() provide scatter/gather read/write, where a single file can be read into, or written from, multiple separate buffers on the application side.
  • [truncate()] truncates or extends the specified path to be an exact size. If the existing file was larger, excess data is lost; if it was smaller, it’s padded with nul characters.
  • posix_fadvise() allows applications to declare an intention to use a specific access pattern on a file, to allow the filesystem to potentially make optimisations. This can be an intention for sequential access, random access, or an intention to read a particular block so it can be fetched into the cache.
  • posix_fallocate() reserves disk space for expansion of a particular file.
  • sync() flushes any filesystem caches to disk.
  • waitid() is a variant of waitpid() which allows more control over which child process state changes to wait for.
  • getgrouplist() returns the list of group IDs to which the specified username belongs.
os.times() and os.uname() Return Named Tuples
In an extension to the previously tuple return types, this allows results to be accessed by attribute name.
os.lseek() in Sparse Files
On some platforms, lseek() now supports additional options for the whence parameter, os.SEEK_HOLE and os.SEEK_DATA. These start at a specified offset and find the nearest location which either has data, or is a hole in the data. They’re only really useful in sparse files, because other files have contiguous data anyway.
stat.filemode() Added
Not strictly in the os module, but since the stat module is a companion to os.stat() I thought it most appropriate to cover here. An undocumented function tarfile.filemode() has exposed as stat.filemode(), which convert a file mode such as 0o100755 into the string form -rwxr-xr-x.

shutil & shlex Modules

shlex.quote() Added
Actually this function hasn’t been added so much as moved in from the pipes module, but it was previously undocumented. It escapes all characters in a string which might otherwise have special significance to a shell.
shutil.disk_usage() Added
Returns the total, used and free disk space values for the partition on which the specified path resides. Under the hood this seems to use os.statvfs(), but this wrapper is more convenient and also works on Windows, which doesn’t provide statvfs().
shutil.chown() Now Accept Names
As an alternative to the numeric IDs of the user and/or group.
shutil.get_terminal_size() Added
Attempts to discern the terminal window size. If the environment variables COLUMNS and LINES are defined, they’re used. Otherwise, os.get_terminal_size() (mentioned above) is called on sys.stdout. If this fails for any reason, the fallback values passed as a parameter are returned — these default to 80x24 if not specified.
shutil.copy2() and shutil.copystat() Improvements
These now correctly duplicate nanosecond-precision timestamps, as well as extended attributes on platforms that support them.
shutil.move() Symlinks
Now handles symlinks as POSIX mv does, re-creating the symlink instead of copying the contents of the target file when copying across filesystems, as used to be the previous behaviour. Also now also returns the destination path for convenience.
shutil.rmtree() Security
On platforms that support dir_fd in os.open() and os.unlink(), it’s now used by shutil.rmtree to avoid symlink attacks.

IPC Modules

New Functions
  • pthread_sigmask() allows querying and update of the signal mask for the current thread. If you’re interested in more details of the interactions between threads and signals, I found this article had some useful examples.
  • pthread_kill() sends a signal to a specified thread ID.
  • sigpending() is for examining the signals which are currently pending on the current thread or the process as a whole.
  • sigwait() and sigwaitinfo() both block until one of a set of signals becomes pending, with the latter returning more information about the signal which arrived.
  • sigtimedwait() is the same as sigwaitinfo() except that it only waits for a specified amount of time.
Signal Number On Wakeup FD
When using signal.set_wakeup_fd() to allow signals to wake up code waiting on file IO events (e.g. using the select module), the signal number is now written as the byte into this FD, whereas previously simply a nul byte was written regardless of which signal arrived. This allows the handler of that polling loop to determine which signal arrived, if multiple are being waited on.
OSError Replaces RuntimeError in signal
When errors occur in the functions signal.signal() and signal.siginterrupt(), they now raise OSError with an errno attribute, as opposed to a simple RuntimeError previously.
subprocess Commands Can Be bytes
Previously this was not possible on POSIX platforms.
subprocess.DEVNULL Added
This allows output to be disarded on any platform.

threading Module

Theading Classes Can Be Subclassed

Several of the objects in threading used to be factory functions returning instances, but are now real classes and hence may be subclassed. This change includes:

  • threading.Condition
  • threading.Semaphore
  • threading.BoundedSemaphore
  • threading.Event
  • threading.Timer
threading.Thread Constructor Accepts daemon
A daemon keyword parameter has been added to the threading.Thread constructor to override the default behaviour of inheriting this from the parent thread.
threading.get_ident() Exposed
The function _thread.get_ident() is now exposed as a supported function threading.get_ident(), which returns the thread ID of the current thread.

time Module

The time module has several new functions which are useful. The first three of these are new clocks with different properties:

time.monotonic()
Returns the (fractional) number of seconds since some unspecified reference point. The absolute value of this time isn’t useful, but it’s guaranteed to monotonically increase and it’s unaffected by any changes to system time, so it’s useful to measure the time between two events in a way which won’t be broken during DST boundaries or the system administrator changing the clock.
time.perf_counter()
As time.monotonic() but has the higest available resolution on the platform.
time.process.time()
Returns the total time spent active in the current process, including both system and user CPU time. Whilst the process is sleeping (blocked) this counter doesn’t tick up. The reference point is undefined, so only the difference between consecutive calls is valid.
time.get_clock_info()

This function returns details about the specified clock, which could be any of the options above (passed as a string) or "time" for the details of the time.time() standard system clock. The result is an object which has the following attributes:

  • adjustable is True if the clock may be changed by something external to the process (e.g. a system administrator or an NTP daemon).
  • implementation is the name of the underlying C function called to provide the timer value.
  • monotonic is True if the clock is guaranteed to never go backwards.
  • resolution is the resolution of the clock in fractional seconds.
Access To System Clocks

The time module also has also exposed the following underlying system calls to query the status of various system clocks:

  • clock_getres() returns the resolution of the specified clock, in fractional seconds.
  • clock_gettime() returns the current time of the specified clock, in fractional seconds.
  • clock_settime() sets the time on the specified clock, if the process has appropriate privileges. The only clock for which that’s supported currently is CLOCK_REALTIME.

The clocks which can be specified in this release are:

  • time.CLOCK_REALTIME is the standard system clock.
  • time.CLOCK_MONOTONIC is a monotonically increasing clock since some unspecified reference point.
  • time.CLOCK_MONOTONIC_RAW provides access to the raw hardware timer that’s not subject to adjustments.
  • time.CLOCK_PROCESS_CPUTIME_ID counts CPU time on a per-process basis.
  • time.CLOCK_THREAD_CPUTIME_ID counts CPU time on a per-thread basis.
  • time.CLOCK_HIGHRES is a higher-resolution clock only available on Solaris.

Implicit Namespace Packages

This is a feature which is probably only of interest to a particular set of package maintainers, so I’m going to do my best not to drill into too much detail. However, there’s a certain level of context required for this to make sense — you can always skip to the next section if it gets too dull!

First I should touch on what’s a namespace package in the first place. If you’re a Python programmer, you’ll probably be aware that the basic unit of code reusability is the module1. Modules can be imported individually, but they can also be collected into packages, which can contain modules or other packages. In its simplest forms, a module is a single .py file and a package is a directory which contains a file called __init__.py. The contents of this script are executed when the package is important, but the very fact of the file’s existence is what tags it as a packge to Python, even if the file is empty.

So now we come to what on earth is a namespace package. Simply put, this is a logical package which presents a uniform name to be imported within Python code, but is physically split across multiple directories. For example, you may want to create a machinelearning package, which itself contains other packages like dimensionreduction, anomolydetection and clustering. For such a large domain, however, each of those packages is likely to consist of its own modules and subpackages, and have its own team of maintainers, and coordinating some common release strategy and packaging system across all those teams and repositories is going to be really painful. What you really want to do is have each team package and ship its own code independently, but still have them presented to the programmer as a uniform package. This would be a namespace package.

Python already had two approaches for doing this, one provided by setuptools and later another one provided by the pkgutil module in the standard library. Both of these rely on the namespace package providing some respective boilerplate __init__.py files to declare it as a namespace package. These are shown below for reference, but I’m not going to discuss them further because this section is about the new approach.

# The setuptools approach involves calling a function in __init__.py,
# and also requires some changes in setup.py.
__import__('pkg_resources').declare_namespace(__name__)

# The pkgutil approach just has each package add its own directory to
# the __path__ attribute for the namespace package, which defines the
# list of directories to search for modules and subpackages. This is
# more or less equivalent to a script modifying sys.path, but more
# carefully scoped to impact only the package in question.
__path__ = __import__('pkgutil').extend_path(__path__, __name__)

Both of these approaches share some issues, however. One of them is that when OS package maintainers (e.g. for Linux distributions) want somewhere to install these different things, they’d probably like to choose the same place, to keep things tidy. But this means all those packages are going to try and install an __init__.py file over the top of each other, which makes things tricky — the OS packaging system doesn’t know these files necessarily contain the same things and will generate all sorts of complaints about the conflict.

The new approach, therefore, is to make these packages implicit, where there’s no need for an __init__.py. You can just chuck some modules and/or sub-packages into a directory which is a subdirectory of something on sys.path and Python will treat that as a package and make the contents available. This is discussed in much more detail in PEP 420.

Beyond these rather niche use-cases of mega-packages, this feature seems like it should make life a little easier creating regular packages. After all, it’s quite common that you don’t really need any setup code in __init__.py, and creating that empty file just feels messy. So if we don’t need to these days then why bother?

Well, as a result of this change it’s true that regular packages can be created without the need for __init__.py, but the old approach is still the correct way to create a regular package, and has some advantages. The primary one is that omitting __init__.py is likely to break existing tools which attempt to search for code, such as unittest, pytest and mypy to name just a few. It’s also noteworthy that if you rely on namespace packages and then someone adds something to your namespace which contains an __init__.py, this ends the search process for the package in question since Python assumes this is a regular package. This means all your other implicit namespace packages will be suddenly hidden when the clashing regular package is installed. Using __init__.py consistently everywhere avoids this problem.

Furthermore, regular packages can be imported as soon as they’re located on the path, but for namespace packages the entire path must be fully processed before the package can be created. The path entries must also be recalcuated on every import, for example in case the user has added additional entries to sys.path which would contribute additional content to an existing namespace package. These factors can introduce performance issues when importing namespace packages.

There are also some more minor factors which favour regular packages which I’m including below for completeness but which I doubt will be particularly compelling for many people.`

  • Namespace packages lack some features of regular packages, such as they’re missing a __file__ attribute and the __path__ attribute is read-only. These aren’t likely a major issue for anyone, unless you have some grotty code which it trying to calculate paths relative to the source files in the package or similar.
  • The setuptools.find_packages() function won’t find these new style namespace packages, although there is now a setuptools.find_namespace_packages() function which will, so it should be a fairly simple issue to modify setup.py appropriately.
  • If you’ve implemented your own import finders and loaders as per PEP 302 then these will need to be modified to support this new approach. I’m guessing this is a pretty small slice of developers, though.

As a final note, if you are having any issues with imports, I strongly recommend checking out Nick Coghlan‘s excellent article Traps for Unware in Python’s Import System which discusses some of the most common problems you might run into.

Other Builtin Changes

There are a set of small but useful changes in some of the builtins that are worth noting.

open() Opener
There is a new parameter opener for open() calls which is callable which is invoked with arguments (filename, flags) and is expected to return the file descriptor as os.open() would. This can be used to, for example, pass flags which aren’t supported by open(), but still benefit from the context manager behaviour offered by open().
open() Exclusively
The x mode was added for exclusive creation, failing if the file already exists. This is equivalent to the O_EXCL flag to open() on POSIX systems.
print() Flushing
print() now has a flush keyword argument which, if set to True, flushes the output stream immediately after the output.
hash() Randomization
As of Python 3.3, a random salt is used during hashing operations by default. This improves security by making hash values less predictable between separate invocations of the interpreter, but it does mean you definitely need to not rely on them being consistent if you serialise them out somewhere. I wrote a brief article about this about this topic around half a decade ago, as I was quite surprised at the time how serious a problem it can be.
str.casefold()
str objects now have a casefold() method to return a casefolded version of the string. This is intended to be used for case-insensitive comparisons, and is a much more Unicode-friendly approach than calling upper() or lower(). A full discussion of why is outside the scope of this article, but I suggest the excellent article Truths Programmers Should Know About Case by James Bennett for an informative article about the complexities of case outside of Latin-1 languages. Spoiler: it’s harder than you think, which should always be your default assumption for any I18n issues2.
copy() and clear()
There are now copy() and clear() methods on both list and bytearray objects, with the obvious semantics.
range Equality
Equality comparisons have been defined on range objects based on equality of the generated values. For example, range(3, 10, 3) == range(3, 12, 3). However, bear in mind this doesn’t evaluate the actual contents so range(3) != [0, 1, 2]. Also, applying transformations such as reversed seems to defeat these comparisons.
dict.setdefault() enhancement
Previously dict.setdefault() resulted in two hash lookups, one to check for an existing item and one for the insertion. Since a hash lookup can call into arbitrary Python code this meant that the operation was potentially non-atomic. This has been fixed in Python 3.3 to only perform the lookup once.
bytes Methods Taking int
The methods count(), find(), rfind(), index() and rindex() of bytes and bytearray objects now accept an integer in the range 0-255 to specify a single byte value.
memoryview changes
The memoryview class has a new implementation which fixes several previous ownership and lifetime issues which had lead to crash reports. This release also adds a number of features, such as better support for multi-dimensional lists and more flexible slicing.

Other Module Changes

There were some other additional and improved modules which I’ll outline briefly below.

bz2 Rewritten

The bz2 module has been completely rewritten, adding several new features:

  • There’s a new bz2.open() function, which supports opening files in binary mode (where it operates just like the bzip2.BZ2File constructor) or text mode (where it applies an io.TextIOWrapper).
  • You can now pass any file-like object to bz2.BZ2File using the fileobj parameter.
  • Support for multi-stream inputs and outputs has been added.
  • All of the io.BufferedIOBase interface is now implemented by bz2.BZ2File, except for detach() and truncate().
Abstract Base Classses Moved To collections.abc
This avoids confusion with the concrete classes provided by collections. Alises still exist at the top-level, however, to preserve backwards-compatibility.
crypt.mksalt()
For convenience of generating a random salt, there’s a new crypt.mksalt() function to create the 2-character salt used by Unix passwords.
datetime Improvements

There are a few enhancements to the ever-useful datetime library.

  • Equality comparisons between naive and timezone-aware datetime objects used to raise TypeError, but it was decided this was inconsistent with the behaviour of other incomparable types. As of Python 3.3 this will simply return False instead. Note that other comparisons will still raise TypeError, however.
  • There’s a new datetime.timestamp() method to return an epoch timestamp representation. This is implicitly in UTC, so timezone-aware datetimes will be converted and naive datetimes will be assumed to be in the local timezone and converted using the platform’s mktime().
  • datetime.strftime() now supports years prior to 1000 CE.
  • datetime.astimezone() now assumes the system time zone if no parameters are passed.
decimal Rewritten in C
There’s a new C implementation of the decimal module using the high-performance libmpdec. There are some API changes as a result which I’m not going to go into here as I think most of them only impact edge cases.
functools.lru_cache() Type Segregation
Back in an earlier article we talked about the functools.lcu_cache class for caching function results based on the parameters. This caching was based on checking the full set of arguments for equality with previous ones specified, and if they all compared equal then the cached result would be returned instead of calling the function. In this release, there’s a new typed parameter which, if True, also enforces that the arguments are of the same type to trigger the caching behaviour. For example, calling a function with 3 and then 3.0 would return the cached value with typed=False (the default) but would call the function twice with typed=True.
importlib
A number of changes to the mechanics of importing so that importlib.__import__ is now used directly by __import__(). A number of other changes have had to happen behind the scenes to make this happen, but now it means that the import machinery is fully exposed as part of importlib which is great for transparency and for any code which needs to find and import modules programmatically. I considered this a little niche to cover in detail, but the release notes have some good discussion on it.
io.TextIOWrapper Buffering Optional
The constructor of io.TextIOWrapper has a new write_through optional argument. If set to True, write() calls are guaranteed not to be buffered but will be immediately passed to the underlying binary buffer.
itertools.accumulate() Supports Custom Function
This function, that was added in the previous release, now supports any binary function as opposed to just summing results. For example, passing func=operator.mul would give a running product of values.
logging.basicConfig() Supports Handlers
There’s now a handlers parameter on logging.basicConfig() which takes an iterable of handlers to be added the root logger. This is probably handy for those scripts that are just large enough to be worth using logging, particularly if you consider the code might one day form the basis of a reusable module, but which aren’t big enough to mess around setting up a logging configuration file.
lzma Added
Provides LZMA compression, first used in the 7-Zip program and now primarily provided by the xz utility. This library supports the .xz file format, and also the .lzma legacy format used by earlier versions of this utility.
math.log2() Added
Not just a convenient alias for math.log(x, 2), this will often be faster and/or more accurate than the existing approach, which involves the usual division of logs to convert the base.
pickle Dispatch Tables
The pickle.Pickler class constructor now takes a dispatch_table parameter which allows the pickling functions to be customised on a per-type basis.
sched Improvements

The sched module, for generalised event scheduling, has had a variety of improvements made to it:

  • run() can now be passed blocking=False to execute pending events and then return without blocking. This widens the scope of applications which can use the module.
  • sched.scheduler can now be used safely in multithreaded environments.
  • The parameters to the sched.scheduler constructor now have sensible defaults.
  • enter() and enterabs() methods now no longer require the argument parameter to be specified, and also support a kwargs parameter to pass values by keyword to the callback.
sys.implementation
There’s a new sys.implementation attribute which holds information about the current implementation being used. A full list of the attributes is beyond the scope of this article, but as one example sys.implementation.version is a version tuple in the same format as sys.version_info. The former contains the implmentation version whereas the latter specifes the Python language version implemented — for CPython the two will be the same, since this is the reference implementation, but for cases like PyPy the two will differ. PEP 412 has more details.
tarfile Supports LZMA
Using the new lzma module mentioned above.
textwrap Indent Function
A new indent() method allows a prefix to be added to every line in a given string. This functionality has been in the textwrap.TextWrapper class for some time, but is now exposed as its own function for convenience.
xml.etree.ElementTree C Extension
This module now uses its C extension by default, there’s no longer any need to import xml.etree.cElementTree, although that module remains for backwards compatibility.
zlib EOF
The zlib module now has a zlib.Decompress.eof attribute which is True if the end of the stream has been reached. If this is False but there is no more data, it indicates that the compressed stream has been truncated.

Other Changes

As usual, there were some minor things that struck me as less critical, but I wanted to mention nonetheless.

Raw bytes literals
Raw” str literals are written r"..." and bytes literals are b"...". Until previously combining these required br"...", but as of Python 3.3 rb"..." will also work. Rejoice in the syntax errors thus avoided.
2.x-style Unicode literals
To ease transition of Python 2 code, u"..." literals are once again supported for str objects. This has no semantic significance in Python 3 since it is the default.
Fine-Grained Import Locks
Imports used to take a global lock, which could lead to some odd effects in the presence of multiple threads importing concurrently and code being run at import time. In Python 3.3 this has been switched to a per-module lock, so imports in multiple concurrent threads are still serialised correctly whilst still allowing different modules to be imported independently. If you enjoy learning about the subtle issues one must consider when trying to make concurrency bullet-proof, you may find issue 9260 an interesting read.
Windows Launcher
On Windows the Python installer now sets up a launcher which will run .py files when double-clicked. It even checks the shebang line to determine the Python version to use, if multiple are available.
Buffer Protocol Documentation
The buffer protocol documentation has been improved significantly.
Efficient Attribute Storage
The dict implementation used for holding attributes of objects has been updated to allow it to share the memory used for the key strings between multiple instances of a class. This can save 10-20% on memory footprints on heavily object-oriented code, and increased locality also achieves some modest performance improvements of up to 10%. PEP 412 has the full details.

Conclusions

So that’s Python 3.3, and what a lot there was in it! The yield from support is handy, but really just a taster of proper coroutines that are coming in future releases with the async keyword. The venv module is a bit of a game-changer in my opinion, because now that everyone can simply rely on it being there we can do a lot better documenting and automating development and runtime setups of Python applications. Similarly the addition of unittest.mock means everyone can use the powerful mocking features it provides to enhance unit tests without having to add to their project’s development-time dependencies. Testing is something where you want to lower the barrier to it as much as you can, to encourage everyone to use it freely.

The other thing that jumped out to me about this release in particular was the sheer breadth of new POSIX functions and other operating system functionality that are now exposed. It’s always a pet peeve of mine when my favourite system calls aren’t easily exposed in Python, so I love to see these sweeping improvements.

So all in all, no massive overhauls, but a huge array of useful features. What more could you ask from a point release?


  1. This could be pure Python or an extension module in a another langauge like C or C++, but that distinction isn’t important for this discussion. 

  2. Or if you really want the nitty gritty, feel free to peruse §3.13 of the Unicode standard. But if you do — and with sincere apologies to the authors of the Unicode standard who’ve forgotten more about international alphabets than I’ll ever know — my advice is to brew some strong coffee first. 

  3. Well, since you asked that’s specifically RFC 2045, RFC 2046, RFC 2047, RFC 4288, RFC 4289 and RFC 2049

  4. The default header_factory is documented in the email.headerregistry module. 

  5. And let’s be honest, my “niche filter” is so close to the identity function that they could probably share a lawnmower. I tend to only miss out the things that apply to only around five people, three of whom don’t even use Python. 

  6. However, since KEXTs have been replaced with system extensions more recently, which run in user-space rather than in the kernel, then I don’t know whether the PF_SYSTEM protocols are going to remain relevant for very long. 

7 Mar 2021 at 11:27AM in Software
 |   | 
Photo by David Clode on Unsplash
 | 

☑ Python 2to3: What’s New in 3.3 - Part 1

6 Mar 2021 at 11:11PM in Software
 |   | 

This is part 4 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.2.

The fourth Python 3.x release brought another slew of great new features. So many, in fact, that I’ve split this release into two articles, of which this is the first. Highlights in this part include yield from expressions, mocking support in unittest and virtualenv suppport in the standard library.

green python two

The next release in this sequence of articles is Python 3.3, which was released just over 19 months after version 3.2. This one was another packed release and it contained so many features I decided to split this into two articles. In this first one we’ll be covering yield from expressions which allow generators to delegate for each other, support for mocking built in to the standard library, the venv module, and a host of diagnostic improvements.

Builtin Virtual Environments

If you’ve been using Python for a decent length of time, you’re probably familiar with the virtualenv tool written by prolific Python contributor Ian Bicking1 around thirteen years ago. This was the sort of utility that you instantly wonder how you managed without it before, and it’s become a really key development2 tool for many Python developers.

As an acknowledgement of its importance, the Python team pulled a subset of its functionality into the standard Python library as the new venv module, and exposed a command-line interface with the pyvenv script. This is fully detailed in PEP 405.

On the face of it, this might not seem to be all that important, since virtualenv already exists and does a jolly good job all round. However, I think there are a whole host of benefits which make this stategically important. First and foremost, since it’s part of the standard distribution, there’s little chance that the core Python developers will make some change that renders it incompatible on any supported platform. It can also probably benefit from internal implementation details of Python on which an external project couldn’t safely rely, which may enable greater performance and/or reliability.

Secondly, the fact that it’s installed by default means that project maintainers have a baseline option they can count on, for installation or setup scripts, or just for documentation. This will not doubt cut down on support queries from inexperienced users who wonder why this virtualenv command isn’t working.

Thirdly, this acts as defense against the forking of the project, which is always a background concern with open source. It’s not uncommon for one popular project to be forked and taken in two divergent directions, and then suddenly project maintainers and users alike need to worry about which one they’re going with, the support efforts of communitieis are split, and all sorts of other annoyances. Having standard support in the standard library means there’s an option that can be expected to work in all cases.

In any case, regardless of whether you feel this is an important feature or just a minor tweak, it’s at least handy to have venv always available on any platform where Python is installed.

As an aside, if you’re curious about how virtualenv works then Carl Meyer presented an interesting talk on the subject, of which you can find the video and sildes online.

Generator Delegation

I actually already discussed this topic fairly well in my first article in my series on coroutines in Python a few years ago. But to save you the trouble of reading all that, or the gory details in PEP 380, I’ll briefly cover it here.

This is a fairly straightforward enhancement for generators to yield control to each other, which is performed using the new yield from statement. It’s perhaps best explained with a simple example:

>>> def fun2():
...     yield from range(10)
...     yield from range(30, 20, -2)
...
>>> list(fun2())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 28, 26, 24, 22]

On the face of it this is just a simple shorthand for for i in iter: yield i. However, there’s rather more to it when you consider the coroutine-like features that generators have where you can pass values into them, since these values also need to be routed directly to the delegate generator as well as the yielded values being routed back out.

There’s also an enhancement to generator return values. Previously the use of return within a generator was simply a way to terminate the generator, raising StopIteration, and it was a syntax error to provide an argument to the return statement. As of Python 3.3, however, this has been relaxed and a value is permitted. The value is returned to the called by attaching it to the StopIteration exception, but where yield from is used then this becomes the value to which the yield from expression evaluates.

This may seem a bit abstract and hard to grasp, so I’ve included an example of using these features for parsing HTTP chunk-encoded bodies. This is a format used for HTTP responses if the sender doesn’t know the size of the response up front, where the data is split into chunks of a known size and the length of a chunk is sent first followed by the data. This means the sender can keep transmitting data until it’s exhausted, and the reader can be processing it parallel. The end of the data is indicated by an empty chunk.

This sort of message-based interpretation of data from a byte stream is always a little fiddly. It’s most efficient to read in large chunks from the socket, and in the case of a chunk header you don’t know many bytes it’s going to be anyway, since the length is variable number of digits. As a result, by the time you’ve read the data you need, the chances are your buffer already contains some of the next piece of data. If you want to structure your code well and split parsing the various pieces up into multiple functions, as the single responsibility principle suggests, then this means you’ve always got this odd bit of “overflow” data as the initial set to parse before reading more from the data source.

There’s also the aspect that it’s nice to decouple the parsing from the data source. For example, although you’d expect a HTTP response to generally come in from a socket object, there’ll always be someone who already has it in a string form and still wants to parse it — so why force them to jump through some hoops making their string look like a file object again, when you could just structure your code a little more elegantly to decouple the parsing and I/O?

For all of the above reasons, I think that generators make a fairly elegant solution to this issue. Take a look at the code below and then I’ll explain why it works and why I think this is potentially a useful approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def content_length_decoder(length, data=b""):
    """Can be used directly by requests with Content-Length header."""

    while len(data) < length:
        length -= len(data)
        data = yield data
    new_input = yield data[:length]
    return data[length:] + new_input

def chunked_decoder(data=b""):
    """Decodes HTTP bodies with chunked encoding."""

    while True:
        crlf_index = data.find(b"\r\n")
        if crlf_index < 0:
            # Loop until we have <len>CRLF chunk header.
            data += yield b""
            continue
        chunk_len = int(data[:crlf_index], 16)

        if chunk_len == 0:
            # Zero length chunk terminates body.
            return data[crlf_index+2:]

        chunk = content_length_decoder(chunk_len, data[crlf_index+2:])
        data = yield from chunk

        # Strip off trailing CRLF from end of chunk.
        while len(data) < 2:
            data += yield b""
        data = data[2:]

# This is an example of a chunk-encoded response, with the headers
# already stripped off.
body_pieces = (b"C\r\nStrange",
               b" wome\r\n1B\r\nn lying in",
               b" ponds, distribut\r\n13\r",
               b"\ning swords, is no b\r\n",
               b"20\r\nasis for a system of ",
               b"government!\r\n0\r\n\r\n")

decoder = chunked_decoder()
document = bytearray(next(decoder))
try:
    for input_piece in body_pieces:
        document += decoder.send(input_piece)
    document += decoder.send(b"")
except StopIteration as exc:
    # Generally expect this to be the final terminating CRLF, but HTTP
    # standard allows for "trailing headers" here.
    print("Trailing headers: " + repr(exc.value))
print("Document: " + repr(document))

The general idea here is that each generator parses data which is passed to it via its send() method. It processes input until its section is done, and then it returns control to the caller. Ultimately decoded data is yielded from the generators, and each one returns any unparsed input data via its StopIteration exception.

In the example above you can see how this allows content_length_decoder() to be factored out from chunked_decoder() and used to decode each chunk. This refactoring would allow a more complete implementation to reuse this same generator to decode bodies which have a Content-Length header instead of being sent in chunked encoding. Without yield from this delegation wouldn’t be possible unless orchestrated by the top-level code outside of the generators, and that breaks the abstraction.

This is just one example of using generators in this fashion which sprung to mind, and I’m sure there are better ones, but hopefully it illustrates some of the potential. Of course, there are more developments on coroutines in future versions of Python 3 which I’ll be looking at later in this series, or if you can’t wait then you can take a read through my earlier series of articles specifically on the topic of coroutines.

Unit Testing

The major change in Unit Testing in Python 3.3 is that the mocking library has been merged into the standard library as unittest.mock. A full overview of this library is way beyond the scope of this article, so I’ll briefly touch on the highlights with some simple examples.

The core classes are Mock and MagicMock, where MagicMock is a variation which has some additional behaviours around Python’s magic methods4. These classes will accept requests for any attribute or method call, and create a mock object to track accesses to them. Afterwards, your unit test can make assertions about which methods were called by the code under test, including which parameters were passed to them.

One aspect that’s perhaps not immediately obvious is that these two objects represet more or less any object, such as functions or classes. For example, if you create a Mock instance which represents a class and then access a method on it, a child Mock object represents that method. This is possible in Python since everything comes down to attribute access at the end of the day — it just happens that calling a method queries an attribute __call__ on the object. Python’s duck-typing approach means that it doesn’t care whether it’s a genuine function that’s being called, or an object which implements __call__ such as Mock.

Here’s a short snippet which shows that without any configuration, a Mock object can be used to track calls to methods:

>>> from unittest import mock
>>> m = mock.Mock()
>>> m.any_method()
<Mock name='mock.any_method()' id='4352388752'>
>>> m.mock_calls
[call.any_method()]
>>> m.another_method(123, "hello")
<Mock name='mock.another_method()' id='4352401552'>
>>> m.mock_calls
[call.any_method(), call.another_method(123, 'hello')]

Here I’m using the mock_calls attribute, which tracks the calls made, but there are also a number of assert_X() methods which are probably more useful in the context of a unit test. They work in a very similar way to the existing assertions in unittest.

This is great for methods with no return type and are side-effect free, but what about implementing those behaviours? Well, that’s pretty straightforward once you understand the basic structure. Let’s say you have a class and you want to add a method with a side-effect, you just create a new Mock object and assign that as an attribute with the name of the method to the mock that’s representing your object instance. Then you create some function which implements whatever side-effects you require, and you assign that to the special side_effect attribute of the Mock representing your method. And then you’re done:

>>> m = mock.Mock()
>>> m.mocked_method = mock.Mock()
>>> def mocked_method_side_effect(arg):
...     print("Called with " + repr(arg))
...     return arg * 2
...
>>> m.mocked_method.side_effect = mocked_method_side_effect
>>> m.mocked_method(123)
Called with 123
246

Finally, as an illustration of the MagicMock class, you can see from the snippet below that the standard Mock object refuses to auto-create magic methods, but MagicMock implements them in the same way. You can add side-effects and return values to these in the same way as any normal methods.

>>> m = mock.Mock()
>>> len(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Mock' has no len()
>>> mm = mock.MagicMock()
>>> len(mm)
0
>>> mm[123]
<MagicMock name='mock.__getitem__()' id='4352367376'>
>>> mm.mock_calls
[call.__len__(), call.__getitem__(123)]
>>> mm.__len__.mock_calls
[call()]
>>> mm.__getitem__.mock_calls
[call(123)]

That covers the basics of creating mocks, but how about injecting them into your code under test? Well, of course sometimes you can do that yourself by passing in a mock object directly. But often you’ll need to change one of the dependencies of the code. To do this, you can use mock.patch as a decorator around your test methods to overwrite one or more dependencies with mocks. In the example below, the time.time() function is replaced by a MagicMock instance, and the return_value attribute is used to control the time reported to the code under test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import random
import time
from unittest import mock

class StickyRandom:
    def __init__(self):
        self.last_time = 0
    def get_value(self):
        if time.time() - self.last_time > 60:
            self.last_value = random.random()
            self.last_time = time.time()
        return self.last_value

@mock.patch("time.time")
def test_sticky_random(time_mock):
    instance = StickyRandom()
    time_mock.return_value = 1000
    value1 = instance.get_value()
    time_mock.return_value = 1030
    value2 = instance.get_value()
    assert value1 == value2
    time_mock.return_value = 1090
    value3 = instance.get_value()
    assert value3 != value1

test_sticky_random()

So that’s it for my whirlwind tour of mocking. There’s a lot more to it than I’ve covered, of course, so do take the time to read through the full documentation.

Diagnostics Changes

There are a few changes which are helpful for exception handling and introspection.

Nicer OS Exceptions

The situation around catching errors in Operating System operations has always been a bit of a mess with too many exceptions covering what are very similar operations at their heart. This can cause all sorts of annoying bugs in error handling if you try to catch the wrong exception.

For example, if you fail to os.remove() a file you get an OSError but if you fail to open() it you get an IOError. So that’s two exceptions for I/O operations right there, but if you happen to be using sockets then you need to also worry about socket.error. If you’re using select you might get select.error, but equally you might get any of the above as well.

The upshot of all this is that for any block of code that does a bunch of I/O you end up having to either catch Exception, which can hide other bugs, or catch all of the above individually.

Thankfully in Python 3.3 this situation has been averted since these have all been collapsed into OSError as per PEP 3151. The full list that’s been rolled into this is:

  • OSError
  • IOError
  • EnvironmentError
  • WindowsError
  • mmap.error
  • socket.error
  • select.error

Never fear for your existing code, however, beacuse the old names have all been maintained as aliases for OSError.

As well as this, however, there’s another change that’s even handier. Often you need to only catch some subset of errors and allow others to pass on as true error conditions. A common example of this is where you’re doing non-blocking operations, or you’ve specified some sort of timeout, and you want to ignore those cases but still catch other errors. In these cases, you often find yourself branching on errno like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import errno

def reliable_send(data, sock):
    while data:
        try:
            sent = sock.send(data)
            data = data[sent:]
        except socket.error as exc:
            if exc.errno == errno.EINTR:
                continue
            else:
                raise

It’s not terrible, but breaks the usual idiom of each error being its own exception, and make things just that bit harder to read.

Python 3.3 to the rescue! New exception types have been added which are derivations of OSError and correspond to the more common of these error cases, so they can be caught more gracefully. The new exceptions and the equivalent errno codes are:

New Exception Errno code(s)
BlockingIOError EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS
ChildProcessError ECHILD
FileExistsError EEXIST
FileNotFoundError ENOENT
InterruptedError EINTR
IsADirectoryError EISDIR
NotADirectoryError ENOTDIR
PermissionError EACCES, EPERM
ProcessLookupError ESRCH
TimeoutError ETIMEDOUT
ConnectionError A base class for the remaining exceptions…
… BrokenPipeError EPIPE, ESHUTDOWN
… ConnectionAbortedError ECONNABORTED
… ConnectionRefusedError ECONNREFUSED
… ConnectionResetError ECONNRESET

The BlockingIOError exception also has a handy characters_written attribute, when using buffered I/O classes. This indicates how many characters were written before the filehandle became blocked.

To finish off this setion, here’s a small example of how this might make code more readable. Take this code to handle a set of different errors which can occur when opening and attempting to read a particular filename:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import errno

try:
    # ...
except (IOError, OSError) as exc:
    if exc.errno == errno.EISDIR:
        print("Can't open directories")
    elif exc.errno in (errno.EPERM, errno.EACCES):
        print("Permission error")
    else:
        print("Unknown error")
except UnicodeDecodeError:
    print("Unicode error")
except Exception:
    print("Unknown error")

Particularly unpleasant here is the code duplication between handling unmatched errno codes and random other exceptions — although that’s just the duplication of a print() in this example, in reality that could become significant code duplication. With the new exceptions introduced in Python 3.3, however, this is all significantly cleaner:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
try:
    # ...
except IsADirectoryError:
    print("Can't open directories")
except PermissionError:
    print("Permission error")
except UnicodeDecodeError:
    print("Unicode error")
except Exception:
    print("Unknown error")

Suppressing Exception Chaining

As we covered in the first post in this series, exceptions in Python 3 can be chained. when they are chained, the default traceback is updated to show this context, and earlier exceptions can be recovered from attributes of the latest.

You might also recall that it’s possible to explicitly chain exceptions with the syntax raise NewException() from exc. This sets the __cause__ attribute of the exception, as opposed to the __context__ attribute which records the original exception being handled if this one was raised within an existing exception handling block.

Well, Python 3.3 adds a new variant to this which can be used to suppress the display of any exceptions from __context__, which is raise NewException() from None. You can see an example of this behaviour below, which you can compare to the same example in the first-post:

>>> try:
...     raise Exception("one")
... except Exception as exc1:
...     try:
...         raise Exception("two")
...     except Exception as exc2:
...         raise Exception("three") from None
...
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
Exception: three

The mechanics of how this is implemented could be a little confusing because they were changed after the feature was first implemented. The original PEP 409 specified that the default value of __cause__ to be Ellipsis, which was a pretty arbitrary choice as a singleton which wasn’t an exception, so it couldn’t be confused with a real cause; and wasn’t None, so later code could detect if it had been explicitly set to None via the raise Exception() from None idiom.

It was later decided that this was overloading the purpose of __cause__ in an inelegant fashion, however, so PEP 415 was implemented which made no change to the language features introduced by PEP 409, but changed the implementation. The rather hacky use of Ellipsis was removed and a new __suppress_context__ attribute was added. The semantics are that whenever __cause__ is set (typically with raise X from Y), __suppress_context__ is flipped to true. This applies when you set __cause__ to another exception, in which case the presumption is that it’s more useful to show than __context__ since it’s by explicit programmer choice; or using the raise X from None idiom, which is just the language syntax for setting __suppress_context__ without changing __cause__. Note that regardless of the value of __suppress_context__, the contents of the __context__ attribute are still available, and any code you write in your own exception handler is, of course, not obliged to respect __suppress_context__.

I must admit, I’m struggling to think of cases where the detail of that change would make a big difference to code your write. However, I’ve learned over the years that exception handling is one of those areas of the code you tend to test less thoroughly, and those areas are exactly where it’s helpful to have a knowledge of the details since it’s that much more likely you’ll find bugs here by code inspection rather than testing.

Introspection Improvements

Since time immemorial functions and classes have had a __name__ attribute. Well, it now has a little baby sibling, the __qualname__ attribute (PEP 3155) which indicates the full “path” of definition of this object, including any containing namespaces. The string represetation has also been updated to use this new, longer, specification. The semantics are mostly fairly self-explanatory, I think, so probably best illustrated with an example:

>>> class One:
...     class Two:
...         def method(self):
...             def inner():
...                 pass
...             return inner
...
>>> One.__name__, One.__qualname__
('One', 'One')
>>> One.Two.__name__, One.Two.__qualname__
('Two', 'One.Two')
>>> One.Two.method.__name__, One.Two.method.__qualname__
('method', 'One.Two.method')
>>> inner = One.Two().method()
>>> inner.__name__, inner.__qualname__
('inner', 'One.Two.method.<locals>.inner')
>>> str(inner)
'<function One.Two.method.<locals>.inner at 0x10467b170>'

Also, there’s a new inspect.signature() function for introspection of callables (PEP 362). This returns a inspect.Signature instance which references other classes such as inspect.Parameter and allows the siganture of callables to be easily introspected in code. Again, an example is probably most helpful here to give you just a flavour of what’s exposed:

>>> def myfunction(one: int, two: str = "hello", *args: str, keyword: int = None):
...     print(one, two, args, keyword)
...
>>> myfunction(123, "monty", "python", "circus")
123 monty ('python', 'circus') None
>>> inspect.signature(myfunction)
<Signature (one: int, two: str = 'hello', *args: str, keyword: int = None)>
>>> inspect.signature(myfunction).parameters["keyword"]
<Parameter "keyword: int = None">
>>> inspect.signature(myfunction).parameters["keyword"].annotation
<class 'int'>
>>> repr(inspect.signature(myfunction).parameters["keyword"].default)
'None'
>>> print("\n".join(": ".join((name, repr(param._kind)))
        for name, param in inspect.signature(myfunction).parameters.items()))
one: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
two: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
args: <_ParameterKind.VAR_POSITIONAL: 2>
keyword: <_ParameterKind.KEYWORD_ONLY: 3>

Finally, there’s also a new function inspect.getclosurevars() which reports the names bound in a particular function:

>>> import inspect
>>> xxx = 999
>>> def outer():
...     aaa = 100
...     def middle():
...         bbb = 200
...         def inner():
...             ccc = 300
...             return aaa + bbb + ccc + xxx
...         return inner
...     return middle()
...
>>> inspect.getclosurevars(outer())
ClosureVars(nonlocals={'aaa': 100, 'bbb': 200}, globals={'xxx': 999}, builtins={}, unbound=set())

In a similar vein there’s also inspect.getgeneratorlocals() which dumps the current internal state of a generator. This could be very useful for diagnosing bugs in the context of the caller, particularly if you don’t own the code implementing the generator and so can’t easily add logging statements or similar:

>>> def generator(maxvalue):
...     cumulative = 0
...     for i in range(maxvalue):
...         cumulative += i
...         yield cumulative
...
>>> instance = generator(10)
>>> next(instance)
0
>>> next(instance)
1
>>> next(instance)
3
>>> next(instance)
6
>>> inspect.getgeneratorlocals(instance)
{'maxvalue': 10, 'cumulative': 6, 'i': 3}

faulthandler Module

There’s a new module in Python 3.3 called faulthandler which is used to show a Python traceback on an event like a segmentation fault. This could be very useful when developing or using C extension modules which often fail in a crash, making it very hard to tell where the problem actually occurred. Of course, you can fire up a debugger and figure out the line of code if it’s your module, but if it’s someone else’s at least this will help you figure out whether the error lies in your code or not.

You can enable this support at runtime with faulthandler.enable(), or you can pass -X faulthandler to the interpreter on the command-line, or set the PYTHONFAULTHANDLER environment variable. Note that this will install signal handlers for SIGSEGV, SIGFPE, SIGABRT, SIGBUS, and SIGILL — if you’re using your own signal handlers for any of these, you’ll probably want to call faulthandler.enable() first and then make sure you chain into the earlier handler from your own.

Here’s an example of it working — for the avoidance of doubt, I triggered the handler here myself by manually sending SIGSEGV to the process:

>>> import faulthandler
>>> import time
>>> faulthandler.enable()
>>>
>>> def innerfunc():
...     time.sleep(300)
...
>>> def outerfunc():
...     innerfunc()
...
>>> outerfunc()
Fatal Python error: Segmentation fault

Current thread 0x000000011966bdc0 (most recent call first):
  File "<stdin>", line 2 in innerfunc
  File "<stdin>", line 2 in outerfunc
  File "<stdin>", line 1 in <module>
[1]    16338 segmentation fault  python3

Module Tracing Callbacks

There are a couple of modules which have added the ability to register callbacks for tracing purposes.

The gc module now provides an attribute callbacks which is a list of functions which will be called before and after each garbage collection pass. Each one has two parameters passed, the first is either "start" or "stop" to indicate whether this is before or after the collection pass, and the second is a dict providing details of the results.

>>> import gc
>>> def func(*args):
...     print("GC" + repr(args))
...
>>> gc.callbacks.append(func)
>>> class MyClass:
...     def __init__(self, arg):
...         self.arg = arg
...     def __del__(self):
...         pass
...
>>> x = MyClass(None)
>>> y = MyClass(x)
>>> z = MyClass(y)
>>> x.arg = z
>>> del x, y, z
>>> gc.collect()
GC('start', {'generation': 2, 'collected': 0, 'uncollectable': 0})
GC('stop', {'generation': 2, 'collected': 6, 'uncollectable': 0})
6

The sqlite3.Connection class has a method set_trace_callback() which can be used to register a callback function which will be called for every SQL statement that’s run by the backend, and it’s passed the statement as a string. Note this doesn’t just include statements passed to the execute() method of a cursor, but may include statements that the Python module itself runs, e.g. for transaction management.

Unicode Changes

With apologies to those already familiar with Unicode, a brief history lesson: Unicode was originally conceived as a 16-bit character set, which was thought to be sufficient to encode all languages in active use around the world. In 1996, however, the Unicode 2.0 standard expanded this to add 16 additional 16-bit “planes” to the set, to include scope for all characters ever used by any culture in history, plus other assorted symbols. This made it effectively a 21-bit character set3. The inital 16-bit set became the Basic Multilingual Plane (BMP), and the next two planes the Supplementary Multilingual Plane and Supplementary Ideographic Plane respectively.

OK, Unicode history lesson over. So what’s this got to do with Python? To understand that we need a brief Python history lesson. Python originally used 16-bit values for Unicode characters (i.e. UCS-2 encoding), which meant that it only suppored characters in the BMP. In Python 2.2 support for “wide” builds was added, so by adding a particular configure flag when compiling the interpreter, it could be built to use UCS-4 instead. This had the advantage of allowing the full range of all Unicode planes, but at the expense of using 4 bytes for every character. Since most distributions would use the wide build, because they had to assume full Unicode support was necessary, this meant in Python 2.x unicode objects consisting primarily of Latin-1 were four times larger than they needed to be.

This has been the case until Python 3.3, where the implementation of PEP 393 means that the concepts of narrow and wide builds has been removed and everyone can now take advantage of the ability to access all Unicode characters. This is done by deciding whether to use 1-, 2- or 4-byte characters at runtime based on the highest ordinal codepoint used in the string. So, pure ASCII or Latin-1 strings use 1-byte characters, strings composed entirely from within the BMP use 2-byte characters and if any other planes are used then 4-byte characters are used.

In the example below you can see this illustrated.

>>> # Standard ASCII has 1 byte per character plus 49 bytes overhead.
>>> sys.getsizeof("x" * 99)
148
>>> # Each new ASCII character adds 1 byte.
>>> sys.getsizeof("x" * 99 + "x")
149
>>> # Adding one BMP character expands the size of every character to
>>> # 2 bytes, plus 74 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{bullet}")
274
>>> sys.getsizeof("x" * 99 + "\N{bullet}" + "x")
276
>>> # Moving beyond BMP expands the size of every character to 4 bytes,
>>> # plus 76 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{taxi}")
476
>>> sys.getsizeof("x" * 99 + "\N{taxi}" + "x")
480

This basically offers the best of both worlds on the Python side. As well as reducing memory usage, this should also improve cache efficency by putting values closed together in memory. In case you’re wondering about the value of this, it’s important to remember that part of Supplementary Multilingual Plane is a funny little block called “Emoticons”, and we all know you’re not a proper application without putting "\N{face screaming in fear}" in a few critical error logs here and there. Just be aware that you may be quadrulpling the size of the string in memory by doing so.

On another Unicode related note, support for aliases has been added to the \N{...} escape sequences. Some of these are abbreviations, such as \N{SHY} for \N{SOFT HYPHEN}, and some of them are previously used incorrect names for backwards compatibility where corrections have been made to the standard. In addition these aliases are also supported in unicodedata.lookup(), and this additionally supports pre-defined sequences as well. An example of a sequence would be LATIN SMALL LETTER M WITH TILDE which is equivalent to "m\N{COMBINING TILDE}". Here are some more examples:

>>> import unicodedata
>>> "\N{NBSP}" == "\N{NO-BREAK SPACE}" == "\u00A0"
True
>>> "\N{LATIN SMALL LETTER GHA}" == "\N{LATIN SMALL LETTER OI}"
True
>>> (unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
... == "\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}")
True

Conclusions

That’s it for this post, but we’re not done with Python 3.3 yet! Check out the following article for my tour of the remaining changes in this release, as well as some thoughts on the entire release.


  1. As an unrelated aside, a few months ago (at time of writing!) Ian Bicking wrote a review of his main projects which makes for some interesting reading. 

  2. And for some people a production release tool as well, although personally I think a slightly cleaner wrapper like shrinkwrap makes for a more supportable option. 

  3.  

  4. The ones named of the form __xxx__()

6 Mar 2021 at 11:11PM in Software
 |   | 
Photo by David Clode on Unsplash
 | 

February 2021

☑ Python 2to3: What’s New in 3.2

7 Feb 2021 at 1:08PM in Software
 |   | 

This is part 3 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.1.

Another installment in my look at all the new features added to Python in each 3.x release, this one covering 3.2. There’s a lot covered including the argparse module, support for futures, changes to the GIL implementation, SNI support in SSL/TLS, and much more besides. This is my longest article ever by far! If you’re puzzled why I’m looking at releases that are years old, check out the first post in the series.

green python two

In this post I’m going to continue my examination of every Python 3.x release to date with a look at Python 3.2. I seem to remember this as a pretty big one, so there’s some possibility that this article will rival the first one in this series for length. In fact, it got so long that I also implemented “Table of Contents” support in my articles! So, grab yourself a coffee and snacks and let’s jump right in and see what hidden gems await us.

Command-line Arguments

We kick off with one of my favourite Python modules, argparse, defined in PEP 389. This is the latest in series of modules for parsing command-line arguments, which is a topic close to my heart as I’ve written a lot of command-line utilities over the years. I spent a number of those years getting increasingly frustrated with the amount of boilerplate I needed to add every time for things like validating arguments and presenting help strings.

Python’s first attempt at this was the getopt module, which was essentially just exposing the POSIX getopt() function in Python, even offering a version that’s compatible with the GNU version. This works, and it’s handy for C programmers familiar with the API, but it makes you do most of the work of validation and such. The next option was optparse, which did a lot more work for you and was very useful indeed.

Whilst optparse did a lot of work of parsing options for you (e.g. --verbose), it left any other arguments in the list for you to parse yourself. This was always slightly frustrating for me, because let’s say you expect the user to pass a list of integers, it seemed inconvenient to force them to use options for it just to take advantage of the parsing and validation the module offers. Also, more complex command-line applications like git often have subcommands which are tedious to validate by hand as well.

The argparse module is a replacement for optparse which aims to address these limitations, and I think by this point we’ve got to something pretty comprehensive. It’s usage is fairly similar to optparse, but adds enough flexibility to parse all sorts of arguments. It also can validate the types of arguments, provide command-line help automatically and allow subcommands to be validated.

The variety of options this module provides are massive, so there’s no way I’m going to attempt an exhaustive examination here. By way of illustration, I’ve implemented a very tiny subset of the git command-line as a demonstration of how subcommands work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import argparse
import os

# These functions would normally carry out the subcommands.
def do_status(args):
    print("Normally I'd run the status command here")

def do_log(args):
    print("Normally I'd run the log command here")

# We construct the base parser here and add global options.
parser = argparse.ArgumentParser()
parser.add_argument("--version", action="version", version="%(prog)s 2.24")
parser.add_argument("-C", action="store", dest="working_dir", metavar="<path>",
                    help="Run as if was started in PATH")
parser.add_argument("-p", "--paginate", action="store_true", dest="paginate",
                    help="Enable pagination of output")
parser.add_argument("-P", "--no-pager", action="store_false", dest="paginate",
                    help="Disable pagingation of output")
parser.set_defaults(subcommand=None, use_pager=True, working_dir=os.getcwd())

# We add a "status" subcommand with its own parser.
subparsers = parser.add_subparsers(title="Subcommands", description="Valid subcommands",
                                   help="additional help")
parser_status = subparsers.add_parser("status", help="Show working tree status")
parser_status.add_argument("-s", "--short", action="store_const", const="short",
                           dest="format", help="Use short format")
parser_status.add_argument("-z", action="store_const", const="\x00", dest="lineend",
                           help="Terminate output lines with NUL instead of LF")
parser_status.add_argument("pathspecs", metavar="<pathspec>", nargs="*",
                           help="One or more pathspecs to show")
parser_status.set_defaults(subcommand=do_status, format="long", lineend="\n")

# We add a "log" subcommand as well.
parser_log = subparsers.add_parser("log", help="Show commit logs")
parser_log.add_argument("-p", "--patch", action="store_true", dest="patch",
                        help="Generate patch")
parser_log.set_defaults(subcommand=do_log, patch=False)

# Shows how this parser could be used.
args = parser.parse_args()
if args.subcommand is None:
    print("No subcommand chosen")
    parser.print_help()
else:
    args.subcommand(args)

You can see the command-line help generated by the class below. First up, the output of running fakegit.py --help:

usage: fakegit.py [-h] [--version] [-C <path>] [-p] [-P] {status,log} ...

optional arguments:
  -h, --help      show this help message and exit
  --version       show program's version number and exit
  -C <path>       Run as if was started in PATH
  -p, --paginate  Enable pagination of output
  -P, --no-pager  Disable pagingation of output

Subcommands:
  Valid subcommands

  {status,log}    additional help
    status        Show working tree status
    log           Show commit logs

The subcommands also support their own command-line help, such as fakegit.py status --help:

usage: fakegit.py status [-h] [-s] [-z] [<pathspec> [<pathspec> ...]]

positional arguments:
  <pathspec>   One or more pathspecs to show

optional arguments:
  -h, --help   show this help message and exit
  -s, --short  Use short format
  -z           Terminate output lines with NUL instead of LF

Logging

The logging module has acquired the ability to be configured by passing a dict, as per PEP 391. Previously it could accept a config file in .ini format as parsed by the configparser module, but formats such as JSON and YAML are becoming more popular these days. To allow these to be used, logging has allowed a dict to be passed specifying the configuration, given that most of these formats can be trivial reconstructed into that format, a illustrated for JSON:

1
2
3
4
5
import json
import logging.config
with open("logging-config.json") as conf_fd:
    config = json.load(conf_fd)
logging.config.dictConfig(config)

When you’re packaging a decent sized application storing logging configuration in a file makes it easier to maintain the logging configuration vs. the option of hard-coding it in executable code. For example, it becomes easier to swap in a different logging configuration in different environments (e.g. pre-production and production). The fact that more popular formats can now be supported will open this flexibility to more developers.

In addition to this, the logging.basicConfig() function now has a style parameter where you can select which type of string formatting token to use for the format string itself. All of the following are equivalent:

>>> import logging
>>> logging.basicConfig(style='%', format="%(name)s -> %(levelname)s: %(message)s")
>>> logging.basicConfig(style='{', format="{name} -> {levelname} {message}")
>>> logging.basicConfig(style='$', format="$name -> $levelname: $message")

Also, if a log event occurs prior to configuring logging, there is a default setup of a StreamHandler connected to sys.stderr, which displays any message of WARNING level or higher. If you need to fiddle with this handler for any reason, it’s available as logging.lastResort.

Some other smaller changes:

  • Levels can now be supplied to setLevel() as strings such as INFO instead of integers like logging.INFO.
  • A getChild() method on Logger instances now returns a logger with a suffix appended to the name. For example, logging.getLogger("foo").getChild("bar.baz") will return the same logger as logging.getLogger("foo.bar.baz"). This is convenient when the first level of the name is __name__, as it often is by convention, or in cases where a parent logger is passed to some code which wants to create its own child logger from it.
  • The hasHandlers() method has also been added to Logger which returns True iff this logger, or a parent to which events are propagated, has at least one configured handler.
  • A new logging.setLogRecordFactory() and a corresponding getLogRecordFactory() have been added to allow programmers to override log record creation process.

Concurrency

There are a number of changes in concurrency this release.

Futures

The largest change is a new concurrent.futures module in the library, specified by PEP 3148, and it’s a pretty useful one. The intention with the new concurrent namespace is to collect together high-level code for managing concurrency, but so far it’s only acquired the one futures module.

The intention here is to provide what has become a standard abstraction over concurrent operations which represents the eventual result of a concurrent operation. In the Python module, the API style is deliberately decoupled from the implementation detail of what form of concurrency is used, whether it’s a thread, another process or some RPC to another host. This is useful as it allows the style to be potentially changed later if necessary without invalidating the business logic around it.

The style is to construct an executor which is where the flavour of concurrency is selected. Currently the module supports two options, ThreadPoolExecutor and ProcessPoolExecutor. The code can then schedule jobs to the executor, which returns a Future instance which can be used to obtain the results of the operation once it’s complete.

To exercise these in a simple example I wrote a basic password cracker, something that should benefit from parallelisation. I used PBKDF2 with SHA-256 for hashing the passwords, although only with 1000 iterations1 to keep running times reasonable on my laptop. Also, to keep things simple we assume that the password is a single dictionary word with no variations in case.

For comparison I first wrote a simple implementation which checks every word in /usr/share/dict/words with no parallelism:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import concurrent.futures
import hashlib
import sys

# The salt is normally stored alongside the password hash.
SALT = b"\xe2\x13*\xbb\x1a\xaar\t"
# This is the hash of a dictionary word.
TARGET_HASH = b"\xba<\xdfU\xc3\xdanx\x1b\x1c\xb0js\xf1\x19\xa9\xc5\xb9"\
              b"d!l\xa2\x14\x11K\x86\xac#\xc8\xc7\x8a\x91"
ITERATIONS = 1000

def calc_checksum(line):
    word = line.strip()
    return (word, hashlib.pbkdf2_hmac("sha256", word, SALT, ITERATIONS))

def main():
    with open("/usr/share/dict/words", "rb") as fd:
        for line in fd:
            check = calc_checksum(line)
            if check[1] == TARGET_HASH:
                print(line.strip())
    return 0

if __name__ == "__main__":
    sys.exit(main())

Here’s the output of time running it:

python3 crack.py  257.08s user 0.25s system 99% cpu 4:17.72 total

On my modest 2016 MacBook Pro, this took 4m 17s in total, and the CPU usage figures indicated that one core was basically maxed out, as you’d expect. Then I swapped out main() for a version that used ThreadPoolExecutor from concurrent.futures:

15
16
17
18
19
20
21
22
23
24
25
def main():
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        futures = set()
        with open("/usr/share/dict/words", "rb") as dict_fd:
            for line in dict_fd:
                futures.add(executor.submit(calc_checksum, line))
        for future in concurrent.futures.as_completed(futures):
            word, check = future.result()
            if check == TARGET_HASH:
                print(word)
    return 0

After creating a ThreadPoolExecutor which can use a maximum of 8 worker threads at any time, we then need to submit jobs to the executor. We do this in a loop around reading /usr/share/dict/words, submitting each word as a job to the executor to distribute among its workers. Once all the jobs are submitted, we then wait for them to complete and harvest the results.

Again, here’s the time output:

python3 crack.py  506.42s user 2.50s system 680% cpu 1:14.83 total

With my laptop’s four cores, I’d expect this would run around four times as fast2 and it more or less did, allowing for some overhead scheduling the work to the threads. The total run time was 1m 14s so a little less than the expected four times faster, but not a lot. The CPU usage was around 85% of the total of all four cores, which is again roughly what I’d expect. Running in a quarter of the time seems like a pretty good deal for only four lines of additional code!

Finally, just for fun I then swapped out ThreadPoolExecutor for ProcessPoolExecutor, which is the same but using child processes instead of threads:

16
17
    with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
        

And the time output with processes:

python3 crack.py  575.08s user 15.50s system 669% cpu 1:28.15 total

I didn’t expect this to make much difference to a CPU-bound task like this, provided that the hashing routine are releasing the GIL as they’re supposed to. Indeed, it was actually somewhat slower than the threaded case, taking 1m 28s to execute in total. The total user time was higher for the same amount of work, so this definitely points to some decreased efficiency rather than just differences in background load or similar. I’m assuming that the overhead of the additional IPC and associated memory copying accounts for the increased time, but this sort of thing may well be platform-dependent.

As one final flourish, I tried to reduce the inefficiencies of the multiprocess case by batching the work into larger chunks using a recipe from the itertools documentation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import concurrent.futures
import hashlib
import itertools
import sys

# The salt is normally stored alongside the password hash.
SALT = b'\xe2\x13*\xbb\x1a\xaar\t'
# This is the hash of a dictionary word.
TARGET_HASH = b"\xba<\xdfU\xc3\xdanx\x1b\x1c\xb0js\xf1\x19\xa9\xc5\xb9"\
              b"d!l\xa2\x14\x11K\x86\xac#\xc8\xc7\x8a\x91"
ITERATIONS = 1000

def calc_checksums(lines):
    return {
        word: hashlib.pbkdf2_hmac('sha256', word, SALT, ITERATIONS)
        for word in (line.strip() for line in lines if line is not None)
    }

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

def main():
    with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
        futures = set()
        with open("/usr/share/dict/words", "rb") as dict_fd:
            for lines in grouper(dict_fd, 1000):
                futures.add(executor.submit(calc_checksums, lines))
        for future in concurrent.futures.as_completed(futures):
            results = future.result()
            for word, check in results.items():
                if check == TARGET_HASH:
                    print(word)
    return 0

if __name__ == "__main__":
    sys.exit(main())

This definitely made some difference, bringing the time down from 1m 28s to 1m 6s. The CPU usage also indicates more of the CPU time is being spent in user space, presumably due to less IPC.

python3 crack.py  509.95s user 1.20s system 764% cpu 1:06.83 total

I suspect that the multithreaded case would also benefit from some batching, but at this point I thought I’d better draw a line under it or I’d never finish this article.

Overall, I really like the concurrent.futures module, as it takes so much hassle out of processing things in parallel. There are still cases where the threading module is going to be more appropriate, such as some background thread which performs periodic actions asynchronously. But for cases where you have a specific task that you want to tackle synchronously but in parallel, this module wraps up a lot of the annoying details.

I’m excited to see what else might be added to concurrent in the future3!

Threading

Despite all the attention on concurrent.futures this release, the threading module has also had some attention with the addition of a new Barrier class. This is initialised with a number of threads to wait for. As individual threads call wait() on the barrier they are held up until all the required number of threads are waiting, at which point all are allowed to proceed simultaneously. This is a little like the join() method, except the threads can continue to execute after the barrier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import threading
import time

def wait_thread(name, barrier, delay):
    for i in range(3):
        print("{} starting {}s delay".format(name, delay))
        time.sleep(delay)
        print("{} finishing delay".format(name))
        barrier.wait()

num_threads = 5
barrier = threading.Barrier(num_threads)
threads = [
    threading.Thread(target=wait_thread, args=(str(i), barrier, (i+1) * 2))
    for i in range(num_threads)
]
print("Starting threads...")
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
print("All threads finished.")

The Barrier can also be initialised with a timeout argument. If the timeout expires before the required number of threads have called wait() then all currently waiting threads are released and a BrokenBarrierError exception is raised from all the wait() methods.

I can think of a few use-cases where this synchronisation primitive might come in handy, such as multiple threads all producing streams of output which need to be synchronised with each other so one of them doesn’t get too far ahead of the other. For example, perhaps one thread is producing chunks of audio data and another chunks of video, you could use a barrier to ensure that neither of them gets ahead of the other.

Another small but useful change in threading is that the Lock.acquire(), RLock.acquire() and Semaphore.acquire() methods can now accept a timeout, instead of only allowing a simple choice between blocking and non-blocking as before. Also there’s been a fix to allow lock acquisitions to be interrupted by signals on pthreads platforms, which means that programs that deadlock on locks can be killed by repeated SIGINT (as opposed to requiring SIGKILL as they used to sometimes).

Finally, threading.RLock has been moved from pure Python to a C implementation, which results in a 10-15x speedup using them.

GIL Overhaul

In another change that will impact all forms of threading in CPython, the code behind the GIL has been rewritten. The new implementation aims to offer more predictable switching intervals and reduced overhead due to lock contention.

Prior to this change, the GIL was released after a fixed number of bytecode instructions had been executed. However, this is a very crude way to measure a timeslice since the time taken to execute an instruction can vary from a few nanoseconds to much longer, since not all the expensive C functions in the library release the GIL while they operate. This can mean that scheduling between threads can be very unbalanced depending on their workload.

To replace this, the new approach releases the GIL at a fixed time interval, although the GIL is still only released at an instruction boundary. The specific interval is tunable through sys.setswitchinterval(), with the current default being 5 milliseconds. As well as being a more balanced way to share processor time among threads, this can also reduce the overhead of locks in heavily contended situations — this is because waiting for a lock which is already held by another thread can add significant overhead on some platforms (apparently OS X is particularly impacted by this).

If you want to get technical4, threads wishing to take the GIL first wait on a condition variable for it to be released, with a timeout equal to the switch interval. Hence, it’ll wake up either after this interval, or if the GIL is released by the holding thread if that’s earlier. At this point the requesting thread checks whether any context switches have already occurred, and if not it sets the volatile flag gil_drop_request, shared among all threads, to indicate that it’s requesting the release of the GIL. It then continues around this loop until it gets the lock, re-requesting GIL drop after a delay every time a new thread acquires it.

The holding thread, meanwhile, attempts to release the GIL when it performs blocking operations, or otherwise every time around the eval loop it checks if gil_drop_request is set and releases the GIL if so. In so doing, it wakes up any threads which are waiting on the GIL and relies on the OS to ensure fair scheduling among threads.

The advantage of this approach is that it provides an advisory cap on the amount of time a thread may hold the GIL, by delaying setting the gil_drop_request flag, but also allows the eval loop as long as it needs to finish proessing its current bytecode instruction. It also minimises overhead in the simple case when no other thread has requested the GIL.

The final change is around thread switching. Prior to Python 3.2, the GIL was released for a handful of CPU cycles to allow the OS to schedule another thread, and then it was immediately reacquired. This was efficient if the common case is that no other threads are ready to run, and meant that threads running lots of very short opcodes weren’t unduly penalised, but in some cases this delay wasn’t sufficient to trigger the OS to context switch to a different thread. This can cause particular problems with you have an I/O-bound thread competing with a CPU-intensive one — the OS will attempt to schedule the I/O-bound thread, but it will immediately attempt to acquire the GIL and be suspended again. Meanwhile, the CPU-bound thread will tend to cling to the GIL for longer than it should, leading to higher I/O latency.

To combat this, the new system forces a thread switch at the end of the fixed interval if any other threads are waiting on the GIL. The OS is still responsible for scheduling which thread, this change just ensures that it’s not the previously running thread. It does this using a last_holder shared variable which points to the last holder of the GIL. When a thread releases the GIL, it additionally checks if last_holder is its own ID and if so, it waits on a condition variable for the value to change to another thread. This can’t cause a deadlock if no other threads are waiting, because in that case gil_drop_request isn’t set and this whole operation is skipped.

Overall I’m hopeful that these changes should make a positive impact to fair scheduling in multithreaded Python applications. As much as I’m sure everyone would love to find a way to remove the GIL entirely, it doesn’t seem like that’s likely for some time to come.

Date and Time

There are a host of small improvements to the datetime module to blast through.

First and foremost is that there’s now a timezone type which implements the tzinfo interface and can be used in simple cases of fixed offsets from UTC (i.e. no DST adjustments or the like). This means that creating a timezone-aware datetime at a known offset from UTC is now straightforward:

>>> from datetime import datetime, timedelta, timezone
>>> # Naive datetime (no timezone attached)
>>> datetime.now()
datetime.datetime(2021, 2, 6, 15, 26, 37, 818998)
>>> # Time in UTC (happens to be my timezone also!)
>>> datetime.now(timezone.utc)
datetime.datetime(2021, 2, 6, 15, 26, 46, 488588, tzinfo=datetime.timezone.utc)
>>> # Current time in New York (UTC-5) ignoring DST
>>> datetime.now(timezone(timedelta(0, -5*3600)))
datetime.datetime(2021, 2, 6, 10, 27, 41, 764597, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400)))

Also, timedelta objects can now be multiplied and divided by integers or floats, as well as divided by each other to determine how many of one interval fit into the other interval. This is all fairly straightforward by converting the values to a total number of seconds to perform the operations, but it’s convenient not to have to.

>>> timedelta(1, 20*60*60) * 1.5
datetime.timedelta(days=2, seconds=64800)
>>> timedelta(8, 3600) / 4
datetime.timedelta(days=2, seconds=900)
>>> timedelta(8, 3600) / timedelta(2, 900)
4.0

If you’re using Python to store information about the Late Medieval Period then you’re in luck, as datetime.date.strftime() can now cope with dates prior to 1900. If you want to expand your research to the Dark Ages, however, you’re out of luck since it still only handles dates from 1000 onwards.

Also, use of two-digit years is being discouraged. Until now setting time.accept2dyear to True would allow you to use a 2-digit year in a time tuple and its century would be guessed. However, as of Python 3.2 using this logic will raise you a DeprecationError. Quite right too, 2-digit years are quite an anacronism these days.

String Formatting

The str.format() method for string formatting is now joined by str.format_map() which, as the name implies, takes a mapping type to supply arguments by name.

>>> "You must cut down the mightiest {plant} in the forest with... a {fish}!"
    .format_map({"fish": "herring", "plant": "tree"})
'You must cut down the mightiest tree in the forest with... a herring!'

As well as a standard dict instance, you can pass any dict-like object and Python has plenty of these, such as ConfigParser and the objects created by the dbm modules.

There have also been some minor changes to formatting of numeric values as strings. Prior to this release convertinig a float or complex to string form with str() would show fewer decimal places than repr(). This was because the repr() level of precision would occasionally show surprising results, and the pragmatic way to avoid this being more of an issue was to make str() round to a lower precision.

However, as discussed in the previous article, repr() was changed to always select the shortest equivalent representation for these types in Python 3.1. Hence, in Python 3.2 the str() and repr() forms of these types have been unified to the same precision.

Function Enhancements

There are a series of enhancements to decorators provided by the functools module, plus a change to contextlib.

Firstly, just to make the example from the previous article more pointless, there is now a functools.lru_cache() decorator which can cache the results of a function based on its parameters. If the function is called with the same parameters, a cached result will be used if present.

This is really handy to drop in to commonly-used but slow functions for a very low effort speed boost. What’s even more useful is that you can call a cache_info() method of the decorated function to get statistics about the cache. There’s also a cache_clear() method if you need to invalidate the cache, although there’s unfortunately no option to clear only selected parameters.

>>> @functools.lru_cache(maxsize=10)
... def slow_func(arg):
...   return arg + 1
...
>>> slow_func(100)
101
>>> slow_func(200)
201
>>> slow_func(100)
101
>>> slow_func.cache_info()
CacheInfo(hits=1, misses=2, maxsize=10, currsize=2)

Secondly, there have been some improvements to functools.wraps() to improve introspection, such as a __wrapped__ attribute pointing back to the original callable and copying __annotations__ across to the wrapped version, if defined.

Thirdly, a new functools.total_ordering() class decorator has been provided. This is very useful for producing classes which support all the rich comparison operators with minimal effort. If you define a class with __eq__ and __lt__ and apply the @functools.total_ordering decorator to it, all the other rich comparision operators will be synthesized.

>>> import functools
>>> @functools.total_ordering
... class MyClass:
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other.value
...     def __eq__(self, other):
...         return self.value == other.value
...
>>> one = MyClass(100)
>>> two = MyClass(200)
>>> one < two
True
>>> one > two
False
>>> one == two
False
>>> one != two
True

Finally, there have been some changes which mean that the contextlib.contextmanager() decorator now results in a function which can be used both as a context manager (as previously) but now also as a function decorator. This could be pretty handy, although bear in mind if you yield a value which is normally bound in a with statement, there’s no equivalent approach for function deocorators.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import contextlib

@contextlib.contextmanager
def log_entry_exit(ident):
    print("Entering {}".format(ident))
    yield
    print("Leaving {}".format(ident))

with log_entry_exit("foo"):
    print("In context")

@log_entry_exit("my_func")
def my_func(value):
    print("Value is {}".format(value))

my_func(123)

Itertools

Only one improvement to itertools which is the addition of an accumulate function. However, this has the potential to be pretty handy so I’ve given it its own section.

Passed an iterable, itertools.accumulate() will return the cumulative sum of all elements so far. This works with any type that’s defined for operator +:

>>> import itertools
>>> list(itertools.accumulate([1,2,3,4,5]))
[1, 3, 6, 10, 15]
>>> list(itertools.accumulate([[1,2],[3],[4,5,6]]))
[[1, 2], [1, 2, 3], [1, 2, 3, 4, 5, 6]]

For other types, you can define any binary function to combine them:

>>> import operator
>>> list(itertools.accumulate((set((1,2,3)), set((3,4,5))),
         func=operator.or_))
[{1, 2, 3}, {1, 2, 3, 4, 5}]

And it’s also possible to start with an initial value before anything’s added by providing the initial argument.

Collections

The collections module has had a few improvements.

Counter

The collections.Counter class added in the previous release has now been extended with a subtract() method which supports negative numbers. Previously the semantics of -= as applied to a Counter would never reduce a value beyond zero — it would simply be removed from the set. This is consistent with how you’d expect a counter to work:

>>> x = Counter(a=10, b=20)
>>> x -= Counter(a=5, b=30)
>>> x
Counter({'a': 5})

However, in its initerpretation as a multiset, you might actually want values to go negative. If so, you can use the new subtract() method:

>>> x = Counter(a=10, b=20)
>>> x.subtract(Counter(a=5, b=30))
>>> x
Counter({'a': 5, 'b': -10})

OrderedDict

As demonstrated in the previous article, it’s a little inconvenient to move something to the end of the insertion order. That’s been addressed in this release with the OrderedDict.move_to_end() method. By default this moves the item to the last position in the ordered sequence in the same way as x[key] = x.pop(key) would but is significantly more efficient. Alternatively you can call move_to_end(key, last=False) to move it to the first position in the sequence.

Deque

Finally, collections.deque has two new methods, count() and reverse() which allow them to be used in more situations where code was designed to take a list.

>>> import collections
>>> x = collections.deque('antidisestablishmentarianism')
>>> x.count('i')
5
>>> x.reverse()
>>> x
deque(['m', 's', 'i', 'n', 'a', 'i', 'r', 'a', 't', 'n', 'e', 'm', 'h', 's',
'i', 'l', 'b', 'a', 't', 's', 'e', 's', 'i', 'd', 'i', 't', 'n', 'a'])

Internet Modules

The three modules email, mailbox and nntplib now correctly support the str and bytes types that Python 3 introduced. In particular, this means that messages in mixed encodings now work correctly. These have also necessitated a number of changes in the mailbox module, which should now work correctly.

The email module has new functions message_from_bytes() and message_from_binary_file(), and classes BytesFeedParser and BytesParser, to allow messages read or stored in the form of bytes to be parsed into model objects. Also, the get_payload() method and Generator class have been updated to properly support the Content-Transfer-Encoding header, encoding or decoding as appropriate.

Sticking with the theme of email, imaplib now supports upgrade of an existing connection to TLS using the new imaplib.IMAP4.starttls() method.

The ftplib.FTP class now supports the context manager protocol to consume socket.error exceptions which are thrown and close the connection when done. This makes it pretty handy, but due to the way that FTP opens additional sockets, you need to be careful to close all these before the context manager exits or your application will hang. Consider the following example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from ftplib import FTP

with FTP("ftp1.at.proftpd.org") as ftp:
    ftp.login()
    print(ftp.dir())
    sock = ftp.transfercmd("RETR README.MIRRORS")
    while True:
        data = sock.recv(8192)
        if not data:
            break
        print(data)
    sock.close()

Assuming that FTP site is still up, and README.MIRRORS is still available, that should execute fine. However, if you remove that sock.close() line then you should find it just hangs up and never terminiates (perhaps until the TCP connection gets terminated due to being idle).

The socket.create_connection() function can also be used as a context manager, and swallows errors and closes the connection in the same way as the FTP class above.

The ssl module has seen some love with a host of small improvements. There’s a new SSLContext class to hold persistent connection data such as settings, certificates and private keys. This allows the settings to be reused for multiple connections, and provides a wrap_socket() method for creating a socket using the stored details.

There’s a new ssl.match_hostname() which applies RFC-specified rules for confirming that a specified certificate matches the specified hostname. The certificate specification it expects is as returned by SSLSocket.getpeercert(), but it’s not particularly hard to fake as shown in the session below.

>>> import ssl
>>> cert = {'subject': ((('commonName', '*.andy-pearce.com'),),)}
>>> ssl.match_hostname(cert, "www.andy-pearce.com")
>>> ssl.match_hostname(cert, "ftp.andy-pearce.com")
>>> ssl.match_hostname(cert, "www.andy-pearce.org")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andy/.pyenv/versions/3.2.6/lib/python3.2/ssl.py", line 162, in match_hostname
    % (hostname, dnsnames[0]))
ssl.CertificateError: hostname 'www.andy-pearce.org' doesn't match '*.andy-pearce.com'

This release also adds support for SNI (Server Name Indication), which is like virtual hosting but for SSL connections. This removes the longstanding issue whereby you can host as many domains on a single IP address for standard HTTP, but for SSL you needed a unique IP address for each domain. This is essentially beause the virtual hosting of websites is implemented by passing the HTTP Host header, but since the SSL connection is set up prior to sending the HTTP request (by definition!) then the only thing you have to connect to is an IP address. The remote end needs to decide what certificate to send you, and since all it has to decide that is the IP address then you can’t have different certificates for different domains on the same IP. This is problematic because the certificate needs to match the domain or the browser will reject it.

SNI handles this by extending the SSL ClientHello message to include the domain. To implement this with the ssl module in Python, you need to specify the server_hostname parameter to SSLContext.wrap_socket().

The http.client module has been updated to use the new certificate verification processes when using a HTTPSConnection. The request() method is now more flexible on sending request bodies — previously it required a file object, but now it will also accept an iterable providing that an explicit Content-Length header was sent. According to HTTP/1.1 this header shouldn’t be required, since requests can be sent using chunked encoding, which doesn’t require the length of the request body to be known up front. In practice, however, it’s common for servers not to bother supporting chunked requests, despite being mandated by the HTTP/1.1 standard. As a result, it’s sensible to regard Content-Length as mandatory for requests with a body. HTTP/2 has its own methods of streaming data anyway, so once that gains wide acceptance then chunked encoding won’t be used anyway — but given the rate of adoption so far, I wouldn’t hold your breath.

The urllib.parse module has some changes as well, with urlparse() now supporting IPv6 and urldefrag() returning a collections.namedtuple for convenience. The urlencode() function can also now accept both str and bytes for the query parameter.

Markup Languages

There have been some significant updates to the xml.etree.ElementTree package, including the addition of the following top-level functions:

fromstringlist()
A handy method which builds an XML document from a series of fragment strings. In partiular this means you can open a filehandle in text mode and have it parsed one line at a time, since iterating the filehandle will yield one line at a time.
tostringlist()
The opposite of fromstringlist(), generates the XML output in chunks. It doesn’t make any guarantees except that joining them all together will yield the same as generating the output as a single string, but in my experience each chunk is around 8192 bytes plus whatever takes it up to the next tag boundary.
register_namespace()
Allows you to register a namespace prefix globally, which can be useful for parsing lots of XML documents which make heavy use of namespaces.

The Element class also has a few extra methods:

Element.extend()
Appends children to the current element from a sequence, which must itself contain Element instances.
Element.iterfind()
As Element.findall() but yields elements instead of returning a list.
Element.itertext()
As Element.findtext() but iterates over all the current element and all child elements as opposed to just returning the first match.

The TreeBuilder class also has acquired the end() method to end the current element and doctype() to handle a doctype declaration.

Finally, a couple of unnecessary methods have been deprecated. Instead of getchildren() you can just use list(elem), and instead of getiterator() just use Element.iter().

Also in 3.2 there’s a new html module, but it only contains one function escape() so far which will do the obvious HTML-escaping.

>>> import html
>>> html.escape("<blink> & <marquee> tags are both deprecated")
'&lt;blink&gt; &amp; &lt;marquee&gt; tags are both deprecated'

Compression and Archiving

The gzip.GzipFile class now provides a peek() method which can read a number of bytes from the archive without advancing the read pointer. This can be very useful when implemented parsers which need to choose between various functions to branch into based on what’s next in the file, but which to also leave those functions to read from the file itself as a simpler interface.

The gzip module has also added the compress() and decompress() methods which simply perform in-memory compression/decompression without the need to construct a GzipFile instance. This has been a source of irritation for me in the past, so it’s great to see it finally addressed.

The zipfile module also had some improvements, with the ZipFile class now supporting use as a context manager. Also, the ZipExtFile object has had some performance improvements. This is the file-like object returned when you open a file within a ZIP archive using the ZipFile.open() method. You can also wrap it in io.BufferedReader for even better performance if you’re doing multiple smaller reads.

The tarfile module has changes, with tarfile.TarFile also supporting use as a context manager. Also, the add() method for adding files to the archive now supports a filter parameter which can modify attributes of the files as they’re added, or exclude them altogether. You pass a callable using this parameter, which is called on each file as it’s added. It’s passed a TarInfo structure which has the metainformation about the file, such as the permissions and owner. It can return a modified version of the structure (e.g. to squash all files to being owned by a specific user), or it can return None to block the file from being added.

Finally, the shutil module has also grown a couple of archive-related functions, make_archive() and unpack_archive(). These provide a convenient high-level interface to zipping up multiple files into an archive without having to mess around with the details of the individual compression modules. It also means that the format of your archives can be altered with minimal impact on your code by changing a parameter.

It supports the common archiving formats out of the box, but there’s also a register_archive_format() hook should you wish to add code to handle additional formats.

Math

There are some new functions in the math library, some of which look pretty handy.

isfinite()
Returns True iff the float argument is not a special value (e.g. NaN or infinity)
expm1()
Calculates for small x in a way which doesn’t result in a loss of precision that can occur when subtracting nearly equal values.
erf() and erfc()
erf() is the Guassian Error Function, which is useful for assessinig how much of an outlier a data point is against a normal distribution. The erfc() function is simply the compliment where erfc(x) == 1 - erf(x).
gamma() and lgamma()
Implements the Gamma Function, which is an extension of factorial to cover continuous and complex numbers. I suspect for almost everyone math.factorial() will be what you’re looking for. Since the value grows so quickly, larger values will yield an OverflowError. To deal with this, the lgamma() function returns the natural logarithm of the value.

Compiled Code

There have been a couple of changes to the way that both compiled bytecode and shared object files are stored on disk. More casual users of Python might want to skip over this section, although I would say it’s always helpful to know what’s going on under the hood, if only to help diagnose problems you might run into.

PYC Directories

The previous scheme of storing .pyc files in the same directory as the .py files didn’t play nicely when the same source files were being used by multiple different interpreters. The interpreter would note that the file was created by another one, and replace it with its own. As the files swap back and forth, it cancels out the benefits of caching in the first place.

As a result, the name of the interpreter is now added to the .pyc filename, and to stop these files cluttering things up too much they’ve all been moved to a __pycache__ directory.

I suspect many people will not need to care about this any further than it being another entry for the .gitignore file. However, sometimes there can be odd effects with these compiled files, so it’s worth being aware of. For example, if a module is installed and used and then deleted, it might leave the .pyc files behind, confusing programmers who were expecting an import error. If you do want to check for this, there’s a new __cached__ attribute of an imported module indicating the file that was loaded, in addition to the existing __file__ attribute which continues to refer to the source file. The imp module also has some new functions which are useful for scripts that need to correlate source and compiled files for some reason, as illustrated by the session below:

>>> import mylib
>>> print(mylib.__file__)
/tmp/mylib.py
>>> print(mylib.__cached__)
/tmp/__pycache__/mylib.cpython-32.pyc
>>> import imp
>>> imp.get_tag()
'cpython-32'
>>> imp.cache_from_source("/tmp/mylib.py")
'/tmp/__pycache__/mylib.cpython-32.pyc'
>>> imp.source_from_cache("/tmp/__pycache__/mylib.cpython-32.pyc")
'/tmp/mylib.py'

There are also some corresponding changes to the py_compile, compileall and importlib.abc modules which are a bit esoteric to cover here, the documentation has you well covered. You can also find lots of details and a beautiful module loading flowchart in PEP 3147.

Shared Objects

Similar changes have been implemented for shared object files. These are compiled against a specific ABI (Application Binary Interface) and the ABI is sensitive to major Python version, but also the compilation flags that were used to compiled the interpreter can also affect it. As a result, being able to support the same shared object compiled against multiple ABIs is useful.

The implementation is similar to that for compiled bytecode, where .so files acquire unique filenames based on the ABI and are collected into a shared directory pyshared. The suffix for the current interpreter can be queried using sysconfig:

>>> import sysconfig
>>> sysconfig.get_config_var("SOABI")
'cpython-32m-x86_64-linux-gnu'
>>> sysconfig.get_config_var("EXT_SUFFIX")
'.cpython-32m-x86_64-linux-gnu.so'

The interpreter is cpython, 32 is the version and the letters appended indicate the compilation flags. In this example, m corresponds to pymalloc.

If you want more details, PEP 3149 has a ton of interesting info.

Syntax Changes

The syntax of the language has been expanded to allow deletion of a variable that are free in a nested block. If that didn’t make any sense, it’s best explained with an example. The following code was legal in Python 2.x, but would raised a SyntaxError in Python 3.0 or 3.1. In Python 3.2, however, this is once again legal.

1
2
3
4
5
6
7
def outer_function(x):
    def inner():
        # Reference to x in a nested scope.
        return x
    inner()
    # Deleting variable referenced in nested scope.
    del x

So what happens if we were to call inner() again after the del x now? We exactly the same results as if we hadn’t declared the local yet which is to get NameError with the message free variable 'x' referenced before assignment in enclosing scope. The following example may make this message clearer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def outer_function():
    def inner():
        return x
    # print(inner()) here would raise NameError
    x = 123
    print(inner())  # Prints 123
    x = 456
    print(inner())  # Prints 456
    del x
    # print(inner()) here would raise NameError

An important example of an implicit del is at the end of an except block, so the following code would have raised a SyntaxError in Python 3.0-3.1, but is now valid again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import traceback

def func():
    def print_exception():
        traceback.print_exception(type(exc), exc, exc.__traceback__)
    try:
        do_something_here()
    except Exception as exc:
        print_exception()
        # There is an implicit `del exc` here

Diagnostics and Testing

A new ResourceWarning has been added to detect issues such as gc.garbage not being empty at interpreter shutdown, indicating finalisation problems with the code. It’s also raised if a file object is destroyed before being properly closed.

This warning is silenced by default, but can be enabled by the warnings module, or using an appropriate -W option on the command-line. The session shown below shows the warning being triggered by destroying an unclosed file object:

>>> warnings.filterwarnings("default")
>>> f = open("/etc/passwd", "rb")
>>> del f
<stdin>:1: ResourceWarning: unclosed file <_io.BufferedReader name='/etc/passwd'>

Note that as of Python 3.4 most of the cases that could cause garbage collection to fail have been resolved, but we have to pretend we don’t know that for now.

There have also been a range of improvements to the unittest module. There are two new assertions, assertWarns() and assertWarnsRegex(), to test whether code raises appropriate warnings (e.g. DeprecationWarning). Another new assertion assertCountEqual() can be used to perform an order-independent comparison of two iterables — functionally this is equivalent to feeding them both into collections.Counter() and comparing the results. There is also a new maxDiff attribute for limiting the size of diff output when logging assertion failures.

Some of the assertion names are being tidied up. Examples include assertRegex() being the new name for assertRegexpMatches() and assertTrue() replacing assert_(). The assertDictContainsSubset() assertion has also been deprecated because the arguments were in the wrong order, so it was never quite clear which argument was required to be a subset of which.

Finally, the command-line usage with python -m unittest has been made more flexible, so you can specify either module names or source file paths to indicate which tests to run. There are also additional options for python -m unittest discover for specifying which directory to search for tests, and a regex filter on the filenames to run.

Optimisations

Some performance tweaks are welome to see. Firstly, the peephole optimizer is now smart enough to convert set literals consisting of constants to frozenset. This makes things faster in cases like this:

1
2
3
def is_archive(path):
    _, ext = os.path.splitext(path)
    return ext.lower() in {"zip", "tgz", "gz", "tar", "bz2"}

The Timsort algorithm used by list.sort() and sorted() is now faster and uses less memory when a key function is supplied by changing the way this case is handled internally. The performance and memory consumption of json decoding is also improved, particularly in the case where the same key is used repeatedly.

A faster substring search algorithm, which is based on the Boyer-Moore-Horspool algorithm, is used for a number of methods on str, bytes and bytearray objects such as split(), rsplit(), splitlines(), rfind() and rindex().

Finally, int to str conversions now process two digits at a time to reduce the number of arithmetic operations required.

Other Changes

There’s a whole host of little changes which didn’t sit nicely in their own section. Strap in and prepare for the data blast!

New WSGI Specification
As part of this release PEP 3333 is included as an update to the original PEP 333 which specifies the WSGI (Web Server Gateway Interface) specification. Primarily this tightens up the specifications around request/response header and body strings with regards to the types (str vs. bytes) and encodings to use. This is important reading for anyone building web apps conforming to WSGI.
range Improvements
range objects now support index() and count() methods, as well as slicing and negative indices, to make them more interoperable with list and other sequences.
csv Improvements
The csv module now supports a unix_dialect output mode where all fields are quoted and lines are terminated with \n. Also, csv.DictWriter has a writeheader() method which writes a row of column headers to the output file, using the key names you provided at construction.
tempfile.TemporaryDirectory Added
The tempfile module now provides a TemporaryDirectory context manager for easy cleanup of temporary directories.
Popen() Context Managers
os.popen() and subprocess.Popen() can now act as context managers to automatically close any associated file descriptors.
configparser Always Uses Safe Parsing
configparser.SafeConfigParser has been renamed to ConfigParser to replace the old unsafe one. The default settings have also been updated to make things more predictable.
select.PIPE_BUF Added
The select module has added a PIPE_BUF constant which defines the minimum number of bytes which is guaranteed not to block when a select.select() has indicated that a pipe is ready for writing.
callable() Re-introduced
The callable() builtin from Python 2.x was re-added to the language, as it’s a more readable alternative to isinstance(x, collections.Callable).
ast.literal_eval() For Safer eval()
The ast module has a useful literal_eval() function which can be used to evaluate expressions more safely than the builtin eval().
reprlib.recursive_repr() Added
When writing __repr__() special methods, it’s easy to forget to handle the case where a container can contain a reference to itself, which easily leads to __repr__() calling itself in an endlessly recursive loop. The reprlib module now provides a recursive_repr() decorator which will detect the recursive call and add ... to the string representation instead.
Numeric Type Hash Equivalence
Hash values of the various different numeric types should now be equal whenever their actual values are equal, e.g hash(1) == hash(1.0) == hash(1+0j).
hashlib.algorithms_available() Added
The hashlib module now provides the algorithms_available set which indicates the hashing algorithms available on the current platform, as well as algorithms_guaranteed which are the algorithms guaranteed to be available on all platforms.
hasattr() Improvements
Some undesiriable behaviour in hasattr() has been fixed. This works by calling getattr() and checking whether an exception is thrown. This approach allows it to support the multiple ways in which an attribute may be provided, such as implementing __getattr__(). However, prior to this release hasattr() would catch any exception, which could mask genuine bugs. As of Python 3.2 it will only catch AttributeError, allowing any other exceptioni to propogate out.
memoryview.release() Added
Bit of an esoteric one this, but memoryview objects now have a release() method and support use as a context manager. These objects allow a zero-copy view into any object that supports the buffer protocol, which includes the builtins bytes and bytearray. Some objects may need to allocate resources in order to provide this view, particularly those provided by C/C++ extension modules. The release() method allows these resources to be freed earlier than the memoryview object itself going out of scope.
structsequence Tool Improvements
The internal structsequence tool has been updated so that C structures returned by the likes of os.stat() and time.gmtime() now work like namedtuple and can be used anywhere where a tuple is expected.
Interpreter Quiet Mode
There’s a -q command-line option to the interpreter to enable “quiet” mode, which suppresses the copyright and version information being displayed in interactive mode. I struggle a little to think of cases where this would matter, I’ll be honest — perhaps if you’re embedding the interpreter as a feature in a larger application?

Conclusions

Well now, I must admit that I did not expect that to be double the size of the post covering Python 3.0! If you’ve come here reading that whole article in one go, I must say I’m impressed. Perhaps lay off caffeine for awhile…?

Overall it feels like a really massive release, this one. Admittedly I did cover a high proportion of the details, whereas in the first article I glossed over quite a lot as some of the changes were so massive I wanted to focus on them.

Out of all that, it’s really hard to pick only a few highlights, but I’ll give it a go. As I said at the outset I love argparse — anyone who writes command-line tools and cares about their usability should save a lot of hassle with this. Also, the concurrent.futures module is great — I’ve only really started using it recently, and I love how it makes it really convenient to add parallelism in simple cases to applications where the effort might otherwise be too high to justify the effort.

The functools.lru_cache() and functools.total_ordering() decorators are both great additions because they offer significant advantages with minimal coding effort, and this is the sort of feature that a language like Python should really be focusing on. It’s never going to beat C or Rust in the performance stakes, but it has real strengths in time to market, as well as the concision and elegance of code.

It’s also great to see some updates to the suite of Internet-facing modules, as having high quality implementations of these in the standard library is another great strength of Python that needs to be maintained. SSL adding support for SNI is a key improvement that can’t come too soon, as it still seems a long way off that we’ll be saying goodbye to the limited address space of IPv4.

Finally, the GIL changes are great to see. Although we’d all love to see the GIL be deprecated entirely, this is clearly a very difficult problem or it would have been addressed by now. Until someone can come up with something clever to achieve this, at least things are significantly better than they were for multithreaded Python applications.

So there we go, my longest article yet. If you have any feedback on the amount of detail that I’m putting in (either too much or too little!) then I’d love to hear from you. I recently changed my commenting system from Disqus to Hyvor which is much more privacy-focused and doesn’t require you to register an account to comment, and also has one-click feedback buttons. I find writing these articles extremely helpful for myself anyway, but it’s always nice to know if anyone else is reading them! If you’re reading this on the front-page, you can jump to the comments section of the article view using the link at the end of the article at the bottom-right.

OK, so that’s it — before I even think of looking at the Python 3.3 release notes, I’m going to go lie down in a darkened room with a damp cloth on my forehead.


  1. In real production envronments you should use many more iterations than this, a bigger salt and ideally a better key derivation function like scrypt, as defined in RFC 7914. Unforunately that won’t be in Python until 3.6. 

  2. Maybe more due to hyperthreading, but my assumption was that it wouldn’t help much with a CPU-intensive task like password hashing. My results seemed to validate that assumption. 

  3. Spoiler alert: using my time machine I can tell you it’s not a lot else yet, at least as of 3.10.0a5. 

  4. And you know I love to get technical. 

7 Feb 2021 at 1:08PM in Software
 |   | 
Photo by David Clode on Unsplash
 | 

Page 1 of 11   |   Page 2 →   |   Page 11 ⇒