☑ Python 2to3: What’s New in 3.5 - Part 4, Module Improvements

10 Jul 2021 at 10:38PM in Software
 |   | 

In this series looking at features introduced by every version of Python 3, this is the fourth looking at Python 3.5. In it we look at the major updates to the standard library which were made in this release.

This is the 11th of the 13 articles that currently make up the “Python 2to3” series.

green python two 35

As you’ve perhaps read in the preceding three articles in this series, Python 3.5 had quite an impressive number of new features. However, this didn’t come at the expense of the usual widespread improvements to the standard library as well, and in this fourth and final article on Python 3.5 we’ll go through most of these.

There’s some further asyncio improvements, a whole lot of improvements in networking modules, some handy changes in the pathlib module, and more. But let’s kick off with some changes to containers.

Collections

There are a handful of improvements in the collections, collections.abc and enum modules.

OrderedDict Improvements

Firstly OrderedDict has been re-written in C for better performance. The release notes say this should improve its performance 4-100x, which basically means it should be faster, but exactly how much depends very much on what you’re doing with it.

Secondly, the items(), keys() and values() views now support iteration with reversed().

deque Now a MutableSequence
The deque class has added index(), insert() and copy() methods and also supports + and * operators, which all together means it now fulfills the requirements of the MutableSequence abstract base class. This means they can be used to replace list in most contexts, which is handy because for FIFO-like use-cases deques are great data structures.
New Abstract Base Classes
I always assumed the lack of collections.abc.Generator was an oversight, and it’s good to see it corrected in this release. This becomes particularly relevant in 3.9 when the abstract base classes support direct use as type hints, but that’s quite a few articles away yet. There are also some other new abstract base classes for async support, namely Awaitable, Coroutine, AsyncIterator, and AsyncIterable.
enum.Enum Takes start Parameter
For cases where enumerations are automatically numbered, the start parameter specifies the initial integer value.

Compression & Archiving

There’s a minor but useful change in the gzip module to allow the x character to be added to the mode argument to request exclusive creation of the file.

In a similarly small fashion, lzma.LZMADecompressor.decompress() now accepts an optional max_length parameter to put an upper limit on the size of the decompressed data, so you don’t accidentally fill up your disk.

In the tarfile module, open() now accepts the x mode modified to require exclusive creation of the file. Also, the TarFile.extract() and Tarfile.extractAll() methods now accept a boolean numeric_owner parameter — if True, the numeric UID and GID from the tarfile are used to set ownership of extracted files, as opposed to the normal behaviour of translating the names into whatever IDs are in use on the local system.

The zipfile module now supports writing output to strems that don’t support seeking. Also, ZipFile.open() now also accepts a x specified on the mode string, to require exclusive creation of a new file.

Concurrency

Several of the modules for concurrency saw some incremental improvements, including asyncio, concurrent.futures and multiprocessing

asyncio

There are a number of assorted improvements to asyncio in this release, most of which are discussed below.

Debug Mode
There are set_debug() and get_debug() methods on event loop objects to enable/disable additional runtime checks. These include better context on exception tracebacks to indicate the affected task, warnings when operations are performed which may not be thread-safe, and warnings when asynchronous calls take longer than a certain time to return.
SSL in Proactor Loops
SSL support has been factored out from the specific type of loop in question, so is now available for all loop types.
loop.is_closed() Added
The new is_closed() methods on event loops checks if the loop was closed yet.
async() Replaced
Instead of the old async() function, ensure_future() should now always be used.
Task Factories
It’s now possible to set a new factory for constructing Task objects, using the event loop’s new set_task_factory() and get_task_factory() methods. This factory will be used by the create_task() method, and the factory must be a callable that takes two parameters: the event loop to which the task is being added, and the coroutine to form the body of the task.
New methods on asyncio.Queue
The asyncio.Queue class is similar to that provided by the queue module, except that the get() and put() methods are coroutines. In this release a new join() coroutine method was added to block until the queue has no outstanding pending items. Note that this doesn’t just mean all items have been retrieved with get(), it also requires them to be processed — this is indicated by consumers calling a new task_done() method (not a coroutine) once each fetched task is complete.
Threadsafe Scheduling
There’s new run_coroutine_threadsafe() method to submit coroutines to an event loop safely from other threads.
Future Construction
There’s a new method on event loops create_future() to create new Future objects, so alternative loop implementations can provide fast implementations. This method should be used going forward.
StreamReader.readuntil() Added
The high-level StreamReader class, for working with network connections, now has a handy readuntil() method which reads data up to and included a specified separator, and then removes that from the buffer and returns it. Handy for reading, say, HTTP headers up to the terminating \r\n\r\n.

concurrent.futures

There are a couple of changes to allow better performance. Firstly, Executor.map() now offers a chunksize parameter which allows the code to control the batch size when assigning tasks to child processes with ProcessPoolExecutor. This is particularly convenient when tasks are quick, so the overhead of pushing them out individually would be considerable.

Secondly, the number of workers in the ThreadPoolExecutor is now optional, and defaults to 5x the number of CPUs. This makes it simpler to write code that can make best use of available resources on any platform.

Diagnostics & Testing

Some improvements to handy inspect and logging, as well as unittest.

inspect

The BoundArguments object represents the result of binding argument values to parameters from a Signature object. In Python 3.5, this has acquired a new apply_defaults() method which will add any default values for arguments not already bound in that instance. This is probably best demonstrated with an example.

>>> import inspect
>>> def my_func(arg1, arg2, arg3="three", arg4="four",
                *args, kwarg1="eins"): pass
...
>>> signature = inspect.signature(my_func)
>>> bound_args = signature.bind("un", "deux", "troi")
>>> bound_args
<BoundArguments (arg1='un', arg2='deux', arg3='troi')>
>>> bound_args.apply_defaults()
>>> bound_args
<BoundArguments (arg1='un', arg2='deux', arg3='troi', arg4='four', args=(), kwarg1='eins')>
>>> bound_args.args
('un', 'deux', 'troi', 'four')
>>> bound_args.kwargs
{'kwarg1': 'eins'}

The signature() function itself has also had a new follow_wrapped optional keyword argument. This controls whether to follow the __wrapped__ attribute that decorators add to link to the wrapped function, which we discussed in the article on Python 3.2. This defaults to True but you can now override it to False if you want the actual callable you pass without following the __wrapped__ chain.

There are also some new functions to inspect coroutine objects and functions:

iscoroutine()
Returns True if the specified object is a coroutine, returned from a coroutine function.
iscoroutinefunction()
Returns True if the specified function was defined with async def.
isawaitable()
Returns True if the specified object can be used as the target of an await expression, which can be a coroutine of any object with an __await__() method.
getcoroutinelocals()
This is the coroutine equivalent of gengeneratorlocals(), which was added in Python 3.3. In short, it returns a dict of the current values of local variables within the specified coroutine.
getcoroutinestate()
Returns the current state of the specified coroutine object, which is one of CORO_CREATED, CORO_RUNNING, CORO_SUSPENDED and CORO_CLOSED.

logging

All logging methods which produce a log entry accept an exc_info attribute, which you can set to True to include details of the currently handled exception (if any), or a 3-tuple of (type, value, traceback) as returned by sys.exc_info() to log details of the specified exception. The change in Python 3.5 is that you can now also just pass an instance of an exception to trigger this latter behaviour, which is convenient.

The handlers.HTTPHandler class, which supports sending log messages to a web server via GET or POST, has been improved so that you can optionally pass an ssl.SSLContext instance to configure SSL settings used for the HTTP connection. As remote logging is likely to be the sort of thing where you’d want some authentication and confidentiality, this seems like a useful change.

The handlers.QueueListener is not actually a handler, but a companion to the handlers.QueueHandler class. The QueueHandler supports writing log messages to a queue, such as that provided by the queue or multiprocessing modules, and the QueueListener watches this queue and processes the log messages so enqueued. This allows threads and other processes to quickly deal with generating log messages without the overhead of potential expensive handlers, and have the log messages dealt with asynchronously in a different thread.

This was actually added in Python 3.2, but the reason I’m mentioning it here is that in Python 3.5 a respect_handler_level parameter has been added to its constructor. If True, this class will filter log messages according to the threshold log level of each registered handler. Prior to this, every log message would always be passed to every handler.

unittest

In unittest, the TestLoader.loadTestsFromModule() takes an optional pattern keyword argument. If a module defines a load_tests() method, to customise how tests are loaded, the pattern argument is passed as the third parameter to load_tests(). This allows the set of tests to be loaded to be filtered.

Errors during discovery are now exposed as TestLoader.errors, which could be useful for those running tests automatically as opposed to interactively. Also, when executing on the command-line there’s a new --locals flag which includes local variables in backtraces, for easier diagnosis of the reasons for failures.

The unittest.mock module also has some handy changes. First up is to address an irritating issue caused when you mistype an assertX() method on a Mock object. Because these objects manufacture attributes on demand, it will be treated as a regular method and the test will not raise a failure despite the error. That is until Python 3.5, because now any method name with a prefix of assert (or assret, a common typo) will cause an immediate AttributeError. If you happen to have such methods that you legitimately want to mock, you can disable this behaviour by passing unsafe=True to the Mock constructor.

There’s a new assert_not_called() method on Mock, which is slightly more readable than manually asserting call_count is zero. The MagicMock has also been enhanced to support a few new special methods, namely __truediv__(), __divmod__() and __matmul()__.

Text Parsing

Some changes to make configparser more flexible, some json convenience changes, and some corners of re filled out to improve its expressiveness even more.

configparser

The constructor of ConfigParser now allows custom converters to be registered. You do this by passing the converters parameter as a dictionary mapping type names into functions which take the string value as an argument and return the converted value.

These work the same way as the built-in getint(), getfloat() and getboolean() methods in that ConfigParser doesn’t try to guess the type of values or convert them on reading — instead, you choose the conversion when you query the value. Any exceptions raised during conversion will occur when you query the value using the accessor. If you register a type "foo" then there’ll be a getfoo() method, so be careful when choosing the names of your types in the dictionary.

Here’s an example:

>>> import configparser
>>> import ipaddress
>>> convs = {
...     "intlist": lambda x: [int(i.strip()) for i in x.split(",")],
...     "ipaddr": lambda x: ipaddress.ip_address(x)
... }
>>> parser = configparser.ConfigParser(converters=convs)
>>> parser.read_string("""
... [my_section]
... one_value = 2, 3, 5, 7, 11, 13, 17
... another_value = 10.1.200.15
... some_other_value = 2002:0a01:c80f::
... """)
>>> parser.get("my_section", "one_value")
'2, 3, 5, 7, 11, 13, 17'
>>> parser.getintlist("my_section", "one_value")
[2, 3, 5, 7, 11, 13, 17]
>>> parser.get("my_section", "another_value")
'10.1.200.15'
>>> parser.getipaddr("my_section", "another_value")
IPv4Address('10.1.200.15')
>>> parser.getipaddr("my_section", "some_other_value")
IPv6Address('2002:a01:c80f::')
>>> parser.getipaddr("my_section", "one_value")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andy/.pyenv/versions/3.5.10/lib/python3.5/configparser.py", line 806, in _get_conv
    **kwargs)
  File "/Users/andy/.pyenv/versions/3.5.10/lib/python3.5/configparser.py", line 800, in _get
    return conv(self.get(section, option, **kwargs))
  File "<stdin>", line 3, in <lambda>
  File "/Users/andy/.pyenv/versions/3.5.10/lib/python3.5/ipaddress.py", line 54, in ip_address
    address)
ValueError: '2, 3, 5, 7, 11, 13, 17' does not appear to be an IPv4 or IPv6 address

json

A couple of little enhancements. Firstly, json.tool preserves the ordering of keys in JSON objects now, nuless --sort-keys is passed which will resort them lexicographically.

Secondly, decoding JSON now throws json.JSONDecodeError instead of ValueError for better context — however, since the former is a subclass of the latter, existing code that catches ValueError should continue to work.

re

There are some changes with regards to matching groups in Python 3.5.

In Python 3.4 and prior, lookbehind assertions were not allowed to contain references to match groups. When compiling the pattern you’d get a warning, and then it would simply fail to match when used.

Python 3.4.10 (default, Mar 28 2021, 04:12:21)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(r"(ABC|XYZ)...(?<=\1)DEF")
/Users/apearce16/.pyenv/versions/3.4.10/lib/python3.4/sre_parse.py:361: RuntimeWarning: group references in lookbehind assertions are not supported
  RuntimeWarning)
>>> assert pattern.match("ABCABCDEF")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

However, in Python 3.5 group references are permitted in lookbehind assertions, although only if they evaluate to a fixed-width pattern. You can also use conditional group references, such as (?(1)ABC|DEF)1.

Python 3.5.10 (default, Mar 28 2021, 04:14:51)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>
>>> pattern = re.compile(r"(ABC|XYZ)...(?<=\1)DEF")
>>> assert pattern.match("ABCABCDEF")
>>>
>>> pattern = re.compile(r"((AAA|BBB):)?...(?<=(?(2)\2|XXX)):CCC")
>>> assert pattern.match("AAA:AAA:CCC")
>>> assert not pattern.match("BBB:AAA:CCC")
>>> assert pattern.match("BBB:BBB:CCC")
>>> assert not pattern.match("DDD:DDD:CCC")
>>> assert not pattern.match("DDD:CCC")
>>> assert pattern.match("XXX:CCC")
>>>

There’s also a change with regard to using matching groups in the replacement string of re.sub() and re.subn(). In Python 3.4, if a group failed to match in the source string and it was referenced in the replacement, then an exception would be raised:

Python 3.4.10 (default, Mar 28 2021, 04:12:21)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import re
>>> re.sub("(AAA:)?BBB", r"<<\1>>", ":::AAA:BBB:::")
':::<<AAA:>>:::'
>>> re.sub("(AAA:)?BBB", r"<<\1>>", ":::BBB:::")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/apearce16/.pyenv/versions/3.4.10/lib/python3.4/re.py", line 179, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/Users/apearce16/.pyenv/versions/3.4.10/lib/python3.4/re.py", line 331, in filter
    return sre_parse.expand_template(template, match)
  File "/Users/apearce16/.pyenv/versions/3.4.10/lib/python3.4/sre_parse.py", line 888, in expand_template
    raise error("unmatched group")
sre_constants.error: unmatched group

However, in Python 3.5 instead of raising an exception, the group reference is simply replaced with an empty string:

Python 3.5.10 (default, Mar 28 2021, 04:14:51)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub("(AAA:)?BBB", r"<<\1>>", ":::AAA:BBB:::")
':::<<AAA:>>:::'
>>> re.sub("(AAA:)?BBB", r"<<\1>>", ":::BBB:::")
':::<<>>:::'

In addition to these changes, the number of matching groups is no longer limited to 100 as it was previously. I might tentatively suggest, however, that if you have a case for a regular expression with more than 100 match groups, you should strongly consider using a more robust parsing mechnanism than regular expressions. Or at least breaking your expression up into a set of smaller regular expressions.

Finally, unrelated to matching groups, the re.error instances have new attributes that provide more context on the error in question. This means that the errors you get pinpoint where in the pattern the error is occurring.

Internet

Quite at batch of changes to the networking modules this release. First we’ll get some smaller details out of the way, and then run through the modules with more significant updates.

The IMAP4 class has got the context manager treatment, so can now be used with the with statement. At the end of the enclosed block, it will automatically send the LOGOUT command.

The module also now supports UTF-8, implementing RFC 6855 and its pre-requisite RFC 5161. To activate this, the option must be enabled by passing UTF8=ACCEPT to IMAP4.enable() — the addition of this method is what supports RFC 5161. Whether this was successfully negotiated can be checked with IMAP4.utf8_enabled. Also, non-ASCII usernames and passwords are now possible, being encoded with UTF-8.

Similary to imaplib, poplib now supports UTF-8 as per RFC 6856. This can be activated by calling POP3.utf8(), which returns the server response if successful or raises an error_proto exception if not.

email

The email module has a few more options. The Policy.mangle_from_ option controls whether lines that start From in email bodies have a > charcter inserted. This was done before to avoid confusion with the header of the same name, but is now disabled by default in all policies except compat32 (for backwards compatibility).

The Message and EmailMessage classes now has a get_content_disposition() method which makes it easier to determine the canonical value for this important header, primarily to determine whether this is an inline or attachment MIME part. It’s worth remembering that in a multipart message, the payload is a list of such instances, so each can have its own disposition value.

There are also changes to support UTF-8 headers and passing email.charset.Chartset instances to the mime.text.MIMEText constructor, but these are a little esoteric to go into any details on.

http

There’s a new HTTPStatus enum which defines HTTP status codes. As well as the usual name and value attributes, each entry also defines a phrase and description, see the example below.

>>> http.HTTPStatus.NOT_FOUND.value
404
>>> http.HTTPStatus.NOT_FOUND.name
'NOT_FOUND'
>>> http.HTTPStatus.NOT_FOUND.phrase
'Not Found'
>>> http.HTTPStatus.NOT_FOUND.description
'Nothing matches the given URI'
>>> http.HTTPStatus.TOO_MANY_REQUESTS.description
'The user has sent too many requests in a given amount of time ("rate limiting")'

In http.client there’s also a new RemoteDisconnected exception which will be raised if the server closes the connection before at least the status line of the response can be returned. Previously the slightly less helpful BadStatusLine was returned with an empty status line, which could cause some confusion.

This exception is a subclass of ConnectionResetError which is itself derived from ConnectionError. In a related change, any ConnectionError now causes the underlying socket to be closed, to be reopened on the next request. This makes a great deal of sense since these errors typically mean either there’s no connection, or you’ve become somehow out of sync with the server, and the best option is to reconnect.

ipaddress

A couple of minor improvements. Firstly, the constructors for IPv4Network and IPv6Network now accept a 2-tuple of (address, netmask) where the latter can be an integer number of bits or a mask in dotted-decimal notation.

Secondly, addresses now have a reverse_pointer attribute which returns the name of the PTR record used for reverse DNS lookups.

>>> import ipaddress
>>> ipaddress.IPv4Network(("93.184.0.0", 16))
IPv4Network('93.184.0.0/16')
>>> ipaddress.IPv4Network(("93.184.0.0", "255.255.0.0"))
IPv4Network('93.184.0.0/16')
>>> ipaddress.IPv4Address("93.184.216.34").reverse_pointer
'34.216.184.93.in-addr.arpa'

smtpd

For those running a mailserver written in Python, you might like to be aware that smtpd has a few changes to have UTF-8. Those of you who prefer Google to read, sorry, handle your email instead can probably skip straight to the next section.

The SMTPServer and SMTPChannel constructors can now be passed a decode_data keyword parameter. If True, data is decoded as UTF-8 before being passed as a str to the process_message() function, which applications are expected to implement to handle incoming messages. If False, however, messages are just passed as a raw bytes object instead. The default is True in Python 3.5 for backwards compatibility, but the intention is to change the default to False in Python 3.6, as a framework like this shouldn’t necessarily assume UTF-8 encoding.

Additionally, if decode_data is False then the server advertises the 8BITMIME SMTP extension as specified by RFC 6152. This permits 8-bit characters to be used, whereas the original SMTP specification specified characters must be 7-bit. If a client application specifies BODY=8BITMIME on the MAIL line then this will be passed to process_message() using a new mail_options keyword parameter, which is a list of all the options the client specified.

There’s additionally support for RFC 6531, which extends SMTP to support UTF-8 in mailbox names and header fields. This is only advertised by the server if enable_SMTPUTF8=True is passed to the SMTPServer or SMTPChannel constructor. If the client application wishes to utilise this, they add SMTPUTF8 to the MAIL line and it’s again passed into process_message() via mail_options. It’s the responsibility of the implementation of this method to handle the UTF-8 values appropriately.

Finally, both the local bind address and the upstream SMTP relayer, which are passed as localaddr and remoteaddr respectively to the SMTPServer constructor, can now be specifie as IPv6 addresses.

smtplib

The smtplib module supports several authentication methods2 and typically you’d call SMTP.login(), passing in a username and password, to use whichever is the best one supported by the server. However, if you want to use an authorization method not supported by the library (e.g. DIGEST-MD5) then things are considerably trickier. Until now, that is, with the addition of the SMTP.auth() method. Code implementing a different auth method passes in a callable object which is used to process the server’s challenge.

In addition, both SMTP.sendmail() and SMTP.send_message() now also support the UTF-8 RFC 6531, by passing SMTPUTF8 in the mail_options parameter. This is required if you want to use non-ASCII characters in the email address fields.

socket

There are a few changes to the socket module that are worth noting. The first one is that functions with timeouts now use a monotonic clock instead of the system clock — this probably won’t impact too many people, but might save you from some maddeningly unreproducible bugs when the NTP daemon jumps in and fiddles with the system clock under your feet.

More generally applicable, socket objects now offer a sendfile() method which uses os.sendfile() method that was exposed back in Python 3.3. This is 2-3x faster at transferring files across sockets as the copying is done entirely within the kernel.

Also, an annoying issue with the timeout of socket.sendall() has been fixed. Previously, each time data was successfully sent the timeout clock was reset — the consequence was that there wasn’t any way for calling code to impose an overall timeout on the operation in the event that the connection was slow but made occasional progress. As of Python 3.5, however, the timeout is treated as an overall timeout for the entire operation, which is much more useful.

Finally, the backlog parameter to socket.listen() is now optional — if omitted it defaults to SOMAXCONN which is the maximum permitted value, or 128 if that’s lower.

ssl

There are quite a few changes in the ssl module, some of which is a little esoteric so I’ll try and be brief. I’m not holding out much hope, though.

First up is the new SSLObject class, which has been added for cases where you need to access the SSL protocol stack but you don’t want all the network IO that’s built into SSLSocket. The MemoryBIO class acts as a memory buffer to pass data in and out for this case. This would be useful for integrating SSL into an application’s wider poll loop, for example.

Next we have support for RFC 7301 application-layer protocol negotiation. This extension to TLS allows clients to negotiate which of several protocols to exchange over a single underlying secure channel at connection time, and is a key requirement for being able to run HTTP/2 over existing HTTP/1.1 SSL ports. The SSLContext class now has a set_alpn_protocols() method which is used for the application code to advertise during the TLS handshake — for example, if you want to use either HTTP/2 or HTTP/1.1 you could pass ["h2", "http/1.1"]. The SSLSocket has a corresponding selected_alpn_protocol() method which returns the protocol which was selected for the connection.

And then there’s a collection of smaller enhancements:

SSLSocket.version() Added
To query the version of TLS in use.
SSLSocket.sendfile() Added
Similar to the same change mentioned earlier to socket.socket objects.
Non-Blocking SSLSocket.send() Change
If SSLSocket.send() was called on a non-blocking socket when it would normally block, it would previously return 0. Now it raises the ssl.SSLWantReadError or ssl.SSLWantWriteError exceptions as appropriate.
cert_time_to_seconds() Now Takes UTC
Previously it would treat the input time as local time, but the standards expect UTC.
Methods No Longer Reset Timeouts
Similar to the change in socket.sendall(), various SSLSocket methods no longer reset the timeout on a successful write: do_handshake(), read(), shutdown(), and write().

urllib

HTTP Basic Authentication is still used on the web for simple cases, as it’s easy to code and provides basic protection that may be suitable for low-risk cases over otherwise secure channels (e.g. TLS). This style of authentication demands that requests for protected resources that lack an appropriate Authorization header should trigger a 401 Unauthorized response. Some HTTP client libraries have come to depend on this behaviour, and won’t actually send an Authorization header until they see the 401 so they know that authorization is required. It turns out that Python’s urllib is one of those libraries.

The problem is that some servers don’t conform to this behaviour, often intentionally — a notable example is Github’s API which responds with a standard 404 Not Found error instead of a 401, for rather vague hand-wavey reasons of security3. The solution to this is to pre-emptively send the Authorization header on the initial request instead of waiting for the 401 — this is explcitly anticipated in §2 of RFC 2617 where it says:

A client MAY preemptively send the corresponding Authorization header with requests for resources in that space without receipt of another challenge from the server.

To facilitate these there’s a new urllib.request.HTTPPasswordMgrWithPriorAuth class which is similar to HTTPPasswordMgrWithDefaultRealm but which pre-emptively sends the Authorization header. Even in cases where the server responds with an appropriate 401, this approach also saves an unncecessary round-trip time.

As well as this new class, there are a handful of smaller changes:

parse.urlencode() Supports quote_via Argument
The urlencode() method needs to URL-encode values, escaping special characters which aren’t valid in URLs. It now accepts a quote_via parameter, where you can specify a function to transform a string to a URL-safe form. The default is to use urllib.parse.quote_plus(), which encodes spaces as +.
request.urlopen() Supports context Argument
This can be used to pass in an ssl.SSLContext object to use for HTTPS connections.
parse.urljoin() Updated
An issue has been fixed when adding relative URLs that use .. enough times to move outside the root of the URL space. As suggested by §5.4 of RFC 3986, these invalid excess sections should just be ignored, and that’s why urljoin() now does in Python 3.5.

Language Support

There are some small changes to modules that wrap up common coding tasks.

contextlib.redirect_stderr() Added
This context manager is great for handling badly-behaved code that writes directly to stderr. It takes an alternative file descriptor as a parameter, which could be an io.StringIO instance, for example.
functools.lru_cache Implemented in C
This should offer significant performance improvements.
Deferring Module Loading
For cases where startup time is critical, importlib.util.LazyLoader has been added to allow the actual load of a module to be deferred until the first attribute access. I’d say this is generally a bad idea unless you really need it, however, as any exceptions or error messages which would have occurred at import are then deferred until first use and occur in a confusing context.
module_from_spec() Added
There’s a new method importlib.util.module_from_spec() which is now the preferred way to create a new module. The advantage over directly instantiating types.ModuleType is that it additionally sets some import-controlled attributes on the new module object based on the ModuleSpec that you pass in.

Operating System

A few assorted tidbits for operating system features, as well as some slightly more substantial updates to pathlib and subprocess.

Firstly, the glob module now Supports ** in the glob() and iglob() functions. This acts rather like * except that it also matches directory separators — in other words, it recurses into subdirectories to find matches.

In the os module, urandom() now uses getrandom() on Linux and getentropy() on OpenBSD, to avoid the need to open /dev/urandom, which can fail if your process is at its filehandle limit.

There are handy new get_blocking() and set_blocking() methods on file descriptors, to avoid having to fiddle around with O_NONBLOCK directly.

And for anyone, like me, who’s always been slightly annoyed at how useful os.path.commonprefix() isn’t, there’s a saviour in the form of os.path.commonpath(). The issue is that commonprefix() always returns the longest common prefix string, but even if that breaks the name in the middle of a directory or filename. The new commonpath() does the same thing, but always breaks the path on a directory separator, so the result will be a valid path.

>>> os.path.commonprefix(("/home/andy/myfirstfile", "/home/andy/mysecondfile"))
'/home/andy/my'
>>> os.path.commonpath(("/home/andy/myfirstfile", "/home/andy/mysecondfile"))
'/home/andy'

The shutil.move() function now takes a new parameter copy_function to specify the function to use for copying if moving items between filesystems (within the same filesystem, os.rename() is still used instead). It defaults to shutil.copy2(), which copies all file content and metadata, but something like shutil.copy() might be more appropriate if you just want the content itself copied.

In the signal module, the various SIG* constants (e.g. signal.SIGTERM) have been replaced by values in the signal.Signals enumeration. This allows for more convenience when logging values, etc.

pathlib

The pathlib module, which you may remember was added in Python 3.4, has some handy changes. Firstly, Path instances now have a samefile() method, which indicates whether this path refers to the same physical file as another path, which can be another Path object or just a string. This is particularly useful as it actually goes to the concrete filesystem, so it supports things like symbolic links to the same file.

>>> import os
>>> import pathlib
>>> import shutil
>>> os.makedirs("/tmp/test/one")
>>> os.makedirs("/tmp/test/two")
>>> os.makedirs("/tmp/test/three")
>>> with open("/tmp/test/one/testfile_one", "w") as fd:
...     fd.write("hello, world\n")
...
13
>>> shutil.copy("/tmp/test/one/testfile_one", "/tmp/test/two/testfile_two")
'/tmp/test/two/testfile_two'
>>> os.symlink("/tmp/test/one/testfile_one", "/tmp/test/three/testfile_three")
>>>
>>> mypath = pathlib.Path("/tmp/test/one/testfile_one")
>>> # True, because it's just an obfuscated version of the same path.
>>> mypath.samefile("/tmp/test/one/../../test/one/testfile_one")
True
>>> # False, because this was a copy of the file, not the same file.
>>> mypath.samefile("/tmp/test/two/testfile_two")
False
>>> # True, because this is a symlink to the same file.
>>> mypath.samefile("/tmp/test/three/testfile_three")
True

There are a few other improvements to Path as well:

Path.mkdir() Now Takes exist_ok
Passing exist_ok=True will suppress the FileExistsError if the target already exists, similar to mkdir -p on the command-line.
Path.expanduser() Added
Returns a new Path instance with ~ and ~username tokens expanded.
Path.home() Added
Constructs a Path of the user’s home directory.
New Read/Write Methods
There are four new methods on Path, which are read_text(), write_text(), read_bytes() and write_bytes(). These simplify read/write operations on files represented by Path objects.

subprocess

Previously, the approach to executing a child process which typically to construct a Popen instance, set attributes on it as required and then either call the communicate() method or manually deal with IO until you eventually detect the child has terminated with wait() or poll() and reap the return code.

The Popen class is extremely flexible, and enables many use-cases for subprocesses, both synchronous and asynchronous. However, there are often times where you don’t need this flexibility — you just want to execute something, wait for it to finish and then check its exit status. For these cases there’s now a convenience function subprocess.run().

This accepts most of the arguments that Popen.__init__() takes, so you still have a good deal of flexibility, but it does impose a synchronous model on calling code — it waits for the subprocess to terminate, and returns a CompletedProcess instance directly from the function to save you a separate call to recover the exit status. If you need more flexibility, there’s still the option of using the underlying Popen object directly instead, but run() is now the recommended approach if suitable.

Aside from the myriad parameters passed directly into the Popen constructor, it’s got some handy convenience methods. For the common case of capturing all output, just specify capture_output=True and recover the output from the stdout and stderr attributes of the CompletedProcess instance you get back. Another common case is to want to only deal with output if the command failed, and for this you can pass check=True which will convert a non-zero exit status into a CalledProcessError exception, attributes of which conveniently contain the output, if it was capture. You can specify timeout which is passed to Popen.communicate(), but if the process does time out then run() arranges for it to be killed and reaped before re-raising the TimeoutExpired exception, which is handy.

Overall this promises to streamline a lot of of common cases for spawning external commands.

Other Changes

Whatever didn’t fit elsewhere, but I thought was also worth a brief mention.

heapq.merge() Improvements
It’s now possible to pass a key parameteter to heapq.merge() to customise the values used to compare elements. There’s also an optional reverse parameter to invert the result of the comparison. These mirror the same options to the sorted() builtin function.
locale.delocalize() Added
This function converts numbers to canonical form (no thousands separators, dot decimal point) from whatever format is expected in the current locale.
New math Constants
There are two new constants, math.inf and math.nan, to save you making typos in float("inf") and float("nan").
math.gcd Added
The fractions.gcd() function has been deprecated and now there’s a new math.gcd() function instead, to return the greatest common divisor of two numbers.
sqlite3.Row Class A Full Sequence
Now reversed() iteration and slice indexing work on these objects.

Conclusions

So that’s Python 3.5, such a big release it’s taken four articles just to cover it all. The coroutine changes and type hinting are both big features, and among all the rest of the changes are some real gems such as the addition of os.scandir(), automatic handling of EINTR and the addition of subprocess.run(). Frankly it’s been quite a mission going through it all, and I’m still continually amazed how many changes have been squeezed into every release of this already mature language.

Anyway, I’m already looking forward to seeing what’s in 3.6, but also hopeful it perhaps won’t be quite such a big release as I’m still holding hopes of catching all the way up to 3.9 before 3.10 turns up in October. Thanks for reading!


  1. For context, a conditional reference like (?(1)ABC|DEF) checks if match group 1 (in this example) matched successfully. If so, the conditional block matches ABC, and if not the conditional block matches DEF

  2. Namely CRAM-MD5, PLAIN and LOGIN

  3. My assumption is that they’re saying you can tell whether a resource exists by making a request and seeing whether you get a 401 instead of a 404, but I’ve yet to see any explanation of why they can’t simply respond with a 401 to all requests that lack an Authorization header, and only returning a 404 in cases where the user is correctly authenticated for a resource in that part of the URL space. This would have the distinction of not breaking the RFC. In my opinion, you should structure a REST API so that authentication is something that can happen based on a prefix of it and you complete that process before you even check if that URL maps to a real resource or not. But it’s just my view. 

This is the 11th of the 13 articles that currently make up the “Python 2to3” series.

10 Jul 2021 at 10:38PM in Software
 |   | 
Photo by David Clode on Unsplash