This is part 7 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.4 - Part 1.
In this series looking at features introduced by every version of Python 3, this one is the second of two covering release 3.4. We look at improvements to the way multiprocessing spawns child processes, various powerful new facilities for code instrospection, improvements to garbage collection, and a lot more besides.
In this article we conclude our look at Python 3.4 which started with the previous one in this series. Last time we took a look at the ensurepip
module, file descriptor inheritance changes, the codecs
module, and a series of other new modules which were added to the library. In this article we’ll be looking at a host of changes that have been made to existing modules, some long-awaited improvements to garbage collection and a few other small details.
The bulk of this article is going to be looking at changes to modules in the standard library. As usual, I’ve tried to group them by category to make things somewhat more approachable, and we’re kicking off with a category that I never even really knew existed in the standard library.
This release contained some improvements in a handful of modules for dealing with audio formats, and it wasn’t until I looked into the changes in these modules that I even knew they were there. This is one of the reasons I like to write these articles, so I’m including the changes here at least partly just to mention them in case anyone else was similarly unaware of their existence.
First up, the aifc
module allows read/write access to AIFF and AIFF-C format files. This module has had some small tweaks:
getparams()
now returns namedtuple
instead of tuple
.aifc.open()
can now be used as a context manager.writeframesraw()
and writeframes()
now accept any bytes
-like object.Next we have the audioop
module, which provides useful operations on raw audio fragments, such as converting between mono and stereo, converting between different raw audio formats, and searching for a snippet of audio within a larger fragment. As of Python 3.4, this module now offers a byteswap()
method for endian conversion of all samples in a fragment, and also all functions now accept any bytes
-like object.
The sunau
module, which allows read/write access to Au format audio files. The first three tweaks are essentially the same as for aifc
I mentioned above, so I won’t repeat them. The final change is that AU_write.setsamplewidth()
now supports 24-bit samples8.
Likewise the wave
module has those same three changes as well. Additionally it now is able to write output to file descriptors which don’t support seeking, although in these cases the number of frames written in the header better be correct when first written.
The multiprocessing
module has had a few changes. First up is the concept of start methods, which gives the programmer control of how subprocesses are created. It’s especially useful to exercise this control when mixing threads and processes. There are three methods now supported on Unix, although spawn
is the only option on Windows:
spawn
fork()
and exec()
pair. This is the default (and only) option on Windows.fork
fork()
to create a child process, but doesn’t exec()
into a new instance of the interpreter. As mentioned at the start of the article, by file handles will still not be inherited unless the programmer has explicitly set them to be inheritable. In multithreaded code, however, there can still be problems using a bare fork()
like this. The replicates the entire address space of the process as-is, but only the currently executing thread of execution. If another thread happens to have a mutex held when the current thread calls fork()
, for example, that mutex will still be held in the child process but with the thread holding it no longer extant, so this mutex will never be released6.forkserver
fork()
and multithreaded code is to make sure you call fork()
before any other threads are spawned. Since the current thread is the only one that’s ever existed up to that point, and it survives into the child process, then there’s no chance for the process global state to be in an indeterminate state. This solutions is the purpose of the forkserver
model. In this case, a separate process is created at startup, and this is used to fork all the new child processes. A Unix domain socket is created to communicate between the main process and the forkserver. When a new child is created, two pipes are created to send work to the child process and receive the exit status back, respectively. In the forkserver
module, the client end file descriptors for these pipes are sent over the UDS to the fork server process. As a result, this method is only available on OSs that support sending FDs over UDSs (e.g. Linux). Note that the child process that the fork server process creates does not require a UDS, it inherits what it needs using standard fork()
semantics.This last model is a bit of a delicate dance, so out of interest I sniffed aorund the code and drew up this sequence diagram to illustrate how it happens.
To set and query which of these methods is in use globally, the multiprocessing
module provides get_start_method()
and set_start_method()
, and you can choose from any of the methods returned by get_all_start_methods()
.
As well as this you can now create a context with get_context()
. This allows the start method to be set for a specific context, and the context object shares the same API as the multiprocessing
module so you can just use methods on the object instead of the module functions to utilise the settings of that particular context. Any worker pools you create are specific to that context. This allows different libraries interoperating in the same application to avoid interfering with each other by each creating their own context instead of having to mess with global state.
The threading
module also has a minor improvement in the form of the main_thread()
function, which returns a Thread
object representing the main thread of execution.
hashlib
now provides pbkdf2_hmac()
function implementing the commonly used PKCS#5 key derivation function2. This is based on an existing hash digest algorithm (e.g. SHA-256) which is combined with a salt and repeated a specified number of times. As usual, the salt must be preserved so that the process can be repeated again to generate the same secret key from the same credential consistently in the future.
>>> import hashlib
>>> import os
>>>
>>> salt = os.urandom(16)
>>> hashlib.pbkdf2_hmac("sha256", b"password", salt, 100000)
b'Vwq\xfe\x87\x10.\x1c\xd8S\x17N\x04\xda\xb8\xc3\x8a\x14C\xf1\x10F\x9eaQ\x1f\xe4\xd04%L\xc9'
The hmac.new()
function now accepts bytearray
as well as bytes
for the key, and the type of the data fed in may be any of the types accepted by hashlib
. Also, the digest algorithm passed to new()
may be any of the names recognised by hashlib
, and the choice of MD5 as a default is deprecated — in future there will be no default.
The dis
module for disassembling bytecode has had some facilities added to allow user code better programmatic access. There’s a new Instruction
class representing a bytecode instruction, with appropriate parameters for inspecting it, and a get_instructions()
method which takes a callable and yields the bytecode instructions that comprise it as Instruction
instances. For those who prefer a more object-oriented interface, the new Bytecode
class offers similar facilities.
>>> import dis
>>>
>>> def func(arg):
... print("Arg value: " + str(arg))
... return arg * 2
>>>
>>> for instr in dis.get_instructions(func):
... print(instr.offset, instr.opname, instr.argrepr)
...
0 LOAD_GLOBAL print
3 LOAD_CONST 'Arg value: '
6 LOAD_GLOBAL str
9 LOAD_FAST arg
12 CALL_FUNCTION 1 positional, 0 keyword pair
15 BINARY_ADD
16 CALL_FUNCTION 1 positional, 0 keyword pair
19 POP_TOP
20 LOAD_FAST arg
23 LOAD_CONST 2
26 BINARY_MULTIPLY
27 RETURN_VALUE
inspect
, which provides functions for introspecting runtime objects, has also had some features added in 3.4. First up is a command-line interface, so by executing the module and passing a module name, or a specific function or class within that module, the source code will be displayed. Or if --details
is passed then information about the specified object will be displayed instead.
$ python -m inspect shutil:copy
def copy(src, dst, *, follow_symlinks=True):
"""Copy data and mode bits ("cp src dst"). Return the file's destination.
The destination may be a directory.
If follow_symlinks is false, symlinks won't be followed. This
resembles GNU's "cp -P src dst".
If source and destination are the same file, a SameFileError will be
raised.
"""
if os.path.isdir(dst):
dst = os.path.join(dst, os.path.basename(src))
copyfile(src, dst, follow_symlinks=follow_symlinks)
copymode(src, dst, follow_symlinks=follow_symlinks)
return dst
$ python -m inspect --details shutil:copy
Target: shutil:copy
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Line: 214
$ python -m inspect --details shutil
Target: shutil
Origin: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/shutil.py
Cached: /home/andy/.pyenv/versions/3.4.10/lib/python3.4/__pycache__/shutil.cpython-34.pyc
Loader: <_frozen_importlib.SourceFileLoader object at 0x1051fa518>
Next there’s a new unwrap()
method which is used to introspect on the original function that’s been wrapped by decorators. It works by following the chain of __wrapped__
attributes, which are set by the functools.wraps()
decorator, or anything else that calls functools.update_wrapper()
.
>>> import functools
>>> import inspect
>>>
>>> def some_decorator(func):
... @functools.wraps(func)
... def wrapper_func(*args, **kwargs):
... print("Calling " + func.__name__)
... print(" - Args: " + repr(args))
... print(" - KW args: " + repr(kwargs))
... ret_val = func(*args, **kwargs)
... print("Return from " + func.__name__ + ": " + repr(ret_val))
... return ret_val
... return wrapper_func
...
>>> @some_decorator
... def some_other_func(arg):
... """Just doubles something."""
... print("Prepare to be amazed as I double " + repr(arg))
... return arg * 2
...
>>> some_other_func(123)
Calling some_other_func
- Args: (123,)
- KW args: {}
Prepare to be amazed as I double 123
Return from some_other_func: 246
246
>>> some_other_func("hello")
Calling some_other_func
- Args: ('hello',)
- KW args: {}
Prepare to be amazed as I double 'hello'
Return from some_other_func: 'hellohello'
'hellohello'
>>>
>>> some_other_func.__name__
'some_other_func'
>>> some_other_func.__doc__
'Just doubles something.'
>>>
>>> inspect.unwrap(some_other_func)(123)
Prepare to be amazed as I double 123
246
In an earlier article on Python 3.3, I spoke about the introduction of the inspect.signature()
function. In Python 3.4 the existing inspect.getfullargspec()
function, which returns information about a specified function’s parameters, is now based on signature()
which means it supports a broader set of callables. One difference is that getfullargspec()
still ignores __wrapped__
attributes, unlike signature()
, so if you’re querying decorated functions then you may still need the latter.
On the subject of signature()
, that has also changed in this release so that it no longer checks the type of the object passed in, but instead will work with anything that quacks like a function5. This now allows it to work with Cython functions, for example.
The logging
module has a few tweaks. TimedRotatingFileHandler
can now specify the time of day at which file rotation should happen, and SocketHandler
and DatagramHandler
now support Unix domain sockets by setting port=None
. The configuration interface is also a little more flexible, as a configparser.RawConfigParser
instance (or a subclass of it) can now be passed to fileConfig()
, which allows an application to embed logging configuration in part of a larger file. On the same topic of configuration, the logging.config.listen()
function, which spawns a thread listening on a socket for updated logging configurations for live modification of logging in a running process, can now be passed a validation function which is used to sanity check updated configurations before applying them.
The pprint
module has had a couple of updates to deal more gracefully with long output. Firstly, there’s a new compact
parameter which defaults to False
. If you pass True
then sequences are printed with as many items per line will fit within the specified width
, which defaults to 80
. Secondly, long strings are now split over multiple lines using Python’s standard line continuation syntax.
>>> pprint.pprint(x)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
'key2c': ['alright',
'but',
'apart',
'from',
# ... items elided from output for brevity ...
'ever',
'done',
'for',
'us']}}
>>> pprint.pprint(x, compact=True, width=75)
{'key1': {'key1a': ['tis', 'but', 'a', 'scratch'],
'key1b': ["it's", 'just', 'a', 'flesh', 'wound']},
'key2': {'key2a': ["he's", 'not', 'the', 'messiah'],
'key2b': ["he's", 'a', 'very', 'naughty', 'boy'],
'key2c': ['alright', 'but', 'apart', 'from', 'the',
'sanitation', 'the', 'medicine', 'education', 'wine',
'public', 'order', 'irrigation', 'roads', 'the',
'fresh-water', 'system', 'and', 'public', 'health',
'what', 'have', 'the', 'romans', 'ever', 'done',
'for', 'us']}}
>>> pprint.pprint(" ".join(x["key2"]["key2c"]), width=50)
('alright but apart from the sanitation the '
'medicine education wine public order '
'irrigation roads the fresh-water system and '
'public health what have the romans ever done '
'for us')
In the sys
module there’s also a new function getallocatedblocks()
which is a lighter-weight alternative to the new tracemalloc
module described in the previous article. This function simply returns the number of blocks currently allocated by the interpreter, which is useful for tracing memory leaks. Since it’s so lightweight, you could easily have all your Python applications publish or log this metric at intervals to check for concerning behaviour like monotonically increasing usage.
One quirk I found is that the first time you call it, it seems to perform some allocations, so you want to call it at least twice before doing any comparisons to make sure it’s in a steady state. This behaviour may change on different platforms and Python releases, so just something to keep an eye on.
>>> import sys
>>> sys.getallocatedblocks()
17553
>>> sys.getallocatedblocks()
17558
>>> sys.getallocatedblocks()
17558
>>> x = "hello, world"
>>> sys.getallocatedblocks()
17559
>>> del x
>>> sys.getallocatedblocks()
17558
Yet more good news for debugging and testing are some changes to the unittest
module. First up is subTest()
which can be used as a context manager to allow one test method to generate multiple test cases dynamically. See the simple code below for an example.
>>> import unittest
>>>
>>> class SampleTest(unittest.TestCase):
def runTest(self):
for word in ("one", "two", "three", "four"):
with self.subTest(testword=word):
self.assertEqual(len(word), 3)
...
>>> unittest.TextTestRunner(verbosity=2).run(SampleTest())
runTest (__main__.SampleTest) ...
======================================================================
FAIL: runTest (__main__.SampleTest) (testword='three')
----------------------------------------------------------------------
Traceback (most recent call last):
File "<stdin>", line 1, in runTest
AssertionError: 5 != 3
======================================================================
FAIL: runTest (__main__.SampleTest) (testword='four')
----------------------------------------------------------------------
Traceback (most recent call last):
File "<stdin>", line 1, in runTest
AssertionError: 4 != 3
----------------------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=2)
In addition to this, test discovery via TestLoader.discover()
or python -m unittest discover
, now sorts test cases consistently between runs which makes it much easier to compare them.
There’s also a new assertLogs()
context manager, which can be used to ensure that code under test emits a log entry. By default this checks for any message of at least INFO
level being emitted by any logger, but these parameters can be overridden. In general I don’t think it’s a good idea to tightly couple unit test cases with logging, since it can make things brittle — but there are cases where it’s important to log, such as specific text in a log file triggering an alert somewhere else. In these cases it’s important to catch cases where someone might change or remove the log entry without realising its importance, and being able to do so without explicitly mocking the logging
library yourself will prove quite handy.
Following on from the policy framework added to the email
package in Python 3.3, this release adds support for passing a policy
argument to the as_string()
method when generating string representations of messages. There is also a new as_bytes()
method which is equivalent but returns bytes
instead of str
.
Another change in email
is the addition of two subclasses for Message
, which are EmailMessage
and MIMEPart
. The former should be used to represent email messages going forward and has a new default policy, with the base class Message
being reserved for backwards compatibility using the compat32
policy. The latter represents a subpart of a MIME message and is identical to EmailMessage
except for the ommission of some headers which aren’t required for subparts.
Finally in email
there’s a new module contentmanager
which offers better facilities for managing message payloads. Currently this offers a ContentManager
base class and a single concrete derivation, raw_data_manager
, which is the one used by the default EmailPolicy
. This offers some basic facilities for doing encoding/decoding to/from bytes and handling of headers for each message part. The contentmanager
module also offers facilities for you to register your own managers if you would like to do so.
Looking at the http
module briefly, the BaseHTTPRequestHandler.send_error()
method, which is used to send an error response to the client, now offers an explain
parameter. Along with the existing optional message
parameter, these can be set to override the default text for each HTTP error code that’s normally sent.
The response is formatted using the contents of the error_message_format
attribute, which you can override by the default is as shown below. You can see how the new %(explain)s
expansion will be presented in the error.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Error response</title>
</head>
<body>
<h1>Error response</h1>
<p>Error code: %(code)d</p>
<p>Message: %(message)s.</p>
<p>Error code explanation: %(code)s - %(explain)s.</p>
</body>
</html>
The ipaddress
module was provisional in Python 3.3, but is now considered a stable interface as of 3.4. In addition the IPV4Address
and IPV6Address
classes now offer an is_global
attribute, which is True
if the address is intended to be globally routable (i.e. is not reserved as a private address). At least this is what the documentation indicates — in practice, I found that only IPV6Address
offers this feature, it’s missing from IPV4Address
. Looks like this was noted and fixed in issue 21386 on the Python tracker, but that fix didn’t make it out until Python 3.5.
In any case, here’s an example of it being used for some IPv6 addresses:
>>> import ipaddress
>>> # Localhost address
... ipaddress.ip_address("::1").is_global
False
>>> # Private address range
... ipaddress.ip_address("fd12:3456:789a:1::1").is_global
False
>>> # python.org
... ipaddress.ip_address("2a04:4e42:4::223").is_global
True
The poplib
module has a couple of extra functions. Firstly there’s capa()
which returns the list of capabilities advertised by the POP server. Secondly there’s stls()
, which issues the STLS
command to upgrade a clear-text connection to SSL as specified by RFC 2595. For those familiar with it, this is very similar in operation to IMAP’s STARTTLS
command.
In smtplib
the exception type SMTPException
is now a subclass of OSError
, which allows both socket and protocol errors to be caught together and handled in a consistent way, in case that simplifies application logic. This sort of change highlights how important it is to pick the right base class for your exceptions in a library, becuase you may be able to make life considerably simpler for some of your users if you get it right.
The socket
library has a few minor updates, the first being the new get_inheritable()
and set_inheritable()
methods on socket objects to change their inheritability as we discussed in the the previous article. Also continuing from an earlier article on release 3.3, the new PF_CAN
socket family has a new member: CAN_BCM
is the broadcast manager protocol. But unless you’re writing Python code to run on a vehicle messaging bus then you can safely disregard this one.
One nice touch is that socket.AF_*
and socket.SOCK_*
constants are now defined in terms of the new enum
module which we covered in the previous article. This means we can get some useful values out in log trace instead of magic numbers that we need to look up. The other change in this release is for Windows users, who can now enjoy inet_pton()
and inet_ntop()
for added IPv6 goodness.
There are some more extensive changes to the ssl
module. Firstly, TSL v1.1 and v1.2 support has been added, using PROTOCOL_TLSv1_1
and PROTOCOL_TLSv1_2
respectively. This is where I have to remind myself that Python 3.4 was released in March 2014, as by 2021’s standards these versions are looking long in the tooth, being defined in 2006 and 2008 respectively. Indeed, all the major browser vendors deprecated 1.0 and 1.1 in March 2020.
Secondly, there’s a handy convenience function create_default_context()
for creating an SSLContext
with some sensible settings to provide reasonable security. These are stricter than the defaults in the SSLContext
constructor, and are also subject to change if security best practices evolve. This gives code a better chance to stay up to date with security practices via a simple Python version upgrade, although I assume the downside is a slightly increased chance of introducing issues if (say) a client’s Python version is updated but the server is still using outdated settings so they fail to negotiate a mutually agreeable protocol version.
One detail about the create_default_context()
function that I like is it’s purpose
parameter, which selects different sets of parameter values for different purposes. This release includes two purposes, SERVER_AUTH
is the default which is for client-side connections to authenticate servers, and CLIENT_AUTH
is for server-side connections to authenticate clients.
The SSLContext
class method load_verify_locations()
has a new cadata
parameter, which allows certificates to be passed directly in PEM- or DER-encoded forms. This is in contrast to the existing cafile
and capath
parameters which both require certificates to be stored in files.
There’s a new function get_default_verify_paths()
which returns the current list of paths OpenSSL will check for a default certificate authority (CA). These values are the same ones that are set with the existing set_default_verify_paths()
. This will be useful for debugging, with encryption you want as much transparency as you can possibly get because it can be very challenging to figure out the source of issues when your only feedback is generally a “yes” or “no”.
On the theme of tranparency, SSLContext
now has a cert_store_stats()
method which returns statistics on the number certificates loaded, and also a get_ca_certs()
method to return a list of the currently loaded CA certificates.
A welcome addition is the ability to customise the certificate verification process by setting the verify_flags
attribute on an SSLContext
. This can be set by ORing together one or more flags. This release defines the following flags which related to checks against certificate revocation lists (CRLs):
VERIFY_DEFAULT
VERIFY_CRL_CHECK_LEAF
load_verify_locations()
, or validation will fail.VERIFY_CRL_CHECK_CHAIN
VERIFY_X509_STRICT
Another useful addition for common users, the load_default_certs()
method on SSLContext
loads a set of standard CA certificates from locations which are platform-dependent. Note that if you use create_default_context()
and you don’t pass your own CA certificate store, this method will be called for you.
>>> import pprint
>>> import ssl
>>>
>>> context = ssl.SSLContext(protocol=ssl.PROTOCOL_TLSv1_2)
>>> len(context.get_ca_certs())
0
>>> context.load_default_certs()
>>> len(context.get_ca_certs())
163
>>> pprint.pprint(context.get_ca_certs()[151])
{'issuer': ((('countryName', 'US'),),
(('organizationName', 'VeriSign, Inc.'),),
(('organizationalUnitName', 'VeriSign Trust Network'),),
(('organizationalUnitName',
'(c) 2008 VeriSign, Inc. - For authorized use only'),),
(('commonName',
'VeriSign Universal Root Certification Authority'),)),
'notAfter': 'Dec 1 23:59:59 2037 GMT',
'notBefore': 'Apr 2 00:00:00 2008 GMT',
'serialNumber': '401AC46421B31321030EBBE4121AC51D',
'subject': ((('countryName', 'US'),),
(('organizationName', 'VeriSign, Inc.'),),
(('organizationalUnitName', 'VeriSign Trust Network'),),
(('organizationalUnitName',
'(c) 2008 VeriSign, Inc. - For authorized use only'),),
(('commonName',
'VeriSign Universal Root Certification Authority'),)),
'version': 3}
You may recall from the earlier article on Python 3.2 that client-side support for SNI (Server Name Indication) was added then. Well, Python 3.4 adds server-side support for SNI. This is achieved using the set_servername_callback()
method7 of SSLContext
, which registers a callback function which is invoked when the client uses SNI. The callback is invoked with three arguments: the SSLSocket
instance, a string indicating the name the client has requested, and the SSLContext
instance. A common role for this callback is to swap out the SSLContext
attached to the socket for one which matches the server name that’s being requested — otherwise the certificate will fail to validate.
Finally in ssl
, Windows users get two additional functions, enum_certificates()
and enum_crls()
which can retrieve certificates and CRLs from the Windows certificate store.
There a number of improvements in the urllib.request
module. It now supports URIs using the data:
scheme with the DataHandler
class. The HTTP method used by the Request
class can be specified by overriding the method
class attribute in a subclass. It’s also possible to now safely reuse Request
objects — updating full_url
or data
causes all relevant internal state to be updated as appropriate. This means you can set up a template Request
and then use that for multiple individual requests which differ only in the URL or the request body data.
Also in urllib
, HTTPError
exceptions now have a headers
attribute which contains the HTTP headers from the response which triggered the error.
A few changes have been made to some of the modules that support core language features.
First up is the abc
module for defining abstract base classes. Previously, abstract base classes were defined using the metaclass
keyword parameter to the class definition, which could sometimes confuse people:
import abc
class MyClass(metaclass=abc.ABCMeta):
...
Now there’s an abc.ABC
base class so you can instead use this rather more readable version:
class MyClass(abc.ABC):
...
Next a useful change in contextlib
, which now offers a suppress
context manager to ignore exceptions in its block. If any of the listed exceptions occur, they are ignored and execution jumps to just outside the with
block. This is really just a more concise and/or better self-documenting way of catching then ignoring the exceptions yourself.
>>> import contextlib
>>>
>>> with contextlib.suppress(OSError, IOError):
... with open("/tmp/canopen", "w") as fd:
... print("Success in /tmp")
... with open("/cannotopen", "w") as fd:
... print("Success in /")
... print("Done both files")
...
Success in /tmp
>>>
There’s also a new redirect_stdout()
context manager which temporarily redirects sys.stdout
to any other stream, including io.StringIO
to capture the output in a string. This is useful for dealing with poorly-designed code which writes its errors directly to standard output instead of raising exceptions. Oddly no equivalent redirect_stderr()
option to match this, however1.
Moving on there are some improvements to the functools
module. First up is partialmethod()
which works like partial()
except that it’s used for defining partial specialisations of methods instead of direct callables. It supports descriptors like classmethod()
, staticmethod()
, and so on, and also any method that accepts self
as the first positional argument.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
In the code above, set_host_up()
and set_host_down()
can be called as normal methods with no parameters, and just indirect into set_state()
with the appropriate argument passed.
The other addition to functools
is the singledispatch
decorator. This allows the creation of a generic function which calls into one of several separate underlying implementation functions based on the type of the first parameter. The code below illustrates a generic function which calculates the square of an integer from several possible input types:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The importlib
module has also had some more attention in this release. First up is a change to InspectLoader
, the abstract base class for loaders3. This now has a method source_to_code()
which converts Python source code to executable byte code. The default implementation calls the builtin compile()
with appropriate arguments, but it would be possible to override this method to add other features — for example, to use ast.parse()
to obtains the AST4 of the code, then manipulate it somehow (e.g. to implement some optimisation), and then finally use compile()
to convert this to executable Python code.
Also in InspectLoader
the get_code()
, which used to be abstract, now has a concrete default implementation. This is responsible for returning the code object for a module. The documentation states that if possible it should be overridden for performance reasons, however, as the default one uses the get_source()
method which can be a somewhat expensive operation as it has to decode the source and do universal newline conversion.
Speaking of get_source()
, there’s a new importlib.util.decode_source()
function that decodes source from bytes with universal newline processing — this is quite useful for implementing get_source()
methods easily.
Potentially of interest to more people, imp.reload()
is now importlib.reload()
, as part of the ongoing deprecation of the imp
module. In a similar vein, imp.get_magic()
is replaced by importlib.util.MAGIC_NUMBER
, and both imp.cache_from_source()
and imp.source_from_cache()
have moved to importlib.util
as well.
Following on from the discussion of namespace packages in the last article, the NamespaceLoader
used now conforms to the InspectLoader
interface, which has the concrete benefit that the runpy
module, and hence the python -m <module>
command-line option, now work with namespace packages too.
Finally in importlib
, the ExtensionFileLoader
in importlib.machinery
has now received a get_filename()
method, whose omission was simply an oversight in the original implementation.
The new descriptor DynamicClassAttribute
has been added to the types
module. You use this in cases where you want an attribute that acts differently based on whether it’s been accessed through an instance or directly through the class. It seems that the main use-case for this is when you want to define class attributes on a base class, but still allow subclasses to reuse the same names for their properties without conflicting. For this to work you need a define a __getattr__()
method in your base class, but since this is quite an obscure little corner then I’ll leave the official types
documentation to go into more detail. I’ll just leave you with a code sample that illustrates its use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
And to close off our language support features there are a handful of changes to the weakref
module. First of all, the WeakMethod
class has been added for taking a weak reference to a bound method. You can’t use a standard weak reference because bound methods are ephemeral, they only exist while they’re being called unless there’s another variable keeping a reference to them. Therefore, if the weak reference was the only reference then it wouldn’t be enough to keep them alive. Thus the WeakMethod
class was added to simulate a weak reference to a bound method by re-creating the bound method as required until either the instance or the method no longer exist.
This class follows standard weakref.ref
semantics where calling the weak reference returns either None
or the object itself. Since the object in this example is a callable, then we need another pair of brackets to call that. This explains the m2()()
you’ll see in the snippet below.
>>> import weakref
>>>
>>> class MyClass:
... def __init__(self, value):
... self._value = value
... def my_method(self):
... print("Method called")
... return self._value
...
>>> instance = MyClass(123)
>>> m1 = weakref.ref(instance.my_method)
>>> # Standard weakrefs don't work here.
... repr(m1())
'None'
>>> m2 = weakref.WeakMethod(instance.my_method)
>>> repr(m2())
'<bound method MyClass.my_method of <__main__.MyClass object at 0x10abc8f60>>'
>>> repr(m2()())
Method called
'123'
>>> # Here you can see the bound method is re-created each time.
... m2() is m2()
False
>>> del instance
>>> # Now we've deleted the object, the method is gone.
... repr(m2())
'None'
There’s also a new class weakref.finalize
which allows you to install a callback to be invoked when an object is garbage-collected. In this regard it works a bit like an externally installed __del__()
method. You pass in an object instance and a callback function as well as, optionally, parameters to be passed to the callback. The finalize
object is returned, but even if you delete this reference it remains installed and the callback will still be called when the object is destroyed. This includes when the interpreter exits, although you can set the atexit
attribute to False
to prevent this.
>>> import sys
>>> import weakref
>>>
>>> class MyClass:
... pass
...
>>> def callback(arg):
... print("Callback called with {}".format(arg))
...
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "one")
>>> # Deleting the finalize instance makes no difference
... del finalizer
>>> # The callback is still called when the instance is GC.
... del instance
Callback called with one
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "two")
>>> # You can trigger the callback earlier if you like.
... finalizer()
Callback called with two
>>> finalizer.alive
False
>>> # It's only called once, so it now won't fire on deletion.
... del instance
>>>
>>> instance = MyClass()
>>> finalizer = weakref.finalize(instance, callback, "three")
>>> finalizer.atexit
True
>>> # Callback is invoked at system exit, if atexit=True
... sys.exit(0)
Callback called with three
The html
module has sprouted a handy little unescape()
function which converts HTML character entities back to their unicode equivalents.
>>> import html
>>> html.unescape("I spent £50 on this & that")
'I spent £50 on this & that'
>>> html.unescape("π is the №1 mathematical constant")
'π is the №1 mathematical constant'
The HTMLParser
class has been updated to take advantage of this, so now there’s a convert_charrefs
parameter that, if True
performs this conversion. For backwards-compatibility it defaults to False
, but the documentation warns this will flip to True
in a future release.
The xml.extree
module has also seem some changes, with a new XMLPullParser
parser being added. This is intended for applications which can’t perform blocking reads of the data for any reason. Data is fed into the parser incrementally with the feed()
method, and instead of the callback method approach used by XMLParser
the XMLPullParser
relies on the application to call a read_events()
method to collect any parsed items found so far. I’ve found this sort of incremental parsing model really useful in the past where you may be parsing particularly large documents, since often you can process the information incrementally into some other useful data structure and save a lot of memory, so it’s worthwhile getting familiar with this class.
Each call to the read_events()
method will yield a generator which allows you to iterate through the events. Once an item is read from the generator it’s removed from the list, but the call to read_events()
itself doesn’t clear anything, so you don’t need to worry about avoiding partial reads of the generator before dropping it — the remaining events will still be there on your next call to read_events()
. That said, creating multiple such generators and using them in parallel could have unpredictable results, and spanning them across threads is probably a particularly bad idea.
One important point to note is that if there is an error parsing the document, then this method is where the ParseError
exception will be raised. This implies that the feed()
method just adds text to an input buffer and all the actual parsing happens on-demand in read_events()
.
Each item yielded will be a 2-tuple of the event type and a payload which is event-type-specific. On the subject of event type, the constructor of XMLPullParser
takes a list of event types that you’re interested in, which defaults to use end
events. The event types you can specify in this release are:
Event | Meaning | Payload |
---|---|---|
start |
Opening tag | Element object |
end |
Closing tag | Element object |
start-ns |
Start namespace | Tuple (prefix, uri) |
end-ns |
End namespace | None |
It’s worth noting that the start
event is raised as soon as the end of the opening tag is seen, so the Element
object won’t have any text
or tail attributes. If you care about these, probably best to just filter on end
events, where the entire element is returned. The start
events are mostly useful so you can see the context in which intervening tags are used, including any attributes defined within the containing opening tag.
The start-ns
event is generated prior to the opening tag which specifies the namespace prefix, and the end-ns
event is generated just after its matching closing tag. In the tags that follow which use the namespace prefix the URI will be substituted in, since really the prefix is just an alias for the URI.
Here’s an example of its use showing that only events for completed items are returned, and showing what happens if the document is malformed:
>>> import xml.etree.ElementTree as ET
>>> import pprint
>>>
>>> parser = ET.XMLPullParser(("start", "end"))
>>> parser.feed("<document><one>Sometext</one><two><th")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'document' at 0x1057d2728>),
('start', <Element 'one' at 0x1057d2b38>),
('end', <Element 'one' at 0x1057d2b38>),
('start', <Element 'two' at 0x1057d2b88>)]
>>> parser.feed("ree>Moretext</three><four>Yet")
>>> pprint.pprint(list(parser.read_events()))
[('start', <Element 'three' at 0x1057d2c28>),
('end', <Element 'three' at 0x1057d2c28>),
('start', <Element 'four' at 0x1057d2c78>)]
>>> parser.feed("moretext</closewrongtag></two></document>")
>>> pprint.pprint(list(parser.read_events()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1281, in read_events
raise event
File "/Users/andy/.pyenv/versions/3.4.10/lib/python3.4/xml/etree/ElementTree.py", line 1239, in feed
self._parser.feed(data)
xml.etree.ElementTree.ParseError: mismatched tag: line 1, column 76
Another small enhancement is that the tostring()
and tostringlist()
functions, as well as the ElementTree.write()
method, now have a short_empty_elements
keyword parameter. If set to True
, which is the default, this causes empty tags to use the <tag />
shorthand. If set to False
the expanded <tag></tag>
form will be used instead.
As well as the file descriptor inheritance features mentioned above, the os
module also has a few more changes, listed below.
os.cpu_count()
AddedNone
if it can’t be determined. This is now used as the implementation for multiprocessing.cpu_count()
.os.path
Improvements on Windowsos.path.samestat()
is now available, to tell if two stat()
results refer to the same file, and os.path.ismount()
now correctly recognises volumes which are mounted below the drive letter level.os.open()
New FlagsOn platforms where the underlying call supports them, os.open()
now supports two new flags.
O_PATH
is used for obtaining a file descriptor to a path without actually opening it — reading or writing it will yield EBADF
. This is useful for operations that don’t require us to access the file or directory such as fchdir()
.O_TMPFILE
creates an open file but never creates a directory entry for it, so it can be used as a temporary file. This is one step better than the usual approach of creating and then immediately deleting a temporary file, relying on the open filehandle to prevent the filesystem from reclaiming the blocks, because it doesn’t allow any window of opportunity to see the directory entry.MacOS users get to benefit from some improvements to the plistlib
module, which offers functions to read and write Apple .plist
(property list) files. This module now sports an API that’s more consistent with other similar ones, with functions load()
, loads()
, dump()
and dumps()
. The module also now supports the binary file format, as well as the existing support for the XML version.
On Linux, the resource
module has some additional features. The Linux-specific prlimit()
system call has been exposed, which allows you to both set and retrieve the current limit for any process based on its PID. You provide a resource (e.g. resource.RLIMIT_NOFILE
controls the number of open file descriptors permitted) and then you can either provide a new value for the resource to set it and return the prior value, or omit the limit argument to just query the current setting. Note that you may get PermissionError
raised if the current process doesn’t have the CAP_SYS_RESOURCE
capability.
On a related note, since some Unix variants have additional RLIMIT_*
constants available, these have also been exposed in the resource
module:
RLIMIT_MSGQUEUE
(on Linux)RLIMIT_NICE
(on Linux)RLIMIT_RTPRIO
(on Linux)RLIMIT_RTTIME
(on Linux)RLIMIT_SIGPENDING
(on Linux)RLIMIT_SBSIZE
(on FreeBSD)RLIMIT_SWAP
(on FreeBSD)RLIMIT_NPTS
(on FreeBSD)The stat
module is now backed by a C implementation _stat
, which makes it much easier to expose the myriad of platform-dependent values that exist. Three new ST_MODE
flags were also added:
S_IFDOOR
S_IFPORT
poll()
.S_IFWHT
Some other assorted updates that didn’t fit any of the themes above.
argparse.FileType
Improvementsencoding
and errors
arguments that are passed straight on to the resultant open()
call.base64
Improvementsbytes
-like object. Also, there are now functions to encode/decode Ascii85, both the variant used by Adobe for the PostScript and PDF formats, and also the one used by Git to encode binary patches.dbm.open()
Improvementsdbm.open()
call now supports use as a context manager.glob.escape()
Addedimportlib.machinery.ModuleSpec
Addedimportlib
to continue to address some of the outstanding quirks and inconsistencies in this process. Primarily the change is to move some attributes from the module
object itself to a new ModuleSpec
object, which will be available via the __spec__
attribute. As far as I can tell this doesn’t offer a great deal of concrete benefits initially, but I believe it’s laying the foundations for further improvements to the import system in future releases. Check out the PEP for plenty of details.re.fullmatch()
Addedre.match()
which only checked for a match starting at the beginning of the search string, and re.search()
which would find a match starting anywhere. Now there’s also re.fullmatch()
, and a corresponding method on compiled patterns, which finds matches covering the entire string (i.e. anchored at both ends).selectors
Addedselectors
was added as a higher-level abstraction over the implementations provided by the select
module. This will probably make it easier for programmers who are less experienced with select()
, poll()
and friends to implement reliable applications, as these calls definitely have a few tricky quirks. That said, I would have thought the intention would be for most people to shift to using asyncio
for these purposes, if they’re able.shutil.copyfile()
Raises SameFileError
SameFileError
exception allows applications to take special action in this case.struct.iter_unpack()
Addedstruct.Struct
objects.tarfile
CLI Addedtarfile
module which can be invoked with python -m tarfile
.textwrap
EnhancementsTextWrapper
class now offers two new attributes: max_lines
limits the number of lines in the output, and placeholder
which is appended to output to indicate it was truncated due to the setting of max_lines
. There’s also a new handy textwrap.shorten()
convenience function that uses these facilities to shorten a single line to a specified length, and appand placeholder if truncation occurred.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
PyZipFile.writepy()
Enhancementzipfile.PyZipFile
class is a specialised compressor for the purposes of creating ZIP archives of python libraries. It now supports a filterfunc
parameter which must be a function accepting a single argument. It will be called for each file added to the archive, being passed the full path, and if it returns False
for any path then it’ll be excluded from the archive. This could be used to exclude unit test code, for example.There were a collection of changes to builtins which are worth a quick mention.
min()
and max()
Defaultsdefault
keyword-only parameter, to be returned if the iterable you pass is empty.__file__
Path__file__
attribute of modules should now always use absolute paths, except for __main__.__file__
if the script was invoked with a relative name. Could be handy, especially when using this to generate log entries and the like.bytes.join()
Accepts Any Bufferjoin()
method of bytes
and bytesarray
previously used to be restricted to accepting objects of these types. Now in both cases it will accept any object supporting the buffer protocol.memoryview
Supports reversed()
memoryview
was not registered as a subclass of collections.Sequence
. It is in Python 3.4. Also, it can now be used with reversed()
.In this release there’s also an important change to the garbage collection process, as defined in PEP 442. This finally resolves some long-standing problems around garbage collection of reference cycles where the objects have custom finalisers (e.g. __del__()
methods).
Just to make sure we’re on the same page, a reference cycle is when you have a series of objects which all hold a reference to each other where there is no clear “root” object which can be deleted first. This means that their reference counts never normally drop to zero, because there’s always another object holding a reference to them. If, like me, you’re a more visual thinker, here’s a simple illustration:
It’s for these cases that the garbage collector was created. It will detect reference cycles where there are no longer any external references pointing to them9, and if so it’ll break all the references within the cycle. This allows the references counts to drop to zero and the normal object cleanup occurs.
This is fine, except when more than one of the objects have custom finalisers. In these cases, it’s not clear in what order the finalisers should be called, and also there’s the risk that the finalisers could make changes which themselves impact the garbage collection process. So historically the interpreter has balked at these cases and left the objects on the gc.garbage
list for programmers to clean up using their specific knowledge of the objects in question. Of course, it’s always better never to create such reference cycles in the first place, but sometimes it’s surprisingly easy to do so by accident.
The good news is that in Python 3.4 this situation has been improved so that in almost every case the garbage collector will be able to collect reference cycles. The garbage collector now has two additional stages. In the first of these, the finalisers of all objects in isolated reference cycles are invoked. The only choice here is really to call them in an undefined order, so you should avoid making too many assumptions in the finalisers that you write.
The second new step, after all finalisers have been run, is to re-traverse the cycles and confirm they’re still isolated. This is required because the finalisers may have ended up creating references from outside the cycle which should keep it alive. If the cycle is no longer isolated, the collection is aborted this time around and the objects persist. Note that their finalisers will only ever be called once, however, and this won’t change if they’re resurrected in this fashion.
Assuming the collection wasn’t aborted, it now continues as normal.
This should cover most of the cases people are likely to hit. However, there’s an important exception which can still bite you: this change doesn’t affect objects defined in C extension modules which have a custom tp_dealloc
function. These objects may still end up on gc.garbage
, unfortunately.
The take-aways from this change appear to be:
gc.garbage
any more.Here are the other changes I felt were noteworthy enough to mention, but not enough to jump into a lot details.
-I
option to run in isolated mode. This removes the current directory from sys.path
, as well as the user’s own site-packages
directory, and also ignores all PYTHON*
environment variables. The intention is to be able to run a script in a clean system-defined environment, without any user customisations being able to impact it. This can be specified on the shebang line of system scripts, for example.As usual there are a number of optimisations, of which I’ve only included some of the more interesting ones here:
set
s are cheaper due to an optimisation of trying some limited linear probing in the case of a collision, which can take advntage of cache locality, before falling back on open addressing if there are still repeated collisions (by default the limit for linear probing is 9).html.escape()
is around 10x faster.os.urandom()
now uses a lazily-opened persistent file descriptor to avoid the overhead of opening large numbers of file descriptors when run in parallel from multiple threads.Another packed release yet again, and plenty of useful additions large and small. Blocking inheritance of file descriptors by default is one of those features that’s going to be helpful to a lot of people without them even knowing it, which is the sort of thing Python does well in general. The new modules in this release aren’t anything earth-shattering, but they’re all useful additions. The lack of something like the enum
module in particular is something that has always felt like a bit of a rough edge. The diagnostic improvements like tracemalloc
and the inspect
improvements all feel like the type of thing that you necessarily be using every day, but when you have a need of them then they’re priceless. The addition of subTest
to unittest
is definitely a handy one, as it makes failures in certain cases much more transparent than just realising the overall test has failed and having to insert logging statements to figure out why.
The incremental XMLPullParser
is a great addition in my opinion, I’ve always had a bit of a dislike of callback-based approaches since they always seem to force you to jump through more hoops than you’d like. Whichever one is a natural fit does depend on your use-case, however, so it’s good to have both approaches available to choose from. I’m also really glad to see the long-standing issue of garbage collecting reference cycles with custom finalisers has finally been tidied up — it’s one more change to give is more confidence using Python for very long-running daemon processes.
It does feel rather like a “tidying up loose ends and known issues” release, this one, but there’s plenty there to justify an upgrade. From what I know of later releases, I wonder if that was somewhat intentional — stabilising the platform for syntactic updates and other more sweeping changes in the future.
Spoiler alert: it was added in the next release. ↩
A key derivation function is used to “stretch” a password of somewhat limited length into a longer byte sequence that can be used as a cryptographic key. Doing this naively can significantly reduce the security of a system, so using established algorithms is strongly recommended. ↩
These are the objects returned by the finder, and which are responsible for actually loading the module from disk into memory. If you want to know more, you can find more details in the Importer Protocol section of PEP 302. ↩
AST stands for Abstract Syntax Trees, and represents a normalised intermediate form for Python code which has been parsed but not yet compiled. ↩
That’s a duck typing joke. Also, why do you never see ducks typing? They’ve never found a keyboard that quite fits the bill. That was another duck typing joke, although I use the word “joke” in its loosest possible sensible in both cases. ↩
For a more in-depth discussion of some of the issues using fork()
in multithreaded code, this article has some good discussion. ↩
Spoiler alert: this method is only applicable until Python 3.7 where it was deprecated in favour of a newer sni_callback
attribute. The semantics are similar, however. ↩
It’s useful to have support for it in software, of course, because you don’t necessarily have control of the formats you ened to open. But as Benjamin Zwickel convincingly explains in The 24-Bit Delusion, 24-bit audio is generally a waste of time since audio equipment cannot reproduce it accurately enough. ↩
For reference, this is what PEP 442 refers to as a cyclic isolate. ↩
This is part 6 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.3 - Part 2.
In this series looking at features introduced by every version of Python 3, this one is the first of two covering release 3.4. We look at a universal install of the pip
utility, improvements to handling codecs, and the addition of the asyncio
and enum
modules, among other things.
Python 3.4 was released on March 16 2014, around 18 months after Python 3.3. That means I’m only writing this around seven years late, as opposed to my Python 3.0 overview which was twelve years behind — at this rate I should be caught up in time for the summer.
This release was mostly focused on standard library improvements and there weren’t any syntax changes. There’s a lot here to like, however, including a bevy of new modules and a whole herd of enhancements to existing ones, so let’s fire up our Python 3.4 interpreters and import some info.
For anyone who’s unaware of pip
, is the most widely used package management tool for Python, its name being a recursive acronym for pip installs packages. Originally written by Ian Bicking, creator of virtualenv
, it was originally called pyinstall
and was written to be a more fully-featured alternative to easy_install
, which was the official package installation tool at the time.
Since pip
is the tool you naturally turn to for installation Python modules and tools, this always begs the question: how do you install pip
for the first time? Typically the answer has been to install some OS package with it in, and once you have it installed you can use it to install everything else. In the new release, however, there’s a new ensurepip
module to perform this bootstrapping operation. It uses a private copy of pip
that’s distributed with CPython, so it doesn’t require network access and can readily be used by anyone on any platform.
This approach is part of a wider standardisation effort around distributing Python packages, and pip
was selected as a tool that’s already popular and also works well within virtual environments. Speaking of which, this release also updates the venv
module to install pip
in virtual environments by default, using ensurepip
. This was something that virtualenv
always did, and the lack of it in venv
was a serious barrier to adoption of venv
for a number of people. Additionally the CPython installers on Windows and MacOS also default to installing pip
on these platforms. You can find full details in PEP 453.
When you try newer langauges like Go and Rust, coming from a heritage of C++ and the like, one of the biggest factors that leaps out at you isn’t so much the language itself but the convenience of the well integrated standard tooling. With this release I think Python has taken another step in this direction, with a standard and consistent package management on all the major platforms.
Under POSIX, file descriptors are by default inherited by child processes during a fork()
operation. This offers some concrete advantages, such as the child process automatically inheriting the stdin
, stdout
and stderr
from the parent, and also allowing the parent to create a pipe with pipe()
to communicate with the child process1.
However, this behaviour can cause confusion and bugs. For example, if the child process is a long-running daemon then this file descriptor may be held open indefinitely and the disk space associated with the file will not be freed. Or if the parent had a large number of open file descriptors, the child may exhaust the remaining space if it too tries to open a large number. This is one reason why it’s common to iterate over all file descriptors and call close()
on them after forking.
In Python 3.4, however, this behaviour has been modified so that file descriptors are not inherited. This is implemented by setting FD_CLOEXEC
on the descriptor via fcntl()
2 on POSIX systems, which closes all current file descriptors when any of the execX()
family are called. On Windows, SetHandleInformation()
is used passing HANDLE_FLAG_INHERIT
with much the same purpose.
Since inheritance of file descriptors is still desirable in some circumstances, the functions os.get_inheritable()
and os.set_inheritable()
can be used to query and set this behaviour on a per-filehandle basis. There are also os.get_handle_inheritable()
and os.set_handle_inheritable()
calls on Windows, if you’re using native Windows handles rather than the POSIX layer.
One important aspect to note here is that when using the FD_CLOEXEC
flag, the close()
happens on the execX()
call, so if you call a plain vanilla os.fork()
and continue execution in the same script then all the descriptors will still be open. To demonstrate the action of these methods, you’ll need to do something like this (which is Unix-specific since it assumes the existence of /tmp
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
When run, you should see something like the following:
ERROR: [Errno 9] Bad file descriptor
Contents of file:
Before fork
SECOND
That first line is the output from the first attempt to write the file, which fails. The contente of the output file clearly indicates the second write was successful.
In general I think this change is a very sensible one as the previous default behaviour of inheriting file descriptors by default on POSIX systems probably took a lot of less experienced developers (and a few more experienced ones!) by surprise. It’s the sort of nasty surprise that you don’t realise is there until those odd cases where, say, you’re dealing with hundreds of open files at once and when you spawn a child process it suddenly starts complaining it’s hit the system limit on open file descriptors and you wonder what on earth is going on. It always seems that such odd cases are always those when you have the tightest deadlines, too, so the last thing you need is to spend hours tracking down some weird file descriptor inheritance bug.
If you need to know more, PEP 446 has the lowdown, including references to real issues in various OSS projects caused by this behaviour.
The codecs
module has long been a fixture in Python, since it was introduced in (I think!) Python 2.0, released over two decades ago. It was intended as general framework for registering and using any sort of codec, and this can be seen from the diverse range of codecs it supports. For example, as well as obvious candidates like utf-8
and ascii
, you’ve got options like base64
, hex
, zlib
and bz2
. You can even register your own with codecs.register()
.
However, most people don’t use codecs
on a frequent basis, but they do use the convenience methods str.encode()
and bytes.decode()
all the time. This can cause confusion because while the encode()
and decode()
methods provided by codecs
are generic, the convenience methods on str
and bytes
are not — these only support the limited set of text encodings that make sense for those classes.
In Python 3.4 this situation has been somewhat improved by more helpful error messages and improved documentation.
Firstly, the methods codecs.encode()
and codecs.decode()
are now documented, which they weren’t previously. This is probably because they’re really they are just convenient wrappers for calling lookup()
and invoking the encoder object thus created, but unless you’re doing a lot of encoding/decoding with the same codec, the simplicity of their interface is probably preferable. Since these are C extension modules under the hood, there shouldn’t be a lot of performance overhead for using these wrappers either.
>>> import codecs
>>> encoder = codecs.lookup("rot13")
>>> encoder.encode("123hello123")
('123uryyb123', 11)
Secondly, using one of the non-text encodings without going through the codecs
module now yields a helpful error which points you in that direction.
>>> "123hello123".encode("rot13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: 'rot13' is not a text encoding; use codecs.encode() to handle arbitrary codecs
Finally, errors during encoding now use chained exceptions to ensure that the codec responsible for them is indicated as well as the underlying error raised by that codec.
>>> codecs.decode("abcdefgh", "hex")
Traceback (most recent call last):
File "/Users/andy/.pyenv/versions/3.4.10/encodings/hex_codec.py", line 19, in hex_decode
return (binascii.a2b_hex(input), len(input))
binascii.Error: Non-hexadecimal digit found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
binascii.Error: decoding with 'hex' codec failed (Error: Non-hexadecimal digit found)
Hopefully all this will go some way to making things easier to grasp for anyone grappling with the nuances of codecs in Python.
This release has a number of new modules, which are discussed in the sections below. I’ve skipped ensurepip
since it’s already been discussed at the top of this article.
This release contains the new asyncio
module which provides an event loop framework for Python. I’m not going to discuss it much in this article because I already covered it a few years ago in an article that was part of my coroutines series. The other reason not to go into things in too much detail here are that the situation evolved fairly rapidly from Python 3.4 to 3.7, so it probably makes more sense to have a more complete look in retrospect.
Briefly, it’s nominally the successor to the asyncore
module, for doing asynchronous I/O, which was always promising in priciple but a bit of a disappointment in practice due to a lack of flexibility. This is far from the whole story, however, as it also forms the basis for the modern use of coroutines in Python.
Since I’m writing these articles with the benefit of hindsight, my strong suggestion is to either go find some other good tutorials on asyncio
that were written in the last couple of years, and which use Python 3.7 as a basis; or wait until I get around to covering Python 3.7 myself, where I’ll run through in more detail (especially since my previous articles stopped at Python 3.5).
Enumerations are something that Python’s been lacking for some time. This is partly due to the fact that it’s not too hard to find ways to work around this omission, but they’re often a little unsatisfactory. It’s also partly due to the fact that nobody could fully agree on the best way to implement them.
Well in Python 3.4 PEP 435 has come along to change all that, and it’s a handy little addition.
Enumerations are defined using the same syntax as a class:
class WeekDay(Enum):
MONDAY = 1
TUESDAY = 2
WEDNESDAY = 3
THURSDAY = 4
FRIDAY = 5
SATURDAY = 6
SUNDAY = 7
However, it’s important to note that this isn’t actually a class, as it’s linked to the enum.EnumMeta
metaclass. Don’t worry too much about the details, just be aware that this is not a class but essentially a new construct that uses the same syntax as classes, and you won’t be taken by surprise later.
You’ll notice that all the enumeration members need to be assigned a value, you can’t just list the member names on their own (although read on for a nuance to this). When you have an enumeration value you can query both its name and value, and also str
and repr
have sensible values. See the excerpt below for an illustration of all these aspects.
>>> WeekDay.WEDNESDAY.name
'WEDNESDAY'
>>> WeekDay.WEDNESDAY.value
3
>>> str(WeekDay.FRIDAY)
'WeekDay.FRIDAY'
>>> repr(WeekDay.FRIDAY)
'<WeekDay.FRIDAY: 5>'
>>> type(WeekDay.FRIDAY)
<enum 'WeekDay'>
>>> type(WeekDay)
<class 'enum.EnumMeta'>
>>> WeekDay.THURSDAY - WeekDay.MONDAY
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'WeekDay' and 'WeekDay'
>>> WeekDay.THURSDAY.value - WeekDay.MONDAY.value
3
I did mention that every enumeration members need a name, but there is an enum.auto()
helper for you to automatically assign values if all you need is something unique. The excerpt below illustrates this as well as iterating through an enumeration.
>>> from enum import Enum, auto
>>> class Colour(Enum):
... RED = auto()
... GREEN = auto()
... BLUE = auto()
...
>>> print("\n".join(i.name + "=" + str(i.value) for i in Colour))
RED=1
GREEN=2
BLUE=3
Every enumeration name must be unique within a given enumeration definition, but the values can be duplicated if needed, which you can use to define aliases for values. If this isn’t desirable, the @enum.unique
decorator can enforce uniqueness, raising a ValueError
if not.
One thing that’s not immediately obvious from these examples is that enumeration member values may be any type and different types may even be mixed within the same enumeration. I’m not sure how valuable this would be to do in practive, however.
Values can be compared by identity or equality, but comparing enumeration members to their underlying types always returns not equal. Even when comparing by identity, aliases for the same underlying value compare equal. Also note that when iterating through enumerations, aliases are skipped and the first definition for each value is used.
>>> class Numbers(Enum):
... ONE = 1
... UN = 1
... EIN = 1
... TWO = 2
... DEUX = 2
... ZWEI = 2
...
>>> Numbers.ONE is Numbers.UN
True
>>> Numbers.TWO == Numbers.ZWEI
True
>>> Numbers.ONE == Numbers.TWO
False
>>> Numbers.ONE is Numbers.TWO
False
>>> Numbers.ONE == 1
False
>>> list(Numbers)
[<Numbers.ONE: 1>, <Numbers.TWO: 2>]
If you really do need to include aliases in your iteration, the special __members__
dictionary can be used for that.
>>> import pprint
>>> pprint.pprint(Numbers.__members__)
mappingproxy({'DEUX': <Numbers.TWO: 2>,
'EIN': <Numbers.ONE: 1>,
'ONE': <Numbers.ONE: 1>,
'TWO': <Numbers.TWO: 2>,
'UN': <Numbers.ONE: 1>,p
'ZWEI': <Numbers.TWO: 2>})
Finally, the module also provides some subclasses of Enum
which may be useful. For example, IntEnum
is one which adds the ability to compare enumeration values with int
as well as other enumeration values.
This is a bit of a whirlwind tour of what’s been written to be quite a flexible module, but hopefully if gives you an idea of its capabilities. Check out the full documentation for more details.
This release sees the addition of a new library pathlib
to manipulate filesystem paths, with semantics appropriate for different operating systems. This is intended to be a higher-level abstraction than that provided by the existing os.path
library, which itself has some functions to abstract away from the filesystem details (e.g. os.path.join()
which uses appropriate slashes to build a path).
There are common base classes across platforms, and then different subclasses for POSIX and Windows. The classes are also split into pure and concrete, where pure classes represent theoretical paths but lack any methods to interact with the concrete filesystem. The concrete equivalents have such methods, but can only be instantiated on the appropriate platform.
For reference, here is the class hierarchy:
When run on a POSIX system, the following excerpt illustrates which of the platform-specific classes can be instantiated, and also that the pure classes lack the filesystem methods that the concrete ones provide:
>>> import pathlib
>>> a = pathlib.PurePosixPath("/tmp")
>>> b = pathlib.PureWindowsPath("/tmp")
>>> c = pathlib.PosixPath("/tmp")
>>> d = pathlib.WindowsPath("/tmp")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.4.10/pathlib.py", line 927, in __new__
% (cls.__name__,))
NotImplementedError: cannot instantiate 'WindowsPath' on your system
>>> c.exists()
True
>>> len(list(c.iterdir()))
24
>>> a.exists()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'PurePosixPath' object has no attribute 'exists'
>>> len(list(a.iterdir()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'PurePosixPath' object has no attribute 'iterdir'
Of course, a lot of the time you’ll just want whatever path represents the platform on which you’re running, so if you instantiate plain old Path
you’ll get the appropriate concrete representation.
>>> x = pathlib.Path("/tmp")
>>> type(x)
<class 'pathlib.PosixPath'>
One handy feature is that the division operator (slash) has been overridden so that you can append path elements with it. Note that this operator is the same on all platforms, and also you always use forward-slashes even on Windows. However, when you stringify the path, Windows paths will be given backslashes. The excerpt below illustrates these features, and also some of the manipulations that pure paths support.
>>> x = pathlib.PureWindowsPath("C:/") / "Users" / "andy"
>>> x
PureWindowsPath('C:/Users/andy')
>>> str(x)
'C:\\Users\\andy'
>>> x.parent
PureWindowsPath('C:/Users')
>>> [str(i) for i in x.parents]
['C:\\Users', 'C:\\']
>>> x.drive
'C:'
So far it’s pretty handy but perhaps nothing to write home about. However, there are some handy features. One is glob matching, where you can test a given path for matches against a glob-style pattern with the match()
method.
>>> x = pathlib.PurePath("a/b/c/d/e.py")
>>> x.match("*.py")
True
>>> x.match("d/*.py")
True
>>> x.match("a/*.py")
False
>>> x.match("a/*/*.py")
False
>>> x.match("a/*/*/*/*.py")
True
>>> x.match("d/?.py")
True
>>> x.match("d/??.py")
False
Then there’s relative_to()
which is handy for getting the relative path of a file to some specified parent directory. It also raises an exception if the path isn’t under the parent directory, which makes checking for errors in paths specified by the user more convenient.
>>> x = pathlib.PurePath("/one/two/three/four/five.py")
>>> x.relative_to("/one/two/three")
PurePosixPath('four/five.py')
>>> x.relative_to("/xxx")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../pathlib.py", line 819, in relative_to
.format(str(self), str(formatted)))
ValueError: '/one/two/three/four/five.py' does not start with '/xxx'
And finally there’s with_name()
, with_stem()
and with_suffix()
which are useful for making manipulations of parts of the filename.
>>> x = pathlib.PurePath("/home/andy/file.md")
>>> x.with_name("newfilename.html")
PurePosixPath('/home/andy/newfilename.html')
>>> x.with_stem("newfile")
PurePosixPath('/home/andy/newfile.md')
>>> x.with_suffix(".html")
PurePosixPath('/home/andy/file.html')
>>> x.with_suffix("")
PurePosixPath('/home/andy/file')
The concrete classes add a lot more useful functionality for querying the content of directories and reading file ownership and metadata, but if you want more details I suggest you go read the excellent documentation. If you want the motivations behind some of the design decisions, go and read PEP 428.
Both simple and useful, this new module contains some handy functions to calculate basic statistical measures from sets of data. All of these operations support the standard numeric types int
, float
, Decimal
and Fraction
and raise StatisticsError
on errors, such as an empty data set being passed.
The following functions for determining different forms of average value are provided in this release:
mean()
sum(data) / len(data)
except supporting generalised iterators that can only be evaluated once and don’t support len()
.median()
data[len(data) // 2]
except supporting generalised iterators. Also, if the number of items in data
is even then the mean of the two middle items is returned instead of selecting one of them, so the value is not necessarily one of the actual members of the data set in this case.median_low()
and median_high()
median()
and each other for data sets with an odd number of elements. If the number of elements is even, these return one of the two middle elements instead of their mean as median()
does, with median_low()
returning the lower of the two and median_high()
the higher.median_grouped()
1
, which would represent continuous values that have been rounded to the nearest integer. The method involes identifying the median interval, and then using the proportion of values above and within that interval to interpolate an estimate of the median value within it3.mode()
StatisticsError
if there’s more than one value with equal-highest cardinality.There are also functions to calculate the variance and standard deviation of the data:
pstdev()
and stdev()
pvariance()
and variance()
These operations are generally fairly simple to implement yourself, but making them operate correctly on any iterator is slightly fiddly and it’s definitely handy to have them available in the standard library. I also have a funny feeling that we’ll be seeing more additions to this library in the future beyond the fairly basic set that’s been included initially.
As you can probably tell from the name, this module is intended to help you track down where memory is being allocated in your scripts. It does this by storing the line of code that allocated every block, and offering APIs which allow your code to query which files or lines of code have allocated the most blocks, and also compare snapshots between two points in time so you can track down the source of memory leaks.
Due to the memory and CPU overhead of performing this tracing it’s not enabled by default. You can start tracking at runtime with tracemalloc.start()
, or to start it early you can pass the PYTHONTRACEMALLOC
environment variable or -X tracemalloc
command-line option. You can also store multiple frames of traceback against each block, at the cost of increased CPU and memory overhead, which can be helpful for tracing the source of memory allocations made by common shared code.
Once tracing is enabled you can grab a snapshot at any point with take_snapshot()
, which returns a Snapshot
instance which can be interrogated for information at any later point. Once you have a Snapshot
instance you can call statistics()
on it to get the memory allocations aggregated by source file, or broken down by line number of specific backtrace. There’s also a compare_to()
method for examining the delta in memory allocations between two points, and there are dump()
and load()
methods for saving snapshots to disk for later analysis, which could be useful for tracing code in production environments.
As a quick example of these two methods, consider the following completely artificial code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | memory.py
|
1 2 3 4 5 | lib1.py
|
1 2 3 4 | lib2.py
|
1 2 3 4 5 6 7 8 9 10 11 | lib3.py
|
Let’s take a quick look at the two parts of the output that executing memory.py
gives us. The first half that I get on my MacOS system is shown below — wherever you see “...
” it’s where I’ve stripped out leading paths to avoid the need for word wrapping:
---- Initial snapshot:
.../lib3.py:10: size=105 KiB, count=101, average=1068 B
memory.py:10: size=48.1 KiB, count=1, average=48.1 KiB
memory.py:9: size=12.0 KiB, count=1, average=12.0 KiB
.../lib1.py:5: size=10.3 KiB, count=102, average=104 B
.../lib3.py:11: size=848 B, count=1, average=848 B
memory.py:13: size=536 B, count=2, average=268 B
.../python3.4/random.py:253: size=536 B, count=1, average=536 B
memory.py:12: size=56 B, count=1, average=56 B
memory.py:11: size=56 B, count=1, average=56 B
.../lib3.py:6: size=32 B, count=1, average=32 B
.../lib2.py:3: size=32 B, count=1, average=32 B
I’m not going to go through all of these, but let’s pick a few examples to check what we’re seeing makes sense. Note that the results from statistics()
are always sorted in decreasing order of total memory consumption.
The first line indicates lib3.py:10
allocated memory 101 times, which is reassuring because it’s not allocating every time around the nested loop. Interesting to note that it’s one more time than the number of times around the outer loop, however, which perhaps implies there’s some allocation that was done the first time and then reused. The average allocation of 1068 bytes makes sense, since these are str
objects of 1024 characters and based on sys.getsizeof("")
on my platform each instance has an overhead of around 50 bytes.
Next up are memory.py:10
and memory.py:9
which are straightforward enough: single allocations for single strings. The sizes are such that the str
overhead is lost in rounding errors, but do note that the string using extended Unicode characters4 requires 4 bytes per character and is therefore four times larger than the byte-per-character ASCII one. If you’ve read the earlier articles in this series, you may recall that this behaviour was introduced in Python 3.3.
Skipping forward slightly, the allocation on lib3.py:11
is interesting: when we append the str
we’ve built to the list we get a single allocation of 848 bytes. I assume there’s some optimisation going on here, because if I increase the loop count the allocation count remains at one but the size increases.
The last thing I’ll call out is the two allocations on memory.py:13
. I’m not quite sure exactly what’s triggering this, but it’s some sort of optimisation — even if the loop has zero iterations then these allocations still occur, but if I comment out the loop entirely then these allocations disappear. Fascinating stuff!
Now we’ll look at the second half the output, comparing the initial snapshot to that after the class instances are deleted:
---- Incremental snapshot:
.../lib3.py:10: size=520 B (-105 KiB), count=1 (-100), average=520 B
.../lib1.py:5: size=0 B (-10.3 KiB), count=0 (-102)
.../python3.4/tracemalloc.py:462: size=1320 B (+1320 B), count=3 (+3), average=440 B
.../python3.4/tracemalloc.py:207: size=952 B (+952 B), count=3 (+3), average=317 B
.../python3.4/tracemalloc.py:165: size=920 B (+920 B), count=3 (+3), average=307 B
.../lib3.py:11: size=0 B (-848 B), count=0 (-1)
.../python3.4/tracemalloc.py:460: size=672 B (+672 B), count=1 (+1), average=672 B
.../python3.4/tracemalloc.py:432: size=520 B (+520 B), count=2 (+2), average=260 B
memory.py:18: size=472 B (+472 B), count=1 (+1), average=472 B
.../python3.4/tracemalloc.py:53: size=472 B (+472 B), count=1 (+1), average=472 B
.../python3.4/tracemalloc.py:192: size=440 B (+440 B), count=1 (+1), average=440 B
.../python3.4/tracemalloc.py:54: size=440 B (+440 B), count=1 (+1), average=440 B
.../python3.4/tracemalloc.py:65: size=432 B (+432 B), count=6 (+6), average=72 B
.../python3.4/tracemalloc.py:428: size=432 B (+432 B), count=1 (+1), average=432 B
.../python3.4/tracemalloc.py:349: size=208 B (+208 B), count=4 (+4), average=52 B
.../python3.4/tracemalloc.py:487: size=120 B (+120 B), count=2 (+2), average=60 B
memory.py:16: size=90 B (+90 B), count=2 (+2), average=45 B
.../python3.4/tracemalloc.py:461: size=64 B (+64 B), count=1 (+1), average=64 B
memory.py:13: size=480 B (-56 B), count=1 (-1), average=480 B
.../python3.4/tracemalloc.py:275: size=56 B (+56 B), count=1 (+1), average=56 B
.../python3.4/tracemalloc.py:189: size=56 B (+56 B), count=1 (+1), average=56 B
memory.py:12: size=0 B (-56 B), count=0 (-1)
memory.py:11: size=0 B (-56 B), count=0 (-1)
.../python3.4/tracemalloc.py:425: size=48 B (+48 B), count=1 (+1), average=48 B
.../python3.4/tracemalloc.py:277: size=32 B (+32 B), count=1 (+1), average=32 B
.../lib3.py:6: size=0 B (-32 B), count=0 (-1)
.../lib2.py:3: size=0 B (-32 B), count=0 (-1)
memory.py:10: size=48.1 KiB (+0 B), count=1 (+0), average=48.1 KiB
memory.py:9: size=12.0 KiB (+0 B), count=1 (+0), average=12.0 KiB
.../python3.4/random.py:253: size=536 B (+0 B), count=1 (+0), average=536 B
Firstly, there are of course a number of allocations within tracemalloc.py
, which are the result of creating and analysing the previous snapshot. We’ll disregard these, because they depend on the details of the library implementation which we don’t have transparency into here.
Beyond this, most of the changes are as you’d expect. Interesting points to note are that one of the allocations lib3.py:10
was not freed, and only one of the two allocations from memory.py:13
was freed. Since these were the two cases where I was a little puzzled by the apparently spurious additional allocations, I’m not particularly surprised to see these two being the ones that weren’t freed afterwards.
In a simple example like this, it’s easy to see how you could track down memory leaks and similar issues. However, I suspect in a complex codebase it could be quite a challenge to focus in on the impactful allocations with the amount of detail provided. I guess the main reason people would turn to this module is only to track down major memory leaks rather than a few KB here and there, so at that point perhaps the important allocations would stand out clearly from the background noise.
Either way, it’s certainly a welcome addition to the library!
Great stuff so far, but we’ve got plenty of library enhancements still to get through. I’ll discuss those and few other remaining details in the next post, and I’ll also sum up my overall thoughts on this release as a whole.
So the parent process closes one end of the pipe and the child process closes the other end. If you want bidirectional communication you can do the same with another pipe, just the opposite way around. There are other ways for processes to communicate, of course, but this is one of the oldest. ↩
If you want to get technical there’s a faster path used on platforms which support it which is to call ioctl()
with either FIOCLEX
or FIONCLEX
to perform the same task. This is only because it’s generally a few percent faster than the equivalent fcntl()
call, but less standard. ↩
Or more concisely where is the lowest possible value from the median interval, is the size of the data set, is the number of items below the median interval, is the number of items within the median interval, and is the interval width. ↩
Specifically from the Supplementary Ideographic Plane. ↩
This is part 5 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.3 - Part 1.
The second of my two articles covering features added in Python 3.3, this one talks about a large number of changes to the standard library, especially in network and OS modules. I also discuss implicit namespace packages, which are a bit niche but can be useful for maintaining large families of packages.
This is the second and final article in this series looking at new features in Python 3.3. and we’ll be primarily drilling into a large number of changes to the Python libraries. There’s a lot of interesting stuff to cover the Internet side such as the new ipaddress
module and changes to email
, and also in terms of OS features such as a slew of new POSIX functions that have been exposed.
There are a few module changes relating to networking and Internet protocols in this release.
There’s a new ipaddress
module for storing IP addresses, as well as other related concepts like subnets and interfaces. All of the types have IPv4 and IPv6 variants, and offer some useful functionality for code to deal with IP addresses generically without needing to worry about the distinctions. The basic types are listed below.
IPv4Address
& IPv6Address
ip_address()
utility function constructs the appropriate one of these from a string specification such as 192.168.0.1
or 2001:db8::1:0
.IPv4Network
& IPv6Network
ip_network()
utility function constructs one of these from a string specification such as 192.168.0.0/28
or 2001:db8::1:0/56
. One thing to note is that because this represents an IP subnet rather than any particular host, it’s an error for any of the bits to be non-zero in the host part of the network specification.IPv4Interface
& IPv6Interface
ip_interface()
utility function constructs this from a string specification such as 192.168.1.20/28
. Note that unlike the specification passed to ip_network()
, this has non-zero bits in the host part of the specification.The snippet below demonstrates some of the attributes of address objects:
>>> import ipaddress
>>> x = ipaddress.ip_address("2001:db8::1:0")
>>> x.packed
b' \x01\r\xb8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00'
>>> x.compressed
'2001:db8::1:0'
>>> x.exploded
'2001:0db8:0000:0000:0000:0000:0001:0000'
>>>
>>> x = ipaddress.ip_address("192.168.0.1")
>>> x.packed
b'\xc0\xa8\x00\x01'
>>> x.compressed
'192.168.0.1'
>>> x.exploded
'192.168.0.1'
This snippet illustrates a network and how it can be used to iterate over the addresses within it, as well as check for address membership in the subnet and overlaps with other subnets:
>>> x = ipaddress.ip_network("192.168.0.0/28")
>>> for addr in x:
... print(repr(addr))
...
IPv4Address('192.168.0.0')
IPv4Address('192.168.0.1')
# ... (12 rows skipped)
IPv4Address('192.168.0.14')
IPv4Address('192.168.0.15')
>>> ipaddress.ip_address("192.168.0.2") in x
True
>>> ipaddress.ip_address("192.168.1.2") in x
False
>>> x.overlaps(ipaddress.ip_network("192.168.0.0/30"))
True
>>> x.overlaps(ipaddress.ip_network("192.168.1.0/30"))
False
And finally the interface can be queried for its address and netmask, as well retrieve its specification either as a netmask or in CIDR notation:
>>> x = ipaddress.ip_interface("192.168.0.25/28")
>>> x.network
IPv4Network('192.168.0.16/28')
>>> x.ip
IPv4Address('192.168.0.25')
>>> x.with_prefixlen
'192.168.0.25/28'
>>> x.with_netmask
'192.168.0.25/255.255.255.240'
>>> x.netmask
IPv4Address('255.255.255.240')
>>> x.is_private
True
>>> x.is_link_local
False
Having implemented a lot of this stuff manually in the past, having them here in the standard library is definitely a big convenience factor.
The email
module has always attempted to be compliant with the various MIME RFCs3. The email ecosystem is a broad church, however, and sometimes it’s useful to be able to customise certain behaviours, either to work on email held in non-compliance offine mailboxes or to connect to non-compliant email servers. For these purposes the email
module now has a policy framework.
The Policy
object controls the behaviour of various aspects of the email
module. This can be specified when constructing an instance from email.parser
to parse messages, or when constructing an email.message.Message
directly, or when serialising out an email using the classes in email.generator
.
In fact Policy
is an abstract base class which is designed to be extensible, but instances must provide at least the following properties:
Property | Default | Meaning |
---|---|---|
max_line_length |
78 |
Maximum line length, not including separators, when serialising. |
linesep |
"\n" |
Character used to separate lines when serialising. |
cte_type |
"7bit" |
If 8bit used with a BytesGenerator then non-ASCII may be used. |
raise_on_defect |
False |
Raise errors during parsing instead of adding them to defects list. |
So, if you’ve ever found yourself sick of having to remember to override linesep="\r\n"
in a lot of different places or similar, this new approach should be pretty handy.
However, one of the main motivations to introducing this system is it now allows backwards-incompatible API changes to be made in a way which enables authors to opt-in to them when ready, but without breaking existing code. If you default to the compat32
policy, you get an interface and functionality which is compatible with the old pre-3.3 behaviour.
There is also an EmailPolicy
, however, which introduces a mechanism for handling email headers using custom classes. This policy implements the following controls:
Property | Default | Meaning |
---|---|---|
refold_source |
long |
Controls whether email headers are refolded by the generator. |
header_factory |
See note4 | Callable that takes name and value and returns a custom header object for that particular header. |
The classes used to represent headers can implement custom behaviour and allow access to parsed details. Here’s an example using the default
policy which implements the EmailPolicy
with all default behaviours unchanged:
>>> from email.message import Message
>>> from email.policy import default
>>> msg = Message(policy=default)
>>> msg["To"] = "Andy Pearce <andy@andy-pearce>"
>>> type(msg["To"])
<class 'email.headerregistry._UniqueAddressHeader'>
>>> msg["To"].addresses
(Address(display_name='Andy Pearce', username='andy', domain='andy-pearce'),)
>>>
>>> import email.utils
>>> msg["Date"] = email.utils.localtime()
>>> type(msg["Date"])
<class 'email.headerregistry._UniqueDateHeader'>
>>> msg["Date"].datetime
datetime.datetime(2021, 3, 1, 17, 18, 21, 467804, tzinfo=datetime.timezone(datetime.timedelta(0), 'GMT'))
>>> print(msg)
To: Andy Pearce <andy@andy-pearce>
Date: Mon, 01 Mar 2021 17:18:21 +0000
These classes will handle aspects such as presenting Unicode representations to code, but serialising out using UTF-8 or similar encoding, so the programmer no longer has to deal with such complications, provided they selected the correct policy.
On a separate email-related note, the smtpd
module now also supports RFC 5321, which adds an extension framework to allow optional additions to SMTP; and RFC 1870, which offers clients an ability to pre-delcare the size of messages before sending them to detect errors earlier before sending a lot of data needlessly.
The smtplib
module also has some improvements. The classes now support a source_address
keyword argument to specify the source address to use for binding the outgoing socket, for servers where there are multiple potential interfaces and it’s important that a particular one is used. The SMTP
class can now act as a context manager, issuing a QUIT
command disconnecting when the context expires.
Also on the Internet-related front there were a handful of small enhancements to the ftplib
module.
ftplib.FTP
Now Accepts source_address
FTP_TLS.ccc()
FTP_TLS
class, which is a subclass of FTP
which adds TLS support as per RFC 4217, has now acquired a ccc()
method which reverts the connection back to plaintext. Apparently, this can be useful to take advantage of firewalls that know how to handle NAT with non-secure FTP without opening fixed ports. So now you know.FTP.mlsd()
mlsd()
method has been added to FTP
objects which uses the MLSD
command specified by RFC 3659. This offers a better API than FTP.nlst()
, returning a generator rather than a list
and includes file metadata rather than just filenames. Not all FTP servers support the MLSD
command, however.The http
, html
and urllib
packages also or some love in this release.
BaseHTTPRequestHandler
Header Bufferinghttp.server.BaseHTTPRequestHandler
server now html.parser.HTMLParser
Now Parses Invalid Markupstrict
parameter of the constructor as well as the now-unused HTMLParseError
have been deprecated.html.entities.html5
Addeddict
that maps entity names to the equivalent characters, for example html5["amp;"] == "&"
. This includes all the Unicode characters too. If you want the full list, take a peek at §13.5 of the HTML standard.urllib.Request
Method Specificationurllib.Request
class now has a method
parameter which can specify the HTTP method to use. Previously this was decided automatically between GET
and POST
based on whether body data was provided, and that behaviour is still the default if the method isn’t specified.sendmsg()
and recvmsg()
cmsg
man page.PF_CAN
Supportsocket
class now supports the PF_CAN
protocol family, which I don’t pretend to know much about but is an open source stack contributed by Volkswagen which bridges the Controller Area Network (CAN) standard for implementing a vehicle communications bus into the standard sockets layer. This one’s pretty niche, but it was just too cool not to mention5.PF_RDS
SupportPF_RDS
which is the Reliable Datagram Sockets protocol. This is a protocol developed by Oracle which offers similar interfaces to UDP but offers guaranteed in-order delivery. Unlike TCP, however, it’s still datagram-based and connectionless. You now know at least as much about RDS as I do. If anyone knows why they didn’t just use SCTP, which already seems to offer them everything they need, let me know in the comments.PF_SYSTEM
SupportPF_SYSTEM
. This is a MacOS-specific set of protocols for communicating with kernel extensions6.sethostname()
Addedsethostname()
updates the system hostname. On Unix system this will generally require running as root
or, in the case of Linux at least, having the CAP_SYS_ADMIN
capability.socketserver.BaseServer
Actions Hookservice_actions()
method every time around the main poll loop. In the base class this method does nothing, but derived classes can implement it to perform periodic actions. Specifically, the ForkingMixIn
now uses this hook to clean up any defunct child processes.ssl
Module Random Number GenerationRAND_bytes()
and RAND_pseudo_bytes()
. However, os.urandom()
is still preferable for most applications.ssl
Module ExceptionsSSLZeroReturnError
, SSLWantReadError
, SSLWantWriteError
, SSLSyscallError
and SSLEOFError
.SSLContext.load_cert_chain()
Passwordsload_cert_chain()
method now accepts a password
parameter for cases where the private key is encrypted. It can be a str
or bytes
value containing the actual password, of a callable which will return the password. If specified, this overrides OpenSSL’s default password-prompting mechanism.ssl
Supports Additional Algorithmscompression()
method to query the current compression algorithm in use. The SSL context also now supports an OP_NO_COMPRESSION
option to disable compression.ssl
Next Protocol Negotiationssl.SSLContext.set_npn_protocols()
has been added to support the Next Protocol Negotiation (NPN) extension to TLS. This allows different application-level protocols to be specified in preference order. It was originally added to support Google’s SPDY, and although SPDY is now deprecated (and superceded by HTTP/2) this extension is general in nature and still useful.ssl
Error IntrospectionInstances of ssl.SSLError
now have two additional attributes:
library
is a string indicating the OpenSSL subsystem responsible for the error (e.g. SSL
, X509
).reason
is a string code indicating the reason for the error (e.g. CERTIFICATE_VERIFY_FAILED
).A few new data structures have been added as part of this release.
There’s a new types.SimpleNamespace
type which can be used in cases where you just want to hold some attributes. It’s essentially just a thin wrapper around a dict
which allows the keys to be accessed as attributes instead of being subscripted. It’s also somewhat similar to an empty class definition, except for three main advantages:
types.SimpleNamespace(a=1, xyz=2)
.repr()
which follows the usual guideline that eval(repr(x)) == x
.dict
, unlike the default equality of classes, which compares by the result of id()
.There’s a new collections.ChainMap
class which can group together multiple mappings to form a single unified updateable view. The class overall acts as a mapping, and read lookups are performed across each mapping in turn with the first match being returned. Updates and additions are always performed in the first mapping in the list, and note that this may mask the same key in later mappings (but it will leave the originally mapping intact).
>>> import collections
>>> a = {"one": 1, "two": 2}
>>> b = {"three": 3, "four": 4}
>>> c = {"five": 5}
>>> chain = collections.ChainMap(a, b, c)
>>> chain["one"]
1
>>> chain["five"]
5
>>> chain.get("ten", "MISSING")
'MISSING'
>>> list(chain.keys())
['five', 'three', 'four', 'one', 'two']
>>> chain["one"] = 100
>>> chain["five"] = 500
>>> chain["six"] = 600
>>> list(chain.items())
[('five', 500), ('one', 100), ('three', 3), ('four', 4), ('six', 600), ('two', 2)]
>>> a
{'five': 500, 'six': 600, 'one': 100, 'two': 2}
>>> b
{'three': 3, 'four': 4}
>>> c
{'five': 5}
There are a whole host of enhancements to the os
, shutil
and signal
modules in this release which are covered below. I’ve tried to be brief, but include enough useful details for anyone who’s interested but not immediately familiar.
os.pipe2()
Addedpipe2()
call is now available. This allows flags to be set on the file descriptors thus created atomically at creation. The O_NONBLOCK
flag might seem the most useful, although it’s for O_CLOEXEC
(close-on-exec
) where the atomicity is really essential. If you open a pipe and then try to set O_CLOEXEC
separately, it’s possible for a different thread to call fork()
and execve()
between these two, thus leaving the file descriptor open in the resultant new process (which is exactly what O_CLOEXEC
is meant to avoid).os.sendfile()
Addedsendfile()
system call is now also available. This allows a specified number of bytes to be copied directly between two file descriptors entirely within the kernel, which avoids the overheads of a copy to and from userspace that read()
and write()
would incur. This useful for, say, static file HTTP daemons.os.get_terminal_size()
Addedsys.stdout
by default, to obtain the window size of the attached terminal. On Unix systems (at least) it probably uses the TIOCGWINSZ
command with ioctl()
, so if the file descriptor isn’t attached to a terminal I’d expect you’d get an OSError
due to inappropriate ioctl()
for the device. There’s a higher-level shutil.get_terminal_size()
discussed below which handles these errors, so it’s probably best to use that in most cases.Bugs and security vulnerabilities can result from the use of symlinks in the filesystem if you implement the pattern of first obtaining a target filename, and then opening it in a different step. This is because the target of the symlink may be changed, either accidentally or maliciously, in the meantime. To avoid this, various os
functions have been enhanced to deal with file descriptors instead of filenames, which avoids this issue. This also offers improved performance.
Firstly, there’s a new os.fwalk()
function which is the same as os.walk()
except that it takes a directory file descriptor as a parameter, with the dir_fd
parameter, and instead of the 3-tuple return it returns a 4-tuple of (dirpath, dirnames, filenames, dir_fd)
. Secondly, many functions now support accepting a dir_fd
parameter, and any path names specified should be relative to that directory (e.g. access()
, chmod()
, stat()
). This is not available on all platforms, and attempting to use it when not available will raise NotImplementedError
. To check support, os.supports_dir_fd
is a set
of the functions that support it on the current platform.
Thirdly, many of these functions also now support a follow_symlinks
parameter which, if False
, means they’ll operate on the symlink itself as opposed to the target of the symlink. Once again, this isn’t always available you risk getting NotImplementedError
if you don’t check the function is in os.supports_follows_symlinks
.
Finally, some functions now also support passing a file descriptor instead of a path (e.g. chdir()
, chown()
, stat()
). Support is optional for this as well and you should check your functions are in os.supports_fd
.
os.access()
With Effective IDseffective_ids
parameter which, if True
, checks access using the effective UID/GID as opposed to the real identifiers. This is platform-dependent, check os.supports_effective_ids
, which once again is a set()
of methods.os.getpriority()
& os.setpriority()
os.nice()
but for other processes too.os.replace()
Addedos.rename()
is to overwrite the destination on POSIX platforms, but raises an error on Windows. Now there’s os.replace()
which does the same thing but always overwrites the destination on all platforms.os.stat()
, os.fstat()
and os.lstat()
now support reading timestamps with nanosecond precision, where available on the platform. The os.utime()
function supposed updating nanosecond timestamps.os.getxattr()
, os.listxattr()
, os.removexattr()
and os.setxattr()
. These are key/value pairs that can be associated with files to attach metadata for multiple purposes, such as supporting Access Control Lists (ACLs). Support for these is platform-dependent, not just on the OS but potentially on the underlying filesystem in use as well (although most of the Linux ones seem to support them).os
module now allows access to the sched_*()
family of functions whic control CPU scheduling by the OS. You can find more details on the sched
man page.Support for some additional POSIX filesystem and other operations was added in this release:
lockf()
applies, tests or removes POSIX filesystem locks from a file.pread()
and pwrite()
read/write from a specified offset within a current file descriptor but without changing the current file descriptor offset.readv()
and writev()
provide scatter/gather read/write, where a single file can be read into, or written from, multiple separate buffers on the application side.truncate()
] truncates or extends the specified path to be an exact size. If the existing file was larger, excess data is lost; if it was smaller, it’s padded with nul characters.posix_fadvise()
allows applications to declare an intention to use a specific access pattern on a file, to allow the filesystem to potentially make optimisations. This can be an intention for sequential access, random access, or an intention to read a particular block so it can be fetched into the cache.posix_fallocate()
reserves disk space for expansion of a particular file.sync()
flushes any filesystem caches to disk.waitid()
is a variant of waitpid()
which allows more control over which child process state changes to wait for.getgrouplist()
returns the list of group IDs to which the specified username belongs.os.times()
and os.uname()
Return Named Tuplestuple
return types, this allows results to be accessed by attribute name.os.lseek()
in Sparse Fileslseek()
now supports additional options for the whence
parameter, os.SEEK_HOLE
and os.SEEK_DATA
. These start at a specified offset and find the nearest location which either has data, or is a hole in the data. They’re only really useful in sparse files, because other files have contiguous data anyway.stat.filemode()
Addedos
module, but since the stat
module is a companion to os.stat()
I thought it most appropriate to cover here. An undocumented function tarfile.filemode()
has exposed as stat.filemode()
, which convert a file mode such as 0o100755
into the string form -rwxr-xr-x
.shlex.quote()
Addedpipes
module, but it was previously undocumented. It escapes all characters in a string which might otherwise have special significance to a shell.shutil.disk_usage()
Addedos.statvfs()
, but this wrapper is more convenient and also works on Windows, which doesn’t provide statvfs()
.shutil.chown()
Now Accept Namesshutil.get_terminal_size()
AddedCOLUMNS
and LINES
are defined, they’re used. Otherwise, os.get_terminal_size()
(mentioned above) is called on sys.stdout
. If this fails for any reason, the fallback values passed as a parameter are returned — these default to 80x24 if not specified.shutil.copy2()
and shutil.copystat()
Improvementsshutil.move()
Symlinksmv
does, re-creating the symlink instead of copying the contents of the target file when copying across filesystems, as used to be the previous behaviour. Also now also returns the destination path for convenience.shutil.rmtree()
Securitydir_fd
in os.open()
and os.unlink()
, it’s now used by shutil.rmtree
to avoid symlink attacks.pthread_sigmask()
allows querying and update of the signal mask for the current thread. If you’re interested in more details of the interactions between threads and signals, I found this article had some useful examples.pthread_kill()
sends a signal to a specified thread ID.sigpending()
is for examining the signals which are currently pending on the current thread or the process as a whole.sigwait()
and sigwaitinfo()
both block until one of a set of signals becomes pending, with the latter returning more information about the signal which arrived.sigtimedwait()
is the same as sigwaitinfo()
except that it only waits for a specified amount of time.signal.set_wakeup_fd()
to allow signals to wake up code waiting on file IO events (e.g. using the select
module), the signal number is now written as the byte into this FD, whereas previously simply a nul byte was written regardless of which signal arrived. This allows the handler of that polling loop to determine which signal arrived, if multiple are being waited on.OSError
Replaces RuntimeError
in signal
signal.signal()
and signal.siginterrupt()
, they now raise OSError
with an errno
attribute, as opposed to a simple RuntimeError
previously.subprocess
Commands Can Be bytes
subprocess.DEVNULL
AddedSeveral of the objects in threading
used to be factory functions returning instances, but are now real classes and hence may be subclassed. This change includes:
threading.Condition
threading.Semaphore
threading.BoundedSemaphore
threading.Event
threading.Timer
threading.Thread
Constructor Accepts daemon
daemon
keyword parameter has been added to the threading.Thread
constructor to override the default behaviour of inheriting this from the parent thread.threading.get_ident()
Exposed_thread.get_ident()
is now exposed as a supported function threading.get_ident()
, which returns the thread ID of the current thread.The time
module has several new functions which are useful. The first three of these are new clocks with different properties:
time.monotonic()
time.perf_counter()
time.monotonic()
but has the higest available resolution on the platform.time.process.time()
time.get_clock_info()
This function returns details about the specified clock, which could be any of the options above (passed as a string) or "time"
for the details of the time.time()
standard system clock. The result is an object which has the following attributes:
adjustable
is True
if the clock may be changed by something external to the process (e.g. a system administrator or an NTP daemon).implementation
is the name of the underlying C function called to provide the timer value.monotonic
is True
if the clock is guaranteed to never go backwards.resolution
is the resolution of the clock in fractional seconds.The time
module also has also exposed the following underlying system calls to query the status of various system clocks:
clock_getres()
returns the resolution of the specified clock, in fractional seconds.clock_gettime()
returns the current time of the specified clock, in fractional seconds.clock_settime()
sets the time on the specified clock, if the process has appropriate privileges. The only clock for which that’s supported currently is CLOCK_REALTIME
.The clocks which can be specified in this release are:
time.CLOCK_REALTIME
is the standard system clock.time.CLOCK_MONOTONIC
is a monotonically increasing clock since some unspecified reference point.time.CLOCK_MONOTONIC_RAW
provides access to the raw hardware timer that’s not subject to adjustments.time.CLOCK_PROCESS_CPUTIME_ID
counts CPU time on a per-process basis.time.CLOCK_THREAD_CPUTIME_ID
counts CPU time on a per-thread basis.time.CLOCK_HIGHRES
is a higher-resolution clock only available on Solaris.This is a feature which is probably only of interest to a particular set of package maintainers, so I’m going to do my best not to drill into too much detail. However, there’s a certain level of context required for this to make sense — you can always skip to the next section if it gets too dull!
First I should touch on what’s a namespace package in the first place. If you’re a Python programmer, you’ll probably be aware that the basic unit of code reusability is the module1. Modules can be imported individually, but they can also be collected into packages, which can contain modules or other packages. In its simplest forms, a module is a single .py
file and a package is a directory which contains a file called __init__.py
. The contents of this script are executed when the package is important, but the very fact of the file’s existence is what tags it as a packge to Python, even if the file is empty.
So now we come to what on earth is a namespace package. Simply put, this is a logical package which presents a uniform name to be imported within Python code, but is physically split across multiple directories. For example, you may want to create a machinelearning
package, which itself contains other packages like dimensionreduction
, anomolydetection
and clustering
. For such a large domain, however, each of those packages is likely to consist of its own modules and subpackages, and have its own team of maintainers, and coordinating some common release strategy and packaging system across all those teams and repositories is going to be really painful. What you really want to do is have each team package and ship its own code independently, but still have them presented to the programmer as a uniform package. This would be a namespace package.
Python already had two approaches for doing this, one provided by setuptools
and later another one provided by the pkgutil
module in the standard library. Both of these rely on the namespace package providing some respective boilerplate __init__.py
files to declare it as a namespace package. These are shown below for reference, but I’m not going to discuss them further because this section is about the new approach.
# The setuptools approach involves calling a function in __init__.py,
# and also requires some changes in setup.py.
__import__('pkg_resources').declare_namespace(__name__)
# The pkgutil approach just has each package add its own directory to
# the __path__ attribute for the namespace package, which defines the
# list of directories to search for modules and subpackages. This is
# more or less equivalent to a script modifying sys.path, but more
# carefully scoped to impact only the package in question.
__path__ = __import__('pkgutil').extend_path(__path__, __name__)
Both of these approaches share some issues, however. One of them is that when OS package maintainers (e.g. for Linux distributions) want somewhere to install these different things, they’d probably like to choose the same place, to keep things tidy. But this means all those packages are going to try and install an __init__.py
file over the top of each other, which makes things tricky — the OS packaging system doesn’t know these files necessarily contain the same things and will generate all sorts of complaints about the conflict.
The new approach, therefore, is to make these packages implicit, where there’s no need for an __init__.py
. You can just chuck some modules and/or sub-packages into a directory which is a subdirectory of something on sys.path
and Python will treat that as a package and make the contents available. This is discussed in much more detail in PEP 420.
Beyond these rather niche use-cases of mega-packages, this feature seems like it should make life a little easier creating regular packages. After all, it’s quite common that you don’t really need any setup code in __init__.py
, and creating that empty file just feels messy. So if we don’t need to these days then why bother?
Well, as a result of this change it’s true that regular packages can be created without the need for __init__.py
, but the old approach is still the correct way to create a regular package, and has some advantages. The primary one is that omitting __init__.py
is likely to break existing tools which attempt to search for code, such as unittest
, pytest
and mypy
to name just a few. It’s also noteworthy that if you rely on namespace packages and then someone adds something to your namespace which contains an __init__.py
, this ends the search process for the package in question since Python assumes this is a regular package. This means all your other implicit namespace packages will be suddenly hidden when the clashing regular package is installed. Using __init__.py
consistently everywhere avoids this problem.
Furthermore, regular packages can be imported as soon as they’re located on the path, but for namespace packages the entire path must be fully processed before the package can be created. The path entries must also be recalcuated on every import, for example in case the user has added additional entries to sys.path
which would contribute additional content to an existing namespace package. These factors can introduce performance issues when importing namespace packages.
There are also some more minor factors which favour regular packages which I’m including below for completeness but which I doubt will be particularly compelling for many people.`
__file__
attribute and the __path__
attribute is read-only. These aren’t likely a major issue for anyone, unless you have some grotty code which it trying to calculate paths relative to the source files in the package or similar.setuptools.find_packages()
function won’t find these new style namespace packages, although there is now a setuptools.find_namespace_packages()
function which will, so it should be a fairly simple issue to modify setup.py
appropriately.As a final note, if you are having any issues with imports, I strongly recommend checking out Nick Coghlan‘s excellent article Traps for Unware in Python’s Import System which discusses some of the most common problems you might run into.
There are a set of small but useful changes in some of the builtins that are worth noting.
open()
Openeropener
for open()
calls which is callable which is invoked with arguments (filename, flags)
and is expected to return the file descriptor as os.open()
would. This can be used to, for example, pass flags which aren’t supported by open()
, but still benefit from the context manager behaviour offered by open()
.open()
Exclusivelyx
mode was added for exclusive creation, failing if the file already exists. This is equivalent to the O_EXCL
flag to open()
on POSIX systems.print()
Flushingprint()
now has a flush
keyword argument which, if set to True
, flushes the output stream immediately after the output.hash()
Randomizationstr.casefold()
str
objects now have a casefold()
method to return a casefolded version of the string. This is intended to be used for case-insensitive comparisons, and is a much more Unicode-friendly approach than calling upper()
or lower()
. A full discussion of why is outside the scope of this article, but I suggest the excellent article Truths Programmers Should Know About Case by James Bennett for an informative article about the complexities of case outside of Latin-1 languages. Spoiler: it’s harder than you think, which should always be your default assumption for any I18n issues2.copy()
and clear()
copy()
and clear()
methods on both list
and bytearray
objects, with the obvious semantics.range
Equalityrange
objects based on equality of the generated values. For example, range(3, 10, 3) == range(3, 12, 3)
. However, bear in mind this doesn’t evaluate the actual contents so range(3) != [0, 1, 2]
. Also, applying transformations such as reversed
seems to defeat these comparisons.dict.setdefault()
enhancementdict.setdefault()
resulted in two hash lookups, one to check for an existing item and one for the insertion. Since a hash lookup can call into arbitrary Python code this meant that the operation was potentially non-atomic. This has been fixed in Python 3.3 to only perform the lookup once.bytes
Methods Taking int
count()
, find()
, rfind()
, index()
and rindex()
of bytes
and bytearray
objects now accept an integer in the range 0-255
to specify a single byte value.memoryview
changesmemoryview
class has a new implementation which fixes several previous ownership and lifetime issues which had lead to crash reports. This release also adds a number of features, such as better support for multi-dimensional lists and more flexible slicing.There were some other additional and improved modules which I’ll outline briefly below.
bz2
RewrittenThe bz2
module has been completely rewritten, adding several new features:
bz2.open()
function, which supports opening files in binary mode (where it operates just like the bzip2.BZ2File
constructor) or text mode (where it applies an io.TextIOWrapper
).bz2.BZ2File
using the fileobj
parameter.io.BufferedIOBase
interface is now implemented by bz2.BZ2File
, except for detach()
and truncate()
.collections.abc
collections
. Alises still exist at the top-level, however, to preserve backwards-compatibility.crypt.mksalt()
crypt.mksalt()
function to create the 2-character salt used by Unix passwords.datetime
ImprovementsThere are a few enhancements to the ever-useful datetime
library.
datetime
objects used to raise TypeError
, but it was decided this was inconsistent with the behaviour of other incomparable types. As of Python 3.3 this will simply return False
instead. Note that other comparisons will still raise TypeError
, however.datetime.timestamp()
method to return an epoch timestamp representation. This is implicitly in UTC, so timezone-aware datetime
s will be converted and naive datetime
s will be assumed to be in the local timezone and converted using the platform’s mktime()
.datetime.strftime()
now supports years prior to 1000 CE.datetime.astimezone()
now assumes the system time zone if no parameters are passed.decimal
Rewritten in Cdecimal
module using the high-performance libmpdec. There are some API changes as a result which I’m not going to go into here as I think most of them only impact edge cases.functools.lru_cache()
Type Segregationfunctools.lcu_cache
class for caching function results based on the parameters. This caching was based on checking the full set of arguments for equality with previous ones specified, and if they all compared equal then the cached result would be returned instead of calling the function. In this release, there’s a new typed
parameter which, if True
, also enforces that the arguments are of the same type to trigger the caching behaviour. For example, calling a function with 3
and then 3.0
would return the cached value with typed=False
(the default) but would call the function twice with typed=True
.importlib
importlib.__import__
is now used directly by __import__()
. A number of other changes have had to happen behind the scenes to make this happen, but now it means that the import machinery is fully exposed as part of importlib
which is great for transparency and for any code which needs to find and import modules programmatically. I considered this a little niche to cover in detail, but the release notes have some good discussion on it.io.TextIOWrapper
Buffering Optionalio.TextIOWrapper
has a new write_through
optional argument. If set to True
, write()
calls are guaranteed not to be buffered but will be immediately passed to the underlying binary buffer.itertools.accumulate()
Supports Custom Functionfunc=operator.mul
would give a running product of values.logging.basicConfig()
Supports Handlershandlers
parameter on logging.basicConfig()
which takes an iterable of handlers to be added the root logger. This is probably handy for those scripts that are just large enough to be worth using logging
, particularly if you consider the code might one day form the basis of a reusable module, but which aren’t big enough to mess around setting up a logging configuration file.lzma
Addedxz
utility. This library supports the .xz
file format, and also the .lzma
legacy format used by earlier versions of this utility.math.log2()
Addedmath.log(x, 2)
, this will often be faster and/or more accurate than the existing approach, which involves the usual division of logs to convert the base.pickle
Dispatch Tablespickle.Pickler
class constructor now takes a dispatch_table
parameter which allows the pickling functions to be customised on a per-type basis.sched
ImprovementsThe sched
module, for generalised event scheduling, has had a variety of improvements made to it:
run()
can now be passed blocking=False
to execute pending events and then return without blocking. This widens the scope of applications which can use the module.sched.scheduler
can now be used safely in multithreaded environments.sched.scheduler
constructor now have sensible defaults.enter()
and enterabs()
methods now no longer require the argument
parameter to be specified, and also support a kwargs
parameter to pass values by keyword to the callback.sys.implementation
sys.implementation
attribute which holds information about the current implementation being used. A full list of the attributes is beyond the scope of this article, but as one example sys.implementation.version
is a version tuple in the same format as sys.version_info
. The former contains the implmentation version whereas the latter specifes the Python language version implemented — for CPython the two will be the same, since this is the reference implementation, but for cases like PyPy the two will differ. PEP 412 has more details.tarfile
Supports LZMAlzma
module mentioned above.textwrap
Indent Functionindent()
method allows a prefix to be added to every line in a given string. This functionality has been in the textwrap.TextWrapper
class for some time, but is now exposed as its own function for convenience.xml.etree.ElementTree
C Extensionxml.etree.cElementTree
, although that module remains for backwards compatibility.zlib
EOFzlib
module now has a zlib.Decompress.eof
attribute which is True
if the end of the stream has been reached. If this is False
but there is no more data, it indicates that the compressed stream has been truncated.As usual, there were some minor things that struck me as less critical, but I wanted to mention nonetheless.
bytes
literalsstr
literals are written r"..."
and bytes
literals are b"..."
. Until previously combining these required br"..."
, but as of Python 3.3 rb"..."
will also work. Rejoice in the syntax errors thus avoided.u"..."
literals are once again supported for str
objects. This has no semantic significance in Python 3 since it is the default..py
files when double-clicked. It even checks the shebang line to determine the Python version to use, if multiple are available.dict
implementation used for holding attributes of objects has been updated to allow it to share the memory used for the key strings between multiple instances of a class. This can save 10-20% on memory footprints on heavily object-oriented code, and increased locality also achieves some modest performance improvements of up to 10%. PEP 412 has the full details.So that’s Python 3.3, and what a lot there was in it! The yield from
support is handy, but really just a taster of proper coroutines that are coming in future releases with the async
keyword. The venv
module is a bit of a game-changer in my opinion, because now that everyone can simply rely on it being there we can do a lot better documenting and automating development and runtime setups of Python applications. Similarly the addition of unittest.mock
means everyone can use the powerful mocking features it provides to enhance unit tests without having to add to their project’s development-time dependencies. Testing is something where you want to lower the barrier to it as much as you can, to encourage everyone to use it freely.
The other thing that jumped out to me about this release in particular was the sheer breadth of new POSIX functions and other operating system functionality that are now exposed. It’s always a pet peeve of mine when my favourite system calls aren’t easily exposed in Python, so I love to see these sweeping improvements.
So all in all, no massive overhauls, but a huge array of useful features. What more could you ask from a point release?
This could be pure Python or an extension module in a another langauge like C or C++, but that distinction isn’t important for this discussion. ↩
Or if you really want the nitty gritty, feel free to peruse §3.13 of the Unicode standard. But if you do — and with sincere apologies to the authors of the Unicode standard who’ve forgotten more about international alphabets than I’ll ever know — my advice is to brew some strong coffee first. ↩
Well, since you asked that’s specifically RFC 2045, RFC 2046, RFC 2047, RFC 4288, RFC 4289 and RFC 2049. ↩
The default header_factory
is documented in the email.headerregistry
module. ↩
And let’s be honest, my “niche filter” is so close to the identity function that they could probably share a lawnmower. I tend to only miss out the things that apply to only around five people, three of whom don’t even use Python. ↩
However, since KEXTs have been replaced with system extensions more recently, which run in user-space rather than in the kernel, then I don’t know whether the PF_SYSTEM
protocols are going to remain relevant for very long. ↩
This is part 4 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.2.
The fourth Python 3.x release brought another slew of great new features. So many, in fact, that I’ve split this release into two articles, of which this is the first. Highlights in this part include yield from
expressions, mocking support in unittest
and virtualenv suppport in the standard library.
The next release in this sequence of articles is Python 3.3, which was released just over 19 months after version 3.2. This one was another packed release and it contained so many features I decided to split this into two articles. In this first one we’ll be covering yield from
expressions which allow generators to delegate for each other, support for mocking built in to the standard library, the venv
module, and a host of diagnostic improvements.
If you’ve been using Python for a decent length of time, you’re probably familiar with the virtualenv
tool written by prolific Python contributor Ian Bicking1 around thirteen years ago. This was the sort of utility that you instantly wonder how you managed without it before, and it’s become a really key development2 tool for many Python developers.
As an acknowledgement of its importance, the Python team pulled a subset of its functionality into the standard Python library as the new venv
module, and exposed a command-line interface with the pyvenv
script. This is fully detailed in PEP 405.
On the face of it, this might not seem to be all that important, since virtualenv
already exists and does a jolly good job all round. However, I think there are a whole host of benefits which make this stategically important. First and foremost, since it’s part of the standard distribution, there’s little chance that the core Python developers will make some change that renders it incompatible on any supported platform. It can also probably benefit from internal implementation details of Python on which an external project couldn’t safely rely, which may enable greater performance and/or reliability.
Secondly, the fact that it’s installed by default means that project maintainers have a baseline option they can count on, for installation or setup scripts, or just for documentation. This will not doubt cut down on support queries from inexperienced users who wonder why this virtualenv
command isn’t working.
Thirdly, this acts as defense against the forking of the project, which is always a background concern with open source. It’s not uncommon for one popular project to be forked and taken in two divergent directions, and then suddenly project maintainers and users alike need to worry about which one they’re going with, the support efforts of communitieis are split, and all sorts of other annoyances. Having standard support in the standard library means there’s an option that can be expected to work in all cases.
In any case, regardless of whether you feel this is an important feature or just a minor tweak, it’s at least handy to have venv
always available on any platform where Python is installed.
As an aside, if you’re curious about how virtualenv
works then Carl Meyer presented an interesting talk on the subject, of which you can find the video and sildes online.
I actually already discussed this topic fairly well in my first article in my series on coroutines in Python a few years ago. But to save you the trouble of reading all that, or the gory details in PEP 380, I’ll briefly cover it here.
This is a fairly straightforward enhancement for generators to yield control to each other, which is performed using the new yield from
statement. It’s perhaps best explained with a simple example:
>>> def fun2():
... yield from range(10)
... yield from range(30, 20, -2)
...
>>> list(fun2())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 28, 26, 24, 22]
On the face of it this is just a simple shorthand for for i in iter: yield i
. However, there’s rather more to it when you consider the coroutine-like features that generators have where you can pass values into them, since these values also need to be routed directly to the delegate generator as well as the yielded values being routed back out.
There’s also an enhancement to generator return values. Previously the use of return
within a generator was simply a way to terminate the generator, raising StopIteration
, and it was a syntax error to provide an argument to the return
statement. As of Python 3.3, however, this has been relaxed and a value is permitted. The value is returned to the called by attaching it to the StopIteration
exception, but where yield from
is used then this becomes the value to which the yield from
expression evaluates.
This may seem a bit abstract and hard to grasp, so I’ve included an example of using these features for parsing HTTP chunk-encoded bodies. This is a format used for HTTP responses if the sender doesn’t know the size of the response up front, where the data is split into chunks of a known size and the length of a chunk is sent first followed by the data. This means the sender can keep transmitting data until it’s exhausted, and the reader can be processing it parallel. The end of the data is indicated by an empty chunk.
This sort of message-based interpretation of data from a byte stream is always a little fiddly. It’s most efficient to read in large chunks from the socket, and in the case of a chunk header you don’t know many bytes it’s going to be anyway, since the length is variable number of digits. As a result, by the time you’ve read the data you need, the chances are your buffer already contains some of the next piece of data. If you want to structure your code well and split parsing the various pieces up into multiple functions, as the single responsibility principle suggests, then this means you’ve always got this odd bit of “overflow” data as the initial set to parse before reading more from the data source.
There’s also the aspect that it’s nice to decouple the parsing from the data source. For example, although you’d expect a HTTP response to generally come in from a socket object, there’ll always be someone who already has it in a string form and still wants to parse it — so why force them to jump through some hoops making their string look like a file object again, when you could just structure your code a little more elegantly to decouple the parsing and I/O?
For all of the above reasons, I think that generators make a fairly elegant solution to this issue. Take a look at the code below and then I’ll explain why it works and why I think this is potentially a useful approach.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
The general idea here is that each generator parses data which is passed to it via its send()
method. It processes input until its section is done, and then it returns control to the caller. Ultimately decoded data is yielded from the generators, and each one returns any unparsed input data via its StopIteration
exception.
In the example above you can see how this allows content_length_decoder()
to be factored out from chunked_decoder()
and used to decode each chunk. This refactoring would allow a more complete implementation to reuse this same generator to decode bodies which have a Content-Length
header instead of being sent in chunked encoding. Without yield from
this delegation wouldn’t be possible unless orchestrated by the top-level code outside of the generators, and that breaks the abstraction.
This is just one example of using generators in this fashion which sprung to mind, and I’m sure there are better ones, but hopefully it illustrates some of the potential. Of course, there are more developments on coroutines in future versions of Python 3 which I’ll be looking at later in this series, or if you can’t wait then you can take a read through my earlier series of articles specifically on the topic of coroutines.
The major change in Unit Testing in Python 3.3 is that the mocking library has been merged into the standard library as unittest.mock
. A full overview of this library is way beyond the scope of this article, so I’ll briefly touch on the highlights with some simple examples.
The core classes are Mock
and MagicMock
, where MagicMock
is a variation which has some additional behaviours around Python’s magic methods4. These classes will accept requests for any attribute or method call, and create a mock object to track accesses to them. Afterwards, your unit test can make assertions about which methods were called by the code under test, including which parameters were passed to them.
One aspect that’s perhaps not immediately obvious is that these two objects represet more or less any object, such as functions or classes. For example, if you create a Mock
instance which represents a class and then access a method on it, a child Mock
object represents that method. This is possible in Python since everything comes down to attribute access at the end of the day — it just happens that calling a method queries an attribute __call__
on the object. Python’s duck-typing approach means that it doesn’t care whether it’s a genuine function that’s being called, or an object which implements __call__
such as Mock
.
Here’s a short snippet which shows that without any configuration, a Mock
object can be used to track calls to methods:
>>> from unittest import mock
>>> m = mock.Mock()
>>> m.any_method()
<Mock name='mock.any_method()' id='4352388752'>
>>> m.mock_calls
[call.any_method()]
>>> m.another_method(123, "hello")
<Mock name='mock.another_method()' id='4352401552'>
>>> m.mock_calls
[call.any_method(), call.another_method(123, 'hello')]
Here I’m using the mock_calls
attribute, which tracks the calls made, but there are also a number of assert_X()
methods which are probably more useful in the context of a unit test. They work in a very similar way to the existing assertions in unittest
.
This is great for methods with no return type and are side-effect free, but what about implementing those behaviours? Well, that’s pretty straightforward once you understand the basic structure. Let’s say you have a class and you want to add a method with a side-effect, you just create a new Mock
object and assign that as an attribute with the name of the method to the mock that’s representing your object instance. Then you create some function which implements whatever side-effects you require, and you assign that to the special side_effect
attribute of the Mock
representing your method. And then you’re done:
>>> m = mock.Mock()
>>> m.mocked_method = mock.Mock()
>>> def mocked_method_side_effect(arg):
... print("Called with " + repr(arg))
... return arg * 2
...
>>> m.mocked_method.side_effect = mocked_method_side_effect
>>> m.mocked_method(123)
Called with 123
246
Finally, as an illustration of the MagicMock
class, you can see from the snippet below that the standard Mock
object refuses to auto-create magic methods, but MagicMock
implements them in the same way. You can add side-effects and return values to these in the same way as any normal methods.
>>> m = mock.Mock()
>>> len(m)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'Mock' has no len()
>>> mm = mock.MagicMock()
>>> len(mm)
0
>>> mm[123]
<MagicMock name='mock.__getitem__()' id='4352367376'>
>>> mm.mock_calls
[call.__len__(), call.__getitem__(123)]
>>> mm.__len__.mock_calls
[call()]
>>> mm.__getitem__.mock_calls
[call(123)]
That covers the basics of creating mocks, but how about injecting them into your code under test? Well, of course sometimes you can do that yourself by passing in a mock object directly. But often you’ll need to change one of the dependencies of the code. To do this, you can use mock.patch
as a decorator around your test methods to overwrite one or more dependencies with mocks. In the example below, the time.time()
function is replaced by a MagicMock
instance, and the return_value
attribute is used to control the time reported to the code under test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
So that’s it for my whirlwind tour of mocking. There’s a lot more to it than I’ve covered, of course, so do take the time to read through the full documentation.
There are a few changes which are helpful for exception handling and introspection.
The situation around catching errors in Operating System operations has always been a bit of a mess with too many exceptions covering what are very similar operations at their heart. This can cause all sorts of annoying bugs in error handling if you try to catch the wrong exception.
For example, if you fail to os.remove()
a file you get an OSError
but if you fail to open()
it you get an IOError
. So that’s two exceptions for I/O operations right there, but if you happen to be using sockets then you need to also worry about socket.error
. If you’re using select
you might get select.error
, but equally you might get any of the above as well.
The upshot of all this is that for any block of code that does a bunch of I/O you end up having to either catch Exception
, which can hide other bugs, or catch all of the above individually.
Thankfully in Python 3.3 this situation has been averted since these have all been collapsed into OSError
as per PEP 3151. The full list that’s been rolled into this is:
OSError
IOError
EnvironmentError
WindowsError
mmap.error
socket.error
select.error
Never fear for your existing code, however, beacuse the old names have all been maintained as aliases for OSError
.
As well as this, however, there’s another change that’s even handier. Often you need to only catch some subset of errors and allow others to pass on as true error conditions. A common example of this is where you’re doing non-blocking operations, or you’ve specified some sort of timeout, and you want to ignore those cases but still catch other errors. In these cases, you often find yourself branching on errno
like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
It’s not terrible, but breaks the usual idiom of each error being its own exception, and make things just that bit harder to read.
Python 3.3 to the rescue! New exception types have been added which are derivations of OSError
and correspond to the more common of these error cases, so they can be caught more gracefully. The new exceptions and the equivalent errno
codes are:
New Exception | Errno code(s) |
---|---|
BlockingIOError |
EAGAIN , EALREADY , EWOULDBLOCK , EINPROGRESS |
ChildProcessError |
ECHILD |
FileExistsError |
EEXIST |
FileNotFoundError |
ENOENT |
InterruptedError |
EINTR |
IsADirectoryError |
EISDIR |
NotADirectoryError |
ENOTDIR |
PermissionError |
EACCES , EPERM |
ProcessLookupError |
ESRCH |
TimeoutError |
ETIMEDOUT |
ConnectionError |
A base class for the remaining exceptions… |
… BrokenPipeError |
EPIPE , ESHUTDOWN |
… ConnectionAbortedError |
ECONNABORTED |
… ConnectionRefusedError |
ECONNREFUSED |
… ConnectionResetError |
ECONNRESET |
The BlockingIOError
exception also has a handy characters_written
attribute, when using buffered I/O classes. This indicates how many characters were written before the filehandle became blocked.
To finish off this setion, here’s a small example of how this might make code more readable. Take this code to handle a set of different errors which can occur when opening and attempting to read a particular filename:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Particularly unpleasant here is the code duplication between handling unmatched errno
codes and random other exceptions — although that’s just the duplication of a print()
in this example, in reality that could become significant code duplication. With the new exceptions introduced in Python 3.3, however, this is all significantly cleaner:
1 2 3 4 5 6 7 8 9 10 |
|
As we covered in the first post in this series, exceptions in Python 3 can be chained. when they are chained, the default traceback is updated to show this context, and earlier exceptions can be recovered from attributes of the latest.
You might also recall that it’s possible to explicitly chain exceptions with the syntax raise NewException() from exc
. This sets the __cause__
attribute of the exception, as opposed to the __context__
attribute which records the original exception being handled if this one was raised within an existing exception handling block.
Well, Python 3.3 adds a new variant to this which can be used to suppress the display of any exceptions from __context__
, which is raise NewException() from None
. You can see an example of this behaviour below, which you can compare to the same example in the first-post:
>>> try:
... raise Exception("one")
... except Exception as exc1:
... try:
... raise Exception("two")
... except Exception as exc2:
... raise Exception("three") from None
...
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
Exception: three
The mechanics of how this is implemented could be a little confusing because they were changed after the feature was first implemented. The original PEP 409 specified that the default value of __cause__
to be Ellipsis
, which was a pretty arbitrary choice as a singleton which wasn’t an exception, so it couldn’t be confused with a real cause; and wasn’t None
, so later code could detect if it had been explicitly set to None
via the raise Exception() from None
idiom.
It was later decided that this was overloading the purpose of __cause__
in an inelegant fashion, however, so PEP 415 was implemented which made no change to the language features introduced by PEP 409, but changed the implementation. The rather hacky use of Ellipsis
was removed and a new __suppress_context__
attribute was added. The semantics are that whenever __cause__
is set (typically with raise X from Y
), __suppress_context__
is flipped to true. This applies when you set __cause__
to another exception, in which case the presumption is that it’s more useful to show than __context__
since it’s by explicit programmer choice; or using the raise X from None
idiom, which is just the language syntax for setting __suppress_context__
without changing __cause__
. Note that regardless of the value of __suppress_context__
, the contents of the __context__
attribute are still available, and any code you write in your own exception handler is, of course, not obliged to respect __suppress_context__
.
I must admit, I’m struggling to think of cases where the detail of that change would make a big difference to code your write. However, I’ve learned over the years that exception handling is one of those areas of the code you tend to test less thoroughly, and those areas are exactly where it’s helpful to have a knowledge of the details since it’s that much more likely you’ll find bugs here by code inspection rather than testing.
Since time immemorial functions and classes have had a __name__
attribute. Well, it now has a little baby sibling, the __qualname__
attribute (PEP 3155) which indicates the full “path” of definition of this object, including any containing namespaces. The string represetation has also been updated to use this new, longer, specification. The semantics are mostly fairly self-explanatory, I think, so probably best illustrated with an example:
>>> class One:
... class Two:
... def method(self):
... def inner():
... pass
... return inner
...
>>> One.__name__, One.__qualname__
('One', 'One')
>>> One.Two.__name__, One.Two.__qualname__
('Two', 'One.Two')
>>> One.Two.method.__name__, One.Two.method.__qualname__
('method', 'One.Two.method')
>>> inner = One.Two().method()
>>> inner.__name__, inner.__qualname__
('inner', 'One.Two.method.<locals>.inner')
>>> str(inner)
'<function One.Two.method.<locals>.inner at 0x10467b170>'
Also, there’s a new inspect.signature()
function for introspection of callables (PEP 362). This returns a inspect.Signature
instance which references other classes such as inspect.Parameter
and allows the siganture of callables to be easily introspected in code. Again, an example is probably most helpful here to give you just a flavour of what’s exposed:
>>> def myfunction(one: int, two: str = "hello", *args: str, keyword: int = None):
... print(one, two, args, keyword)
...
>>> myfunction(123, "monty", "python", "circus")
123 monty ('python', 'circus') None
>>> inspect.signature(myfunction)
<Signature (one: int, two: str = 'hello', *args: str, keyword: int = None)>
>>> inspect.signature(myfunction).parameters["keyword"]
<Parameter "keyword: int = None">
>>> inspect.signature(myfunction).parameters["keyword"].annotation
<class 'int'>
>>> repr(inspect.signature(myfunction).parameters["keyword"].default)
'None'
>>> print("\n".join(": ".join((name, repr(param._kind)))
for name, param in inspect.signature(myfunction).parameters.items()))
one: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
two: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
args: <_ParameterKind.VAR_POSITIONAL: 2>
keyword: <_ParameterKind.KEYWORD_ONLY: 3>
Finally, there’s also a new function inspect.getclosurevars()
which reports the names bound in a particular function:
>>> import inspect
>>> xxx = 999
>>> def outer():
... aaa = 100
... def middle():
... bbb = 200
... def inner():
... ccc = 300
... return aaa + bbb + ccc + xxx
... return inner
... return middle()
...
>>> inspect.getclosurevars(outer())
ClosureVars(nonlocals={'aaa': 100, 'bbb': 200}, globals={'xxx': 999}, builtins={}, unbound=set())
In a similar vein there’s also inspect.getgeneratorlocals()
which dumps the current internal state of a generator. This could be very useful for diagnosing bugs in the context of the caller, particularly if you don’t own the code implementing the generator and so can’t easily add logging statements or similar:
>>> def generator(maxvalue):
... cumulative = 0
... for i in range(maxvalue):
... cumulative += i
... yield cumulative
...
>>> instance = generator(10)
>>> next(instance)
0
>>> next(instance)
1
>>> next(instance)
3
>>> next(instance)
6
>>> inspect.getgeneratorlocals(instance)
{'maxvalue': 10, 'cumulative': 6, 'i': 3}
There’s a new module in Python 3.3 called faulthandler
which is used to show a Python traceback on an event like a segmentation fault. This could be very useful when developing or using C extension modules which often fail in a crash, making it very hard to tell where the problem actually occurred. Of course, you can fire up a debugger and figure out the line of code if it’s your module, but if it’s someone else’s at least this will help you figure out whether the error lies in your code or not.
You can enable this support at runtime with faulthandler.enable()
, or you can pass -X faulthandler
to the interpreter on the command-line, or set the PYTHONFAULTHANDLER
environment variable. Note that this will install signal handlers for SIGSEGV
, SIGFPE
, SIGABRT
, SIGBUS
, and SIGILL
— if you’re using your own signal handlers for any of these, you’ll probably want to call faulthandler.enable()
first and then make sure you chain into the earlier handler from your own.
Here’s an example of it working — for the avoidance of doubt, I triggered the handler here myself by manually sending SIGSEGV
to the process:
>>> import faulthandler
>>> import time
>>> faulthandler.enable()
>>>
>>> def innerfunc():
... time.sleep(300)
...
>>> def outerfunc():
... innerfunc()
...
>>> outerfunc()
Fatal Python error: Segmentation fault
Current thread 0x000000011966bdc0 (most recent call first):
File "<stdin>", line 2 in innerfunc
File "<stdin>", line 2 in outerfunc
File "<stdin>", line 1 in <module>
[1] 16338 segmentation fault python3
There are a couple of modules which have added the ability to register callbacks for tracing purposes.
The gc
module now provides an attribute callbacks
which is a list of functions which will be called before and after each garbage collection pass. Each one has two parameters passed, the first is either "start"
or "stop"
to indicate whether this is before or after the collection pass, and the second is a dict
providing details of the results.
>>> import gc
>>> def func(*args):
... print("GC" + repr(args))
...
>>> gc.callbacks.append(func)
>>> class MyClass:
... def __init__(self, arg):
... self.arg = arg
... def __del__(self):
... pass
...
>>> x = MyClass(None)
>>> y = MyClass(x)
>>> z = MyClass(y)
>>> x.arg = z
>>> del x, y, z
>>> gc.collect()
GC('start', {'generation': 2, 'collected': 0, 'uncollectable': 0})
GC('stop', {'generation': 2, 'collected': 6, 'uncollectable': 0})
6
The sqlite3.Connection
class has a method set_trace_callback()
which can be used to register a callback function which will be called for every SQL statement that’s run by the backend, and it’s passed the statement as a string. Note this doesn’t just include statements passed to the execute()
method of a cursor, but may include statements that the Python module itself runs, e.g. for transaction management.
With apologies to those already familiar with Unicode, a brief history lesson: Unicode was originally conceived as a 16-bit character set, which was thought to be sufficient to encode all languages in active use around the world. In 1996, however, the Unicode 2.0 standard expanded this to add 16 additional 16-bit “planes” to the set, to include scope for all characters ever used by any culture in history, plus other assorted symbols. This made it effectively a 21-bit character set3. The inital 16-bit set became the Basic Multilingual Plane (BMP), and the next two planes the Supplementary Multilingual Plane and Supplementary Ideographic Plane respectively.
OK, Unicode history lesson over. So what’s this got to do with Python? To understand that we need a brief Python history lesson. Python originally used 16-bit values for Unicode characters (i.e. UCS-2 encoding), which meant that it only suppored characters in the BMP. In Python 2.2 support for “wide” builds was added, so by adding a particular configure flag when compiling the interpreter, it could be built to use UCS-4 instead. This had the advantage of allowing the full range of all Unicode planes, but at the expense of using 4 bytes for every character. Since most distributions would use the wide build, because they had to assume full Unicode support was necessary, this meant in Python 2.x unicode
objects consisting primarily of Latin-1 were four times larger than they needed to be.
This has been the case until Python 3.3, where the implementation of PEP 393 means that the concepts of narrow and wide builds has been removed and everyone can now take advantage of the ability to access all Unicode characters. This is done by deciding whether to use 1-, 2- or 4-byte characters at runtime based on the highest ordinal codepoint used in the string. So, pure ASCII or Latin-1 strings use 1-byte characters, strings composed entirely from within the BMP use 2-byte characters and if any other planes are used then 4-byte characters are used.
In the example below you can see this illustrated.
>>> # Standard ASCII has 1 byte per character plus 49 bytes overhead.
>>> sys.getsizeof("x" * 99)
148
>>> # Each new ASCII character adds 1 byte.
>>> sys.getsizeof("x" * 99 + "x")
149
>>> # Adding one BMP character expands the size of every character to
>>> # 2 bytes, plus 74 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{bullet}")
274
>>> sys.getsizeof("x" * 99 + "\N{bullet}" + "x")
276
>>> # Moving beyond BMP expands the size of every character to 4 bytes,
>>> # plus 76 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{taxi}")
476
>>> sys.getsizeof("x" * 99 + "\N{taxi}" + "x")
480
This basically offers the best of both worlds on the Python side. As well as reducing memory usage, this should also improve cache efficency by putting values closed together in memory. In case you’re wondering about the value of this, it’s important to remember that part of Supplementary Multilingual Plane is a funny little block called “Emoticons”, and we all know you’re not a proper application without putting "\N{face screaming in fear}"
in a few critical error logs here and there. Just be aware that you may be quadrulpling the size of the string in memory by doing so.
On another Unicode related note, support for aliases has been added to the \N{...}
escape sequences. Some of these are abbreviations, such as \N{SHY}
for \N{SOFT HYPHEN}
, and some of them are previously used incorrect names for backwards compatibility where corrections have been made to the standard. In addition these aliases are also supported in unicodedata.lookup()
, and this additionally supports pre-defined sequences as well. An example of a sequence would be LATIN SMALL LETTER M WITH TILDE
which is equivalent to "m\N{COMBINING TILDE}"
. Here are some more examples:
>>> import unicodedata
>>> "\N{NBSP}" == "\N{NO-BREAK SPACE}" == "\u00A0"
True
>>> "\N{LATIN SMALL LETTER GHA}" == "\N{LATIN SMALL LETTER OI}"
True
>>> (unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
... == "\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}")
True
That’s it for this post, but we’re not done with Python 3.3 yet! Check out the following article for my tour of the remaining changes in this release, as well as some thoughts on the entire release.
As an unrelated aside, a few months ago (at time of writing!) Ian Bicking wrote a review of his main projects which makes for some interesting reading. ↩
And for some people a production release tool as well, although personally I think a slightly cleaner wrapper like shrinkwrap makes for a more supportable option. ↩
The ones named of the form __xxx__()
. ↩
This is part 3 of the “Python 2to3” series which started with Python 2to3: What’s New in 3.0. The previous article in the series was Python 2to3: What’s New in 3.1.
Another installment in my look at all the new features added to Python in each 3.x release, this one covering 3.2. There’s a lot covered including the argparse module, support for futures, changes to the GIL implementation, SNI support in SSL/TLS, and much more besides. This is my longest article ever by far! If you’re puzzled why I’m looking at releases that are years old, check out the first post in the series.
In this post I’m going to continue my examination of every Python 3.x release to date with a look at Python 3.2. I seem to remember this as a pretty big one, so there’s some possibility that this article will rival the first one in this series for length. In fact, it got so long that I also implemented “Table of Contents” support in my articles! So, grab yourself a coffee and snacks and let’s jump right in and see what hidden gems await us.
We kick off with one of my favourite Python modules, argparse
, defined in PEP 389. This is the latest in series of modules for parsing command-line arguments, which is a topic close to my heart as I’ve written a lot of command-line utilities over the years. I spent a number of those years getting increasingly frustrated with the amount of boilerplate I needed to add every time for things like validating arguments and presenting help strings.
Python’s first attempt at this was the getopt
module, which was essentially just exposing the POSIX getopt()
function in Python, even offering a version that’s compatible with the GNU version. This works, and it’s handy for C programmers familiar with the API, but it makes you do most of the work of validation and such. The next option was optparse
, which did a lot more work for you and was very useful indeed.
Whilst optparse
did a lot of work of parsing options for you (e.g. --verbose
), it left any other arguments in the list for you to parse yourself. This was always slightly frustrating for me, because let’s say you expect the user to pass a list of integers, it seemed inconvenient to force them to use options for it just to take advantage of the parsing and validation the module offers. Also, more complex command-line applications like git
often have subcommands which are tedious to validate by hand as well.
The argparse
module is a replacement for optparse
which aims to address these limitations, and I think by this point we’ve got to something pretty comprehensive. It’s usage is fairly similar to optparse
, but adds enough flexibility to parse all sorts of arguments. It also can validate the types of arguments, provide command-line help automatically and allow subcommands to be validated.
The variety of options this module provides are massive, so there’s no way I’m going to attempt an exhaustive examination here. By way of illustration, I’ve implemented a very tiny subset of the git
command-line as a demonstration of how subcommands work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
You can see the command-line help generated by the class below. First up, the output of running fakegit.py --help
:
usage: fakegit.py [-h] [--version] [-C <path>] [-p] [-P] {status,log} ...
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-C <path> Run as if was started in PATH
-p, --paginate Enable pagination of output
-P, --no-pager Disable pagingation of output
Subcommands:
Valid subcommands
{status,log} additional help
status Show working tree status
log Show commit logs
The subcommands also support their own command-line help, such as fakegit.py status --help
:
usage: fakegit.py status [-h] [-s] [-z] [<pathspec> [<pathspec> ...]]
positional arguments:
<pathspec> One or more pathspecs to show
optional arguments:
-h, --help show this help message and exit
-s, --short Use short format
-z Terminate output lines with NUL instead of LF
The logging
module has acquired the ability to be configured by passing a dict
, as per PEP 391. Previously it could accept a config file in .ini
format as parsed by the configparser
module, but formats such as JSON and YAML are becoming more popular these days. To allow these to be used, logging
has allowed a dict
to be passed specifying the configuration, given that most of these formats can be trivial reconstructed into that format, a illustrated for JSON:
1 2 3 4 5 |
|
When you’re packaging a decent sized application storing logging configuration in a file makes it easier to maintain the logging configuration vs. the option of hard-coding it in executable code. For example, it becomes easier to swap in a different logging configuration in different environments (e.g. pre-production and production). The fact that more popular formats can now be supported will open this flexibility to more developers.
In addition to this, the logging.basicConfig()
function now has a style
parameter where you can select which type of string formatting token to use for the format string itself. All of the following are equivalent:
>>> import logging
>>> logging.basicConfig(style='%', format="%(name)s -> %(levelname)s: %(message)s")
>>> logging.basicConfig(style='{', format="{name} -> {levelname} {message}")
>>> logging.basicConfig(style='$', format="$name -> $levelname: $message")
Also, if a log event occurs prior to configuring logging, there is a default setup of a StreamHandler
connected to sys.stderr
, which displays any message of WARNING
level or higher. If you need to fiddle with this handler for any reason, it’s available as logging.lastResort
.
Some other smaller changes:
setLevel()
as strings such as INFO
instead of integers like logging.INFO
.getChild()
method on Logger
instances now returns a logger with a suffix appended to the name. For example, logging.getLogger("foo").getChild("bar.baz")
will return the same logger as logging.getLogger("foo.bar.baz")
. This is convenient when the first level of the name is __name__
, as it often is by convention, or in cases where a parent logger is passed to some code which wants to create its own child logger from it.hasHandlers()
method has also been added to Logger
which returns True
iff this logger, or a parent to which events are propagated, has at least one configured handler.logging.setLogRecordFactory()
and a corresponding getLogRecordFactory()
have been added to allow programmers to override log record creation process.There are a number of changes in concurrency this release.
The largest change is a new concurrent.futures
module in the library, specified by PEP 3148, and it’s a pretty useful one. The intention with the new concurrent
namespace is to collect together high-level code for managing concurrency, but so far it’s only acquired the one futures
module.
The intention here is to provide what has become a standard abstraction over concurrent operations which represents the eventual result of a concurrent operation. In the Python module, the API style is deliberately decoupled from the implementation detail of what form of concurrency is used, whether it’s a thread, another process or some RPC to another host. This is useful as it allows the style to be potentially changed later if necessary without invalidating the business logic around it.
The style is to construct an executor which is where the flavour of concurrency is selected. Currently the module supports two options, ThreadPoolExecutor
and ProcessPoolExecutor
. The code can then schedule jobs to the executor, which returns a Future
instance which can be used to obtain the results of the operation once it’s complete.
To exercise these in a simple example I wrote a basic password cracker, something that should benefit from parallelisation. I used PBKDF2 with SHA-256 for hashing the passwords, although only with 1000 iterations1 to keep running times reasonable on my laptop. Also, to keep things simple we assume that the password is a single dictionary word with no variations in case.
For comparison I first wrote a simple implementation which checks every word in /usr/share/dict/words
with no parallelism:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Here’s the output of time
running it:
python3 crack.py 257.08s user 0.25s system 99% cpu 4:17.72 total
On my modest 2016 MacBook Pro, this took 4m 17s in total, and the CPU usage figures indicated that one core was basically maxed out, as you’d expect. Then I swapped out main()
for a version that used ThreadPoolExecutor
from concurrent.futures
:
15 16 17 18 19 20 21 22 23 24 25 |
|
After creating a ThreadPoolExecutor
which can use a maximum of 8 worker threads at any time, we then need to submit jobs to the executor. We do this in a loop around reading /usr/share/dict/words
, submitting each word as a job to the executor to distribute among its workers. Once all the jobs are submitted, we then wait for them to complete and harvest the results.
Again, here’s the time
output:
python3 crack.py 506.42s user 2.50s system 680% cpu 1:14.83 total
With my laptop’s four cores, I’d expect this would run around four times as fast2 and it more or less did, allowing for some overhead scheduling the work to the threads. The total run time was 1m 14s so a little less than the expected four times faster, but not a lot. The CPU usage was around 85% of the total of all four cores, which is again roughly what I’d expect. Running in a quarter of the time seems like a pretty good deal for only four lines of additional code!
Finally, just for fun I then swapped out ThreadPoolExecutor
for ProcessPoolExecutor
, which is the same but using child processes instead of threads:
16 17 |
|
And the time
output with processes:
python3 crack.py 575.08s user 15.50s system 669% cpu 1:28.15 total
I didn’t expect this to make much difference to a CPU-bound task like this, provided that the hashing routine are releasing the GIL as they’re supposed to. Indeed, it was actually somewhat slower than the threaded case, taking 1m 28s to execute in total. The total user time was higher for the same amount of work, so this definitely points to some decreased efficiency rather than just differences in background load or similar. I’m assuming that the overhead of the additional IPC and associated memory copying accounts for the increased time, but this sort of thing may well be platform-dependent.
As one final flourish, I tried to reduce the inefficiencies of the multiprocess case by batching the work into larger chunks using a recipe from the itertools documentation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
This definitely made some difference, bringing the time down from 1m 28s to 1m 6s. The CPU usage also indicates more of the CPU time is being spent in user space, presumably due to less IPC.
python3 crack.py 509.95s user 1.20s system 764% cpu 1:06.83 total
I suspect that the multithreaded case would also benefit from some batching, but at this point I thought I’d better draw a line under it or I’d never finish this article.
Overall, I really like the concurrent.futures
module, as it takes so much hassle out of processing things in parallel. There are still cases where the threading
module is going to be more appropriate, such as some background thread which performs periodic actions asynchronously. But for cases where you have a specific task that you want to tackle synchronously but in parallel, this module wraps up a lot of the annoying details.
I’m excited to see what else might be added to concurrent
in the future3!
Despite all the attention on concurrent.futures
this release, the threading
module has also had some attention with the addition of a new Barrier
class. This is initialised with a number of threads to wait for. As individual threads call wait()
on the barrier they are held up until all the required number of threads are waiting, at which point all are allowed to proceed simultaneously. This is a little like the join()
method, except the threads can continue to execute after the barrier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
The Barrier
can also be initialised with a timeout
argument. If the timeout expires before the required number of threads have called wait()
then all currently waiting threads are released and a BrokenBarrierError
exception is raised from all the wait()
methods.
I can think of a few use-cases where this synchronisation primitive might come in handy, such as multiple threads all producing streams of output which need to be synchronised with each other so one of them doesn’t get too far ahead of the other. For example, perhaps one thread is producing chunks of audio data and another chunks of video, you could use a barrier to ensure that neither of them gets ahead of the other.
Another small but useful change in threading
is that the Lock.acquire()
, RLock.acquire()
and Semaphore.acquire()
methods can now accept a timeout, instead of only allowing a simple choice between blocking and non-blocking as before. Also there’s been a fix to allow lock acquisitions to be interrupted by signals on pthreads platforms, which means that programs that deadlock on locks can be killed by repeated SIGINT
(as opposed to requiring SIGKILL
as they used to sometimes).
Finally, threading.RLock
has been moved from pure Python to a C implementation, which results in a 10-15x speedup using them.
In another change that will impact all forms of threading in CPython, the code behind the GIL has been rewritten. The new implementation aims to offer more predictable switching intervals and reduced overhead due to lock contention.
Prior to this change, the GIL was released after a fixed number of bytecode instructions had been executed. However, this is a very crude way to measure a timeslice since the time taken to execute an instruction can vary from a few nanoseconds to much longer, since not all the expensive C functions in the library release the GIL while they operate. This can mean that scheduling between threads can be very unbalanced depending on their workload.
To replace this, the new approach releases the GIL at a fixed time interval, although the GIL is still only released at an instruction boundary. The specific interval is tunable through sys.setswitchinterval()
, with the current default being 5 milliseconds. As well as being a more balanced way to share processor time among threads, this can also reduce the overhead of locks in heavily contended situations — this is because waiting for a lock which is already held by another thread can add significant overhead on some platforms (apparently OS X is particularly impacted by this).
If you want to get technical4, threads wishing to take the GIL first wait on a condition variable for it to be released, with a timeout equal to the switch interval. Hence, it’ll wake up either after this interval, or if the GIL is released by the holding thread if that’s earlier. At this point the requesting thread checks whether any context switches have already occurred, and if not it sets the volatile flag gil_drop_request
, shared among all threads, to indicate that it’s requesting the release of the GIL. It then continues around this loop until it gets the lock, re-requesting GIL drop after a delay every time a new thread acquires it.
The holding thread, meanwhile, attempts to release the GIL when it performs blocking operations, or otherwise every time around the eval loop it checks if gil_drop_request
is set and releases the GIL if so. In so doing, it wakes up any threads which are waiting on the GIL and relies on the OS to ensure fair scheduling among threads.
The advantage of this approach is that it provides an advisory cap on the amount of time a thread may hold the GIL, by delaying setting the gil_drop_request
flag, but also allows the eval loop as long as it needs to finish proessing its current bytecode instruction. It also minimises overhead in the simple case when no other thread has requested the GIL.
The final change is around thread switching. Prior to Python 3.2, the GIL was released for a handful of CPU cycles to allow the OS to schedule another thread, and then it was immediately reacquired. This was efficient if the common case is that no other threads are ready to run, and meant that threads running lots of very short opcodes weren’t unduly penalised, but in some cases this delay wasn’t sufficient to trigger the OS to context switch to a different thread. This can cause particular problems with you have an I/O-bound thread competing with a CPU-intensive one — the OS will attempt to schedule the I/O-bound thread, but it will immediately attempt to acquire the GIL and be suspended again. Meanwhile, the CPU-bound thread will tend to cling to the GIL for longer than it should, leading to higher I/O latency.
To combat this, the new system forces a thread switch at the end of the fixed interval if any other threads are waiting on the GIL. The OS is still responsible for scheduling which thread, this change just ensures that it’s not the previously running thread. It does this using a last_holder
shared variable which points to the last holder of the GIL. When a thread releases the GIL, it additionally checks if last_holder
is its own ID and if so, it waits on a condition variable for the value to change to another thread. This can’t cause a deadlock if no other threads are waiting, because in that case gil_drop_request
isn’t set and this whole operation is skipped.
Overall I’m hopeful that these changes should make a positive impact to fair scheduling in multithreaded Python applications. As much as I’m sure everyone would love to find a way to remove the GIL entirely, it doesn’t seem like that’s likely for some time to come.
There are a host of small improvements to the datetime
module to blast through.
First and foremost is that there’s now a timezone
type which implements the tzinfo
interface and can be used in simple cases of fixed offsets from UTC (i.e. no DST adjustments or the like). This means that creating a timezone-aware datetime
at a known offset from UTC is now straightforward:
>>> from datetime import datetime, timedelta, timezone
>>> # Naive datetime (no timezone attached)
>>> datetime.now()
datetime.datetime(2021, 2, 6, 15, 26, 37, 818998)
>>> # Time in UTC (happens to be my timezone also!)
>>> datetime.now(timezone.utc)
datetime.datetime(2021, 2, 6, 15, 26, 46, 488588, tzinfo=datetime.timezone.utc)
>>> # Current time in New York (UTC-5) ignoring DST
>>> datetime.now(timezone(timedelta(0, -5*3600)))
datetime.datetime(2021, 2, 6, 10, 27, 41, 764597, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400)))
Also, timedelta
objects can now be multiplied and divided by integers or floats, as well as divided by each other to determine how many of one interval fit into the other interval. This is all fairly straightforward by converting the values to a total number of seconds to perform the operations, but it’s convenient not to have to.
>>> timedelta(1, 20*60*60) * 1.5
datetime.timedelta(days=2, seconds=64800)
>>> timedelta(8, 3600) / 4
datetime.timedelta(days=2, seconds=900)
>>> timedelta(8, 3600) / timedelta(2, 900)
4.0
If you’re using Python to store information about the Late Medieval Period then you’re in luck, as datetime.date.strftime()
can now cope with dates prior to 1900. If you want to expand your research to the Dark Ages, however, you’re out of luck since it still only handles dates from 1000 onwards.
Also, use of two-digit years is being discouraged. Until now setting time.accept2dyear
to True
would allow you to use a 2-digit year in a time tuple and its century would be guessed. However, as of Python 3.2 using this logic will raise you a DeprecationError
. Quite right too, 2-digit years are quite an anacronism these days.
The str.format()
method for string formatting is now joined by str.format_map()
which, as the name implies, takes a mapping type to supply arguments by name.
>>> "You must cut down the mightiest {plant} in the forest with... a {fish}!"
.format_map({"fish": "herring", "plant": "tree"})
'You must cut down the mightiest tree in the forest with... a herring!'
As well as a standard dict
instance, you can pass any dict
-like object and Python has plenty of these, such as ConfigParser
and the objects created by the dbm
modules.
There have also been some minor changes to formatting of numeric values as strings. Prior to this release convertinig a float
or complex
to string form with str()
would show fewer decimal places than repr()
. This was because the repr()
level of precision would occasionally show surprising results, and the pragmatic way to avoid this being more of an issue was to make str()
round to a lower precision.
However, as discussed in the previous article, repr()
was changed to always select the shortest equivalent representation for these types in Python 3.1. Hence, in Python 3.2 the str()
and repr()
forms of these types have been unified to the same precision.
There are a series of enhancements to decorators provided by the functools
module, plus a change to contextlib
.
Firstly, just to make the example from the previous article more pointless, there is now a functools.lru_cache()
decorator which can cache the results of a function based on its parameters. If the function is called with the same parameters, a cached result will be used if present.
This is really handy to drop in to commonly-used but slow functions for a very low effort speed boost. What’s even more useful is that you can call a cache_info()
method of the decorated function to get statistics about the cache. There’s also a cache_clear()
method if you need to invalidate the cache, although there’s unfortunately no option to clear only selected parameters.
>>> @functools.lru_cache(maxsize=10)
... def slow_func(arg):
... return arg + 1
...
>>> slow_func(100)
101
>>> slow_func(200)
201
>>> slow_func(100)
101
>>> slow_func.cache_info()
CacheInfo(hits=1, misses=2, maxsize=10, currsize=2)
Secondly, there have been some improvements to functools.wraps()
to improve introspection, such as a __wrapped__
attribute pointing back to the original callable and copying __annotations__
across to the wrapped version, if defined.
Thirdly, a new functools.total_ordering()
class decorator has been provided. This is very useful for producing classes which support all the rich comparison operators with minimal effort. If you define a class with __eq__
and __lt__
and apply the @functools.total_ordering
decorator to it, all the other rich comparision operators will be synthesized.
>>> import functools
>>> @functools.total_ordering
... class MyClass:
... def __init__(self, value):
... self.value = value
... def __lt__(self, other):
... return self.value < other.value
... def __eq__(self, other):
... return self.value == other.value
...
>>> one = MyClass(100)
>>> two = MyClass(200)
>>> one < two
True
>>> one > two
False
>>> one == two
False
>>> one != two
True
Finally, there have been some changes which mean that the contextlib.contextmanager()
decorator now results in a function which can be used both as a context manager (as previously) but now also as a function decorator. This could be pretty handy, although bear in mind if you yield a value which is normally bound in a with
statement, there’s no equivalent approach for function deocorators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Only one improvement to itertools
which is the addition of an accumulate
function. However, this has the potential to be pretty handy so I’ve given it its own section.
Passed an iterable, itertools.accumulate()
will return the cumulative sum of all elements so far. This works with any type that’s defined for operator +
:
>>> import itertools
>>> list(itertools.accumulate([1,2,3,4,5]))
[1, 3, 6, 10, 15]
>>> list(itertools.accumulate([[1,2],[3],[4,5,6]]))
[[1, 2], [1, 2, 3], [1, 2, 3, 4, 5, 6]]
For other types, you can define any binary function to combine them:
>>> import operator
>>> list(itertools.accumulate((set((1,2,3)), set((3,4,5))),
func=operator.or_))
[{1, 2, 3}, {1, 2, 3, 4, 5}]
And it’s also possible to start with an initial value before anything’s added by providing the initial
argument.
The collections
module has had a few improvements.
The collections.Counter
class added in the previous release has now been extended with a subtract()
method which supports negative numbers. Previously the semantics of -=
as applied to a Counter
would never reduce a value beyond zero — it would simply be removed from the set. This is consistent with how you’d expect a counter to work:
>>> x = Counter(a=10, b=20)
>>> x -= Counter(a=5, b=30)
>>> x
Counter({'a': 5})
However, in its initerpretation as a multiset, you might actually want values to go negative. If so, you can use the new subtract()
method:
>>> x = Counter(a=10, b=20)
>>> x.subtract(Counter(a=5, b=30))
>>> x
Counter({'a': 5, 'b': -10})
As demonstrated in the previous article, it’s a little inconvenient to move something to the end of the insertion order. That’s been addressed in this release with the OrderedDict.move_to_end()
method. By default this moves the item to the last position in the ordered sequence in the same way as x[key] = x.pop(key)
would but is significantly more efficient. Alternatively you can call move_to_end(key, last=False)
to move it to the first position in the sequence.
Finally, collections.deque
has two new methods, count()
and reverse()
which allow them to be used in more situations where code was designed to take a list
.
>>> import collections
>>> x = collections.deque('antidisestablishmentarianism')
>>> x.count('i')
5
>>> x.reverse()
>>> x
deque(['m', 's', 'i', 'n', 'a', 'i', 'r', 'a', 't', 'n', 'e', 'm', 'h', 's',
'i', 'l', 'b', 'a', 't', 's', 'e', 's', 'i', 'd', 'i', 't', 'n', 'a'])
The three modules email
, mailbox
and nntplib
now correctly support the str
and bytes
types that Python 3 introduced. In particular, this means that messages in mixed encodings now work correctly. These have also necessitated a number of changes in the mailbox
module, which should now work correctly.
The email
module has new functions message_from_bytes()
and message_from_binary_file()
, and classes BytesFeedParser
and BytesParser
, to allow messages read or stored in the form of bytes
to be parsed into model objects. Also, the get_payload()
method and Generator
class have been updated to properly support the Content-Transfer-Encoding
header, encoding or decoding as appropriate.
Sticking with the theme of email, imaplib
now supports upgrade of an existing connection to TLS using the new imaplib.IMAP4.starttls()
method.
The ftplib.FTP
class now supports the context manager protocol to consume socket.error
exceptions which are thrown and close the connection when done. This makes it pretty handy, but due to the way that FTP opens additional sockets, you need to be careful to close all these before the context manager exits or your application will hang. Consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Assuming that FTP site is still up, and README.MIRRORS
is still available, that should execute fine. However, if you remove that sock.close()
line then you should find it just hangs up and never terminiates (perhaps until the TCP connection gets terminated due to being idle).
The socket.create_connection()
function can also be used as a context manager, and swallows errors and closes the connection in the same way as the FTP
class above.
The ssl
module has seen some love with a host of small improvements. There’s a new SSLContext
class to hold persistent connection data such as settings, certificates and private keys. This allows the settings to be reused for multiple connections, and provides a wrap_socket()
method for creating a socket using the stored details.
There’s a new ssl.match_hostname()
which applies RFC-specified rules for confirming that a specified certificate matches the specified hostname. The certificate specification it expects is as returned by SSLSocket.getpeercert()
, but it’s not particularly hard to fake as shown in the session below.
>>> import ssl
>>> cert = {'subject': ((('commonName', '*.andy-pearce.com'),),)}
>>> ssl.match_hostname(cert, "www.andy-pearce.com")
>>> ssl.match_hostname(cert, "ftp.andy-pearce.com")
>>> ssl.match_hostname(cert, "www.andy-pearce.org")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andy/.pyenv/versions/3.2.6/lib/python3.2/ssl.py", line 162, in match_hostname
% (hostname, dnsnames[0]))
ssl.CertificateError: hostname 'www.andy-pearce.org' doesn't match '*.andy-pearce.com'
This release also adds support for SNI (Server Name Indication), which is like virtual hosting but for SSL connections. This removes the longstanding issue whereby you can host as many domains on a single IP address for standard HTTP, but for SSL you needed a unique IP address for each domain. This is essentially beause the virtual hosting of websites is implemented by passing the HTTP Host
header, but since the SSL connection is set up prior to sending the HTTP request (by definition!) then the only thing you have to connect to is an IP address. The remote end needs to decide what certificate to send you, and since all it has to decide that is the IP address then you can’t have different certificates for different domains on the same IP. This is problematic because the certificate needs to match the domain or the browser will reject it.
SNI handles this by extending the SSL ClientHello message to include the domain. To implement this with the ssl
module in Python, you need to specify the server_hostname
parameter to SSLContext.wrap_socket()
.
The http.client
module has been updated to use the new certificate verification processes when using a HTTPSConnection
. The request()
method is now more flexible on sending request bodies — previously it required a file object, but now it will also accept an iterable providing that an explicit Content-Length
header was sent. According to HTTP/1.1 this header shouldn’t be required, since requests can be sent using chunked encoding, which doesn’t require the length of the request body to be known up front. In practice, however, it’s common for servers not to bother supporting chunked requests, despite being mandated by the HTTP/1.1 standard. As a result, it’s sensible to regard Content-Length
as mandatory for requests with a body. HTTP/2 has its own methods of streaming data anyway, so once that gains wide acceptance then chunked encoding won’t be used anyway — but given the rate of adoption so far, I wouldn’t hold your breath.
The urllib.parse
module has some changes as well, with urlparse()
now supporting IPv6 and urldefrag()
returning a collections.namedtuple
for convenience. The urlencode()
function can also now accept both str
and bytes
for the query parameter.
There have been some significant updates to the xml.etree.ElementTree
package, including the addition of the following top-level functions:
fromstringlist()
tostringlist()
fromstringlist()
, generates the XML output in chunks. It doesn’t make any guarantees except that joining them all together will yield the same as generating the output as a single string, but in my experience each chunk is around 8192 bytes plus whatever takes it up to the next tag boundary.register_namespace()
The Element
class also has a few extra methods:
Element.extend()
Element
instances.Element.iterfind()
Element.findall()
but yields elements instead of returning a list.Element.itertext()
Element.findtext()
but iterates over all the current element and all child elements as opposed to just returning the first match.The TreeBuilder
class also has acquired the end()
method to end the current element and doctype()
to handle a doctype declaration.
Finally, a couple of unnecessary methods have been deprecated. Instead of getchildren()
you can just use list(elem)
, and instead of getiterator()
just use Element.iter()
.
Also in 3.2 there’s a new html
module, but it only contains one function escape()
so far which will do the obvious HTML-escaping.
>>> import html
>>> html.escape("<blink> & <marquee> tags are both deprecated")
'<blink> & <marquee> tags are both deprecated'
The gzip.GzipFile
class now provides a peek()
method which can read a number of bytes from the archive without advancing the read pointer. This can be very useful when implemented parsers which need to choose between various functions to branch into based on what’s next in the file, but which to also leave those functions to read from the file itself as a simpler interface.
The gzip
module has also added the compress()
and decompress()
methods which simply perform in-memory compression/decompression without the need to construct a GzipFile
instance. This has been a source of irritation for me in the past, so it’s great to see it finally addressed.
The zipfile
module also had some improvements, with the ZipFile
class now supporting use as a context manager. Also, the ZipExtFile
object has had some performance improvements. This is the file-like object returned when you open a file within a ZIP archive using the ZipFile.open()
method. You can also wrap it in io.BufferedReader
for even better performance if you’re doing multiple smaller reads.
The tarfile
module has changes, with tarfile.TarFile
also supporting use as a context manager. Also, the add()
method for adding files to the archive now supports a filter
parameter which can modify attributes of the files as they’re added, or exclude them altogether. You pass a callable using this parameter, which is called on each file as it’s added. It’s passed a TarInfo
structure which has the metainformation about the file, such as the permissions and owner. It can return a modified version of the structure (e.g. to squash all files to being owned by a specific user), or it can return None
to block the file from being added.
Finally, the shutil
module has also grown a couple of archive-related functions, make_archive()
and unpack_archive()
. These provide a convenient high-level interface to zipping up multiple files into an archive without having to mess around with the details of the individual compression modules. It also means that the format of your archives can be altered with minimal impact on your code by changing a parameter.
It supports the common archiving formats out of the box, but there’s also a register_archive_format()
hook should you wish to add code to handle additional formats.
There are some new functions in the math
library, some of which look pretty handy.
isfinite()
True
iff the float
argument is not a special value (e.g. NaN or infinity)expm1()
erf()
and erfc()
erf()
is the Guassian Error Function, which is useful for assessinig how much of an outlier a data point is against a normal distribution. The erfc()
function is simply the compliment where erfc(x) == 1 - erf(x)
.gamma()
and lgamma()
math.factorial()
will be what you’re looking for. Since the value grows so quickly, larger values will yield an OverflowError
. To deal with this, the lgamma()
function returns the natural logarithm of the value.There have been a couple of changes to the way that both compiled bytecode and shared object files are stored on disk. More casual users of Python might want to skip over this section, although I would say it’s always helpful to know what’s going on under the hood, if only to help diagnose problems you might run into.
The previous scheme of storing .pyc
files in the same directory as the .py
files didn’t play nicely when the same source files were being used by multiple different interpreters. The interpreter would note that the file was created by another one, and replace it with its own. As the files swap back and forth, it cancels out the benefits of caching in the first place.
As a result, the name of the interpreter is now added to the .pyc
filename, and to stop these files cluttering things up too much they’ve all been moved to a __pycache__
directory.
I suspect many people will not need to care about this any further than it being another entry for the .gitignore
file. However, sometimes there can be odd effects with these compiled files, so it’s worth being aware of. For example, if a module is installed and used and then deleted, it might leave the .pyc
files behind, confusing programmers who were expecting an import error. If you do want to check for this, there’s a new __cached__
attribute of an imported module indicating the file that was loaded, in addition to the existing __file__
attribute which continues to refer to the source file. The imp
module also has some new functions which are useful for scripts that need to correlate source and compiled files for some reason, as illustrated by the session below:
>>> import mylib
>>> print(mylib.__file__)
/tmp/mylib.py
>>> print(mylib.__cached__)
/tmp/__pycache__/mylib.cpython-32.pyc
>>> import imp
>>> imp.get_tag()
'cpython-32'
>>> imp.cache_from_source("/tmp/mylib.py")
'/tmp/__pycache__/mylib.cpython-32.pyc'
>>> imp.source_from_cache("/tmp/__pycache__/mylib.cpython-32.pyc")
'/tmp/mylib.py'
There are also some corresponding changes to the py_compile
, compileall
and importlib.abc
modules which are a bit esoteric to cover here, the documentation has you well covered. You can also find lots of details and a beautiful module loading flowchart in PEP 3147.
Similar changes have been implemented for shared object files. These are compiled against a specific ABI (Application Binary Interface) and the ABI is sensitive to major Python version, but also the compilation flags that were used to compiled the interpreter can also affect it. As a result, being able to support the same shared object compiled against multiple ABIs is useful.
The implementation is similar to that for compiled bytecode, where .so
files acquire unique filenames based on the ABI and are collected into a shared directory pyshared
. The suffix for the current interpreter can be queried using sysconfig
:
>>> import sysconfig
>>> sysconfig.get_config_var("SOABI")
'cpython-32m-x86_64-linux-gnu'
>>> sysconfig.get_config_var("EXT_SUFFIX")
'.cpython-32m-x86_64-linux-gnu.so'
The interpreter is cpython
, 32
is the version and the letters appended indicate the compilation flags. In this example, m
corresponds to pymalloc.
If you want more details, PEP 3149 has a ton of interesting info.
The syntax of the language has been expanded to allow deletion of a variable that are free in a nested block. If that didn’t make any sense, it’s best explained with an example. The following code was legal in Python 2.x, but would raised a SyntaxError
in Python 3.0 or 3.1. In Python 3.2, however, this is once again legal.
1 2 3 4 5 6 7 |
|
So what happens if we were to call inner()
again after the del x
now? We exactly the same results as if we hadn’t declared the local yet which is to get NameError
with the message free variable 'x' referenced before assignment in enclosing scope
. The following example may make this message clearer.
1 2 3 4 5 6 7 8 9 10 |
|
An important example of an implicit del
is at the end of an except
block, so the following code would have raised a SyntaxError
in Python 3.0-3.1, but is now valid again:
1 2 3 4 5 6 7 8 9 10 |
|
A new ResourceWarning
has been added to detect issues such as gc.garbage
not being empty at interpreter shutdown, indicating finalisation problems with the code. It’s also raised if a file
object is destroyed before being properly closed.
This warning is silenced by default, but can be enabled by the warnings
module, or using an appropriate -W
option on the command-line. The session shown below shows the warning being triggered by destroying an unclosed file
object:
>>> warnings.filterwarnings("default")
>>> f = open("/etc/passwd", "rb")
>>> del f
<stdin>:1: ResourceWarning: unclosed file <_io.BufferedReader name='/etc/passwd'>
Note that as of Python 3.4 most of the cases that could cause garbage collection to fail have been resolved, but we have to pretend we don’t know that for now.
There have also been a range of improvements to the unittest
module. There are two new assertions, assertWarns()
and assertWarnsRegex()
, to test whether code raises appropriate warnings (e.g. DeprecationWarning
). Another new assertion assertCountEqual()
can be used to perform an order-independent comparison of two iterables — functionally this is equivalent to feeding them both into collections.Counter()
and comparing the results. There is also a new maxDiff
attribute for limiting the size of diff output when logging assertion failures.
Some of the assertion names are being tidied up. Examples include assertRegex()
being the new name for assertRegexpMatches()
and assertTrue()
replacing assert_()
. The assertDictContainsSubset()
assertion has also been deprecated because the arguments were in the wrong order, so it was never quite clear which argument was required to be a subset of which.
Finally, the command-line usage with python -m unittest
has been made more flexible, so you can specify either module names or source file paths to indicate which tests to run. There are also additional options for python -m unittest discover
for specifying which directory to search for tests, and a regex filter on the filenames to run.
Some performance tweaks are welome to see. Firstly, the peephole optimizer is now smart enough to convert set
literals consisting of constants to frozenset
. This makes things faster in cases like this:
1 2 3 |
|
The Timsort algorithm used by list.sort()
and sorted()
is now faster and uses less memory when a key
function is supplied by changing the way this case is handled internally. The performance and memory consumption of json
decoding is also improved, particularly in the case where the same key is used repeatedly.
A faster substring search algorithm, which is based on the Boyer-Moore-Horspool algorithm, is used for a number of methods on str
, bytes
and bytearray
objects such as split()
, rsplit()
, splitlines()
, rfind()
and rindex()
.
Finally, int
to str
conversions now process two digits at a time to reduce the number of arithmetic operations required.
There’s a whole host of little changes which didn’t sit nicely in their own section. Strap in and prepare for the data blast!
str
vs. bytes
) and encodings to use. This is important reading for anyone building web apps conforming to WSGI.range
Improvementsrange
objects now support index()
and count()
methods, as well as slicing and negative indices, to make them more interoperable with list
and other sequences.csv
Improvementscsv
module now supports a unix_dialect
output mode where all fields are quoted and lines are terminated with \n
. Also, csv.DictWriter
has a writeheader()
method which writes a row of column headers to the output file, using the key names you provided at construction.tempfile.TemporaryDirectory
Addedtempfile
module now provides a TemporaryDirectory
context manager for easy cleanup of temporary directories.Popen()
Context Managersos.popen()
and subprocess.Popen()
can now act as context managers to automatically close any associated file descriptors.configparser
Always Uses Safe Parsingconfigparser.SafeConfigParser
has been renamed to ConfigParser
to replace the old unsafe one. The default settings have also been updated to make things more predictable.select.PIPE_BUF
Addedselect
module has added a PIPE_BUF
constant which defines the minimum number of bytes which is guaranteed not to block when a select.select()
has indicated that a pipe is ready for writing.callable()
Re-introducedcallable()
builtin from Python 2.x was re-added to the language, as it’s a more readable alternative to isinstance(x, collections.Callable)
.ast.literal_eval()
For Safer eval()
ast
module has a useful literal_eval()
function which can be used to evaluate expressions more safely than the builtin eval()
.reprlib.recursive_repr()
Added__repr__()
special methods, it’s easy to forget to handle the case where a container can contain a reference to itself, which easily leads to __repr__()
calling itself in an endlessly recursive loop. The reprlib
module now provides a recursive_repr()
decorator which will detect the recursive call and add ...
to the string representation instead.hash(1) == hash(1.0) == hash(1+0j)
.hashlib.algorithms_available()
Addedhashlib
module now provides the algorithms_available
set which indicates the hashing algorithms available on the current platform, as well as algorithms_guaranteed
which are the algorithms guaranteed to be available on all platforms.hasattr()
Improvementshasattr()
has been fixed. This works by calling getattr()
and checking whether an exception is thrown. This approach allows it to support the multiple ways in which an attribute may be provided, such as implementing __getattr__()
. However, prior to this release hasattr()
would catch any exception, which could mask genuine bugs. As of Python 3.2 it will only catch AttributeError
, allowing any other exceptioni to propogate out.memoryview.release()
Addedmemoryview
objects now have a release()
method and support use as a context manager. These objects allow a zero-copy view into any object that supports the buffer protocol, which includes the builtins bytes
and bytearray
. Some objects may need to allocate resources in order to provide this view, particularly those provided by C/C++ extension modules. The release()
method allows these resources to be freed earlier than the memoryview
object itself going out of scope.structsequence
Tool Improvementsstructsequence
tool has been updated so that C structures returned by the likes of os.stat()
and time.gmtime()
now work like namedtuple
and can be used anywhere where a tuple
is expected.-q
command-line option to the interpreter to enable “quiet” mode, which suppresses the copyright and version information being displayed in interactive mode. I struggle a little to think of cases where this would matter, I’ll be honest — perhaps if you’re embedding the interpreter as a feature in a larger application?Well now, I must admit that I did not expect that to be double the size of the post covering Python 3.0! If you’ve come here reading that whole article in one go, I must say I’m impressed. Perhaps lay off caffeine for awhile…?
Overall it feels like a really massive release, this one. Admittedly I did cover a high proportion of the details, whereas in the first article I glossed over quite a lot as some of the changes were so massive I wanted to focus on them.
Out of all that, it’s really hard to pick only a few highlights, but I’ll give it a go. As I said at the outset I love argparse
— anyone who writes command-line tools and cares about their usability should save a lot of hassle with this. Also, the concurrent.futures
module is great — I’ve only really started using it recently, and I love how it makes it really convenient to add parallelism in simple cases to applications where the effort might otherwise be too high to justify the effort.
The functools.lru_cache()
and functools.total_ordering()
decorators are both great additions because they offer significant advantages with minimal coding effort, and this is the sort of feature that a language like Python should really be focusing on. It’s never going to beat C or Rust in the performance stakes, but it has real strengths in time to market, as well as the concision and elegance of code.
It’s also great to see some updates to the suite of Internet-facing modules, as having high quality implementations of these in the standard library is another great strength of Python that needs to be maintained. SSL adding support for SNI is a key improvement that can’t come too soon, as it still seems a long way off that we’ll be saying goodbye to the limited address space of IPv4.
Finally, the GIL changes are great to see. Although we’d all love to see the GIL be deprecated entirely, this is clearly a very difficult problem or it would have been addressed by now. Until someone can come up with something clever to achieve this, at least things are significantly better than they were for multithreaded Python applications.
So there we go, my longest article yet. If you have any feedback on the amount of detail that I’m putting in (either too much or too little!) then I’d love to hear from you. I recently changed my commenting system from Disqus to Hyvor which is much more privacy-focused and doesn’t require you to register an account to comment, and also has one-click feedback buttons. I find writing these articles extremely helpful for myself anyway, but it’s always nice to know if anyone else is reading them! If you’re reading this on the front-page, you can jump to the comments section of the article view using the link at the end of the article at the bottom-right.
OK, so that’s it — before I even think of looking at the Python 3.3 release notes, I’m going to go lie down in a darkened room with a damp cloth on my forehead.
In real production envronments you should use many more iterations than this, a bigger salt and ideally a better key derivation function like scrypt, as defined in RFC 7914. Unforunately that won’t be in Python until 3.6. ↩
Maybe more due to hyperthreading, but my assumption was that it wouldn’t help much with a CPU-intensive task like password hashing. My results seemed to validate that assumption. ↩
Spoiler alert: using my time machine I can tell you it’s not a lot else yet, at least as of 3.10.0a5. ↩
And you know I love to get technical. ↩