This article continues to series looking at features added in each release of Python 3.x, with this one covering the move from 3.0 to 3.1. It includes the new contains OrderedDict and Counter, making modules executable as scripts, and marking unit tests as known failures. If you’re puzzled why I’m looking at releases that are years old, check out the first post in the series.
This is the 2nd of the 32 articles that currently make up the “Python 3 Releases” series.
The previous post in this series tries to explain why it’s 2021 and I appear to be posting what’s “new” in a release of Python almost nine years old. In that article I trawled through what I considered the major changes included in Python 3.0, and in this one we’re moving on to release 3.1. I’m hoping I can drill into the remaining releases in a little more detail given the number of features added is a little smaller than the change from Python 2.x to Python 3.x.
And so, with as little further ado as I can manage, let’s get started.
Over the years many people have found use-cases for a dict
where the order of insertion is maintained when you iterate through the entries. If you poked around a lot of Python codebases, you’d likely find quite a few implementations of this, but thankfully in this release it was added to the standard library in the form of collections.OrderedDict
as per PEP 372.
There are a few uses for such a class, one being the configparser
module which was also updated to use this new class. This means that the ordering of configuration items read in from a file can be preserved on write, which could be helpful if you need to manually diff
them to see the changes.
One of the most common uses is to implement an LRU cache and I’m going to use that as an example to illustrate the behaviour of OrderedDict
. For anyone who’s unaware, an LRU cache is for cases where you need associative storage with a fixed maximum size, such as caching the result of some lookup. If your lookup involves a slow remote request and you’re doing this frequently then caching the result in memory can massively improve performance. However, you don’t want to exhaust memory by caching too many values, so once the cache hits some size limit you want to throw away old entries to make room for new ones. The way you do this is by discarding the least recently used (LRU) value, which is supposed to be the least useful to cache since recent hits are generally more like to be be hit again soon in many real-world use cases.
Here’s the code, and I’ll put the discussion below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
As you can see, all the heavy lifting is done by the OrderedDict
class itself. The first thing to note is that I implemented this as a wrapper rather than subclassing OrderedDict
directly. This is to avoid annoying bugs due to self-recurring calls. For example, the implementation of the pop()
method refers to self[key]
, and as you can see the get()
method above calls pop()
. Had I overridden __getitem__()
myself instead of making a new get()
method then the underlyinig pop()
would have jumped straight back into my __getitem__()
, thus creating recursion loop resulting in a RecursionError
exception.
The second thing to note is that to move items to the end of the LRU list in get()
I’m removing the item with pop()
and re-adding it again. This is the only way to shuffle an item to the end of the list in Python 3.1. Thankfully in Python 3.2 a move_to_end()
method was added to perform this action more gracefully and more efficiently too, but since the conceit of this post is that Python 3.2 doesn’t exist yet I can’t possible know that unless I’m psychic1.
The third point of interest is that I’m also calling pop()
in set()
(via the call to clear()
). This is unnecessary if the key doesn’t already exist in the OrderedDict
because a newly added key is always added at the end of the LRU list. However, if the key does exist then the value is updated but the position of the item in the list is not changed. This behaviour may be helpful in some use-cases for OrderedDict
, but in our example we do want it to move, so we remove and re-add this to force it. Once again, in Python 3.2 and beyond move_to_end()
would be a better approach.
So given that OrderedDict
supports everything dict
does, but adds some ordering behaviour in some cases, should we always use it instead of dict
?
Well, I wouldn’t, personally. For one thing, OrderedDict
is implemented in pure Python in the CPython library, so its performance is probably going to be at least marginally lower than dict
. Also, the memory consumption is probably significantly higher as it uses a doubly-link list to store items in access order. Since Python doesn’t have a native linked-list type it implements its own, again in pure Python, although it does at least use __slots__
to minimise the additional memory overhead of this. It also has to use weakref.proxy
in a few places to avoid issues with circular references, and this is going to add a little more overhead.
All this means that although the time complexity of all the methods is no worse than dict
, and the time overheads in real-world usage are going to be at least somewhat higher. More notably, the memory requirements are going to be significantly higher for larger structures. As well as the obvious hit on available memory, increasing the size of in-memory structures can reduce the locality of reference which can further impact performance by reducing the benefits of the L2 and L3 caches. That said, if you build your OrderedDict
in a short amount of time and iterate through the items in the original insertion order then you might actually find your locality of reference improves, on the basis that blocks of memory allocated close together in time are more likely to be phsyically close in memory.
When all’s said and done, though, the overheads are not likely to be significant in many common use-cases and its implementation is almost certainly better (and safer) than you’ll manage yourself unless you spend a lot of time on it. Since a lot of software development tasks are optimised for time to market, these days, then it’s definitely a very useful addition to the library. I just wouldn’t advise using it to replace dict
for standard uses where you don’t actually need the ordering.
The collections
module is doubly blessed in this release as there’s also a collections.Counter
class added for counting unique instances. I remember when I first came across this I was a little puzzled why they’d have added this when it seemed that collections.defaultdict(int)
would do the job just as well. However, this class has some features which aren’t immediately obvious which set it apart.
The key point to note is that this container is less like a conventional dict
and more like a multiset in some other languages. A good example is the elements()
method, which iterates through all the items as many time as they’re counted, as if it was a true multiset.
>>> y = collections.Counter()
>>> y["two"] += 2
>>> y["three"] += 3
>>> list(y)
['two', 'three']
>>> list(y.elements())
['two', 'two', 'three', 'three', 'three']
It’s also possible and add and subtract counters from each other, which has the effect of modifying the counters in the target set, removing any where the counts go to (or below) zero. Being a form of set, they also support intersection with &
and union with |
.
Finally, there’s also a most_common()
method which returns the top N items in the set sorted in reducing order of cardinality.
All in all this is a simple-seeming container, but with some handy features which make it a useful choice in quite a few use-cases. For having started off questioning why it was even added, the more I’ve looked at this class the more I think it’s a bit of a hidden gem.
The itertools
module has also seen some love with a few smaller changes.
First up is the combinations_with_replacement()
function which is a variant of combinations()
but allowing individual elements to be repeated more than once. Bit specialised, but for the cases that’s what you need then it’ll definitely be nice not having to write it.
Next is the compress()
which takes two iterables typically of the same length, and filters the first iterable to only include elements where the corresponding element in the second iterable evaluates to True
.
Finally, the count()
generator now has an optional step
parameter which can accept multiple types of numeric interval, including those from the fractions
and decimal
modules. It would have been handy to also support datetime.timedelta()
, but that’s probably a bit too much of a stretched overload of this iterator’s purpose.
You may recall from the String Formatting section in the previous post the following snippet:
>>> # This requires locale to be set first...
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'en_GB.UTF-8'
>>> # Use locale-specific number separator.
>>> "{0:n}".format(1234567)
'1,234,567'
The thousands separators are quite useful, but having to ensure a locale is set may be a bit of a pain, especially in short scripts. Hence, Python 3.1 adds a new option for a non-locale-aware thousands separator, which always uses a comma every 3 digits. This can be added by putting a ,
prior to the precision, and can be used with any of the base-10 numeric output formats.
>>> "{0:,.4f}".format(12345678.123456)
'12,345,678.1235'
A small change, but probably quite useful for all sorts of diagnostic output where locale correctness is probably secondary to functionality.
Additionally, there’s another handy change which removes the need for explicitly numbering the arguments in format parameters if they’re referenced in the same order they’re defined. Again taking the example from the previous post, here’s the old approach and the new one compared:
>>> "My name is {1} and I'm {0} years old".format(40, "Brian")
"My name is Brian and I'm 40 years old"
>>> "My name is {} and I'm {} years old".format("Brian", 40)
"My name is Brian and I'm 40 years old"
Directories and zip files can now include a __main__.py
and have this executed if run as a script. This is particularly useful for .zip
files on Unix because you can prepend the file with a shebang line which will be respected by the kernel when you attempt to execute the file (assuming the user has permission to execute that file). The zip implementation that the Python interpreter uses to look into zip files is tolerant of unknown data at the start of the file, so despite this header the Python interpreter is still able to import libraries from the zip file as normal.
The net result of all this is that you can distribute a Python application, including multiple additional modules, and have it be executable without the user needing to unpack it.
The terminal session below demonstrates this, if you’re familiar enough with Linux/Unix to follow along:
$ cat __main__.py
import mymodule
mymodule.myfunc()
$ cat mymodule.py
def myfunc():
print("Hello, world")
$ zip /tmp/pytmp.zip *.py
adding: __main__.py (deflated 15%)
adding: mymodule.py (stored 0%)
$ cat > /tmp/pyexec.zip
#!/usr/bin/python3
$ cat /tmp/pytmp.zip >> /tmp/pyexec.zip
$ chmod 0755 /tmp/pyexec.zip
$ /tmp/pyexec.zip
Hello, world
The unzip
utility can still deal with the archive, even helpfully pointing out the additional bytes due to the shebang line:
$ unzip -l /tmp/pyexec.zip
Archive: /tmp/pyexec.zip
warning [/tmp/pyexec.zip]: 19 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length Date Time Name
--------- ---------- ----- ----
34 01-29-2021 23:10 __main__.py
41 01-29-2021 23:10 mymodule.py
--------- -------
75 2 files
There are a few changes relating to context managers in this release.
A simple change has been made to the syntax of the with
statement to allow multiiple context managers to be specified. Although, simple, it’s rather handy in certain use-cases. The best example of this I know is when you’re parsing data from one file into another. Previously I used to use nested with
statements like this:
1 2 3 4 5 6 |
|
There was a contextlib.nested()
function which could be used to stack multiple context managers on one line, but it wasn’t the neatest and had some annoying quirks. Now you can simply append additional managers with a comma:
1 2 3 4 |
|
Beautiful. I’m in serious danger of breaking into a James Blunt song here, so let’s move on quickly — neither of us wants to risk that.
The use of the with
statement has also been expanded a little more wiith gzip.GzipFile
and bz2.BZ2File
now supporting the context manager protocol. This change is more like addressing a historical ommission, and frankly I’m surprised it’s taken this long.
The unittest
module got some enhancements in 3.1. The first of these is that it supports skipping tests or classes of tests based on arbitrary criteria, and it also allows tests to be flagged as an expected failure, which means it won’t count as a failure for the purposes of determining whether the test suite passed. Both of these handy features are implemented using decorators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Next up, the useful assertRaises()
can now be used as a context manager. This makes it a lot easier to check whether a block of code throws a particular expected exception, particularly if that block of code is a little convoluted.
1 2 3 4 5 |
|
Finally, a set of additional assertions have been added whose primary value is typically the diagnostic detail they provide on failures without the programmer having to take any extra steps. Here are some examples:
assertSetEqual()
assertDictEqual()
, assertDictContainsSubset()
assertListEqual()
, assertTupleEqual()
, assertSequenceEqual()
assertRaisesRegexp()
assertIsNone()
, assertIsNotNone()
There are also a collection of small changes in assorted numeric types, including int
, float
and decimal.Decimal
.
First up, int
objects gained a bit_length()
method to indicate the minimum number of bits required to store the binary representation of the number. Seems a little esoteric this one, but could be handy for serialisation code that’s picking from multiple different integer representations to use, or that sort of thing.
>>> (255).bit_length()
8
>>> (2**63).bit_length()
64
>>> (2**63-1).bit_length()
63
>>> (5**5**5).bit_length()
7257
Sticking with int
the behaviour of round()
has been updated. Previously this function would always return a float
regardless of the type it was passed as an input. As of Python 3.1, however, if round()
is passed an int
then it returns an int
instead.
Moving on to float
, the string representation created by repr()
in certain cases has been made more intuitive by using an alternative algorithm by David Gay. It’s outside the scope of this article to go into the full details, as floating point is hard, but suffice to say that repr()
of certain values of float
should now seem a bit more intuitive.
One minor wrinkle here is that the change in representation is going to change what the interactive interpreter shows you by default, so could break any docstrings you have using doctest
becaues the float
values shown may now be shorter.
Continuing the theme of rational numbers, the decimal.Decimal
class can now be constructed from a float
. It’s worth noting that it can have some surprising results due to the inability of binary floating point to represent the same numbers precisely as decimal floats, but as long as you round off to a sensible number of significant digits then you shouldn’t run into too many surprises.
The pickle
module has had a couple of small changes. Firstly, it’s been updated for better interoperability with Python 2.x where in many cases the same objects exist but with different names. For example, __builtin__.set
in Python 2.x exists as builtins.set
in Python 3.x. This unfortunately breaks compatibility with Python 3.0, but only if protocol version 2 is used — if you can assume Python 3.x only then use version 3 and it’ll work fine across all Python 3.x versions (but not Python 2.x). Protocol version 3 should work fine with all subsequent versions.
In other pickle
news, functools.partial
objects can now be serialised with the pickle
module. This is still subject to the standard limitation of pickle
that functions are sent by fully qualified name only, and rely on the deserialisation code to have the same function available during deserialisation.
There are a few other small points which I didn’t think deserved their own section.
string.maketrans()
Deprecatedstring.maketrans()
has been deprecated since, rather confusingly for a function in the string
module, it required bytes
or bytesarray
objects instead of str
. To avoid this confusion, str
, bytes
and bytesarray
now all have their own maketrans()
and translate()
methods. This is a somewhat obscure function unless you’re doing lots of character substitutions in your code, but for people who use it this definitely feels more consistent with the way other methods on these classes are accessed.collections.namedtuple.rename()
Addedcollections.namedtuple
now offers a rename
parameter which, if true, causes invalid field names to be replaced with numeric positional names such as _0
or _1
. This is useful where you don’t have control of the names and they may not be valid Python identifiers, such as constructing a namedtuple
from the result column names in an SQL query where one of the columns is called COUNT(*)
.logging.NullHandler()
AddedNullHandler
class was added to the logging
module, which is a handler which always throws away its log messages. It’s useful for applications which don’t use logging, but use libraries which are written without logginig support. Unless you configure a handler, you get irritating “No handlers could be found…” warning messages. Libraries can also add a NullHandler
, to avoid applications having to worry about this. Indeed, as explainied in the logging
tutorial, this is the only handler that a library should add, since handling of log events should be left to the application to configure.sys.version_info
Now namedtuple
sys.version_info
structure is now a namedtuple
, so its attributes can more helpfully be accessed by name.nntplib
and imaplib
nntplib
and imaplib
.importlib
Addedimportlib
library was added, which is a portable pure Python implementation of the import
statement. This is hoped to add transparency to the import process.io
library was previously pure python but has now been re-written in pure C, for a 2-20x speed up.json
module has a C extension to improve performance.--with-computed-gotos
compile flag gives speedups of up to 20% on the bytecode evaluation loop, depending on the platform, which is always welcome.int
was previously stored in base 215, but now on 64-bit platforms 230 which significantly improves performance of integer arithmetic on them.It’s been particularly interesting for me going through these changes so long after they were first released, because I’m realising how much I missed. Many of the smaller changes aren’t particularly impactful — I can’t see int.bit_length()
is likely to become my most-used function, for example — but things like multiple context managers using a single with
statement, and the non-locale-dependent thousands separator are handy little details to have up your sleeve. The versatility of collections.Counter
as a multiset also makes it well worth keeping it in mind for when you need it.
Overall another great batch of improvements, and it’s nice to see performance enhancements because Python 3.x as a whole was initially a bit of a jump down from Python 2.x from a performance standpoint. Also, I’m mildly surprised at the extent of the changes, as in my head 3.1 was a fairly minor release to tidy up a few small points. I’m now looking forward to going through the remainder of the 3.x releases to see what other gems I’ve been missing out on.
You can tell I’m definitely not psychic because at the end of 2019 I didn’t pile all my savings into Zoom stocks. ↩