☑ Python 2to3: What’s New in 3.1

31 Jan 2021 at 8:45PM in Software
 |   | 

This article continues to series looking at features added in each release of Python 3.x, with this one covering the move from 3.0 to 3.1. It includes the new contains OrderedDict and Counter, making modules executable as scripts, and marking unit tests as known failures. If you’re puzzled why I’m looking at releases that are years old, check out the first post in the series.

This is the 2nd of the 13 articles that currently make up the “Python 2to3” series.

green python two 31

The previous post in this series tries to explain why it’s 2021 and I appear to be posting what’s “new” in a release of Python almost nine years old. In that article I trawled through what I considered the major changes included in Python 3.0, and in this one we’re moving on to release 3.1. I’m hoping I can drill into the remaining releases in a little more detail given the number of features added is a little smaller than the change from Python 2.x to Python 3.x.

And so, with as little further ado as I can manage, let’s get started.

Ordered Dictionaries

Over the years many people have found use-cases for a dict where the order of insertion is maintained when you iterate through the entries. If you poked around a lot of Python codebases, you’d likely find quite a few implementations of this, but thankfully in this release it was added to the standard library in the form of collections.OrderedDict as per PEP 372.

There are a few uses for such a class, one being the configparser module which was also updated to use this new class. This means that the ordering of configuration items read in from a file can be preserved on write, which could be helpful if you need to manually diff them to see the changes.

LRU Cache Using OrderedDict

One of the most common uses is to implement an LRU cache and I’m going to use that as an example to illustrate the behaviour of OrderedDict. For anyone who’s unaware, an LRU cache is for cases where you need associative storage with a fixed maximum size, such as caching the result of some lookup. If your lookup involves a slow remote request and you’re doing this frequently then caching the result in memory can massively improve performance. However, you don’t want to exhaust memory by caching too many values, so once the cache hits some size limit you want to throw away old entries to make room for new ones. The way you do this is by discarding the least recently used (LRU) value, which is supposed to be the least useful to cache since recent hits are generally more like to be be hit again soon in many real-world use cases.

Here’s the code, and I’ll put the discussion below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import collections

class LRUCache:

    def __init__(self, capacity):
        self._cache = collections.OrderedDict()
        self._capacity = max(0, int(capacity))

    def __iter__(self):
        """Just iterate through keys to keep code simple."""
        return iter(self._cache)

    def _trim(self):
        """Drop items as required to cap size at capacity."""
        while (len(self._cache) > self._capacity):
            self._cache.popitem(last=False)

    def get(self, key):
        """Retrieve item and, if found, move to end of LRU list."""
        value = self._cache.pop(key)
        self._cache[key] = value
        return value

    def set(self, key, value):
        """Add or update item, moving to end of LRU list."""
        self.clear(key)
        self._cache[key] = value
        self._trim()

    def clear(self, key):
        """Remove a value from the cache."""
        self._cache.pop(key, None)

    def resize(self, new_capacity):
        """Set the new capacity then re-trim."""
        self._capacity = max(0, int(new_capacity))
        self._trim()

As you can see, all the heavy lifting is done by the OrderedDict class itself. The first thing to note is that I implemented this as a wrapper rather than subclassing OrderedDict directly. This is to avoid annoying bugs due to self-recurring calls. For example, the implementation of the pop() method refers to self[key], and as you can see the get() method above calls pop(). Had I overridden __getitem__() myself instead of making a new get() method then the underlyinig pop() would have jumped straight back into my __getitem__(), thus creating recursion loop resulting in a RecursionError exception.

The second thing to note is that to move items to the end of the LRU list in get() I’m removing the item with pop() and re-adding it again. This is the only way to shuffle an item to the end of the list in Python 3.1. Thankfully in Python 3.2 a move_to_end() method was added to perform this action more gracefully and more efficiently too, but since the conceit of this post is that Python 3.2 doesn’t exist yet I can’t possible know that unless I’m psychic1.

The third point of interest is that I’m also calling pop() in set() (via the call to clear()). This is unnecessary if the key doesn’t already exist in the OrderedDict because a newly added key is always added at the end of the LRU list. However, if the key does exist then the value is updated but the position of the item in the list is not changed. This behaviour may be helpful in some use-cases for OrderedDict, but in our example we do want it to move, so we remove and re-add this to force it. Once again, in Python 3.2 and beyond move_to_end() would be a better approach.

OrderedDict Implementation

So given that OrderedDict supports everything dict does, but adds some ordering behaviour in some cases, should we always use it instead of dict?

Well, I wouldn’t, personally. For one thing, OrderedDict is implemented in pure Python in the CPython library, so its performance is probably going to be at least marginally lower than dict. Also, the memory consumption is probably significantly higher as it uses a doubly-link list to store items in access order. Since Python doesn’t have a native linked-list type it implements its own, again in pure Python, although it does at least use __slots__ to minimise the additional memory overhead of this. It also has to use weakref.proxy in a few places to avoid issues with circular references, and this is going to add a little more overhead.

All this means that although the time complexity of all the methods is no worse than dict, and the time overheads in real-world usage are going to be at least somewhat higher. More notably, the memory requirements are going to be significantly higher for larger structures. As well as the obvious hit on available memory, increasing the size of in-memory structures can reduce the locality of reference which can further impact performance by reducing the benefits of the L2 and L3 caches. That said, if you build your OrderedDict in a short amount of time and iterate through the items in the original insertion order then you might actually find your locality of reference improves, on the basis that blocks of memory allocated close together in time are more likely to be phsyically close in memory.

When all’s said and done, though, the overheads are not likely to be significant in many common use-cases and its implementation is almost certainly better (and safer) than you’ll manage yourself unless you spend a lot of time on it. Since a lot of software development tasks are optimised for time to market, these days, then it’s definitely a very useful addition to the library. I just wouldn’t advise using it to replace dict for standard uses where you don’t actually need the ordering.

New Container Where It Counts

The collections module is doubly blessed in this release as there’s also a collections.Counter class added for counting unique instances. I remember when I first came across this I was a little puzzled why they’d have added this when it seemed that collections.defaultdict(int) would do the job just as well. However, this class has some features which aren’t immediately obvious which set it apart.

The key point to note is that this container is less like a conventional dict and more like a multiset in some other languages. A good example is the elements() method, which iterates through all the items as many time as they’re counted, as if it was a true multiset.

>>> y = collections.Counter()
>>> y["two"] += 2
>>> y["three"] += 3
>>> list(y)
['two', 'three']
>>> list(y.elements())
['two', 'two', 'three', 'three', 'three']

It’s also possible and add and subtract counters from each other, which has the effect of modifying the counters in the target set, removing any where the counts go to (or below) zero. Being a form of set, they also support intersection with & and union with |.

Finally, there’s also a most_common() method which returns the top N items in the set sorted in reducing order of cardinality.

All in all this is a simple-seeming container, but with some handy features which make it a useful choice in quite a few use-cases. For having started off questioning why it was even added, the more I’ve looked at this class the more I think it’s a bit of a hidden gem.

Itertools

The itertools module has also seen some love with a few smaller changes.

First up is the combinations_with_replacement() function which is a variant of combinations() but allowing individual elements to be repeated more than once. Bit specialised, but for the cases that’s what you need then it’ll definitely be nice not having to write it.

Next is the compress() which takes two iterables typically of the same length, and filters the first iterable to only include elements where the corresponding element in the second iterable evaluates to True.

Finally, the count() generator now has an optional step parameter which can accept multiple types of numeric interval, including those from the fractions and decimal modules. It would have been handy to also support datetime.timedelta(), but that’s probably a bit too much of a stretched overload of this iterator’s purpose.

String Format Enhancements

You may recall from the String Formatting section in the previous post the following snippet:

>>> # This requires locale to be set first...
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'en_GB.UTF-8'
>>> # Use locale-specific number separator.
>>> "{0:n}".format(1234567)
'1,234,567'

The thousands separators are quite useful, but having to ensure a locale is set may be a bit of a pain, especially in short scripts. Hence, Python 3.1 adds a new option for a non-locale-aware thousands separator, which always uses a comma every 3 digits. This can be added by putting a , prior to the precision, and can be used with any of the base-10 numeric output formats.

>>> "{0:,.4f}".format(12345678.123456)
'12,345,678.1235'

A small change, but probably quite useful for all sorts of diagnostic output where locale correctness is probably secondary to functionality.

Additionally, there’s another handy change which removes the need for explicitly numbering the arguments in format parameters if they’re referenced in the same order they’re defined. Again taking the example from the previous post, here’s the old approach and the new one compared:

>>> "My name is {1} and I'm {0} years old".format(40, "Brian")
"My name is Brian and I'm 40 years old"
>>> "My name is {} and I'm {} years old".format("Brian", 40)
"My name is Brian and I'm 40 years old"

Executable Modules

Directories and zip files can now include a __main__.py and have this executed if run as a script. This is particularly useful for .zip files on Unix because you can prepend the file with a shebang line which will be respected by the kernel when you attempt to execute the file (assuming the user has permission to execute that file). The zip implementation that the Python interpreter uses to look into zip files is tolerant of unknown data at the start of the file, so despite this header the Python interpreter is still able to import libraries from the zip file as normal.

The net result of all this is that you can distribute a Python application, including multiple additional modules, and have it be executable without the user needing to unpack it.

The terminal session below demonstrates this, if you’re familiar enough with Linux/Unix to follow along:

$ cat __main__.py
import mymodule
mymodule.myfunc()

$ cat mymodule.py
def myfunc():
    print("Hello, world")

$ zip /tmp/pytmp.zip *.py
  adding: __main__.py (deflated 15%)
  adding: mymodule.py (stored 0%)
$ cat > /tmp/pyexec.zip
#!/usr/bin/python3
$ cat /tmp/pytmp.zip >> /tmp/pyexec.zip
$ chmod 0755 /tmp/pyexec.zip
$ /tmp/pyexec.zip
Hello, world

The unzip utility can still deal with the archive, even helpfully pointing out the additional bytes due to the shebang line:

$ unzip -l /tmp/pyexec.zip
Archive:  /tmp/pyexec.zip
warning [/tmp/pyexec.zip]:  19 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
       34  01-29-2021 23:10   __main__.py
       41  01-29-2021 23:10   mymodule.py
---------                     -------
       75                     2 files

Context Managers

There are a few changes relating to context managers in this release.

A simple change has been made to the syntax of the with statement to allow multiiple context managers to be specified. Although, simple, it’s rather handy in certain use-cases. The best example of this I know is when you’re parsing data from one file into another. Previously I used to use nested with statements like this:

1
2
3
4
5
6
def remove_comments(in_path, out_path):
    with open(in_path, "r") as in_fd:
        with open(out_path, "w") as out_fd:
            for line in in_fd:
                stripped = line.split("#", 1)[0]
                out_fd.write(stripped.rstrip() + "\n")

There was a contextlib.nested() function which could be used to stack multiple context managers on one line, but it wasn’t the neatest and had some annoying quirks. Now you can simply append additional managers with a comma:

1
2
3
4
def remove_comments(in_path, out_path):
    with open(in_path, "r") as in_fd, open(out_path, "w" as out_fd):
        for line in in_fd:
            

Beautiful. I’m in serious danger of breaking into a James Blunt song here, so let’s move on quickly — neither of us wants to risk that.

The use of the with statement has also been expanded a little more wiith gzip.GzipFile and bz2.BZ2File now supporting the context manager protocol. This change is more like addressing a historical ommission, and frankly I’m surprised it’s taken this long.

Unit Testing

The unittest module got some enhancements in 3.1. The first of these is that it supports skipping tests or classes of tests based on arbitrary criteria, and it also allows tests to be flagged as an expected failure, which means it won’t count as a failure for the purposes of determining whether the test suite passed. Both of these handy features are implemented using decorators.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import unittest

class TestLumberjack:

    @unittest.skipIf(sys.version_info.major < 3, "Felling not supported with 2.x")
    def test_cut_down_trees(self):
        

    @unittest.skip("Lunch was moved out of scope for Q2")
    def test_eat_my_lunch(self):
        

    # There's some sort of intermittent constipation bug, so we'll mark
    # this as a known failure until we have time to investigate.
    @unittest.expectedFailure()
    def test_go_to_the_lavatory(self):
        

Next up, the useful assertRaises() can now be used as a context manager. This makes it a lot easier to check whether a block of code throws a particular expected exception, particularly if that block of code is a little convoluted.

1
2
3
4
5
class TestGenerators(unittest.TestCase):

    def test_stop_iteration(self):
        with self.assertRaises(StopIteration):
            next(i for i in ())

Finally, a set of additional assertions have been added whose primary value is typically the diagnostic detail they provide on failures without the programmer having to take any extra steps. Here are some examples:

  • assertSetEqual()
  • assertDictEqual(), assertDictContainsSubset()
  • assertListEqual(), assertTupleEqual(), assertSequenceEqual()
  • assertRaisesRegexp()
  • assertIsNone(), assertIsNotNone()

Numeric Types

There are also a collection of small changes in assorted numeric types, including int, float and decimal.Decimal.

First up, int objects gained a bit_length() method to indicate the minimum number of bits required to store the binary representation of the number. Seems a little esoteric this one, but could be handy for serialisation code that’s picking from multiple different integer representations to use, or that sort of thing.

>>> (255).bit_length()
8
>>> (2**63).bit_length()
64
>>> (2**63-1).bit_length()
63
>>> (5**5**5).bit_length()
7257

Sticking with int the behaviour of round() has been updated. Previously this function would always return a float regardless of the type it was passed as an input. As of Python 3.1, however, if round() is passed an int then it returns an int instead.

Moving on to float, the string representation created by repr() in certain cases has been made more intuitive by using an alternative algorithm by David Gay. It’s outside the scope of this article to go into the full details, as floating point is hard, but suffice to say that repr() of certain values of float should now seem a bit more intuitive.

One minor wrinkle here is that the change in representation is going to change what the interactive interpreter shows you by default, so could break any docstrings you have using doctest becaues the float values shown may now be shorter.

Continuing the theme of rational numbers, the decimal.Decimal class can now be constructed from a float. It’s worth noting that it can have some surprising results due to the inability of binary floating point to represent the same numbers precisely as decimal floats, but as long as you round off to a sensible number of significant digits then you shouldn’t run into too many surprises.

Pickling

The pickle module has had a couple of small changes. Firstly, it’s been updated for better interoperability with Python 2.x where in many cases the same objects exist but with different names. For example, __builtin__.set in Python 2.x exists as builtins.set in Python 3.x. This unfortunately breaks compatibility with Python 3.0, but only if protocol version 2 is used — if you can assume Python 3.x only then use version 3 and it’ll work fine across all Python 3.x versions (but not Python 2.x). Protocol version 3 should work fine with all subsequent versions.

In other pickle news, functools.partial objects can now be serialised with the pickle module. This is still subject to the standard limitation of pickle that functions are sent by fully qualified name only, and rely on the deserialisation code to have the same function available during deserialisation.

Other Changes

There are a few other small points which I didn’t think deserved their own section.

string.maketrans() Deprecated
The old string.maketrans() has been deprecated since, rather confusingly for a function in the string module, it required bytes or bytesarray objects instead of str. To avoid this confusion, str, bytes and bytesarray now all have their own maketrans() and translate() methods. This is a somewhat obscure function unless you’re doing lots of character substitutions in your code, but for people who use it this definitely feels more consistent with the way other methods on these classes are accessed.
collections.namedtuple.rename() Added
collections.namedtuple now offers a rename parameter which, if true, causes invalid field names to be replaced with numeric positional names such as _0 or _1. This is useful where you don’t have control of the names and they may not be valid Python identifiers, such as constructing a namedtuple from the result column names in an SQL query where one of the columns is called COUNT(*).
logging.NullHandler() Added
The NullHandler class was added to the logging module, which is a handler which always throws away its log messages. It’s useful for applications which don’t use logging, but use libraries which are written without logginig support. Unless you configure a handler, you get irritating “No handlers could be found…” warning messages. Libraries can also add a NullHandler, to avoid applications having to worry about this. Indeed, as explainied in the logging tutorial, this is the only handler that a library should add, since handling of log events should be left to the application to configure.
sys.version_info Now namedtuple
The sys.version_info structure is now a namedtuple, so its attributes can more helpfully be accessed by name.
IPv6 Support in nntplib and imaplib
IPv6 support added to nntplib and imaplib.
importlib Added
The importlib library was added, which is a portable pure Python implementation of the import statement. This is hoped to add transparency to the import process.
Performance Improvements
  • The new io library was previously pure python but has now been re-written in pure C, for a 2-20x speed up.
  • The json module has a C extension to improve performance.
  • Enabling the --with-computed-gotos compile flag gives speedups of up to 20% on the bytecode evaluation loop, depending on the platform, which is always welcome.
  • Decoding of UTF-8, UTF-16 and Latin-1 is 2-4x faster.
  • int was previously stored in base 215, but now on 64-bit platforms 230 which significantly improves performance of integer arithmetic on them.

Conclusions

It’s been particularly interesting for me going through these changes so long after they were first released, because I’m realising how much I missed. Many of the smaller changes aren’t particularly impactful — I can’t see int.bit_length() is likely to become my most-used function, for example — but things like multiple context managers using a single with statement, and the non-locale-dependent thousands separator are handy little details to have up your sleeve. The versatility of collections.Counter as a multiset also makes it well worth keeping it in mind for when you need it.

Overall another great batch of improvements, and it’s nice to see performance enhancements because Python 3.x as a whole was initially a bit of a jump down from Python 2.x from a performance standpoint. Also, I’m mildly surprised at the extent of the changes, as in my head 3.1 was a fairly minor release to tidy up a few small points. I’m now looking forward to going through the remainder of the 3.x releases to see what other gems I’ve been missing out on.


  1. You can tell I’m definitely not psychic because at the end of 2019 I didn’t pile all my savings into Zoom stocks. 

This is the 2nd of the 13 articles that currently make up the “Python 2to3” series.

31 Jan 2021 at 8:45PM in Software
 |   | 
Photo by David Clode on Unsplash