☑ Python 2to3: What’s New in 3.6 - Part 2, More New Features

28 Jul 2021 at 11:10PM in Software
 |   | 

In this series looking at features introduced by every version of Python 3, we continue our look at the features added in Python 3.6. This second article looks at some more of the new features added to the language added in this release.

This is the 13th of the 14 articles that currently make up the “Python 2to3” series.

green python two 36

We continue our look at the significant new features added to Python in version 3.6, which we started in the previous article in this series. In this article we’ll look at the new secrets module as well as the filesystem path protocol, local time disambiguation and a new dict implementation

secrets Module

Security is a tricky beast, and cryptography one of the trickiest parts of it. You find yourself having to suspend any notion of common sense and go through your code with a fine-toothed comb to see whether you’re doing something that’s unwittingly making your security a million times easier to crack. Even something as simple as generating randomness is fraught with peril.

You may have heard advice to stay well clear of using the random module when it comes to any kind of cryptographic application. This advice has some truth to it, as certainly the pseudorandom generator provided by the default random.Random class is not suitable — it’s simply too predictable for that purpose.

As an aside, if you’re using any of the module-level functions in random, you’re actually using bound methods of a hidden random.Random instance, as illustrated by this excerpt:

>>> import random
>>> random.getrandbits
<built-in method getrandbits of Random object at 0x7ff642867c28>
>>> random.getrandbits.__self__
<random.Random object at 0x7ff642867c28>
>>> random.getrandbits.__self__.randrange(1, 10)
6

On most platforms there’s the alternative random.SystemRandom which uses os.urandom() to provide cryptographically sound randomness, and provides all the same functions. This is solid, but nonetheless it appears it’s sufficiently well hidden that many developers still seem to make the mistake of using the less secure source for security-related randomness.

In an effort to address this, PEP 506 has added the secrets module, to more visibly expose some of the SystemRandom functionality and also add some handy utility functions of its own.

The module first exposes three useful methods of SystemRandom:

>>> import secrets
>>> secrets.choice
<bound method Random.choice of <random.SystemRandom object at 0x7ff64201b028>>
>>> secrets.randbelow
<function randbelow at 0x7ff621554bb8>
>>> secrets.randbits
<bound method SystemRandom.getrandbits of <random.SystemRandom object at 0x7ff64201b028>>

As with random, choice() and randbits() are just bound methods of a hidden global SystemRandom instance. The randbelow() method seems like it’s doing something more, but in reality it’s just sanity checking the limit is positive and then calls a private method SystemRandom._randbelow() to do the actual work1.

The three additional functions are tailored towards generating random tokens of a specified number of bytes, and these genuinely are adding some functionality, albeit a fairly thin wrapper around randbytes(). The first, token_bytes() essentially does the same thing, returning a bytes object directly from randbytes().

The only added functionality is that it has a default token size if you don’t specify one, which at time of release is 32 bytes. It’s probably sensible to use this, if your code doesn’t need to specify a length, because if the need arises for longer tokens in the future then it can be increased without any application code needing to change.

There are then two further wrappers which return the token in different forms: token_hex() just hex-encodes the result of token_bytes() and returns a str containing hex digit, and token_urlsafe() does a base64.urlsafe_b64encode() of the token_bytes() result, strips any padding characters and returns the result as a str.

>>> secrets.token_bytes(16)
b'\x19o\x14\xa4\xba\xbf\xb1\x1d\xa7\x93Z\x06i\xac\xe3\xfe'
>>> secrets.token_hex(16)
'bc2843bf0c83daa64b1918fb9ccd8ad7'
>>> secrets.token_urlsafe(16)
'wQhzZ7iCmA-7FB9_LvPESw'

That’s about it for the secrets module, the only other function provided being compare_digest() which is just an alias for hmac.compare_digest(). All in all it’s mostly just exposing functionality that’s already available, and what new functionality it does provide is just convenience wrappers. But if it helps even some people improve their security, I’m all for it.

Path-Like Objects

Until pathlib was added, filesystem paths were almost invariably represented as either str or bytes objects. Unfortunately this has lead to developers writing code which assume these types, which means this code can fail when passed another path-like object such as pathlib.Path and the related pathlib classes.

In an effort to address this, PEP 519 adds a new protocol for objects representing filesystem paths. Prior to this, code was expected to just call str() on a parameter represents a path — if it was already a str then it’s left unchanged, but if it’s a pathlib.Path or similar then it will be converted to str. The problem here is that lots of objects in Python have a __str__() method, so it’s not a particularly reliable way of detecting if some entirely different object was passed, potentially masking bugs. There’s also the issue of DirEntry objects, which also represent paths but you have to access the path attribute.

The first change is the addition of the os.PathLike abstract base class for any object that represents a path, such as pathlib.Path. To implement this interface, objects must provide a __fspath__() method which returns either str or bytes representing the string form of the path.

The second change is the addition of the os.fspath() function2 which will return str or bytes objects unchanged, or return the result of __fspath__() on the object if defined, or raises TypeError in any other case. This allows functions to continue to support str and bytes for backwards-compatibility and convenience, but also support any new path-like object, and still reliably raise an exception if an incorrect object is passed.

>>> import os
>>> import pathlib
>>>
>>> os.fspath("/one/two/three")
'/one/two/three'
>>> os.fspath(pathlib.Path("/one/two/three"))
'/one/two/three'
>>> os.fspath(["one", "two", "three"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected str, bytes or os.PathLike object, not list
>>>
>>> class MyPathLike(os.PathLike):
...     def __init__(self):
...         self.path_items = []
...     def __fspath__(self):
...         return os.path.join(os.sep, *self.path_items)
...     def append_path_item(self, path_item):
...         self.path_items.append(path_item)
...
>>> x = MyPathLike()
>>> x.append_path_item("foo")
>>> x.append_path_item("bar")
>>> os.fspath(x)
'/foo/bar'

Finally, the builtin open() as well as all the appropriate functions in os and os.path have been updated to accept any os.PathLike, and the os.DirEntry and various pathlib classes have been updated to implement it.

This is one of those neat updates that makes it very easy for code to do the right thing, and hopefully means that in the future the major libraries will all support paths in a way which leaves them open for programmers to make use of the improved path manipulation functionality they now have.

Local Time Disambiguation

If you’ve ever had to write code to schedule something in local time, you’ll know that DST conversions are a source of constant annoyance. Generally when time jumps forward it’s not too bad — you just have to make sure you trigger anything that was in the hour or so you just skipped. When time jumps backwards, however, things are rather more painful because you get the same time twice.

Where possible, the best approach to this is just use UTC, but you don’t always have that luxury — sometimes those pesky users actually want things scheduled in their local time, for example.

This can be handled if you’re very careful to convert everywhere, but it’s fiddly. Luckily Python programmers got a little help in 3.6 with a new way to disambiguate times which are repeated due to the clocks going back. This takes the form of a fold attribute on datetime.datetime and datetime.time objects which indicates how many times this particular local time has been repeated before3.

The idea is that when time jumps forward it creates a gap, but when it jumps backward it creates a fold where you repeat some of the same times. These terms were originally used by Paul Eggert of UCLA who was reporting bugs in the libc support for time zone conversion back in 1994 — you can find more details on all this in PEP 495.

You can see this illustrated below, where the code runs through half-hourly intervals around the end of DST in London in October 2021. You’ll see that fold is 1 for the second repetition of the two times which occur twice.

>>> from datetime import datetime, timedelta, timezone
>>> import dateutil.tz
>>> London = dateutil.tz.gettz("Europe/London")
>>>
>>> base_dt = datetime(2021, 10, 30, 23, 30, tzinfo=timezone.utc)
>>> for i in range(8):
...     ut = base_dt + timedelta(0, i * 30 * 60)
...     lt = ut.astimezone(London)
...     print(f"UTC:{ut.time()} London:{lt.time()} {lt.tzname()} Fold:{lt.fold}")
...
UTC:23:30:00 London:00:30:00 BST Fold:0
UTC:00:00:00 London:01:00:00 BST Fold:0
UTC:00:30:00 London:01:30:00 BST Fold:0
UTC:01:00:00 London:01:00:00 GMT Fold:1
UTC:01:30:00 London:01:30:00 GMT Fold:1
UTC:02:00:00 London:02:00:00 GMT Fold:0
UTC:02:30:00 London:02:30:00 GMT Fold:0
UTC:03:00:00 London:03:00:00 GMT Fold:0

You’ll note that I’m using dateutil.tz there instead of pytz — that’s because the latter hasn’t implemented support for fold, and probably never will since it doesn’t work in a way that’s entirely compatible with Python’s timezone handling approach. For an excellent discussion of the details of this, check out this article by Paul Ganssle where he explains it all with much more clarity than I could manage in a brief coverage here.

So all in all this might seem a little obscure for most people, and perhaps it is. But when it allows you to write a scheduler which doesn’t trigger jobs twice if they’re scheduled within a DST fold, you’ll really appreciate it being there.

New dict Implementation

That old workhorse of so much Python code, the dict class, now has a more memory-efficient representation which uses 20-25% less memory than the Python 3.5 implementation. This is always great, partly for its own sake, and partly because more compact memory layouts can improve performance due to better locality of reference.

More interestingly, however, the new implementation has the effect of preserving the original insertion order of keys, much like collections.OrderedDict does4. This was actually more of a side-effect of some changes to improve the memory efficiency of the structure, and the documentation warns programmers to treat this as an implementation detail, but also leaves the door open for making it official behaviour in the future5.

This change was first implemented in PyPy, as detailed in this blog article, and it’s been moved into CPython more or less unchanged. First of all let’s look at the original dict data structure. The structures shown in these diagrams are simplified from the actual ones to best illustrate the change.

old dict implementation

You can see here the types are C types, since that’s the language in which CPython is implemented, but hopefully it’s pretty clear even if you don’t know C. The main structure is just an integer containing the number of elements in the dict and an array of a structure dict_entry. Slot in this array holds either a single entry or is empty, and each entry contains its hash value as well as the Python objects representing the key and value.

Those familiar with hash tables will recognise this as closed hashing (aka open addressing). When an object is inserted, its hash value is mapped to one of the items in the array. If that slot is empty, the object is inserted there. If it’s not empty, the key object is compared with the key to insert using standard Python equality, and if it’s the same then the object is treated as already in the dict. If it’s different, a new slot must be found to store the object, and this is done using pseudorandom probing to try new slots in a deterministic but not linear order.

This works fine, but the array needs to be kept fairly sparse — if it becomes too full, then almost every insert will involve many probes and this will harm performance. As a result, if it becomes more than ⅔ full then the array is reallocated to a larger size to keep it sparse. This solves the performance problem, but does increase the memory footprint, since each of those unused entries still contains the memory require for all three fields of dict_entry.

So now let’s take a look at the new structure introduced in Python 3.6.

old dict implementation

The dict_entry structure is unchanged, but instead of being referenced in a sparse array, it’s instead stored in a standard linear array of items. Since newly inserted items are appended to this array, it’s always maintained sorted in order of first insertion. Also, since this compact array is contiguous in memory, repeated lookups of multiple entries (e.g. when iterating) can take advantage of caches to improve performance. More to the point, it doesn’t have any empty items so it takes considerably less memory than the sparse array in the original implementation.

This isn’t really a hash table any more, however, so to maintain efficient lookup there is still a sparse array. Now, however, each entry contains only the index of the item in the compact array. Furthermore, the type of the array is only just big enough for the size of the indexes — an empty dict is created using 8-bit values for these offsets, and it’s only resized once the number of items in the dict hits 256. These changes mean that even maintaining the two separate arrays, the structure is overall significantly more memory efficient than the previous one.

So now the standard dict implementation maintains original insert order, does that mean we need never use collections.OrderedDict again? Well actually no, there are still three important differences that you should be aware of:

OrderedDict has more methods
The new dict class lacks methods such as move_to_end() that OrderedDict provides. If you need to maintain ordering other than initial insertion, such as recording least-recently used items, then this is useful functionality.
Definition of equality
The new implementation of dict hasn’t altered it’s definition of equality — the order of insertion is ignored, only the keys and values are compared for equality. Two OrderedDict instances, on the other hand, will only compare equal if all of the keys and values are the same, but also the order of the keys is the same.
OrderedDict is reversible
The new dict implementation still doesn’t support iterating through keys with reversed, whereas OrderedDict does.

Class Attribute Definition Order

PEP 520 was accepted for inclusion in Python 3.6, which specifies how the definition order of attributes of a class would be preserved. However, the changes which have made it into Python 3.6 aren’t quite as specified in the PEP, so you have to be a little careful if you’re going to read up on.

The purpose of the PEP was to preserve the order in which class attributes were defined in the source code, and make this available in the code. The process of creating the __dict__ attribute of a class involves setting up a mapping to act as a namespace in which the assignments are made6, then copying that into a new dict which is available as __dict__. The changes in the PEP involved using an OrderedDict for this inital namespace, and then preserving the order of the names registered in a new tuple called __definition_order__.

This was all very well, but late in the release cycle the Python developers became aware that the new dict implementation, which preserved insertion order, was also going to be included in the same releae — this made the whole thing seem rather redundant, since the __dict__ attribute would be sorted anyway.

So in the end, __definition_order__ was dropped, as per this message from Guido, and the sum result of all this is simply that you can rely on the order that class attriubutes appear in the __dict__ of a class is the same as the order in which they were defined in the code.

Keyword Argument Order

In a potentially more useful change than preserving the definition order of class attributes, PEP 468 describes a change where keyword arguments collected in a function using the **kwargs mechanism now preserve the order in which they were passed.

The original suggestion for this was to use OrderedDict for kwargs, but guess what — since dict is now ordered, this change wasn’t actually required. The only difference is a guarantee that the order they’re inserted matches the order they occur in the function call.

>>> def func(**kwargs):
...     print(f"Args: {kwargs!r}")
...
>>> func(one=1, two=2, three=3)
Args: {'one': 1, 'two': 2, 'three': 3}
>>> func(two=2, three=3, one=1)
Args: {'two': 2, 'three': 3, 'one': 1}

Debugging Memory Allocators

Here’s a feature we all hope we won’t actually have much need for, but if we do we’ll be very glad it exists: you can now install debug hooks on Python’s memory allocators by defining the PYTHONMALLOC environment variable.

I suspect this will mostly be of use to those writing Python extensions in languages like C, where you interact more directly with the allocators than you do in Python code. Still, it’s worth bearing in mind, since it could also be helpful in tracking down errors in other people’s extensions too, mostly to rule out issues in your own code.

Defining this environment variable as PYTHONMALLOC=debug has the following effects:

  • Freshly allocated memory is initialised with byte 0xCB and freed memory is filled with 0xDB.
  • Violations of the memory allocator APIs are detected, such as trying to use PyObject_Free() on a block allocated with PyMem_Malloc().
  • Detects writes outside the valid portion of a buffer (underruns and overflows).
  • Verifies the GIL is held when calling the PyMem_X() and PyObject_X() allocator families.

It’s also possible to define it as PYTHONMALLOC=malloc which flips Python from using its own allocators to using the standard libc malloc() to allocate memory. This is useful when you’re running Python under tools like Valgrind or Electric Fence.

Other Changes

A handful of small changes which didn’t warrant their own section.

Special Methods as None
To explicitly show a lack of support for a special method, such as __iter__(), classes can now set that attribute to None. This prevents implicit fallbacks to other options, and blocks inherited behaviour.
Truncated Repeated Tracebacks
Where the same line occurs in a traceback multiple times consecutively, it’s now truncated after a few instances with the message [Previous line repeated X more times].
ModuleNotFoundError Added
There’s a new exception ModuleNotFoundError when, uh, a module cannot be found. It’s a subclass of the existing ImportError, so it should break any existing code as long as its sensibly written.

Conclusions

So that concludes all the changes in Python 3.6 save the updates to existing modules which, as usual, I’ll cover in the next article.

The changes in this article haven’t been the most momentous, but once again there’s some useful stuff in there. I like the support for path-like objects, as I feel that separating paths from other strings is likely to nix quite a few annoying bugs that occasionally crop up with incorrect parameters. It’s also going to allow the type-hinting to be more usefully specific.

The new dict implementation is also a great little change — both more efficient and order-preserving? Monsieur, with this new implementation you’re really spoiling us!7

The rest I’ll file under “sure, might be useful one day!” and continue my inexorable march towards Python 3.9. I’m still hoping to make it that far before 3.10 is released in October, but I’ll admit things are getting a little tight. That’s partly because I keep wasting time writing overly long “conclusions” sections, padded out with futile attempts at self-referential humour, so I’ll swiftly draw this article to a close before that happens.


  1. I’ve never known about SystemRandom._randbelow(), being a private method, but I can see why the wrapper does the validation — if you call the underlying method with a negative value, it appears to block forever, or at least a very long time. 

  2. There’s also os.fsencode() which will always return bytes, converting as necessary, and os.fsdecode() which will always return str

  3. It’s worth noting that there are other reasons for clocks to go back than DST adjustment, but they’re very rare. 

  4. There are differences in behaviour between the new dict and OrderedDict, however, which I’ll cover in a moment. 

  5. Spoiler alert: this was declared official and permanent behaviour as of Python 3.7. 

  6. This mapping is returned by type.__prepare__(), which can be overridden in metaclasses. If you want to do that for any reason, you should be aware of the order-preserving quality now expected of such a container. 

  7. Although the comparison on quality here is extremely unfair, as you can’t pick up a new dict implementation for less than a tenner at the all-night garage. Disclaimer: this is just a weak attempt at humour based on British pop culture references, please do feel very free to disregard. 

This is the 13th of the 14 articles that currently make up the “Python 2to3” series.

28 Jul 2021 at 11:10PM in Software
 |   | 
Photo by David Clode on Unsplash