☑ Python 2to3: What’s New in 3.3 - Part 1

6 Mar 2021 at 11:11PM in Software
 |   | 

The fourth Python 3.x release brought another slew of great new features. So many, in fact, that I’ve split this release into two articles, of which this is the first. Highlights in this part include yield from expressions, mocking support in unittest and virtualenv suppport in the standard library.

green python two

The next release in this sequence of articles is Python 3.3, which was released just over 19 months after version 3.2. This one was another packed release and it contained so many features I decided to split this into two articles. In this first one we’ll be covering yield from expressions which allow generators to delegate for each other, support for mocking built in to the standard library, the venv module, and a host of diagnostic improvements.

Builtin Virtual Environments

If you’ve been using Python for a decent length of time, you’re probably familiar with the virtualenv tool written by prolific Python contributor Ian Bicking1 around thirteen years ago. This was the sort of utility that you instantly wonder how you managed without it before, and it’s become a really key development2 tool for many Python developers.

As an acknowledgement of its importance, the Python team pulled a subset of its functionality into the standard Python library as the new venv module, and exposed a command-line interface with the pyvenv script. This is fully detailed in PEP 405.

On the face of it, this might not seem to be all that important, since virtualenv already exists and does a jolly good job all round. However, I think there are a whole host of benefits which make this stategically important. First and foremost, since it’s part of the standard distribution, there’s little chance that the core Python developers will make some change that renders it incompatible on any supported platform. It can also probably benefit from internal implementation details of Python on which an external project couldn’t safely rely, which may enable greater performance and/or reliability.

Secondly, the fact that it’s installed by default means that project maintainers have a baseline option they can count on, for installation or setup scripts, or just for documentation. This will not doubt cut down on support queries from inexperienced users who wonder why this virtualenv command isn’t working.

Thirdly, this acts as defense against the forking of the project, which is always a background concern with open source. It’s not uncommon for one popular project to be forked and taken in two divergent directions, and then suddenly project maintainers and users alike need to worry about which one they’re going with, the support efforts of communitieis are split, and all sorts of other annoyances. Having standard support in the standard library means there’s an option that can be expected to work in all cases.

In any case, regardless of whether you feel this is an important feature or just a minor tweak, it’s at least handy to have venv always available on any platform where Python is installed.

As an aside, if you’re curious about how virtualenv works then Carl Meyer presented an interesting talk on the subject, of which you can find the video and sildes online.

Generator Delegation

I actually already discussed this topic fairly well in my first article in my series on coroutines in Python a few years ago. But to save you the trouble of reading all that, or the gory details in PEP 380, I’ll briefly cover it here.

This is a fairly straightforward enhancement for generators to yield control to each other, which is performed using the new yield from statement. It’s perhaps best explained with a simple example:

>>> def fun2():
...     yield from range(10)
...     yield from range(30, 20, -2)
...
>>> list(fun2())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 28, 26, 24, 22]

On the face of it this is just a simple shorthand for for i in iter: yield i. However, there’s rather more to it when you consider the coroutine-like features that generators have where you can pass values into them, since these values also need to be routed directly to the delegate generator as well as the yielded values being routed back out.

There’s also an enhancement to generator return values. Previously the use of return within a generator was simply a way to terminate the generator, raising StopIteration, and it was a syntax error to provide an argument to the return statement. As of Python 3.3, however, this has been relaxed and a value is permitted. The value is returned to the called by attaching it to the StopIteration exception, but where yield from is used then this becomes the value to which the yield from expression evaluates.

This may seem a bit abstract and hard to grasp, so I’ve included an example of using these features for parsing HTTP chunk-encoded bodies. This is a format used for HTTP responses if the sender doesn’t know the size of the response up front, where the data is split into chunks of a known size and the length of a chunk is sent first followed by the data. This means the sender can keep transmitting data until it’s exhausted, and the reader can be processing it parallel. The end of the data is indicated by an empty chunk.

This sort of message-based interpretation of data from a byte stream is always a little fiddly. It’s most efficient to read in large chunks from the socket, and in the case of a chunk header you don’t know many bytes it’s going to be anyway, since the length is variable number of digits. As a result, by the time you’ve read the data you need, the chances are your buffer already contains some of the next piece of data. If you want to structure your code well and split parsing the various pieces up into multiple functions, as the single responsibility principle suggests, then this means you’ve always got this odd bit of “overflow” data as the initial set to parse before reading more from the data source.

There’s also the aspect that it’s nice to decouple the parsing from the data source. For example, although you’d expect a HTTP response to generally come in from a socket object, there’ll always be someone who already has it in a string form and still wants to parse it — so why force them to jump through some hoops making their string look like a file object again, when you could just structure your code a little more elegantly to decouple the parsing and I/O?

For all of the above reasons, I think that generators make a fairly elegant solution to this issue. Take a look at the code below and then I’ll explain why it works and why I think this is potentially a useful approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def content_length_decoder(length, data=b""):
    """Can be used directly by requests with Content-Length header."""

    while len(data) < length:
        length -= len(data)
        data = yield data
    new_input = yield data[:length]
    return data[length:] + new_input

def chunked_decoder(data=b""):
    """Decodes HTTP bodies with chunked encoding."""

    while True:
        crlf_index = data.find(b"\r\n")
        if crlf_index < 0:
            # Loop until we have <len>CRLF chunk header.
            data += yield b""
            continue
        chunk_len = int(data[:crlf_index], 16)

        if chunk_len == 0:
            # Zero length chunk terminates body.
            return data[crlf_index+2:]

        chunk = content_length_decoder(chunk_len, data[crlf_index+2:])
        data = yield from chunk

        # Strip off trailing CRLF from end of chunk.
        while len(data) < 2:
            data += yield b""
        data = data[2:]

# This is an example of a chunk-encoded response, with the headers
# already stripped off.
body_pieces = (b"C\r\nStrange",
               b" wome\r\n1B\r\nn lying in",
               b" ponds, distribut\r\n13\r",
               b"\ning swords, is no b\r\n",
               b"20\r\nasis for a system of ",
               b"government!\r\n0\r\n\r\n")

decoder = chunked_decoder()
document = bytearray(next(decoder))
try:
    for input_piece in body_pieces:
        document += decoder.send(input_piece)
    document += decoder.send(b"")
except StopIteration as exc:
    # Generally expect this to be the final terminating CRLF, but HTTP
    # standard allows for "trailing headers" here.
    print("Trailing headers: " + repr(exc.value))
print("Document: " + repr(document))

The general idea here is that each generator parses data which is passed to it via its send() method. It processes input until its section is done, and then it returns control to the caller. Ultimately decoded data is yielded from the generators, and each one returns any unparsed input data via its StopIteration exception.

In the example above you can see how this allows content_length_decoder() to be factored out from chunked_decoder() and used to decode each chunk. This refactoring would allow a more complete implementation to reuse this same generator to decode bodies which have a Content-Length header instead of being sent in chunked encoding. Without yield from this delegation wouldn’t be possible unless orchestrated by the top-level code outside of the generators, and that breaks the abstraction.

This is just one example of using generators in this fashion which sprung to mind, and I’m sure there are better ones, but hopefully it illustrates some of the potential. Of course, there are more developments on coroutines in future versions of Python 3 which I’ll be looking at later in this series, or if you can’t wait then you can take a read through my earlier series of articles specifically on the topic of coroutines.

Unit Testing

The major change in Unit Testing in Python 3.3 is that the mocking library has been merged into the standard library as unittest.mock. A full overview of this library is way beyond the scope of this article, so I’ll briefly touch on the highlights with some simple examples.

The core classes are Mock and MagicMock, where MagicMock is a variation which has some additional behaviours around Python’s magic methods4. These classes will accept requests for any attribute or method call, and create a mock object to track accesses to them. Afterwards, your unit test can make assertions about which methods were called by the code under test, including which parameters were passed to them.

One aspect that’s perhaps not immediately obvious is that these two objects represet more or less any object, such as functions or classes. For example, if you create a Mock instance which represents a class and then access a method on it, a child Mock object represents that method. This is possible in Python since everything comes down to attribute access at the end of the day — it just happens that calling a method queries an attribute __call__ on the object. Python’s duck-typing approach means that it doesn’t care whether it’s a genuine function that’s being called, or an object which implements __call__ such as Mock.

Here’s a short snippet which shows that without any configuration, a Mock object can be used to track calls to methods:

>>> from unittest import mock
>>> m = mock.Mock()
>>> m.any_method()
<Mock name='mock.any_method()' id='4352388752'>
>>> m.mock_calls
[call.any_method()]
>>> m.another_method(123, "hello")
<Mock name='mock.another_method()' id='4352401552'>
>>> m.mock_calls
[call.any_method(), call.another_method(123, 'hello')]

Here I’m using the mock_calls attribute, which tracks the calls made, but there are also a number of assert_X() methods which are probably more useful in the context of a unit test. They work in a very similar way to the existing assertions in unittest.

This is great for methods with no return type and are side-effect free, but what about implementing those behaviours? Well, that’s pretty straightforward once you understand the basic structure. Let’s say you have a class and you want to add a method with a side-effect, you just create a new Mock object and assign that as an attribute with the name of the method to the mock that’s representing your object instance. Then you create some function which implements whatever side-effects you require, and you assign that to the special side_effect attribute of the Mock representing your method. And then you’re done:

>>> m = mock.Mock()
>>> m.mocked_method = mock.Mock()
>>> def mocked_method_side_effect(arg):
...     print("Called with " + repr(arg))
...     return arg * 2
...
>>> m.mocked_method.side_effect = mocked_method_side_effect
>>> m.mocked_method(123)
Called with 123
246

Finally, as an illustration of the MagicMock class, you can see from the snippet below that the standard Mock object refuses to auto-create magic methods, but MagicMock implements them in the same way. You can add side-effects and return values to these in the same way as any normal methods.

>>> m = mock.Mock()
>>> len(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Mock' has no len()
>>> mm = mock.MagicMock()
>>> len(mm)
0
>>> mm[123]
<MagicMock name='mock.__getitem__()' id='4352367376'>
>>> mm.mock_calls
[call.__len__(), call.__getitem__(123)]
>>> mm.__len__.mock_calls
[call()]
>>> mm.__getitem__.mock_calls
[call(123)]

That covers the basics of creating mocks, but how about injecting them into your code under test? Well, of course sometimes you can do that yourself by passing in a mock object directly. But often you’ll need to change one of the dependencies of the code. To do this, you can use mock.patch as a decorator around your test methods to overwrite one or more dependencies with mocks. In the example below, the time.time() function is replaced by a MagicMock instance, and the return_value attribute is used to control the time reported to the code under test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import random
import time
from unittest import mock

class StickyRandom:
    def __init__(self):
        self.last_time = 0
    def get_value(self):
        if time.time() - self.last_time > 60:
            self.last_value = random.random()
            self.last_time = time.time()
        return self.last_value

@mock.patch("time.time")
def test_sticky_random(time_mock):
    instance = StickyRandom()
    time_mock.return_value = 1000
    value1 = instance.get_value()
    time_mock.return_value = 1030
    value2 = instance.get_value()
    assert value1 == value2
    time_mock.return_value = 1090
    value3 = instance.get_value()
    assert value3 != value1

test_sticky_random()

So that’s it for my whirlwind tour of mocking. There’s a lot more to it than I’ve covered, of course, so do take the time to read through the full documentation.

Diagnostics Changes

There are a few changes which are helpful for exception handling and introspection.

Nicer OS Exceptions

The situation around catching errors in Operating System operations has always been a bit of a mess with too many exceptions covering what are very similar operations at their heart. This can cause all sorts of annoying bugs in error handling if you try to catch the wrong exception.

For example, if you fail to os.remove() a file you get an OSError but if you fail to open() it you get an IOError. So that’s two exceptions for I/O operations right there, but if you happen to be using sockets then you need to also worry about socket.error. If you’re using select you might get select.error, but equally you might get any of the above as well.

The upshot of all this is that for any block of code that does a bunch of I/O you end up having to either catch Exception, which can hide other bugs, or catch all of the above individually.

Thankfully in Python 3.3 this situation has been averted since these have all been collapsed into OSError as per PEP 3151. The full list that’s been rolled into this is:

  • OSError
  • IOError
  • EnvironmentError
  • WindowsError
  • mmap.error
  • socket.error
  • select.error

Never fear for your existing code, however, beacuse the old names have all been maintained as aliases for OSError.

As well as this, however, there’s another change that’s even handier. Often you need to only catch some subset of errors and allow others to pass on as true error conditions. A common example of this is where you’re doing non-blocking operations, or you’ve specified some sort of timeout, and you want to ignore those cases but still catch other errors. In these cases, you often find yourself branching on errno like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import errno

def reliable_send(data, sock):
    while data:
        try:
            sent = sock.send(data)
            data = data[sent:]
        except socket.error as exc:
            if exc.errno == errno.EINTR:
                continue
            else:
                raise

It’s not terrible, but breaks the usual idiom of each error being its own exception, and make things just that bit harder to read.

Python 3.3 to the rescue! New exception types have been added which are derivations of OSError and correspond to the more common of these error cases, so they can be caught more gracefully. The new exceptions and the equivalent errno codes are:

New Exception Errno code(s)
BlockingIOError EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS
ChildProcessError ECHILD
FileExistsError EEXIST
FileNotFoundError ENOENT
InterruptedError EINTR
IsADirectoryError EISDIR
NotADirectoryError ENOTDIR
PermissionError EACCES, EPERM
ProcessLookupError ESRCH
TimeoutError ETIMEDOUT
ConnectionError A base class for the remaining exceptions…
… BrokenPipeError EPIPE, ESHUTDOWN
… ConnectionAbortedError ECONNABORTED
… ConnectionRefusedError ECONNREFUSED
… ConnectionResetError ECONNRESET

The BlockingIOError exception also has a handy characters_written attribute, when using buffered I/O classes. This indicates how many characters were written before the filehandle became blocked.

To finish off this setion, here’s a small example of how this might make code more readable. Take this code to handle a set of different errors which can occur when opening and attempting to read a particular filename:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import errno

try:
    # ...
except (IOError, OSError) as exc:
    if exc.errno == errno.EISDIR:
        print("Can't open directories")
    elif exc.errno in (errno.EPERM, errno.EACCES):
        print("Permission error")
    else:
        print("Unknown error")
except UnicodeDecodeError:
    print("Unicode error")
except Exception:
    print("Unknown error")

Particularly unpleasant here is the code duplication between handling unmatched errno codes and random other exceptions — although that’s just the duplication of a print() in this example, in reality that could become significant code duplication. With the new exceptions introduced in Python 3.3, however, this is all significantly cleaner:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
try:
    # ...
except IsADirectoryError:
    print("Can't open directories")
except PermissionError:
    print("Permission error")
except UnicodeDecodeError:
    print("Unicode error")
except Exception:
    print("Unknown error")

Suppressing Exception Chaining

As we covered in the first post in this series, exceptions in Python 3 can be chained. when they are chained, the default traceback is updated to show this context, and earlier exceptions can be recovered from attributes of the latest.

You might also recall that it’s possible to explicitly chain exceptions with the syntax raise NewException() from exc. This sets the __cause__ attribute of the exception, as opposed to the __context__ attribute which records the original exception being handled if this one was raised within an existing exception handling block.

Well, Python 3.3 adds a new variant to this which can be used to suppress the display of any exceptions from __context__, which is raise NewException() from None. You can see an example of this behaviour below, which you can compare to the same example in the first-post:

>>> try:
...     raise Exception("one")
... except Exception as exc1:
...     try:
...         raise Exception("two")
...     except Exception as exc2:
...         raise Exception("three") from None
...
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
Exception: three

The mechanics of how this is implemented could be a little confusing because they were changed after the feature was first implemented. The original PEP 409 specified that the default value of __cause__ to be Ellipsis, which was a pretty arbitrary choice as a singleton which wasn’t an exception, so it couldn’t be confused with a real cause; and wasn’t None, so later code could detect if it had been explicitly set to None via the raise Exception() from None idiom.

It was later decided that this was overloading the purpose of __cause__ in an inelegant fashion, however, so PEP 415 was implemented which made no change to the language features introduced by PEP 409, but changed the implementation. The rather hacky use of Ellipsis was removed and a new __suppress_context__ attribute was added. The semantics are that whenever __cause__ is set (typically with raise X from Y), __suppress_context__ is flipped to true. This applies when you set __cause__ to another exception, in which case the presumption is that it’s more useful to show than __context__ since it’s by explicit programmer choice; or using the raise X from None idiom, which is just the language syntax for setting __suppress_context__ without changing __cause__. Note that regardless of the value of __suppress_context__, the contents of the __context__ attribute are still available, and any code you write in your own exception handler is, of course, not obliged to respect __suppress_context__.

I must admit, I’m struggling to think of cases where the detail of that change would make a big difference to code your write. However, I’ve learned over the years that exception handling is one of those areas of the code you tend to test less thoroughly, and those areas are exactly where it’s helpful to have a knowledge of the details since it’s that much more likely you’ll find bugs here by code inspection rather than testing.

Introspection Improvements

Since time immemorial functions and classes have had a __name__ attribute. Well, it now has a little baby sibling, the __qualname__ attribute (PEP 3155) which indicates the full “path” of definition of this object, including any containing namespaces. The string represetation has also been updated to use this new, longer, specification. The semantics are mostly fairly self-explanatory, I think, so probably best illustrated with an example:

>>> class One:
...     class Two:
...         def method(self):
...             def inner():
...                 pass
...             return inner
...
>>> One.__name__, One.__qualname__
('One', 'One')
>>> One.Two.__name__, One.Two.__qualname__
('Two', 'One.Two')
>>> One.Two.method.__name__, One.Two.method.__qualname__
('method', 'One.Two.method')
>>> inner = One.Two().method()
>>> inner.__name__, inner.__qualname__
('inner', 'One.Two.method.<locals>.inner')
>>> str(inner)
'<function One.Two.method.<locals>.inner at 0x10467b170>'

Also, there’s a new inspect.signature() function for introspection of callables (PEP 362). This returns a inspect.Signature instance which references other classes such as inspect.Parameter and allows the siganture of callables to be easily introspected in code. Again, an example is probably most helpful here to give you just a flavour of what’s exposed:

>>> def myfunction(one: int, two: str = "hello", *args: str, keyword: int = None):
...     print(one, two, args, keyword)
...
>>> myfunction(123, "monty", "python", "circus")
123 monty ('python', 'circus') None
>>> inspect.signature(myfunction)
<Signature (one: int, two: str = 'hello', *args: str, keyword: int = None)>
>>> inspect.signature(myfunction).parameters["keyword"]
<Parameter "keyword: int = None">
>>> inspect.signature(myfunction).parameters["keyword"].annotation
<class 'int'>
>>> repr(inspect.signature(myfunction).parameters["keyword"].default)
'None'
>>> print("\n".join(": ".join((name, repr(param._kind)))
        for name, param in inspect.signature(myfunction).parameters.items()))
one: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
two: <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>
args: <_ParameterKind.VAR_POSITIONAL: 2>
keyword: <_ParameterKind.KEYWORD_ONLY: 3>

Finally, there’s also a new function inspect.getclosurevars() which reports the names bound in a particular function:

>>> import inspect
>>> xxx = 999
>>> def outer():
...     aaa = 100
...     def middle():
...         bbb = 200
...         def inner():
...             ccc = 300
...             return aaa + bbb + ccc + xxx
...         return inner
...     return middle()
...
>>> inspect.getclosurevars(outer())
ClosureVars(nonlocals={'aaa': 100, 'bbb': 200}, globals={'xxx': 999}, builtins={}, unbound=set())

In a similar vein there’s also inspect.getgeneratorlocals() which dumps the current internal state of a generator. This could be very useful for diagnosing bugs in the context of the caller, particularly if you don’t own the code implementing the generator and so can’t easily add logging statements or similar:

>>> def generator(maxvalue):
...     cumulative = 0
...     for i in range(maxvalue):
...         cumulative += i
...         yield cumulative
...
>>> instance = generator(10)
>>> next(instance)
0
>>> next(instance)
1
>>> next(instance)
3
>>> next(instance)
6
>>> inspect.getgeneratorlocals(instance)
{'maxvalue': 10, 'cumulative': 6, 'i': 3}

faulthandler Module

There’s a new module in Python 3.3 called faulthandler which is used to show a Python traceback on an event like a segmentation fault. This could be very useful when developing or using C extension modules which often fail in a crash, making it very hard to tell where the problem actually occurred. Of course, you can fire up a debugger and figure out the line of code if it’s your module, but if it’s someone else’s at least this will help you figure out whether the error lies in your code or not.

You can enable this support at runtime with faulthandler.enable(), or you can pass -X faulthandler to the interpreter on the command-line, or set the PYTHONFAULTHANDLER environment variable. Note that this will install signal handlers for SIGSEGV, SIGFPE, SIGABRT, SIGBUS, and SIGILL — if you’re using your own signal handlers for any of these, you’ll probably want to call faulthandler.enable() first and then make sure you chain into the earlier handler from your own.

Here’s an example of it working — for the avoidance of doubt, I triggered the handler here myself by manually sending SIGSEGV to the process:

>>> import faulthandler
>>> import time
>>> faulthandler.enable()
>>>
>>> def innerfunc():
...     time.sleep(300)
...
>>> def outerfunc():
...     innerfunc()
...
>>> outerfunc()
Fatal Python error: Segmentation fault

Current thread 0x000000011966bdc0 (most recent call first):
  File "<stdin>", line 2 in innerfunc
  File "<stdin>", line 2 in outerfunc
  File "<stdin>", line 1 in <module>
[1]    16338 segmentation fault  python3

Module Tracing Callbacks

There are a couple of modules which have added the ability to register callbacks for tracing purposes.

The gc module now provides an attribute callbacks which is a list of functions which will be called before and after each garbage collection pass. Each one has two parameters passed, the first is either "start" or "stop" to indicate whether this is before or after the collection pass, and the second is a dict providing details of the results.

>>> import gc
>>> def func(*args):
...     print("GC" + repr(args))
...
>>> gc.callbacks.append(func)
>>> class MyClass:
...     def __init__(self, arg):
...         self.arg = arg
...     def __del__(self):
...         pass
...
>>> x = MyClass(None)
>>> y = MyClass(x)
>>> z = MyClass(y)
>>> x.arg = z
>>> del x, y, z
>>> gc.collect()
GC('start', {'generation': 2, 'collected': 0, 'uncollectable': 0})
GC('stop', {'generation': 2, 'collected': 6, 'uncollectable': 0})
6

The sqlite3.Connection class has a method set_trace_callback() which can be used to register a callback function which will be called for every SQL statement that’s run by the backend, and it’s passed the statement as a string. Note this doesn’t just include statements passed to the execute() method of a cursor, but may include statements that the Python module itself runs, e.g. for transaction management.

Unicode Changes

With apologies to those already familiar with Unicode, a brief history lesson: Unicode was originally conceived as a 16-bit character set, which was thought to be sufficient to encode all languages in active use around the world. In 1996, however, the Unicode 2.0 standard expanded this to add 16 additional 16-bit “planes” to the set, to include scope for all characters ever used by any culture in history, plus other assorted symbols. This made it effectively a 21-bit character set3. The inital 16-bit set became the Basic Multilingual Plane (BMP), and the next two planes the Supplementary Multilingual Plane and Supplementary Ideographic Plane respectively.

OK, Unicode history lesson over. So what’s this got to do with Python? To understand that we need a brief Python history lesson. Python originally used 16-bit values for Unicode characters (i.e. UCS-2 encoding), which meant that it only suppored characters in the BMP. In Python 2.2 support for “wide” builds was added, so by adding a particular configure flag when compiling the interpreter, it could be built to use UCS-4 instead. This had the advantage of allowing the full range of all Unicode planes, but at the expense of using 4 bytes for every character. Since most distributions would use the wide build, because they had to assume full Unicode support was necessary, this meant in Python 2.x unicode objects consisting primarily of Latin-1 were four times larger than they needed to be.

This has been the case until Python 3.3, where the implementation of PEP 393 means that the concepts of narrow and wide builds has been removed and everyone can now take advantage of the ability to access all Unicode characters. This is done by deciding whether to use 1-, 2- or 4-byte characters at runtime based on the highest ordinal codepoint used in the string. So, pure ASCII or Latin-1 strings use 1-byte characters, strings composed entirely from within the BMP use 2-byte characters and if any other planes are used then 4-byte characters are used.

In the example below you can see this illustrated.

>>> # Standard ASCII has 1 byte per character plus 49 bytes overhead.
>>> sys.getsizeof("x" * 99)
148
>>> # Each new ASCII character adds 1 byte.
>>> sys.getsizeof("x" * 99 + "x")
149
>>> # Adding one BMP character expands the size of every character to
>>> # 2 bytes, plus 74 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{bullet}")
274
>>> sys.getsizeof("x" * 99 + "\N{bullet}" + "x")
276
>>> # Moving beyond BMP expands the size of every character to 4 bytes,
>>> # plus 76 bytes overhead.
>>> sys.getsizeof("x" * 99 + "\N{taxi}")
476
>>> sys.getsizeof("x" * 99 + "\N{taxi}" + "x")
480

This basically offers the best of both worlds on the Python side. As well as reducing memory usage, this should also improve cache efficency by putting values closed together in memory. In case you’re wondering about the value of this, it’s important to remember that part of Supplementary Multilingual Plane is a funny little block called “Emoticons”, and we all know you’re not a proper application without putting "\N{face screaming in fear}" in a few critical error logs here and there. Just be aware that you may be quadrulpling the size of the string in memory by doing so.

On another Unicode related note, support for aliases has been added to the \N{...} escape sequences. Some of these are abbreviations, such as \N{SHY} for \N{SOFT HYPHEN}, and some of them are previously used incorrect names for backwards compatibility where corrections have been made to the standard. In addition these aliases are also supported in unicodedata.lookup(), and this additionally supports pre-defined sequences as well. An example of a sequence would be LATIN SMALL LETTER M WITH TILDE which is equivalent to "m\N{COMBINING TILDE}". Here are some more examples:

>>> import unicodedata
>>> "\N{NBSP}" == "\N{NO-BREAK SPACE}" == "\u00A0"
True
>>> "\N{LATIN SMALL LETTER GHA}" == "\N{LATIN SMALL LETTER OI}"
True
>>> (unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
... == "\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}")
True

Conclusions

That’s it for this post, but we’re not done with Python 3.3 yet! Check out the following article for my tour of the remaining changes in this release, as well as some thoughts on the entire release.


  1. As an unrelated aside, a few months ago (at time of writing!) Ian Bicking wrote a review of his main projects which makes for some interesting reading. 

  2. And for some people a production release tool as well, although personally I think a slightly cleaner wrapper like shrinkwrap makes for a more supportable option. 

  3.  

  4. The ones named of the form __xxx__()

6 Mar 2021 at 11:11PM in Software
 |   | 
Photo by David Clode on Unsplash