☑ Python 2to3: What’s New in 3.3 - Part 2

7 Mar 2021 at 11:27AM in Software
 |   | 

The second of my two articles covering features added in Python 3.3, this one talks about a large number of changes to the standard library, especially in network and OS modules. I also discuss implicit namespace packages, which are a bit niche but can be useful for maintaining large families of packages.

green python two 33

This is the second and final article in this series looking at new features in Python 3.3. and we’ll be primarily drilling into a large number of changes to the Python libraries. There’s a lot of interesting stuff to cover the Internet side such as the new ipaddress module and changes to email, and also in terms of OS features such as a slew of new POSIX functions that have been exposed.

Internet

There are a few module changes relating to networking and Internet protocols in this release.

ipaddress

There’s a new ipaddress module for storing IP addresses, as well as other related concepts like subnets and interfaces. All of the types have IPv4 and IPv6 variants, and offer some useful functionality for code to deal with IP addresses generically without needing to worry about the distinctions. The basic types are listed below.

IPv4Address & IPv6Address
Represents a single host address. The ip_address() utility function constructs the appropriate one of these from a string specification such as 192.168.0.1 or 2001:db8::1:0.
IPv4Network & IPv6Network
Represents a single subnet of addresses. The ip_network() utility function constructs one of these from a string specification such as 192.168.0.0/28 or 2001:db8::1:0/56. One thing to note is that because this represents an IP subnet rather than any particular host, it’s an error for any of the bits to be non-zero in the host part of the network specification.
IPv4Interface & IPv6Interface
Represents a host network interface, which has both a host IP address and network giving the details of the local subnets. The ip_interface() utility function constructs this from a string specification such as 192.168.1.20/28. Note that unlike the specification passed to ip_network(), this has non-zero bits in the host part of the specification.

The snippet below demonstrates some of the attributes of address objects:

>>> import ipaddress
>>> x = ipaddress.ip_address("2001:db8::1:0")
>>> x.packed
b' \x01\r\xb8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00'
>>> x.compressed
'2001:db8::1:0'
>>> x.exploded
'2001:0db8:0000:0000:0000:0000:0001:0000'
>>>
>>> x = ipaddress.ip_address("192.168.0.1")
>>> x.packed
b'\xc0\xa8\x00\x01'
>>> x.compressed
'192.168.0.1'
>>> x.exploded
'192.168.0.1'

This snippet illustrates a network and how it can be used to iterate over the addresses within it, as well as check for address membership in the subnet and overlaps with other subnets:

>>> x = ipaddress.ip_network("192.168.0.0/28")
>>> for addr in x:
...     print(repr(addr))
...
IPv4Address('192.168.0.0')
IPv4Address('192.168.0.1')
# ... (12 rows skipped)
IPv4Address('192.168.0.14')
IPv4Address('192.168.0.15')
>>> ipaddress.ip_address("192.168.0.2") in x
True
>>> ipaddress.ip_address("192.168.1.2") in x
False
>>> x.overlaps(ipaddress.ip_network("192.168.0.0/30"))
True
>>> x.overlaps(ipaddress.ip_network("192.168.1.0/30"))
False

And finally the interface can be queried for its address and netmask, as well retrieve its specification either as a netmask or in CIDR notation:

>>> x = ipaddress.ip_interface("192.168.0.25/28")
>>> x.network
IPv4Network('192.168.0.16/28')
>>> x.ip
IPv4Address('192.168.0.25')
>>> x.with_prefixlen
'192.168.0.25/28'
>>> x.with_netmask
'192.168.0.25/255.255.255.240'
>>> x.netmask
IPv4Address('255.255.255.240')
>>> x.is_private
True
>>> x.is_link_local
False

Having implemented a lot of this stuff manually in the past, having them here in the standard library is definitely a big convenience factor.

Email

The email module has always attempted to be compliant with the various MIME RFCs3. The email ecosystem is a broad church, however, and sometimes it’s useful to be able to customise certain behaviours, either to work on email held in non-compliance offine mailboxes or to connect to non-compliant email servers. For these purposes the email module now has a policy framework.

The Policy object controls the behaviour of various aspects of the email module. This can be specified when constructing an instance from email.parser to parse messages, or when constructing an email.message.Message directly, or when serialising out an email using the classes in email.generator.

In fact Policy is an abstract base class which is designed to be extensible, but instances must provide at least the following properties:

Property Default Meaning
max_line_length 78 Maximum line length, not including separators, when serialising.
linesep "\n" Character used to separate lines when serialising.
cte_type "7bit" If 8bit used with a BytesGenerator then non-ASCII may be used.
raise_on_defect False Raise errors during parsing instead of adding them to defects list.

So, if you’ve ever found yourself sick of having to remember to override linesep="\r\n" in a lot of different places or similar, this new approach should be pretty handy.

However, one of the main motivations to introducing this system is it now allows backwards-incompatible API changes to be made in a way which enables authors to opt-in to them when ready, but without breaking existing code. If you default to the compat32 policy, you get an interface and functionality which is compatible with the old pre-3.3 behaviour.

There is also an EmailPolicy, however, which introduces a mechanism for handling email headers using custom classes. This policy implements the following controls:

Property Default Meaning
refold_source long Controls whether email headers are refolded by the generator.
header_factory See note4 Callable that takes name and value and returns a custom header object for that particular header.

The classes used to represent headers can implement custom behaviour and allow access to parsed details. Here’s an example using the default policy which implements the EmailPolicy with all default behaviours unchanged:

>>> from email.message import Message
>>> from email.policy import default
>>> msg = Message(policy=default)
>>> msg["To"] = "Andy Pearce <andy@andy-pearce>"
>>> type(msg["To"])
<class 'email.headerregistry._UniqueAddressHeader'>
>>> msg["To"].addresses
(Address(display_name='Andy Pearce', username='andy', domain='andy-pearce'),)
>>>
>>> import email.utils
>>> msg["Date"] = email.utils.localtime()
>>> type(msg["Date"])
<class 'email.headerregistry._UniqueDateHeader'>
>>> msg["Date"].datetime
datetime.datetime(2021, 3, 1, 17, 18, 21, 467804, tzinfo=datetime.timezone(datetime.timedelta(0), 'GMT'))
>>> print(msg)
To: Andy Pearce <andy@andy-pearce>
Date: Mon, 01 Mar 2021 17:18:21 +0000

These classes will handle aspects such as presenting Unicode representations to code, but serialising out using UTF-8 or similar encoding, so the programmer no longer has to deal with such complications, provided they selected the correct policy.

On a separate email-related note, the smtpd module now also supports RFC 5321, which adds an extension framework to allow optional additions to SMTP; and RFC 1870, which offers clients an ability to pre-delcare the size of messages before sending them to detect errors earlier before sending a lot of data needlessly.

The smtplib module also has some improvements. The classes now support a source_address keyword argument to specify the source address to use for binding the outgoing socket, for servers where there are multiple potential interfaces and it’s important that a particular one is used. The SMTP class can now act as a context manager, issuing a QUIT command disconnecting when the context expires.

FTP

Also on the Internet-related front there were a handful of small enhancements to the ftplib module.

ftplib.FTP Now Accepts source_address
This is to specify the source address to use for binding the outgoing socket, for servers where there are multiple potential interfaces and it’s important that a particular one is used.
FTP_TLS.ccc()
The FTP_TLS class, which is a subclass of FTP which adds TLS support as per RFC 4217, has now acquired a ccc() method which reverts the connection back to plaintext. Apparently, this can be useful to take advantage of firewalls that know how to handle NAT with non-secure FTP without opening fixed ports. So now you know.
FTP.mlsd()
The mlsd() method has been added to FTP objects which uses the MLSD command specified by RFC 3659. This offers a better API than FTP.nlst(), returning a generator rather than a list and includes file metadata rather than just filenames. Not all FTP servers support the MLSD command, however.

Web Modules

The http, html and urllib packages also or some love in this release.

BaseHTTPRequestHandler Header Buffering
The http.server.BaseHTTPRequestHandler server now
html.parser.HTMLParser Now Parses Invalid Markup
After a large collection of bug fixes, errors are no longer raised when parsing broken markup. As a result the old strict parameter of the constructor as well as the now-unused HTMLParseError have been deprecated.
html.entities.html5 Added
This is a useful dict that maps entity names to the equivalent characters, for example html5["amp;"] == "&". This includes all the Unicode characters too. If you want the full list, take a peek at §13.5 of the HTML standard.
urllib.Request Method Specification
The urllib.Request class now has a method parameter which can specify the HTTP method to use. Previously this was decided automatically between GET and POST based on whether body data was provided, and that behaviour is still the default if the method isn’t specified.

Sockets

Support For sendmsg() and recvmsg()
These two functions provide two main additional features over tranditional sends: scatter/gather interfaces to send/receive to/from multiple buffers, and the ability to send and receive ancilliary data. For more details on ancilliary data, see the cmsg man page.
PF_CAN Support
The socket class now supports the PF_CAN protocol family, which I don’t pretend to know much about but is an open source stack contributed by Volkswagen which bridges the Controller Area Network (CAN) standard for implementing a vehicle communications bus into the standard sockets layer. This one’s pretty niche, but it was just too cool not to mention5.
PF_RDS Support
Another additional protocol family supported in this release is PF_RDS which is the Reliable Datagram Sockets protocol. This is a protocol developed by Oracle which offers similar interfaces to UDP but offers guaranteed in-order delivery. Unlike TCP, however, it’s still datagram-based and connectionless. You now know at least as much about RDS as I do. If anyone knows why they didn’t just use SCTP, which already seems to offer them everything they need, let me know in the comments.
PF_SYSTEM Support
We all know that new protocol families always come in threes, and the third is PF_SYSTEM. This is a MacOS-specific set of protocols for communicating with kernel extensions6.
sethostname() Added
If the current process has sufficient privilege, sethostname() updates the system hostname. On Unix system this will generally require running as root or, in the case of Linux at least, having the CAP_SYS_ADMIN capability.
socketserver.BaseServer Actions Hook
The class now calls a service_actions() method every time around the main poll loop. In the base class this method does nothing, but derived classes can implement it to perform periodic actions. Specifically, the ForkingMixIn now uses this hook to clean up any defunct child processes.
ssl Module Random Number Generation
A couple of new OpenSSL functions are exposed for random number generation, RAND_bytes() and RAND_pseudo_bytes(). However, os.urandom() is still preferable for most applications.
ssl Module Exceptions
These are now more fine-grained, and the following new exceptions have been added for particular cases: SSLZeroReturnError, SSLWantReadError, SSLWantWriteError, SSLSyscallError and SSLEOFError.
SSLContext.load_cert_chain() Passwords
The load_cert_chain() method now accepts a password parameter for cases where the private key is encrypted. It can be a str or bytes value containing the actual password, of a callable which will return the password. If specified, this overrides OpenSSL’s default password-prompting mechanism.
ssl Supports Additional Algorithms
Some changes have been made to properly support Diffie-Hellman key exchange on all platforms. In addition, the “PLUS” variants of SCRAM are now supported, which use a technique called channel binding to prevent some person-in-the-middle attacks.
SSL Compression
SSL sockets now have a compression() method to query the current compression algorithm in use. The SSL context also now supports an OP_NO_COMPRESSION option to disable compression.
ssl Next Protocol Negotiation
A new method ssl.SSLContext.set_npn_protocols() has been added to support the Next Protocol Negotiation (NPN) extension to TLS. This allows different application-level protocols to be specified in preference order. It was originally added to support Google’s SPDY, and although SPDY is now deprecated (and superceded by HTTP/2) this extension is general in nature and still useful.
ssl Error Introspection

Instances of ssl.SSLError now have two additional attributes:

  • library is a string indicating the OpenSSL subsystem responsible for the error (e.g. SSL, X509).
  • reason is a string code indicating the reason for the error (e.g. CERTIFICATE_VERIFY_FAILED).

New Collections

A few new data structures have been added as part of this release.

SimpleNamespace

There’s a new types.SimpleNamespace type which can be used in cases where you just want to hold some attributes. It’s essentially just a thin wrapper around a dict which allows the keys to be accessed as attributes instead of being subscripted. It’s also somewhat similar to an empty class definition, except for three main advantages:

  • You can initialise attributes in the constructor, as in types.SimpleNamespace(a=1, xyz=2).
  • It provides a readable repr() which follows the usual guideline that eval(repr(x)) == x.
  • It defines an equality operator which compares by equality of attributes, like a dict, unlike the default equality of classes, which compares by the result of id().

ChainMap

There’s a new collections.ChainMap class which can group together multiple mappings to form a single unified updateable view. The class overall acts as a mapping, and read lookups are performed across each mapping in turn with the first match being returned. Updates and additions are always performed in the first mapping in the list, and note that this may mask the same key in later mappings (but it will leave the originally mapping intact).

>>> import collections
>>> a = {"one": 1, "two": 2}
>>> b = {"three": 3, "four": 4}
>>> c = {"five": 5}
>>> chain = collections.ChainMap(a, b, c)
>>> chain["one"]
1
>>> chain["five"]
5
>>> chain.get("ten", "MISSING")
'MISSING'
>>> list(chain.keys())
['five', 'three', 'four', 'one', 'two']
>>> chain["one"] = 100
>>> chain["five"] = 500
>>> chain["six"] = 600
>>> list(chain.items())
[('five', 500), ('one', 100), ('three', 3), ('four', 4), ('six', 600), ('two', 2)]
>>> a
{'five': 500, 'six': 600, 'one': 100, 'two': 2}
>>> b
{'three': 3, 'four': 4}
>>> c
{'five': 5}

Operating System Features

There are a whole host of enhancements to the os, shutil and signal modules in this release which are covered below. I’ve tried to be brief, but include enough useful details for anyone who’s interested but not immediately familiar.

os Module

os.pipe2() Added
On platforms that support it, the pipe2() call is now available. This allows flags to be set on the file descriptors thus created atomically at creation. The O_NONBLOCK flag might seem the most useful, although it’s for O_CLOEXEC (close-on-exec) where the atomicity is really essential. If you open a pipe and then try to set O_CLOEXEC separately, it’s possible for a different thread to call fork() and execve() between these two, thus leaving the file descriptor open in the resultant new process (which is exactly what O_CLOEXEC is meant to avoid).
os.sendfile() Added
In a similar vain, the sendfile() system call is now also available. This allows a specified number of bytes to be copied directly between two file descriptors entirely within the kernel, which avoids the overheads of a copy to and from userspace that read() and write() would incur. This useful for, say, static file HTTP daemons.
os.get_terminal_size() Added
Queries the specified file descriptor, or sys.stdout by default, to obtain the window size of the attached terminal. On Unix systems (at least) it probably uses the TIOCGWINSZ command with ioctl(), so if the file descriptor isn’t attached to a terminal I’d expect you’d get an OSError due to inappropriate ioctl() for the device. There’s a higher-level shutil.get_terminal_size() discussed below which handles these errors, so it’s probably best to use that in most cases.
Avoiding Symlink Races

Bugs and security vulnerabilities can result from the use of symlinks in the filesystem if you implement the pattern of first obtaining a target filename, and then opening it in a different step. This is because the target of the symlink may be changed, either accidentally or maliciously, in the meantime. To avoid this, various os functions have been enhanced to deal with file descriptors instead of filenames, which avoids this issue. This also offers improved performance.

Firstly, there’s a new os.fwalk() function which is the same as os.walk() except that it takes a directory file descriptor as a parameter, with the dir_fd parameter, and instead of the 3-tuple return it returns a 4-tuple of (dirpath, dirnames, filenames, dir_fd). Secondly, many functions now support accepting a dir_fd parameter, and any path names specified should be relative to that directory (e.g. access(), chmod(), stat()). This is not available on all platforms, and attempting to use it when not available will raise NotImplementedError. To check support, os.supports_dir_fd is a set of the functions that support it on the current platform.

Thirdly, many of these functions also now support a follow_symlinks parameter which, if False, means they’ll operate on the symlink itself as opposed to the target of the symlink. Once again, this isn’t always available you risk getting NotImplementedError if you don’t check the function is in os.supports_follows_symlinks.

Finally, some functions now also support passing a file descriptor instead of a path (e.g. chdir(), chown(), stat()). Support is optional for this as well and you should check your functions are in os.supports_fd.

os.access() With Effective IDs
There’s now an effective_ids parameter which, if True, checks access using the effective UID/GID as opposed to the real identifiers. This is platform-dependent, check os.supports_effective_ids, which once again is a set() of methods.
os.getpriority() & os.setpriority()
These underlyling system calls are now also exposed, so processes can set “nice” values in the same way as with os.nice() but for other processes too.
os.replace() Added
The behaviour of os.rename() is to overwrite the destination on POSIX platforms, but raises an error on Windows. Now there’s os.replace() which does the same thing but always overwrites the destination on all platforms.
Nanosecond Precision File Timestamps
The functions os.stat(), os.fstat() and os.lstat() now support reading timestamps with nanosecond precision, where available on the platform. The os.utime() function supposed updating nanosecond timestamps.
Linux Extended Attributes Support
There are now a family of functions to support Linux extended attributes, namely os.getxattr(), os.listxattr(), os.removexattr() and os.setxattr(). These are key/value pairs that can be associated with files to attach metadata for multiple purposes, such as supporting Access Control Lists (ACLs). Support for these is platform-dependent, not just on the OS but potentially on the underlying filesystem in use as well (although most of the Linux ones seem to support them).
Linux Scheduling
On Linux (and any other supported platforms) the os module now allows access to the sched_*() family of functions whic control CPU scheduling by the OS. You can find more details on the sched man page.
New POSIX Operations

Support for some additional POSIX filesystem and other operations was added in this release:

  • lockf() applies, tests or removes POSIX filesystem locks from a file.
  • pread() and pwrite() read/write from a specified offset within a current file descriptor but without changing the current file descriptor offset.
  • readv() and writev() provide scatter/gather read/write, where a single file can be read into, or written from, multiple separate buffers on the application side.
  • [truncate()] truncates or extends the specified path to be an exact size. If the existing file was larger, excess data is lost; if it was smaller, it’s padded with nul characters.
  • posix_fadvise() allows applications to declare an intention to use a specific access pattern on a file, to allow the filesystem to potentially make optimisations. This can be an intention for sequential access, random access, or an intention to read a particular block so it can be fetched into the cache.
  • posix_fallocate() reserves disk space for expansion of a particular file.
  • sync() flushes any filesystem caches to disk.
  • waitid() is a variant of waitpid() which allows more control over which child process state changes to wait for.
  • getgrouplist() returns the list of group IDs to which the specified username belongs.
os.times() and os.uname() Return Named Tuples
In an extension to the previously tuple return types, this allows results to be accessed by attribute name.
os.lseek() in Sparse Files
On some platforms, lseek() now supports additional options for the whence parameter, os.SEEK_HOLE and os.SEEK_DATA. These start at a specified offset and find the nearest location which either has data, or is a hole in the data. They’re only really useful in sparse files, because other files have contiguous data anyway.
stat.filemode() Added
Not strictly in the os module, but since the stat module is a companion to os.stat() I thought it most appropriate to cover here. An undocumented function tarfile.filemode() has exposed as stat.filemode(), which convert a file mode such as 0o100755 into the string form -rwxr-xr-x.

shutil & shlex Modules

shlex.quote() Added
Actually this function hasn’t been added so much as moved in from the pipes module, but it was previously undocumented. It escapes all characters in a string which might otherwise have special significance to a shell.
shutil.disk_usage() Added
Returns the total, used and free disk space values for the partition on which the specified path resides. Under the hood this seems to use os.statvfs(), but this wrapper is more convenient and also works on Windows, which doesn’t provide statvfs().
shutil.chown() Now Accept Names
As an alternative to the numeric IDs of the user and/or group.
shutil.get_terminal_size() Added
Attempts to discern the terminal window size. If the environment variables COLUMNS and LINES are defined, they’re used. Otherwise, os.get_terminal_size() (mentioned above) is called on sys.stdout. If this fails for any reason, the fallback values passed as a parameter are returned — these default to 80x24 if not specified.
shutil.copy2() and shutil.copystat() Improvements
These now correctly duplicate nanosecond-precision timestamps, as well as extended attributes on platforms that support them.
shutil.move() Symlinks
Now handles symlinks as POSIX mv does, re-creating the symlink instead of copying the contents of the target file when copying across filesystems, as used to be the previous behaviour. Also now also returns the destination path for convenience.
shutil.rmtree() Security
On platforms that support dir_fd in os.open() and os.unlink(), it’s now used by shutil.rmtree to avoid symlink attacks.

IPC Modules

New Functions
  • pthread_sigmask() allows querying and update of the signal mask for the current thread. If you’re interested in more details of the interactions between threads and signals, I found this article had some useful examples.
  • pthread_kill() sends a signal to a specified thread ID.
  • sigpending() is for examining the signals which are currently pending on the current thread or the process as a whole.
  • sigwait() and sigwaitinfo() both block until one of a set of signals becomes pending, with the latter returning more information about the signal which arrived.
  • sigtimedwait() is the same as sigwaitinfo() except that it only waits for a specified amount of time.
Signal Number On Wakeup FD
When using signal.set_wakeup_fd() to allow signals to wake up code waiting on file IO events (e.g. using the select module), the signal number is now written as the byte into this FD, whereas previously simply a nul byte was written regardless of which signal arrived. This allows the handler of that polling loop to determine which signal arrived, if multiple are being waited on.
OSError Replaces RuntimeError in signal
When errors occur in the functions signal.signal() and signal.siginterrupt(), they now raise OSError with an errno attribute, as opposed to a simple RuntimeError previously.
subprocess Commands Can Be bytes
Previously this was not possible on POSIX platforms.
subprocess.DEVNULL Added
This allows output to be disarded on any platform.

threading Module

Theading Classes Can Be Subclassed

Several of the objects in threading used to be factory functions returning instances, but are now real classes and hence may be subclassed. This change includes:

  • threading.Condition
  • threading.Semaphore
  • threading.BoundedSemaphore
  • threading.Event
  • threading.Timer
threading.Thread Constructor Accepts daemon
A daemon keyword parameter has been added to the threading.Thread constructor to override the default behaviour of inheriting this from the parent thread.
threading.get_ident() Exposed
The function _thread.get_ident() is now exposed as a supported function threading.get_ident(), which returns the thread ID of the current thread.

time Module

The time module has several new functions which are useful. The first three of these are new clocks with different properties:

time.monotonic()
Returns the (fractional) number of seconds since some unspecified reference point. The absolute value of this time isn’t useful, but it’s guaranteed to monotonically increase and it’s unaffected by any changes to system time, so it’s useful to measure the time between two events in a way which won’t be broken during DST boundaries or the system administrator changing the clock.
time.perf_counter()
As time.monotonic() but has the higest available resolution on the platform.
time.process.time()
Returns the total time spent active in the current process, including both system and user CPU time. Whilst the process is sleeping (blocked) this counter doesn’t tick up. The reference point is undefined, so only the difference between consecutive calls is valid.
time.get_clock_info()

This function returns details about the specified clock, which could be any of the options above (passed as a string) or "time" for the details of the time.time() standard system clock. The result is an object which has the following attributes:

  • adjustable is True if the clock may be changed by something external to the process (e.g. a system administrator or an NTP daemon).
  • implementation is the name of the underlying C function called to provide the timer value.
  • monotonic is True if the clock is guaranteed to never go backwards.
  • resolution is the resolution of the clock in fractional seconds.
Access To System Clocks

The time module also has also exposed the following underlying system calls to query the status of various system clocks:

  • clock_getres() returns the resolution of the specified clock, in fractional seconds.
  • clock_gettime() returns the current time of the specified clock, in fractional seconds.
  • clock_settime() sets the time on the specified clock, if the process has appropriate privileges. The only clock for which that’s supported currently is CLOCK_REALTIME.

The clocks which can be specified in this release are:

  • time.CLOCK_REALTIME is the standard system clock.
  • time.CLOCK_MONOTONIC is a monotonically increasing clock since some unspecified reference point.
  • time.CLOCK_MONOTONIC_RAW provides access to the raw hardware timer that’s not subject to adjustments.
  • time.CLOCK_PROCESS_CPUTIME_ID counts CPU time on a per-process basis.
  • time.CLOCK_THREAD_CPUTIME_ID counts CPU time on a per-thread basis.
  • time.CLOCK_HIGHRES is a higher-resolution clock only available on Solaris.

Implicit Namespace Packages

This is a feature which is probably only of interest to a particular set of package maintainers, so I’m going to do my best not to drill into too much detail. However, there’s a certain level of context required for this to make sense — you can always skip to the next section if it gets too dull!

First I should touch on what’s a namespace package in the first place. If you’re a Python programmer, you’ll probably be aware that the basic unit of code reusability is the module1. Modules can be imported individually, but they can also be collected into packages, which can contain modules or other packages. In its simplest forms, a module is a single .py file and a package is a directory which contains a file called __init__.py. The contents of this script are executed when the package is important, but the very fact of the file’s existence is what tags it as a packge to Python, even if the file is empty.

So now we come to what on earth is a namespace package. Simply put, this is a logical package which presents a uniform name to be imported within Python code, but is physically split across multiple directories. For example, you may want to create a machinelearning package, which itself contains other packages like dimensionreduction, anomolydetection and clustering. For such a large domain, however, each of those packages is likely to consist of its own modules and subpackages, and have its own team of maintainers, and coordinating some common release strategy and packaging system across all those teams and repositories is going to be really painful. What you really want to do is have each team package and ship its own code independently, but still have them presented to the programmer as a uniform package. This would be a namespace package.

Python already had two approaches for doing this, one provided by setuptools and later another one provided by the pkgutil module in the standard library. Both of these rely on the namespace package providing some respective boilerplate __init__.py files to declare it as a namespace package. These are shown below for reference, but I’m not going to discuss them further because this section is about the new approach.

# The setuptools approach involves calling a function in __init__.py,
# and also requires some changes in setup.py.
__import__('pkg_resources').declare_namespace(__name__)

# The pkgutil approach just has each package add its own directory to
# the __path__ attribute for the namespace package, which defines the
# list of directories to search for modules and subpackages. This is
# more or less equivalent to a script modifying sys.path, but more
# carefully scoped to impact only the package in question.
__path__ = __import__('pkgutil').extend_path(__path__, __name__)

Both of these approaches share some issues, however. One of them is that when OS package maintainers (e.g. for Linux distributions) want somewhere to install these different things, they’d probably like to choose the same place, to keep things tidy. But this means all those packages are going to try and install an __init__.py file over the top of each other, which makes things tricky — the OS packaging system doesn’t know these files necessarily contain the same things and will generate all sorts of complaints about the conflict.

The new approach, therefore, is to make these packages implicit, where there’s no need for an __init__.py. You can just chuck some modules and/or sub-packages into a directory which is a subdirectory of something on sys.path and Python will treat that as a package and make the contents available. This is discussed in much more detail in PEP 420.

Beyond these rather niche use-cases of mega-packages, this feature seems like it should make life a little easier creating regular packages. After all, it’s quite common that you don’t really need any setup code in __init__.py, and creating that empty file just feels messy. So if we don’t need to these days then why bother?

Well, as a result of this change it’s true that regular packages can be created without the need for __init__.py, but the old approach is still the correct way to create a regular package, and has some advantages. The primary one is that omitting __init__.py is likely to break existing tools which attempt to search for code, such as unittest, pytest and mypy to name just a few. It’s also noteworthy that if you rely on namespace packages and then someone adds something to your namespace which contains an __init__.py, this ends the search process for the package in question since Python assumes this is a regular package. This means all your other implicit namespace packages will be suddenly hidden when the clashing regular package is installed. Using __init__.py consistently everywhere avoids this problem.

Furthermore, regular packages can be imported as soon as they’re located on the path, but for namespace packages the entire path must be fully processed before the package can be created. The path entries must also be recalcuated on every import, for example in case the user has added additional entries to sys.path which would contribute additional content to an existing namespace package. These factors can introduce performance issues when importing namespace packages.

There are also some more minor factors which favour regular packages which I’m including below for completeness but which I doubt will be particularly compelling for many people.`

  • Namespace packages lack some features of regular packages, such as they’re missing a __file__ attribute and the __path__ attribute is read-only. These aren’t likely a major issue for anyone, unless you have some grotty code which it trying to calculate paths relative to the source files in the package or similar.
  • The setuptools.find_packages() function won’t find these new style namespace packages, although there is now a setuptools.find_namespace_packages() function which will, so it should be a fairly simple issue to modify setup.py appropriately.
  • If you’ve implemented your own import finders and loaders as per PEP 302 then these will need to be modified to support this new approach. I’m guessing this is a pretty small slice of developers, though.

As a final note, if you are having any issues with imports, I strongly recommend checking out Nick Coghlan‘s excellent article Traps for Unware in Python’s Import System which discusses some of the most common problems you might run into.

Other Builtin Changes

There are a set of small but useful changes in some of the builtins that are worth noting.

open() Opener
There is a new parameter opener for open() calls which is callable which is invoked with arguments (filename, flags) and is expected to return the file descriptor as os.open() would. This can be used to, for example, pass flags which aren’t supported by open(), but still benefit from the context manager behaviour offered by open().
open() Exclusively
The x mode was added for exclusive creation, failing if the file already exists. This is equivalent to the O_EXCL flag to open() on POSIX systems.
print() Flushing
print() now has a flush keyword argument which, if set to True, flushes the output stream immediately after the output.
hash() Randomization
As of Python 3.3, a random salt is used during hashing operations by default. This improves security by making hash values less predictable between separate invocations of the interpreter, but it does mean you definitely need to not rely on them being consistent if you serialise them out somewhere. I wrote a brief article about this about this topic around half a decade ago, as I was quite surprised at the time how serious a problem it can be.
str.casefold()
str objects now have a casefold() method to return a casefolded version of the string. This is intended to be used for case-insensitive comparisons, and is a much more Unicode-friendly approach than calling upper() or lower(). A full discussion of why is outside the scope of this article, but I suggest the excellent article Truths Programmers Should Know About Case by James Bennett for an informative article about the complexities of case outside of Latin-1 languages. Spoiler: it’s harder than you think, which should always be your default assumption for any I18n issues2.
copy() and clear()
There are now copy() and clear() methods on both list and bytearray objects, with the obvious semantics.
range Equality
Equality comparisons have been defined on range objects based on equality of the generated values. For example, range(3, 10, 3) == range(3, 12, 3). However, bear in mind this doesn’t evaluate the actual contents so range(3) != [0, 1, 2]. Also, applying transformations such as reversed seems to defeat these comparisons.
dict.setdefault() enhancement
Previously dict.setdefault() resulted in two hash lookups, one to check for an existing item and one for the insertion. Since a hash lookup can call into arbitrary Python code this meant that the operation was potentially non-atomic. This has been fixed in Python 3.3 to only perform the lookup once.
bytes Methods Taking int
The methods count(), find(), rfind(), index() and rindex() of bytes and bytearray objects now accept an integer in the range 0-255 to specify a single byte value.
memoryview changes
The memoryview class has a new implementation which fixes several previous ownership and lifetime issues which had lead to crash reports. This release also adds a number of features, such as better support for multi-dimensional lists and more flexible slicing.

Other Module Changes

There were some other additional and improved modules which I’ll outline briefly below.

bz2 Rewritten

The bz2 module has been completely rewritten, adding several new features:

  • There’s a new bz2.open() function, which supports opening files in binary mode (where it operates just like the bzip2.BZ2File constructor) or text mode (where it applies an io.TextIOWrapper).
  • You can now pass any file-like object to bz2.BZ2File using the fileobj parameter.
  • Support for multi-stream inputs and outputs has been added.
  • All of the io.BufferedIOBase interface is now implemented by bz2.BZ2File, except for detach() and truncate().
Abstract Base Classses Moved To collections.abc
This avoids confusion with the concrete classes provided by collections. Alises still exist at the top-level, however, to preserve backwards-compatibility.
crypt.mksalt()
For convenience of generating a random salt, there’s a new crypt.mksalt() function to create the 2-character salt used by Unix passwords.
datetime Improvements

There are a few enhancements to the ever-useful datetime library.

  • Equality comparisons between naive and timezone-aware datetime objects used to raise TypeError, but it was decided this was inconsistent with the behaviour of other incomparable types. As of Python 3.3 this will simply return False instead. Note that other comparisons will still raise TypeError, however.
  • There’s a new datetime.timestamp() method to return an epoch timestamp representation. This is implicitly in UTC, so timezone-aware datetimes will be converted and naive datetimes will be assumed to be in the local timezone and converted using the platform’s mktime().
  • datetime.strftime() now supports years prior to 1000 CE.
  • datetime.astimezone() now assumes the system time zone if no parameters are passed.
decimal Rewritten in C
There’s a new C implementation of the decimal module using the high-performance libmpdec. There are some API changes as a result which I’m not going to go into here as I think most of them only impact edge cases.
functools.lru_cache() Type Segregation
Back in an earlier article we talked about the functools.lcu_cache class for caching function results based on the parameters. This caching was based on checking the full set of arguments for equality with previous ones specified, and if they all compared equal then the cached result would be returned instead of calling the function. In this release, there’s a new typed parameter which, if True, also enforces that the arguments are of the same type to trigger the caching behaviour. For example, calling a function with 3 and then 3.0 would return the cached value with typed=False (the default) but would call the function twice with typed=True.
importlib
A number of changes to the mechanics of importing so that importlib.__import__ is now used directly by __import__(). A number of other changes have had to happen behind the scenes to make this happen, but now it means that the import machinery is fully exposed as part of importlib which is great for transparency and for any code which needs to find and import modules programmatically. I considered this a little niche to cover in detail, but the release notes have some good discussion on it.
io.TextIOWrapper Buffering Optional
The constructor of io.TextIOWrapper has a new write_through optional argument. If set to True, write() calls are guaranteed not to be buffered but will be immediately passed to the underlying binary buffer.
itertools.accumulate() Supports Custom Function
This function, that was added in the previous release, now supports any binary function as opposed to just summing results. For example, passing func=operator.mul would give a running product of values.
logging.basicConfig() Supports Handlers
There’s now a handlers parameter on logging.basicConfig() which takes an iterable of handlers to be added the root logger. This is probably handy for those scripts that are just large enough to be worth using logging, particularly if you consider the code might one day form the basis of a reusable module, but which aren’t big enough to mess around setting up a logging configuration file.
lzma Added
Provides LZMA compression, first used in the 7-Zip program and now primarily provided by the xz utility. This library supports the .xz file format, and also the .lzma legacy format used by earlier versions of this utility.
math.log2() Added
Not just a convenient alias for math.log(x, 2), this will often be faster and/or more accurate than the existing approach, which involves the usual division of logs to convert the base.
pickle Dispatch Tables
The pickle.Pickler class constructor now takes a dispatch_table parameter which allows the pickling functions to be customised on a per-type basis.
sched Improvements

The sched module, for generalised event scheduling, has had a variety of improvements made to it:

  • run() can now be passed blocking=False to execute pending events and then return without blocking. This widens the scope of applications which can use the module.
  • sched.scheduler can now be used safely in multithreaded environments.
  • The parameters to the sched.scheduler constructor now have sensible defaults.
  • enter() and enterabs() methods now no longer require the argument parameter to be specified, and also support a kwargs parameter to pass values by keyword to the callback.
sys.implementation
There’s a new sys.implementation attribute which holds information about the current implementation being used. A full list of the attributes is beyond the scope of this article, but as one example sys.implementation.version is a version tuple in the same format as sys.version_info. The former contains the implmentation version whereas the latter specifes the Python language version implemented — for CPython the two will be the same, since this is the reference implementation, but for cases like PyPy the two will differ. PEP 412 has more details.
tarfile Supports LZMA
Using the new lzma module mentioned above.
textwrap Indent Function
A new indent() method allows a prefix to be added to every line in a given string. This functionality has been in the textwrap.TextWrapper class for some time, but is now exposed as its own function for convenience.
xml.etree.ElementTree C Extension
This module now uses its C extension by default, there’s no longer any need to import xml.etree.cElementTree, although that module remains for backwards compatibility.
zlib EOF
The zlib module now has a zlib.Decompress.eof attribute which is True if the end of the stream has been reached. If this is False but there is no more data, it indicates that the compressed stream has been truncated.

Other Changes

As usual, there were some minor things that struck me as less critical, but I wanted to mention nonetheless.

Raw bytes literals
Raw” str literals are written r"..." and bytes literals are b"...". Until previously combining these required br"...", but as of Python 3.3 rb"..." will also work. Rejoice in the syntax errors thus avoided.
2.x-style Unicode literals
To ease transition of Python 2 code, u"..." literals are once again supported for str objects. This has no semantic significance in Python 3 since it is the default.
Fine-Grained Import Locks
Imports used to take a global lock, which could lead to some odd effects in the presence of multiple threads importing concurrently and code being run at import time. In Python 3.3 this has been switched to a per-module lock, so imports in multiple concurrent threads are still serialised correctly whilst still allowing different modules to be imported independently. If you enjoy learning about the subtle issues one must consider when trying to make concurrency bullet-proof, you may find issue 9260 an interesting read.
Windows Launcher
On Windows the Python installer now sets up a launcher which will run .py files when double-clicked. It even checks the shebang line to determine the Python version to use, if multiple are available.
Buffer Protocol Documentation
The buffer protocol documentation has been improved significantly.
Efficient Attribute Storage
The dict implementation used for holding attributes of objects has been updated to allow it to share the memory used for the key strings between multiple instances of a class. This can save 10-20% on memory footprints on heavily object-oriented code, and increased locality also achieves some modest performance improvements of up to 10%. PEP 412 has the full details.

Conclusions

So that’s Python 3.3, and what a lot there was in it! The yield from support is handy, but really just a taster of proper coroutines that are coming in future releases with the async keyword. The venv module is a bit of a game-changer in my opinion, because now that everyone can simply rely on it being there we can do a lot better documenting and automating development and runtime setups of Python applications. Similarly the addition of unittest.mock means everyone can use the powerful mocking features it provides to enhance unit tests without having to add to their project’s development-time dependencies. Testing is something where you want to lower the barrier to it as much as you can, to encourage everyone to use it freely.

The other thing that jumped out to me about this release in particular was the sheer breadth of new POSIX functions and other operating system functionality that are now exposed. It’s always a pet peeve of mine when my favourite system calls aren’t easily exposed in Python, so I love to see these sweeping improvements.

So all in all, no massive overhauls, but a huge array of useful features. What more could you ask from a point release?


  1. This could be pure Python or an extension module in a another langauge like C or C++, but that distinction isn’t important for this discussion. 

  2. Or if you really want the nitty gritty, feel free to peruse §3.13 of the Unicode standard. But if you do — and with sincere apologies to the authors of the Unicode standard who’ve forgotten more about international alphabets than I’ll ever know — my advice is to brew some strong coffee first. 

  3. Well, since you asked that’s specifically RFC 2045, RFC 2046, RFC 2047, RFC 4288, RFC 4289 and RFC 2049

  4. The default header_factory is documented in the email.headerregistry module. 

  5. And let’s be honest, my “niche filter” is so close to the identity function that they could probably share a lawnmower. I tend to only miss out the things that apply to only around five people, three of whom don’t even use Python. 

  6. However, since KEXTs have been replaced with system extensions more recently, which run in user-space rather than in the kernel, then I don’t know whether the PF_SYSTEM protocols are going to remain relevant for very long. 

7 Mar 2021 at 11:27AM in Software
 |   | 
Photo by David Clode on Unsplash