☑ Chunky requests

Why have webservers been so slow to accept chunked requests?

cake slice

HTTP is, in general, a good protocol.

That is to say it’s not awful — in my experience of protocols, not being awful is a major achievement. Perhaps I’ve been unfairly biased by dealings in the past with the likes of SNMP, LDAP and almost any P2P file sharing protocol you can mention1, but it does seem like they’ve all got some major annoyance somewhere along the line.

As you may already be aware, the current version of HTTP is 1.1 and this has been in use almost ubiquitously for over a decade. One of the handy features that was introduced in 1.1 over the older version 1.0 was chunked encoding. If you’re already familiar with it, skip the next three paragraphs.

HTTP requests and responses consist of a set of headers, which define information about the request or response, and then optionally a body. In the case of a response, the body is fairly obviously the file being requested, which could be HTML, image data or anything else. In the case of a request, the body is often omitted for performing a simple GET to download a file, but when doing a POST or PUT to upload data then the body of the request typically contains the data being uploaded.

In HTTP, as in any protocol, the receiver of a message must be able to determine where the message ends. For a message with no body this is easy, as the headers follow a defined format and are terminated with a blank line. When a body is present, however, it can potentially contain any data so it’s not possible to specify a fixed terminator. Instead, it can be specified by adding a Content-Length header to the message — this indicates the number of bytes in the body, so when the receiving end has that many bytes of body data it knows the message is finished.

Sending a Content-Length isn’t always convenient, however — for example, many web pages these days are dynamically generated by server-side applications and hence the size of the response isn’t necessarily known in advance. It can be buffered up locally until it’s complete and then the size of it can be determined, a technique often called store and forward. However, this consumes additional memory on the sending side and increases the user-visible latency of the response by preventing a browser from fetching other resources referenced by the page in parallel with fetching the remainder of the page. As of HTTP/1.1, therefore, a piece-wise method of encoding data known as chunked encoding was added. In this scheme, body data is split into variable-sized chunks and each individual chunk has a short header indicating its size and then the data for that chunk. This means that only the size of each chunk need be known in advance and the sending side can use whatever chunk size is convenient2.

So, chunked encoding is great — well, as long as it’s supported by both ends, that is. If you look at §3.6.1 of the HTTP RFC, however, it’s mandatory to support it — the key phrase is:

All HTTP/1.1 applications MUST be able to receive and decode the “chunked” transfer-coding […]

So, it’s safe to assume that every client, server and library supports it, right? Well, not quite, as it turns out.

In general, support for chunked encoding of responses is pretty good. Of course, there will always be the odd homebrew library here and there that doesn’t even care about RFC-compliance, but the major HTTP clients, servers and libraries all do a reasonable job of it.

Chunk-encoded requests, on the other hand, are a totally different kettle of fish3. For reasons I’ve never quite understood, support for chunk-encoded requests has always been patchy, despite the fact there’s no reason at all that a POST or PUT request might feasibly be as large as any response — for example, when uploading a large file. Sure, there isn’t the same latency argument, but you still don’t want to force the client to buffer up the whole request before sending it just for the sake of lazy programmers.

For example, the popular nginx webserver didn’t support chunk-encoded requests in its core until release 1.3.9, a little more than seven months ago — admittedly there was a plugin to do it in earlier versions. Another example I came across recently was that Python’s httplib module doesn’t support chunked requests at all, even if the user does the chunking — this doesn’t seem to have changed in the latest version at time of writing. As it happens you can still do it yourself, as I recently explained to someone in a Stack Overflow answer, but you have to take care to make sure you don’t provide enough information for httplib to add its own Content-Length header — providing both that and chunked encoding is a no-no5, although the chunk lengths should take precedence according to the RFC.

What really puzzles me is how such a fundamental (and mandatory!) part of the RFC can have been ignored for requests for so long? It’s almost as if these people throw their software together based on real-world use-cases and not by poring endlessly over the intricate details of the standards documents and shunning any involvement with third party implementations. I mean, what’s all this “real world” nonsense? Frankly, I think it’s simply despicable.

But on a more serious note, while I can entirely understand how people might think this sort of thing isn’t too important (and don’t even get me started on the lack of proper support for “100 Continue”6), it makes it a really serious pain when you want to write properly robust code which won’t consume huge amounts of memory even when it doesn’t know the size of a request in advance. If it was a tricky feature I could understand it, but I don’t reckon it can take more than 20 minutes to support, including unit tests. Heck, that Stack Overflow answer I wrote contains a pretty complete implementation and that took me about 5, albeit lacking tests.

So please, the next time you’re working on a HTTP client library, just take a few minutes to implement chunked requests properly. Your coding soul will shine that little bit brighter for it. Now, about that “100 Continue” support… OK, I’d better not push my luck.


  1. The notable exception being BitTorrent which stands head and shoulders above its peers. Ahaha. Ahem. 

  2. Although sending data in chunks that are too small can cause excessive overhead as this blog post illustrates. 

  3. Trust me, you don’t want a kettle of fish on your other hand, especially if it’s just boiled. Come to think of it, who cooks fish in a kettle, anyway?4 

  4. Well, OK, I’m pedantic enough to note that kettle originally derives from ketill which is the Norse word for “cauldron” and it didn’t refer to the sort of closed vessel we now think of as a “kettle” when the phrase originated. I’m always spoiling my own fun. 

  5. See §4.4 of the HTTP RFC item 3. 

  6. Used at least by the Amazon S3 REST API

3 Jul 2013 at 2:20PM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: http web  |  See comments

☑ Tuning in the static

In C++ the static keyword has quite a few wrinkles that may not be immediately apparently. One of them is related to constructor order, and I briefly describe it here.

large rock

The static keyword will be familiar to most C and C++ programmers. It has various uses, but for the purposes of this post I’m going to focus on static, local variables within a function.

In C you might find code like this:

1
2
3
4
5
int function()
{
  static int value = 9;
  return ++value;
}

On each call this function will return int values starting at 10 and incrementing by one on each call, as value is static and hence persists between calls to the function. The following, however, is not valid in C:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
int another_function()
{
  return 123 + 456;
}

int function()
{
  static int value = another_function();
  return ++value;
}

This is because in C, objects with static storage must be initialised with constant expressions1. This makes life easy for the compiler because typically it can put the initial value directly into the data segment of the binary and then just omit any initialisation of that variable when the function is called.

Not so in C++ where things are rather more complicated. Here, static variables within a function or method can be initialised with any expression that their non-static counterparts would accept and the initialisation happens on the first call to that function2.

This makes sense when you think about it, because in C++ variables can be class types and their constructors can perform any arbitrary code anyway. So, calling a function to get the initial value isn’t really much of a leap. However, it’s potentially quite a pain for the compiler and, by extension, the performance-conscious coder as well.

The reason that this might impact performance is that the compiler can no longer perform initialisation by including literals in the binary, since the values aren’t, in general, known until runtime. It now needs to track whether the static variables have been initialised in a function, and it needs to check this every time the function is called. Now I’m not sure which approach compilers take to achieve this, but it’s most likely going to add some overhead3, even if just a little. In a commonly-used function on a performance-critical data path, this could become significant.

A further complicating factor is the fact that each static variable must be separately tracked because any given run through the function may end up not passing the definition if its within a conditional block. Also, objects are required4 to be destroyed in the reverse of the order in which they were constructed. Put these two together and there’s quite a bit of variability — consider this small program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <iostream>
#include <sstream>
#include <string>

class MyClass
{
public:
  MyClass(std::string id);
  ~MyClass();
private:
  std::string id_;
};

MyClass::MyClass(std::string id) : id_(id)
{
  std::cout << "Created " << id << std::endl;
}

MyClass::~MyClass()
{
  std::cout << "Destroyed " << id_ << std::endl;
}

void function(bool do_first)
{
  if (do_first) {
    static MyClass first("first");
  }
  static MyClass second("second");
}

int main(int argc, char *argv[])
{
  MyClass instance("zero");
  function(argc % 2 == 0);
  if (argc > 2) {
    function(true);
  }

  return 0;
}

With this code, you get a different order of construction and destruction based on the number of arguments you provide. We may skip the construction of first entirely:

$ static-order
Created zero
Created second
Destroyed zero
Destroyed second

We may construct first and second on the first call to function() and hence have them destroyed in the opposite order:

$ static-order one
Created zero
Created first
Created second
Destroyed zero
Destroyed second
Destroyed first

Or we may skip over constructing first on the first call, and then have it performed on the second, in which case we get the opposite order of destruction to the above:

$ static-order one two
Created zero
Created second
Created first
Destroyed zero
Destroyed first
Destroyed second

In all cases you’ll note that zero is both constructed and destroyed first. It’s constructed first because it’s created at the start of main() before any calls to function(). It’s destroyed first because this happens once main() goes out of scope, which happens just prior to the termination of the program which is the point at which static objects, whether local or global, go out of scope.

Static variables get more complex every time you look at them — I haven’t covered the order (or lack thereof) of initialising static objects in different compilation units, and we haven’t even begun to talk about multithreaded environments yet…

Just be careful with static. Oh, and, uh, the rest of C++ too, I suppose. In fact, have you ever considered Python?


  1. See §6.7.8.4 of the C standard

  2. Incidentally, this enables a useful way to prevent the static initialisation order fiasco, but that’s another story. 

  3. Well, if you assume you can write to your own code section then there are probably ways of branching to the static initialiser and then overwriting the branch with a no-op or similar, and this would be almost zero overhead. However, I believe on Linux at least the code section is read-only at runtime which puts the kibosh on sneaky tricks like that. 

  4. See §3.6.3.1 of the C++ standard

14 Jun 2013 at 4:18PM by Andy Pearce in Software  | Photo by Frantzou Fleurine on Unsplash  | Tags: c++ c  |  See comments

☑ Just like old

If you have the luxury of migrating your Linux installation to a new hard disk before the old one packs up entirely, it’s quite easily done with standard tools.

two penguins

Recently on my Linux machine at work I started getting some concerning emails from smartd which looked like this:

Device: /dev/sda [SAT], 7 Offline uncorrectable sectors

Any errors from smartd are a cause for concern, but particularly this one. To explain, a short digression — anybody familiar with SMART and hard drives can skip the next three paragraphs.

Hard disks are split into sectors which are the smallest units which can be addressed1. Each sector corresponds to a tiny portion of the physical disk surface and any damage to the surface, such as scratches or particles of dust, may render one or more of these sectors inaccessible — these are often called bad sectors. This has been a problem since the earliest days of hard drives, so operating systems have been designed to cope with sectors that the disk reports as bad, avoiding their use for files.

Modern hard disk manufacturers have more or less accepted that some proportion of drives will have minor defects, so they reserve a small area of the disk for reallocated sectors, in addition to the stated capacity of the drive. When bad sectors are found, the drive’s firmware quietly relocates them to some of this spare space. The amount of space “wasted” by this approach is a tiny proportion of the space of the drive and saves manufacturers from having to deal with a steady stream of customers RMAing drives with a tiny proportion of bad sectors. There is a limit to the scope of this relocation, however, and when the spare space is exhausted the drive has no choice but to report the failures directly to the system2.

The net result of all this is that by the time your system is reporting bad sectors to you, your hard disk has probably already had quite a few physical defects crop up. The way hard drives work, this often means the drive may be starting to degrade and may suffer a catastrophic failure soon — this was confirmed by a large-scale study by Google a few years ago. So, by the time your operating system starts reporting disk errors, it may be just about too late to practically do anything about it. This is where SMART comes in — it’s a method of querying information from hard disks, including such items as the number of sectors which the drive has quietly reallocated for you.

The smartd daemon uses SMART to monitor your disks and watch for changes in the counters which may indicate a problem. Increases in the reallocated sector count should be watched carefully — occasionally these might be isolated instances, but if you see this number change continuously over a few days or weeks then you should assume the worst and plan for your hard disk to fail at any moment. The counter I mentioned above, the offline uncorrectable sector count, is even worse — this means that the drive encountered an error it couldn’t solve when reading or writing part of the disk. This is also a strong indicator of failure.

So, I know my hard disk is about to fail — what can I do about it? The instructions below cover my experiences on an Ubuntu 12.04 system, but the process should be similar for other distributions. Note that this is quite a low-level process which assumes a fair degree of confidence with Linux and is designed to duplicate exactly the same environment. You may find it easier to simply back up your important files and reinstall on to a fresh disk.

Since I use Linux, it turns out to be comparatively easy to migrate over to a new drive. The first step is to obtain a replacement hard disk, then power the system off, connect it up to a spare SATA socket and boot up again. At this point, you should be able to partition it with fdisk, presumably in the same way as your current drive but the only requirement is that each partition is at least big enough to hold all the current files in that part of your system. Once you’ve partitioned it, format the partitions with, for example, mke2fs and mkswap as appropriate. At this point, mount the non-swap partitions in the way that they’ll be mounted in the final system but under some root — for example, if you had just / and /home partitions then you might do:

sudo mkdir /mnt/newhdd
sudo mount /dev/sdb1 /mnt/newhdd
sudo mkdir /mnt/newhdd/home
sudo mount /dev/sdb5 /mnt/newhdd/home

Important: make sure you replace /dev/sdb with the actual device of your new disk. You can find this out using:

sudo lshw -class disk

… and looking at the logical name fields of the devices which are listed.

At this point you’re ready to start copying files over to the new hard disk. You can do this simply with rsync, but you have to provide the appropriate options to copy special files across and avoid copying external media and pseudo-filesystems:

rsync -aHAXvP --exclude="/mnt" --exclude="/lost+found" --exclude="/sys" \
      --exclude="/proc" --exclude="/run/shm" --exclude="/run/lock" / /mnt/newhdd/

You may wish to exclude other directories too — I suggest running mount and excluding anything else which is mounted with tmpfs, for example.

You can leave this running in the background — it’s likely to take quite a long time for a system which has been running for awhile, and it also might impact your system’s performance somewhat. It’s quite safe to abort and re-run another time — the rsync with those parameters will carry on exactly where it left off.

While this is going on, make sure you have an up-to-date rescue disk available, which you’ll need as part of the process. I happened to use the Ubuntu Rescue Remix CD, but any reasonable rescue or live CD is likely to work. It needs to have rsync, grub, the blkid utility and a text editor available and be able to mount all your filesystem types.

Once that command has finished, you then need to wait for a time where you’re ready to do the switch. Make sure you won’t need to be interrupted or use the PC for anything for at least half an hour. If you had to abandon the process and come back to it later, make sure you re-run the above rsync command just prior to doing the following — the idea is to make sure the two systems are as closely synchronised as possible.

When you’re ready to proceed, shut down the system, open it up and swap the two disks over. Strictly speaking you probably don’t need to swap them, but I like to keep my system disk as /dev/sda so it’s easier to remember. Just make sure you remember that they’re swapped now!

Now boot the system into the rescue CD you created earlier. Get to a shell prompt and mount your drives — let’s say that the new disk is now /dev/sda and the old one is /dev/sdb, continuing the two partition example from earlier, then you’d do something like this:

mkdir /mnt/newhdd /mnt/oldhdd
mount /dev/sda1 /mnt/newhdd
mount /dev/sdb1 /mnt/oldhdd
mount /dev/sda5 /mnt/newhdd/home
mount /dev/sdb5 /mnt/oldhdd/home

I’m assuming you’re already logged in as root — if not, you’ll need to use sudo or su as appropriate. This varies between rescue systems.

As you can see, the principle is to mount both old and new systems in the same way as they will be used. As this point you can then invoke something similar to the rsync from earlier, except with the source changed slightly. Note that you don’t need all those --exclude options any more because the only things mounted should be the ones you’ve manually mounted yourself, which are all the partitions you actually want to copy:

rsync -aHAXvP /mnt/oldhdd/ /mnt/newhdd/

Once this final rsync has finished, you’ll need to tweak a few things on your target drive before you can boot into it. After this point do not run rsync again or you will undo the changes you’re about to make.

First, you need to update /mnt/newhdd/etc/fstab to reflect your new hard disk. If you take a look, you’ll probably find that the lines for the standard partitions start like this:

UUID=d3964aa8-f237-4b34-814b-7176719b2e42

What you need to do is replace these UUIDs with the ones from your new drive. You can find this out by running blkid which should output something like this:

/dev/sda1: UUID="b7299d50-8918-459f-9168-2a743f462658" TYPE="swap" 
/dev/sda2: LABEL="/" UUID="43f6065c-d141-4a64-afda-3e0763bbbc9a" TYPE="ext4"
/dev/sdb1: UUID="8affb32a-bb25-8fa2-8473-2adc443d1900" TYPE="swap"
/dev/sdb2: LABEL="/" UUID="d3964aa8-f237-4b34-814b-7176719b2e42" TYPE="ext4"

What you want is to copy the UUID fields for your new disk to fstab over the top of the old ones. Be careful not to accidentally copy a stray quote or similar.

The other thing you need to do is change the grub.conf file to refer to the new IDs. Typically this file is auto-generated, but you can use a simple search and replace to update the IDs in the old file. First grep the file for the UUID of the old root partition to make sure you’re changing the right thing:

grep d3964aa8-f237-4b34-814b-7176719b2e42 /mnt/newhdd/boot/grub/grub.cfg

Then replace it with the new one, with something like this:

cp /mnt/newhdd/boot/grub/grub.cfg /mnt/newhdd/boot/grub.cfg.orig
sed 's/d3964aa8-f237-4b34-814b-7176719b2e42/43f6065c-d141-4a64-afda-3e0763bbbc9a/g' \
    /mnt/newhdd/boot/grub.cfg.orig > /mnt/newhdd/boot/grub.cfg

As an aside, there’s probably a more graceful way of using update-grub to re-write the new configuration, but I found it a lot easier just to do the search and replace like this.

At this point you should install the grub bootloader on to the new disk’s boot sector:

grub-install --recheck --no-floppy --root=/mnt/newhdd /dev/sda

Finally, you should be ready to reboot. Cross your fingers!

If your system doesn’t come back up then I suggest you use the rescue CD to fix things. Also, since you haven’t actually written anything to the old disk, you should always be able to swap the disks back and try to figure out what went wrong.

At this point you should shut your system down again, remove the old disk entirely and try booting up again. If your system came back up before and fails now then it was probably booting off the old disk, which probably indicates the boot sector didn’t install on the new disk properly.

Hopefully that’s been of some help to someone — by all means leave a comment if you have any issues with it or you think I’ve made a mistake. Good luck!


  1. Typically they’re 512 bytes, although recently drives with larger sectors have started to crop up. 

  2. Strictly speaking there’s also a small performance penalty when accessing a reallocated sector on a drive, so they’re also bad news in performance-critical servers — typically this isn’t relevant to most people, however. 

5 Jun 2013 at 6:02PM by Andy Pearce in Software  | Photo by Cara Fuller  | Tags: linux backup  |  See comments

☑ Jinja Ninja

I recently had to do a few not-quite-trivial things with the Jinja2 templating engine, and the more I use it the more I like it.

beach karate

This blog is generated using a tool called Pelican, which generates a set of static HTML files from Markdown pages and other source material. It’s a simple yet elegant tool, and you can customise its output using themes. This site uses a theme I created myself called Graphite. Of course, you’d know all this if you read the little footer at the bottom of the page1.

As it happens, Pelican themes use Jinja2, which is one of the more popular Python templating languages. Since I recently had to do some non-trivial things with the site theme here, I thought I’d post my thoughts on it — the executive summary, for anybody who’s already bored, is that I think it’s rather good.

The main thing I wanted to achieve was to reorganise the archive of old posts into one page per year, with sections for each month. To index this I wanted a top-level page which simply linked to each month of each year, with no posts listed. One thing I didn’t want to do was have to change core Pelican, since I’m trying to keep this theme suitable for anybody (even though it’s unlikely that anyone but myself will ever use it).

Pelican already had some configuration which got me part of the way there. It’s possible for it to put pages into subdirectories according to year and month, and also create index.html pages in them to provide an appropriate index. This was a great starting point, but some work was still needed since the posts were presented to the template as a simple sorted list of objects with appropriate attributes.

Since I wanted the year indices (such as this one) to have links to individual posts organised under headings per month. This was fairly easy to achieve by recording the date of the previous post linked and emitting a header if the month and/or year of the post about to be linked differed. Here’s a snippet from the template which actually generates the links:

<h1>Archives</h1>
{% set last_date = None %}
<dl>
{% for article in dates %}
    {% if last_date != (article.date.year, article.date.month) %}
      <dt>{{ article.date|strftime("%b %Y") }}</dt>
    {% endif %}
    <dd>
      <a href="{{ SITEURL }}/{{ article.url }}"
         title="{{ article.locale_date }}: {{ article.summary|striptags|escape }}">
        {{ article.title }}
      </a>
    </dd>
    {% set last_date = (article.date.year, article.date.month) %}
{% endfor %}
</dl>

Pelican has set dates to an iterable of posts, each of which is a class instance with some appropriate attributes. You can see that setting a tracking variable last_date is simple enough, as is iterating over dates. Then we conditionally emit a <dt> tag containing the date if the current post’s date differs from the previous one2. Since last_date starts at None, this will always compare unequal the first time and emit the month for the first post. Thereafter, the heading is only emitted when the month (or year) changes. This approach does, of course, assume that dates yields in sorted order.

The other points worth noting are the filters, which take the item on the left and transform it somehow. The strftime filter is provided by Pelican, and passes the input date and the format string parameter to strftime() in the obvious way. The striptags and escape filters are available as standard in Jinja2 — their operation should be fairly obvious from the code above.

What I like is the way that I can write fairly natural Pythonic code, referring to attributes and the like, but still have it executed in a fairly secure sandboxed environment instead of just passed to the Python interpreter, where it could cause all sorts of mischief.

Also, there are a few useful extensions to basic Python builtins, such as the ability to refer to the current loop index within a loop, and also refer to the offset from the end of the list as well, to easily identify final and penultimate items for special handling.

The other bit of Jinja2 that’s quite powerful is the concept of inheritance, something that seems to have become increasingly popular in templating engines. The way it’s done here is to be able to declare that one template extends another one:

{% extends base.html %}

The “polymorphism” aspect is handled with the ability to override “blocks”. So, perhaps base.html contains a declaration like this:

<head>
  <title>{% block title %}Andy's Blog{% endblock %}</title>
</head>

Then, a page which wanted to override the title could do so by extending the base template and simply redeclaring the replacement block:

{% extends base.html %}
{% block title %}Andy's Other Page{% endblock %}

Finally, there’s also the ability to define macros, which essentially parameterised snippets of markup which can be called like functions to be substituted into place with the appropriate arguments included. Here’s a trivial example:

{% macro introduction(name) %}
  <p>
    Hello, my name is {{ name }}.
  </p>
{% endmacro %}

Of course, many of these features are provided by other templating engines as well, but I’ve found Jinja2 to be convenient, Pythonic and certainly fast enough for my purposes. I think it’ll be my templating engine of choice for the foreseeable future.


  1. You know, the bit that absolutely nobody ever reads. 

  2. In the original Jinja engine the ifchanged directive provided a more convenient way to do this, but it’s been removed in Jinja2 as it was apparently inefficient

1 Jun 2013 at 6:25PM by Andy Pearce in Software  | Photo by Jason Briscoe on Unsplash  | Tags: python  web html-templates  |  See comments

☑ Hooked on Github

Github’s web hooks make it surprisingly easy to write commit triggers.

github usb

I’ve been using Github for awhile now and I’ve found it to be a very handy little service. I recently discovered just how easy it is to add commit triggers to it, however.

If you look under Settings for a repository and select the Service Hooks option, you’ll see a whole slew of pre-written hooks for integrating your repository into a variety of third party services. These range from bug trackers to automatically posting messages to IRC chat rooms. If you happen to be using one of these services, things are pretty easy.

If you want to integrate with your own service, however, things are almost as easy. In this post, I’ll demonstrate how easy by presenting a simple WSGI application which can keep one or more local repositories on a server synchronised by triggering a git pull command whenever a commit is made to the origin.

Firstly, here’s the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import git
import json
import urlparse


class RequestError(Exception):
    pass


# Update this to include all the Github repositories you wish to watch.
REPO_MAP = {
    "repo-name": "/home/user/src/git-repo-path"
}


def handle_commit(payload):
    """Called for each commit any any watched repository."""

    try:
        # Only pay attention to commits on master.
        if payload["ref"] != 'refs/heads/master':
            return False
        # Obtain local path of repo, if found.
        repo_root = REPO_MAP.get(payload["repository"]["name"], None)
        if repo_root is None:
            return False

    except KeyError:
        raise RequestError("422 Unprocessable Entity")

    # This block performs a "git pull --ff-only" on the repository.
    repo = git.Repo(repo_root)
    repo.remotes.origin.pull(ff_only=True)
    return True


def application(environ, start_response):
    """WSGI application entry point."""

    try:
        # The Github webhook interface always sends us POSTs.
        if environ["REQUEST_METHOD"] != 'POST':
            raise RequestError("405 Method Not Allowed")

        # Extract and parse the body of the POST.
        post_data = urlparse.parse_qs(environ['wsgi.input'].read())

        # Github's webhook interface sends a single "payload" parameter
        # whose value is a JSON-encoded object.
        try:
            payload = json.loads(post_data["payload"][0])
        except (IndexError, KeyError, ValueError):
            raise RequestError("422 Unprocessable Entity")

        # If the request looks valid, pass to handle_commit() which
        # returns True if the commit was handled, False otherwise.
        if handle_commit(payload):
            start_response("200 OK", [("Content-Type", "text/plain")])
            return ["ok"]
        else:
            start_response("403 Forbidden", [("Content-Type", "text/plain")])
            return ["ignored ref"]

    except RequestError as e:
        start_response(str(e), [("Content-Type", "text/plain")])
        return ["request error"]

    except Exception as e:
        start_response("500 Internal Server Error",
                       [("Content-Type", "text/plain")])
        return ["unhandled exception"]

Aside from the Python standard library it also uses the GitPython library for accessing the Git repositories. Please also note that this application is a bare-bones example — it lacks important features such as logging and more graceful error-handling, and it could do with being rather more configurable, but hopefully it’s a reasonable starting point.

To use this application, update the REPO_MAP dictionary to contain all the repositories you wish to watch for updates. The key to the dictionary should be the name of the repository as specified on Github, the value should be the full, absolute path to a checkout of that repository where the Github repository is added as the origin remote (i.e. as if created with git clone). The repository should remaind checked out on the master branch.

Once you have this application up and running you’ll need to note its URL. You then need to go to the Github Service Hooks section and click on the WebHook URLs option at the top of the list. In the text box that appears on the right enter the URL of your WSGI application and hit Update settings.

Now whenever you perform a commit to the master branch of your Github repository, the web hook will trigger a git pull to keep the local repository up to date.

Primarily I’m hoping this serves as an example for other, more useful web hooks, but potentially something like this could serve as a way to keep a production website up to date. For example, if refs/heads/master in the script above is changed to refs/heads/staging and you kept the local repository always checked out on that branch, you could use it as a way to push updates to a staging server just by performing an appropriate commit on to that branch in the master repository.

Also note that the webhook interface contains a lot of rich detail which could be used to do things like update external bug trackers, update auto-generated documentation or a ton of other handy ideas. Github have a decent enough reference for the content of the POSTs your hook will receive and my sample above only scratches the surface.

16 May 2013 at 11:52AM by Andy Pearce in Software  | Photo by Brina Blum on Unsplash  | Tags: web  git python  |  See comments

☑ May Day! May Day!

Backups are a hassle, off-site ones doubly so. However, there are a few tools which make life easier — this post discusses some of them.

hard drive

You’re going to lose your files. All of them. Maybe not today, maybe not tomorrow. Maybe not even soon. The question is, will it be for the rest of your life?

When I looked up “back up” in the thesaurus it listed its synonyms as “abandon”, “fall back”, “retreat” and “withdraw”, and I’d say that’s a fair characterisation of many people when they try to back up their data. These people are making a rod for their own back, however, and one day it’ll hit them.

OK, so we need to back stuff up, we get told that all the time, usually by very smug people while we’re scrabbling around trying to recover some important report just before it’s due. So what’s the best way to go about it?

There are several elements to a successful backup solution. I’d say first and foremost among them is automation. If you need to do something manually to kick off a backup then, unless you’re inhumanly organised, you’re going to forget to do it eventually. Once you start forgetting, chances are you’re going to keep forgetting, right up until the point you need that backup. Needless to say, that’s a little late.

The second element is history — the ability to recover previous versions of files even after they’ve been deleted. Hardware failure is only one reason to restore from a backup, it’s also not implausible that you might accidentally delete a file, or perhaps accidentally delete much of its contents and save it over the original. If you don’t notice for a few days, chances are a backup solution without history will have quietly copied that broken version of the file over the top of the previous version in your backup, losing it forever.

The third element is off-site — i.e. your backups should be stored at a physically separate location to the vulnerable systems. I’ve heard of at least a couple of cases where people have carefully ensured they backed up data between multiple computers, only to have them all stolen one night. Or a burned in a fire. Or any of a list of other disasters. These occurrences are rare, of course, but not rare enough to rule them out.

The fourth and final element is that only you have access. You might be backing up some sensitive data, perhaps without realising it, so you want to make sure that your backups are useless to someone stealing them. Typically this is achieved by encrypting them. Actually this should be called something like “encryption” or “security” but then the list wouldn’t form the snappy acronym Ahoy1:

  • Automated
  • History
  • Off-site
  • You (have sole access)

So, how can we hit the sweet spot of all four of these goals? Because I believe that off-site backups are so important, I’m going to completely ignore software which concentrates on backing up to external hard disks or DVDs. I’m also going to ignore the ability to store additional files remotely — this is useful, but a true backup is just a copy of what you already have locally anyway. Finally, I’ll skip over the possibility of simply storing everything in the cloud to begin with, for example with services such as Google Docs or Evernote, since these options are pretty self-explanatory.

The first possibilities are a host of subscription-based services which will transparently copy files from your PC up into some remote storage somewhere. Often these are aimed at Windows users, although many also support Macs. Linux support is generally lacking. Services such as Carbonite offer unlimited storage for a fixed annual fee, although the storage is effectively limited by the size of the hard disk in your PC. Others, such as MozyHome prefer to bill you monthly based on your storage requirements. There are also services such as Jungle Disk which effectively supply software that you can use with third party cloud storage services such as Amazon S3.

These services are aimed squarely at general users and they tend to be friendly to use. They also generally keep old versions of files for 1-3 months, which is probably enough to recover from most accidental deletion and corruption. They can be a little pricey, however, typically costing anything from $5 to $10 a month (around £3-£6). This might not be too much for the peace of mind that someone’s doing the hard work for you but remember that the costs can increase as the amount you need to store goes up. Things can get even more expensive for people with multiple PCs or lots of external storage.

It’s hard to judge the security of these services — mostly these services claim to use well known forms of encryption such as Blowfish or AES and, assuming this is true, they’re pretty secure. Generally you can have more trust in a service where you provide the encryption key and where the encryption is performed client-side, although in this case you must, of course, keep the key safe as there’s no way they can recover your data without it. For those of you paying attention you’ll realise this means an off-site copy of your key as well, stored in a secure location, but it does depend how far you want to take it — there’s always a trade-off between security and convenience.

If you don’t mind doing a bit more of the work yourself, there are other options for backup which may be more economical. Firstly, if you already have PCs at multiple locations then you might be interested in the newly-released BitTorrent Sync. Many people may have already heard of the BitTorrent file-sharing protocol and this software is also from the company co-founded by Bram Cohen, the creator of the protocol. However, it has very little to do with public file-sharing, although it’s based on the same protocol under the hood. It’s more about keeping your own files duplicated across multiple devices.

You can download clients for Windows, OSX or Linux and once you’ve configured them, they sit there watching a set of directories. You do this on several machines which all link together and share changes to the files in the watched directories. As soon as you add, delete or edit a file on one machine, the sync clients will share that change across the others. Essentially it’s a bit like a private version of Dropbox.

This is a bit of a cheat in the context of this article, of course, because it doesn’t meet one of my own criteria, storing the history of files — it’s a straight sync tool. I’m still mentioning it for two reasons — firstly, it might form a useful component of another backup solution where some other component provides file history; secondly, they’re my criteria and I’ll ignore them if I want to.

Like BitTorrent, it becomes more efficient as you add more machines to the swarm and it has the ability to share links to other peers so in general you should only need to hook a new machine to one of the others in the cloud and it should learn about the rest. It’s also pretty secure as each directory is associated with a unique key and all traffic is encrypted with it — if a peer doesn’t have the key, it can’t share the files. The data at each site isn’t stored encrypted, however, so you still need to maintain physical security of each system as you’d expect. There’s also the possibility to add read-only and one-time keys for sharing files with other people, but I haven’t tried this personally.

I haven’t played with it extensively yet, but from my early experiments it seems pretty good. It’s synchronisation is fast, its memory usage is low and it seems to make good use of OS-specific features to react to file changes quickly and efficiently.

The main downside at the moment is that it’s still at quite an early stage and should be considered beta quality at best. That said, I haven’t had any problems myself. It’s also closed source which might be a turn-off for some people and it’s not yet clear whether the software will remain available for free indefinitely. It also doesn’t duplicate OS-specific meta-information such as Unix permissions which may be an issue for Linux and potentially OSX users.

On the subject of preserving Unix permissions and the like, what options exist for that? Well, there is a very handy tool called rdiff-backup which is based on rather wonderful rsync. Like rsync it’s intended to duplicate one directory somewhere else, either on the same machine or remotely via an SSH connection. Unlike rsync, however, it not only makes the destination directory a clone of the source, but it also stores reverse-diffs of the files back from that point so you can roll them back to any previous backup point.

I’ve had a lot of success using it, although you need to be fairly technical to set it up as there’s a plethora of command-line options to control what’s included and excluded from the backup, how long to keep historical versions and all sorts of other information. The flip side to this slight complexity is that it’s pretty flexible. It’s also quite efficient on space, since it only stores the differences between files that have changed as opposed to many tools which store an entire new copy of the file.

The one area where rdiff-backup falls down, however, is security — it’s fine for backing up between trusted systems, but what about putting information on cloud storage which you don’t necessarily trust? Fortunately there’s another tool based on rdiff-backup called Duplicity which I’ve only relatively recently discovered.

This is a fantastic little tool which allows you to create incremental backups. Essentially this means that the first time you do a backup, it creates a complete copy of all your files. The next time it stores the differences between the previous backup and the current state of the files, like rdiff-backup but using forward-diffs rather than reverse. This means to restore a backup you need the last full one plus all the incrementals in between.

The clever bit is that it splits your files up into chunks2 and also encrypts each chunk with a passphrase that you supply. This means you can safely deposit those chunks on any third party storage you choose without fear of them sneaking a peek at your files. Indeed, Duplicity already comes with a set of different backends for dumping files on a variety of third party storage solutions including Google Drive and Amazon S3, as well as remote SFTP and WebDAV shares.

It’s free and open source, although just like rdiff-backup it’s probably for the more technically-minded user. It also doesn’t run under Windows3. However, Windows users need not despair — it has inspired another project called Duplicati which is a reimplementation from scratch in C#. I haven’t used this at all myself, but it looks very similar to Duplicity in terms of its basic functionality, although there are some small differences which make it incompatible.

The main difference appears to be that it layers a more friendly GUI for configuring the whole thing, which probably makes it more accessible to average users. It still supports full and incremental backups, compression and encryption just as Duplicity does. It also will run on OSX and Linux with the aid of Mono, although unlike Duplicity it doesn’t currently support meta-information such as Unix permissions4, which probably makes Duplicity a more attractive option for Linux unless you really need to restore on different platforms.

Anyway, that’s probably enough of a summary for now. Whatever you do, however, if you’re not doing backups then start, unless you’re the sort of person who craves disappointment and despair. If not then you’ll definitely regret it at some point. Maybe not today- Oh wait, we’ve done that already.


  1. Everyone knows you need a catchy mnemonic when you’re trying to repackage common sense and sell it to people. 

  2. Bzipped multivolume tar archives, for the technically minded. 

  3. At least not without a lot of faff involving Cygwin and a handful of other packages. 

  4. Although there is an open issue in their tracker about implementing support for meta-information. 

1 May 2013 at 1:09PM by Andy Pearce in Software  | Photo by Patrick Lindenberg on Unsplash  | Tags: backup cloud  |  See comments

☑ Python destructor drawbacks

Python’s behaviour with regards to destructors can be a little surprising in some cases.

green python

As you learn Python, sooner or later you’ll come across the special method __del__() on classes. Many people, especially those coming from a C++ background, consider this to be the “destructor” just as they consider __init__() to be the “constructor”. Unfortunately, they’re often not quite correct on either count, and Python’s behaviour in this area can be a little quirky.

Take the following console session:

>>> class MyClass(object):
...   def __init__(self, init_dict):
...     self.my_dict = init_dict.copy()
...   def __del__(self):
...     print "Destroying MyClass instance"
...     print "Value of my_dict: %r" % (self.my_dict,)
... 
>>> instance = MyClass({1:2, 3:4})
>>> del instance
Destroying MyClass instance
Value of my_dict: {1: 2, 3: 4}

Hopefully this is all pretty straightforward. The class is constructed and __init__() takes an initial dict instance and stores a copy of it as the my_dict attribute of the MyClass instance. Once the final reference to the MyClass instance is removed (with del in this case) then it is garbage collected and the __del__() method is called, displaying the appropriate message.

However, what happens if __init__() is interrupted? In C++ if the constructor terminates by throwing an exception then the class isn’t counted as fully constructed and hence there’s no reason to invoke the destructor1. How about in Python? Consider this:

>>> try:
...   instance = MyClass([1,2,3,4])
... except Exception as e:
...   print "Caught exception: %s" % (e,)
... 
Caught exception: 'list' object has no attribute 'copy'
Destroying MyClass instance
Exception AttributeError: "'MyClass' object has no attribute 'my_dict'" in <bound method MyClass.__del__ of <__main__.MyClass object at 0x7fd309fbc450>> ignored

Here we can see that a list instead of a dict has been passed, which is going to cause an AttributeError exception in __init__() because list lacks the copy() method which is called. Here we catch the exception, but then we can see that __del__() has still been called.

Indeed, we get a further exception there because the my_dict attribute hasn’t had chance to be set by __init__() due to the earlier exception. Because __del__() methods are called in quite an odd context, exceptions thrown in them actually result in a simple error to stderr instead of being propagated. That explains the odd message about an exception being ignored which appeared above.

This is quite a gotcha of Python’s __del__() methods — in general, you can never rely on any particular piece of initialisation of the object having been performed, which does reduce their usefulness for some purposes. Of course, it’s possible to be fairly safe with judicious use of hasattr() and getattr(), or catching the relevant exceptions, but this sort of fiddliness is going to lead to tricky bugs sooner or later.

This all seems a little puzzling until you realise that __del__() isn’t actually the opposite of __init__() — in fact, it’s the opposite of __new__(). Indeed, if __new__() of the base class (which is typically responsible for actually doing the allocation) fails then __del__() won’t be called, just as in C++. Of course, this doesn’t mean the appropriate thing to do is shift all your initialisation into __new__() — it just means you have to be aware of the implications of what you’re doing.

There are other gotchas of using __del__() for things like resource locking as well, primarily that it’s a little too easy for stray references to sneak out and keep an object alive longer than you expected. Consider the previous example, modified so that the exception isn’t caught:

>>> instance = MyClass([1,2,3,4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __init__
AttributeError: 'list' object has no attribute 'copy'
>>>

Hmm, how odd — the instance can’t have been created because of the exception, and yet there’s no message from the destructor. Let’s double-check that instance wasn’t somehow created in some weird way:

>>> print instance
Destroying MyClass instance
Exception AttributeError: "'MyClass' object has no attribute 'my_dict'" in <bound method MyClass.__del__ of <__main__.MyClass object at 0x7fd309fbc2d0>> ignored
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'instance' is not defined

Isn’t that interesting! See if you can have a guess at what’s happened…

… Give up? So, it’s true that instance was never defined. That’s why when we try to print it subsequently, we get the NameError exception we can see at the end of the second example. So the only real question is why was __del__() invoked later than we expected? There must be a reference kicking around somewhere which prevented it from being garbage collected, and using gc.get_referrers() we can find out where it is:

>>> instance = MyClass([1,2,3,4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __init__
AttributeError: 'list' object has no attribute 'copy'
>>> import sys
>>> import gc
>>> import types
>>> 
>>> for obj in gc.get_objects():
...   if isinstance(obj, MyClass):
...     for i in gc.get_referrers(obj):
...       if isinstance(i, types.FrameType):
...         print repr(i)
... 
<frame object at 0x1af19c0>
>>> sys.last_traceback.tb_next.tb_frame
<frame object at 0x1af19c0>

Because we don’t have a reference to the instance any more, we have to trawl through the gc.get_objects() output to find it, and then use gc.get_referrers() to find who has the reference. Since I happen to know the answer already, I’ve filtered it to only show the frame object — without this filtering it also includes the list returned by gc.get_objects() and calling repr() on that yields quite a long string!

We then compare this to the parent frame of sys.last_traceback and we get a match. So, the reference that still exists is from a stack frame attached to sys.last_traceback, which is the traceback of the most recent exception thrown. What happened earlier when we then attempted print instance is that this threw an exception which replaced the previous traceback (only the most recent one is kept) and this removed the final reference to the MyClass instance hence causing its __del__() method to finally be called.

Phew! I’ll never complain about C++ destructors again. As an aside, many of the uses for the __del__() method can be replaced by careful use of the context manager protocol, although this does typically require your resource management to extend over only a single function call at some level in the call stack as opposed to the lifetime of a class instance. In many cases I would argue this is actually a good thing anyway, because you should always try to minimise the time when a resource is acquired, but like anything it’s not always applicable.

Still, if you must use __del__(), bear these quirks in mind and hopefully that’s one less debugging nightmare you’ll need to go through in future.


  1. The exception (haha) to this is when a derived class’s constructor throws an exception, then the destructor of any base classes will still be called. This makes sense because by the time the derived class constructor was called, the base class constructors have already executed fully and may need cleaning up just as if an instance of the base class was created directly. 

23 Apr 2013 at 10:48AM by Andy Pearce in Software  | Photo by Alfonso Castro on Unsplash  | Tags: python destructors  |  See comments

☑ When is a closure not a closure?

Python’s simple scoping rules occasionally hide some surprising behaviour.

closed sign

Scoping in Python is pretty simple, especially in Python 2.x. Essentially you have three scopes:

  • Local scope
  • Enclosing scope
  • Global scope

Local scope is anything defined in the same function as you. Enclosing scopes are those of the functions in which you’re defined — this only applies to functions which are lexically contained within other functions1. Global scope is anything at the module level. There’s also a special “builtin” scope outside of that, but let’s ignore that for now. Classes also have their own special sorts of scopes, but we’ll ignore that as well.

When you assign to a variable within a function, this counts as a declaration and the variable is created in the local scope2 of the function. This is unless you use the global keyword to force the variable to refer to one at module scope instead3.

When you read the value of a variable, Python starts with the local scope and attempts to look up the name there. If it’s not found, it recurses up through the enclosing scopes looking for it until it reaches the module scope (and finally the magic builtin scope). This is more or less as you’d expect if you’re used to normal lexically-scoped languages.

However, if you were paying attention you’ll notice that I specifically said that a local scope is defined by a function. In particular, constructs such as for loops do not define their own scopes — they operate entirely in the local scope of the enclosing function (or module). This has some beneficial side-effects — for example, loop counters are still available once the loop has exited, which is rather handy. It has some potential pitfalls — take this code snippet, for example4:

1
2
functions = [(lambda: i) for i in xrange(5)]
print ", ".join(str(func()) for func in functions)

So, this builds a list of functions5 and then executes each one in turn and concatenates and prints the results. Intuitively one would expect the results to be 0 1 2 3 4, but actually we get 4 4 4 4 4 — eh?

What’s happening is that each of the functions created is in a closure with the variable i in its global scope bound to the one used in the loop. However, each iteration just updates the same loop counter in the local scope of the enclosing function (or module) and so all the functions end up with a reference to the same variable i. In other words, closures in Python refer directly to the enclosing scopes, they don’t create “frozen copies” of them6.

This works fine when a closure is created by a function and then returned, because the enclosing scope is then kept alive only by the closure and inaccessible elsewhere. Further invocations of the same function will produce new scopes and different closures. In this case, though, the functions are all defined under the same scope. So when they’re evaluated, they all return the final value of i as it was when the loop terminated.

We can illustrate this by amending the example to delete the loop counter:

1
2
3
functions = [(lambda: i) for i in xrange(5)]
del i
print ", ".join(str(func()) for func in functions)

Now the third line raises an exception:

NameError: global name 'i' is not defined

Of course, if you use the generator expression form to defer generation of the functions until the point of invocation then everything works as you’d expect:

1
2
3
# This prints "0 1 2 3 4" as expected.
functions = ((lambda: i) for i in xrange(5))
print ", ".join(str(func()) for func in functions)

So, all this is quite comprehensible once you understand what’s going on, but I do wonder how many people get bitten by this sort of thing when using closures in loops.

As a final note, this behaviour is the same in Python 3.x. There is a small difference with regards to scopes that is the addition of the nonlocal keyword which is the equivalent of global except it allows updating the value of variables in enclosing scopes which are between the local and global scopes. I believe that with regards to reading the values of such variables, however, the behaviour is unchanged.


  1. Note that this is a lexical definition of enclosure, which is to say it’s to do with where the function is defined. It’s nothing to do with where the function was called from. Unlike dynamically-scoped languages, Python gives a function no access to variables defined in the scope of a calling function. 

  2. This actually extends to the entire function, which is why it’s an error to read the value of a variable assigned to later in the function even if it exists in an enclosing scope. 

  3. Or the nonlocal keywords in Python 3.x — see the note at the end of this post. 

  4. This example uses a list comprehension for concision, but the issues described would apply equally to a for loop. 

  5. Yes I’m using lambda — so sue me, it’s just an example. 

  6. Actually, once you think of closures as references to a scope rather than some sort of “freeze-frame” of the state, some things are easier to understand. For example, if two functions are defined in the same closure, updates that each of them makes to the state can be felt by the other. This is especially relevant if they use Python 3’s nonlocal keyword (see the note at the end this post). 

10 Apr 2013 at 3:41PM by Andy Pearce in Software  | Photo by Tim Mossholder on Unsplash  | Tags: python scoping  |  See comments

⇐ Page 1   |   ← Page 3   |   Page 4 of 6   |   Page 5 →   |   Page 6 ⇒