☑ Validating UK Postcodes

I find myself needing to enter UK addresses on a fairly regular basis and it never fails to amaze me how poor some of the syntax checking is - basic validation of a UK postcode is really not even remotely difficult.

no entry

In these days of online shopping, online account access and, in short, online more or less everything, it’s common to have to enter your address into a web page. Or perhaps someone else’s address. This isn’t usually too taxing, especially with browsers providing auto-complete these days, but one quirk of many of these pages that I find disproportionately irritating is the appalling and inconsistent rules they apply to postcodes.

Most countries have postal codes of some sort — America has zip codes, Ireland has Eircode, Germany has its Postleitzahlen and so on. Most of these systems are fairly simple and the UK is no exception. It’s therefore a continual source of confusion to me how any web developer could possibly manage to mess things up — but they appear to do so, repeatedly, at least for UK postcodes (I can’t speak for the others). Perhaps by explaining things here I might save someone from making the same mistakes — probably not, but at least it gives me the moral highground to rant about it, which is probably the most important thing, right?

A UK postcode is split into two halves, conventionally separated by a space. The first half is the outward code and most commonly consists of 1-2 letters followed by 1-2 digits. The letters specify the area — there are currently 121 of these in Britain1. The digits specify the district within the area, typically with 1 indicating the centre of a city and moving outwards. There’s also an additional complication in London (and perhaps other densely populated areas) that the district becomes too large for use, so an additional letter may be appended to the end of the outward code to further subdivide it.

The part of the postcode after the space is called the inward code and always consists of one digit, specifying a sector within the district which typically narrows it down to around a few thousand addresses, followed by two letters, which specify the unit within the sector which takes it all the way down to the low tens of addresses.

As you might have guessed the names of the two halves reflect the differing uses: the outward code is used to decide which sorting office to send post to for further processing; once it’s arrived there, the inward code is used to determine the specific address to which it should be delivered.

There are some exceptions to these rules, such as for overseas territories or the special code SAN TA1 for Father Christmas. However, unless you’re setting up the Pitcairn Island Delivery Service, you probably don’t need to worry too much about these.

Not that people writing postcode parsing code need to worry about most of this, to be honest — frankly the main take-away from all of the discussion so far is simply this regular expression:

^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$

That said, I really hope someone about to implement a postcode validator doesn’t stop here beacuse this regular expression demonstrates three of my recurrent annoyances with postcode validators online.

Case-sensitivity
Postcodes are typically presented in upper-case, but only the most petty pedant could possibly claim that sw1a 1aa is incorrect. Don’t be needlessly fussy, allow any combination of case.
Spaces #1
Postcodes typically contain a space, but don’t be a sadist and reject postcodes without one. For bonus marks, strip out anything that isn’t an alphanumeric character before you do any other processing.
Spaces #2
Postcodes typically contain a space, so don’t be an idiot and reject postcodes that contain one or more of them, and don’t be pointlessly fussy if they’re in the wrong place either. For bonus marks, see the previous item.

If you were in Python, therefore, to avoid falling into any of these traps you could easily do the following — it’s quick and dirty and uses regular expressions, but it shows how simple the job is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import re

NON_ALPHA_RE = re.compile('[^A-Z0-9]+')
POSTCODE_RE = re.compile('^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$')

def normalise_postcode(postcode):
    """Return a normalised postcode if valid, or None if not."""

    postcode = NON_ALPHA_RE.sub('', postcode.upper())
    postcode = postcode[:-3] + ' ' + postcode[-3:]
    if POSTCODE_RE.match(postcode):
        return postcode
    return None

Now how simple is that? Took me all of ten minutes. Next time you find a website that rejects a postcode that it could have quite easily figured out, or accepts a postcode that’s quite clearly gibberish, just have a little think about how absurdly little time the developer responsible spent on the user experience and perhaps leave some feedback to that effect.

My last annoyance with postcode validators isn’t demonstrable here, but it occurs when people try to do the right thing and validate against the actual postcode database. Unfortunately these people are often rather lackadaisical about keeping it updated so if you live somewhere which was built in the last year or two, you can often kiss goodbye to the thought of ordering anything to your home address. Not a problem in our current house, unless there’s someone somehow running off a postcode database that predates the computer by several decades2 — but having lived at such a place in the past I can attest to it as an almost limitless source of frustration.

Therefore, don’t reject something just because it’s not in the postcode database — and just pay the damn money to keep it up to date, will you? Because no matter how good the thing you’re selling, there’s someone else selling something almost indistinguishable, and they’ll let me order it sucessfully online.

In fact, forget all of the above and just don’t do any postcode validation at all. Let the Royal Mail sort it out, it’s what they get paid for.


  1. The crown dependencies of Guernsey, Jersey and Isle of Man also have two-letter codes that look like postcodes, but I’m not sure if they follow precisely the same scheme. 

  2. And I wouldn’t put it past them! 

18 Aug 2015 at 6:15PM by Andy Pearce in Software  | Photo by Alex Rodriguez Santibanez  | Tags: regex text-processing  |  See comments

☑ Time zones and cron

Time zones can be tricky beasts, particularly where daylight savings time is concerned. This post discusses issues around apply them to something like the ubiquitous Unix cron daemon.

broken watches

I recently had cause to look into how to schedule events to run at times specified in a particular time zone. It turns out to be quite a subtle topic, about which I’ll ramble aimlessly for a few paragraphs below.

Consider writing a scheduler rather like cron. This scheduler has to take account of time zones in its handling - so each event recurrence also has an associated time zone which is used to know when to schedule the events. This is important - a date and time alone (e.g. 2015-07-21 07:09:15) is known as “naïve” and to map that to an actual point in time you also need an associated time zone. This could be UTC, or it could be any of the country-specific offsets from UTC. Once a time zone is added then a datetime becomes “aware”.

This is important for things like cron because a user of a multiuser system may wish to set events in their own time zone, which isn’t necessarily the same as system time. A cron expression is implicitly naïve - there’s no way to turn that into actual events to trigger without adding some sort of time zone. Of course, actual cron implementations themselves probably don’t typically care about this - they’ll just trigger things in system local time and the user can have the hassle of working around it. That’s why I quite carefully said “things like cron” earlier.

Let’s say, then, that you wish to write a better cron - one which allows users to specify time zones with their events. First of all you have to deal with the people who tell you that’s a waste of time - why not just adjust all your cron jobs such that they’re specified in UTC?

Well, that’s a fairly easy argument to deal with - just come up with something where that’s a real pain to do. Let’s say you want to run a cleanup job once every quarter at the end of the month. You know that some months have 30 days and some have 31 so you’re sensible enough to schedule it on the evening of the 30th day of the month: cron won’t run it on months with fewer than 31 days if you put it on the 31st. You might use this cron specification:

00  23  30  3,6,9,12  *

To clarify that expression will run an event at 11:00pm on the 30th day of March, June, September and December. Great, exactly what we want. Except let’s say that I’m in New York, so I’d like this to be specified in EST - but my system time is in UTC. Let’s ignore DST complications for now and assume we always want these events to trigger in UTC-5 (i.e. five hours behind UTC). So let’s adjust that cron expression in EST to be in UTC instead:

00  04  31  3,6,9,12  *

Well that wasn’t so hard - we just added five hours, which caused the day to roll over and now we’re just triggering on the 31st of the following day. But hang on - we already knew that June and September only have 30 days which is why we put it on the 30th to begin with. So now we have a cron job which won’t run on all the specified months.

In this particular simple case it’s not too difficult to see how to split that into two separate cron jobs to support it:

00  04  31  3,12  *
00  04  01  7,10  *

… but in the general case this is not necessarily trivial. For example, try to figure out a cron expression which transforms this expression from New York to UTC without affecting its functionality:

00  23  29  2  *

If we attempt to account for DST then this becomes even less tractable.

We’ve established that the a time-zone-aware cron daemon is the only way to handle this sort of functionality therefore, so how could we implement one?

The simplest solution would be to record the event specifications in particular time zones, but transform them into UTC at the point of actually executing them. Let’s say you’re doing it in Python, you’d have a set of generators for the event specifications and they’d be yielding endless sequences of datetime objects in their specific time zones. You’d convert all of these to some common time zone - say, UTC - and then keep this in a sorted list of events to trigger. All fairly straightforward stuff.

Except that there’s still a wrinkle and that’s around DST. The daylight savings time changes are an intrinsic part of a time zone specification, so if this new cronzone utility is going to support time zones then it’s going to have to deal gracefully with DST. The problem is that it’s not immediately obvious which is the best way this should work.

For the background of anyone unfamiliar with DST, it’s a rule whereby during summer an extra hour is added. Each zone has different dates between which this adjustment is applied, and these dates differ year-to-year - they’re often chosen to be a specific day of the week, such as on a Saturday night.

When the hour jumps forward, the clock simply skips an hour - in the UK, for example, the clock ticks from 00:59:59 straight to 02:00:00. Different zones apply this at different times, but usually in the early hours of the morning.

When the hour jumps backward, the clock repeats an hour. Again in the UK the clock runs as normal until 01:59:59 ticks straight over into 01:00:00. This hour is repeated and then the clock runs on as normal.

All this offers some challenges to cronzone which probably doesn’t want to run commands twice, or skip them entirely. This isn’t a trivial problem, however - the mapping from a time zone which includes DST changes into UTC gets complicated around the points of DST change.

This is most easily illustrated with an example. In 2015 the UK clocks jumped foward in the small hours of March 29 and they’ll go back in the small hours of October 25. Let’s say you scheduled a job to run at 1:30am every day. Well, on the morning of 2015-03-29 in the UK that time simply does not exist - perhaps your job will be skipped. On the morning of 2015-10-25, however, it might run twice - the clocks run past 01:30:00 twice that day.

If your job is doing something important, that could be pretty bad. What if it’s calculating payroll information - perhaps people don’t get paid, or get paid twice. Either way, it’s not ideal.

What is there to do about such a mess? Well, the main thing is to understand the issues and make sure your test cases cover them. Once you’ve done that, it’s a case of finding a decent implementation. If you’re using Python it turns out that the pytz is actually fairly helpful in this regard.

Let’s say you have some code to take a cron-style specification and generate Python datetime objects from it which are the times at which the events should occur. These will implicitly be naïve until we add our per-user time zone information, for which we can use the localize() method in pytz - this attaches an appropriate timezone to a datetime instance.

The clever bit is that this handles DST translations for you, so if you then translate that to another datetime with astimezone() then it’ll fudge the times around so they still represent the wall clock time in the new time zone that corresponds to the same instant in UTC as the old one. Note that you might also need to use the normalize() method when converting directly between time zones as opposed to into UTC - this readjusts the time in case it’s crossed a DST boundary in the process of the conversion.

This isn’t so simple, however - we’ve already seen that converting from a time zone to UTC is not a straight one-to-one mapping around the DST changes - some times in a time zone don’t map to any UTC time (where the hour goes forward) some times map to two UTC times (where the hour goes back).

Since pytz can’t tell exactly what you expect it does what any decent API would do - it offers you a parameter to tell it. In this case several of the methods take the is_dst parameter which is a tri-state value which can be True, False or None. The first thing to note about this parameter is that it only applies during periods of DST change - at all other times the conversion is unambiguous and the value of this parameter has no effect.

The boolean values are straightforward - they just indicate that if there is ambiguity then assume that DST is or is not in place. For example, the UK time 2015-10-25 01:30:00 will occur twice this year, as discussed earlier, so is_dst is used to disambiguate - if it’s True then a conversion to UTC will return 00:30:00, if it’s False it will return 01:30:00.

The third value of None is the one I’m interested in - this has the effect of raising an exception if the time either doesn’t exist (NonExistentTimeError) or if the mapping is ambiguous (AmbiguousTimeError). This allows one to write a scheduler that deals with DST changes in a predictable and safe fashion.

The exact behaviour depends on what you think is most intuitive, but I’d be inclined to respond to the hour jumping forward by immediately running all the jobs that were skipped; and the hour jumping backward by skipping jobs which had already run.

Of course this does have some downsides. Let’s say what you want is to run something every 60 minutes regardless of DST changes, you’ll probably find that around the DST change you get something running twice or with two-hour gap, depending on the direction of the change. It doesn’t seem to me possible to be correct in both cases - for the user who wants things scheduled based on an elapsed time, and also the user who wants things scheduled at specific times of day, not to be missed or run twice over DST changes. Probably the only solution here is to allow them to specify a time zone individually for each job - if they want the former behaviour they can just use UTC.

So that’s it - time zones are tricky beasts. Life would be a lot simpler if DST disappeared overnight, but after all these years there seems little chance of that happening in all 70-odd countries in which it’s used. In the meantime programmers can expect to have to deal with these issues from time to time, where the specific requirements are hard enough to gather without even worrying about the implementation.

Still, it would be significant progress if everyone dealing with times in code would at least educate themselves, consider the issue properly and make an informed decision about how they handle it - as we’ve already seen the “bah, let them eat UTC” approach just doesn’t cut it.

21 Jul 2015 at 7:09AM by Andy Pearce in Software  | Photo by Heather Zabriskie on Unsplash  | Tags: python time  |  See comments

HTTP/2

After eighteen years there’s a new version of HTTP. Having heard comparatively little about it until now, I decided to take a quick look.

highway night

The Web is surely the most used service on the Internet by far — so much so, in fact, that I’m sure to many people the terms “Web” and “Internet” are wholly synonymous.

Perhaps one of the most remarkable things about the Web is that despite the revolutionary changes in the services offered over it, the protocol that underpins the whole thing, HTTP, has remained remarkably unchanged for almost two decades. The Web as we know it now runs more or less entirely over HTTP/1.1, the version standardised in 1997, receiving only minor tweaks since then.

However, the venerable old HTTP/1.1 is now starting to look a little creaky when delivering today’s highly dynamic and media-rich websites. In response to this, Google set a handful of its boffins to planning solutions to what it saw as some of the more serious issues — the result was a new protocol known as SPDY. This was intended to be more or less compatible with HTTP at the application level, but improve matters in the underlying layers.

In fact, SPDY has essentially formed the basis of the new HTTP/2 standard, the long-awaited replacement for HTTP/1.1 — so much so that Google has indicated that it plans to drop support for SPDY in favour of the new standard, and the other main browser vendors seem to be following suit. Since it seems to have fairly broad support, it’s probably useful to take a quick look at what it’s all about, then.

Why HTTP/2?

But before looking at HTTP/2 itself, what are the issues it’s trying to solve? The main one, as I see it, is latency — the amount of time between the first request for a page, and finally seeing the full thing loaded on screen.

The issue here is mainly that pages are typically composed of many disparate media: the main HTML file, one or more CSS files, Javascript files, images, videos, etc. With HTTP/1.1 these are all fetched with separate requests and since one connection can only handle one request at a time, this serialisation tends to hurt performance. Over conventional HTTP browsers can, and do, make multiple connections and fetch media over them concurrently — this helps, although it’s arguably rather wasteful. This is much less helpful over HTTPS, however, as the overhead of creating a secure connection is much higher — the initial connection triggers a cryptographic handshake which not only adds latency, it also puts a significant additional load on the server. Since HTTPS is increasingly becoming the standard for online services, even those which don’t deal in particularly sensitive data, this use-case is quite important.

There are other issues which add to the latency problem. HTTP headers need to be repeated with each request — this is quite wasteful as the ASCII representation of these headers is quite verbose compared to the data contained therein. Clients also need to make a good deal of progress through parsing the main HTML page to figure out which additional files need to be fetched and this adds yet more latency to the overall page load process.

Some of you may be thinking at this point that there’s an existing partial solution to some of these issues: pipelining. This is the process by which a client may send multiple requests without waiting for responses, leaving the server to respond in sequence to them as quickly as it can. Hence, as the client parses the response to the initial request, it can immediately fire off requests for the additional files it needs and this helps reduce latency by avoiding a round-trip-time for each request.

This is a useful technique but it still suffers from a significant problem known as head of line blocking. Essentially this is where a response for a large object holds up further responses on the connection — so for example, fetching a large image, without which the page could quite possibly still be usefully rendered, could hold up fetching of a CSS file which is rather more crucial.

Latency, then, is an annoying issue — so how does HTTP/2 go about solving it?

Negotiation

In essence what they’ve done is allow multiple request/response streams to be multiplexed onto a single connection. They’re still using TCP, but they’ve added a framing layer on top of that, so requests and responses are split up into frames of up to 16KB (by default). Each frame has a minimal header which indicates to which stream the frame belongs and so clients and servers can request and serve multiple resources concurrently over the same connection without suffering from the head of line blocking issue.

This is, of course, a pretty fundamental change from HTTP/1.1 and so they’ve also had to add a negotiation mechanism to allow clients and servers to agree to handle a connection over HTTP/2. The form this mechanism takes differs between HTTP and HTTPS. In the former case the existing HTTP/1.1 Upgrade header is supplied with the value h2c. There’s also a HTTP2-Settings header which must be supplied to specify the parameters of the new HTTP/2 connection, should the server accept — I’ll cover these settings in a moment. If the server supports HTTP/2 it can accept the upgrade by sending a 101 Switching Protocols response and then respond to the initial request using HTTP/2 — otherwise it can treat the request as if the header wasn’t present and respond as normal for HTTP/1.1.

For HTTPS a different mechanism is used, Application Layer Protocol Negotiation. This is an extension to TLS that was requested by the HTTP/2 working group and was published as an RFC last year. It allows a client and server to negotiate an application layer protocol to run over a TLS connection without additional round trip times. Essentially the client includes a list of the protocols it supports in its ClientHello message from which the server selects its most preferred option and indicates this choice in its ServerHello response.

However the connection was negotiated, both endpoints then send final confirmation of HTTP/2 in the form of a connection preface. For the client this takes the form of the following fixed string followed by a SETTINGS frame:

PRI * HTTP/2.0<CRLF>
<CRLF>
SM<CRLF>
<CRLF>

As an aside you may be wondering why the client is obliged to send a HTTP2-Settings header with its upgrade request if it’s then required to send a SETTINGS frame anyway — I’ve no idea myself, it seems a little wasteful to me. The RFC does allow that initial SETTINGS frame to be empty, though.

The server’s connection preface is simply a SETTINGS frame, which may also be empty. Both endpoints must also acknowledge the new frames by sending back empty SETTINGS frames with an ACK flag set. At this point the HTTP/2 connection is fully established.

The SETTINGS frames (or the HTTP2-Settings header) are used to set per-connection (not per-stream) parameters for the sending endpoint. For example, they can be used to set the maximum frame size that endpoint is prepared to accept, or disable server push (see later). A full list of the settings can be found in the RFC.

Frames

As I’ve already mentioned, all HTTP/2 communication is packetised, although of course since it’s running over a stream-based protocol then only one packet can be in flight on a given connection at once. The frame format includes a 9 byte header with the following fields:

Length [24 bits]
The size of the frame, not including the 9-byte header. This must not exceed 16KB unless otherwise negotiated with appropriate SETTINGS frames.
Type [8 bits]
An 8-bit integer indicating the type of this frame.
Flags [8 bits]
A bitfield of flags whose interpretation depends on the frame type.
Reserved [1 bit]
Protocol designers do love their reserved fields. Should always be zero in the current version of the protocol.
Stream ID [31 bits]
A large part of the purpose of HTTP/2 is to multiplex multiple concurrent requests and this is done by means of streams. This ID indicates to which stream this frame applies. Stream ID 0 is special and reserved for frames which apply to the connection as a whole as opposed to a specific stream. Stream IDs are chosen by endpoints and namespaced by the simple expedient of the client always choosing odd-numbered IDs and the server limiting itself to even-numbered IDs. In the case of an upgraded HTTP connection the initial request is automatically assumed to be stream 1.

The RFC specifies a complete state machine for streams which I won’t go into in detail here. Suffice to say that there are two main ways to initiate a new stream — client request and server push. We’ll look at these in turn.

Client Requests

When the client wishes to make a new request, it creates a new stream ID (odd-numbered) and sends a HEADERS frame on it. This replaces the list of headers in a conventional HTTP request. If the headers don’t fit in a single frame then one or more CONTINUATION frames can be used to send the full set, with the final frame being indicated by an END_HEADERS flag being set.

The format of the headers themselves has also changed in HTTP/2 as a form of Huffman Encoding with a static dictionary is now used to compress them. This is actually covered in a separate RFC and I don’t want to go into it in too much detail — essentially it’s just a simple process for converting a set of headers and values into a compressed binary form. As you might expect, this scheme does allow for arbitrary headers not covered by the static dictionary, but it’s designed such that common headers take up very little space.

The block of binary data that results from the encoding process is what’s split up into chunks and placed in the HEADERS and CONTINUATION frames sent as part of making a request. One thing that initially surprised me is that one single sequence of frames which make up one header block must be transmitted sequentially — no other frames, not even from other streams, may be interleaved. The reason for this is pretty clear, however, when you realise that the process of decoding a header block is stateful — this restriction ensures that any given connection only needs one set of decoder state to be maintained, as opposed to one per stream, to keep resource requirements bounded. One consequence of this is that very large header blocks could lead to HTTP/1.1-style head of line blocking could result, but I very much doubt this is an issue in practice.

There are a few other things to note about the transmission of headers other than the compression. The first is that all headers are forced to lowercase for transmission — this is just to assist the compression, and since HTTP headers have always been case-insensitive this shouldn’t affect operation. The second thing to note is that the Cookie header can be broken up into separate headers, one for each constituent cookie. This reduces the amount of state that needs to be retransmitted when cookies change.

The third and final item of note is slightly more profound — the information that used to be presented in the initial request and response lines has now moved into headers. Any request is therefore expected to contain at least the following pseudo-headers:

:method
The method of the request — e.g. GET or POST.
:scheme
The URL scheme of the requested resource — e.g. http or https.
:path
The path and (optionally) query parameters portion of the requested URL.

Similarly the response also contains a pseudo-header:

:status
The status code of the response as a bare integer — there is no longer a way to include a “reason” phrase.

At this point, therefore, we’ve got a way to send a request, including headers. After this point any request body is sent via DATA frames. Due to the framing of HTTP/2 there’s no need for chunked encoding, of course, although the standard does still allow a Content-Length header. Whether or not the request contains a body, it is terminated by its final frame having the END_STREAM flag set — this puts the stream into a “half-closed” state where the request is finished and the client is awaiting a response.

The response is sent using HEADER, CONTINUATION and DATA frames in the same way as the request, and completion of the response is similarly indicated by the final frame having END_STREAM set. After this point the stream is closed and finished with — this has echoes of the old HTTP/1.0 approach of closing the connection to indicate the end of a response, but of course doesn’t have the overhead of closing and re-opening TCP connections associated with it.

Server Push

That’s covered a standard client request, so what about this “push” mechanism that’s been added?

Well, one of the causes of latency when opening a multimedia website is the fact that the client has to scan the original requested page for links to other media files, and then send requests for this. In many cases, however, the server already knows that the client is very likely to, for example, request images that are linked within a HTML page. As a result, the overall page load time could be reduced if the server could just send these resources speculatively along with the main requested page — in essence, this is server push.

When the client makes a request the server can send PUSH_PROMISE frames on the same stream that’s being used for the DATA frames of the response. These frames (and potentially associated CONTINUATION frames) contain a header block which corresponds to a request that the server believes the client will need to make as a result of the currently requested resource. They also reference a new stream ID which is reserved by the server for sending the linked resource. At this point the client can decide if it really does want the offered resource — if so, all it has to do is wait until the server starts sending the response, consisting of the usual HEADERS, CONTINUATION and DATA frames. If it wishes to reject the push, it can do so by sending a RST_STREAM frame to close the stream.

That’s about it for server push. The only slightly fiddly detail is that the server has to send the PUSH_PROMISE frame prior to any DATA frames that reference the offered resource — this prevents any races where the client could request a pushed resource. The flip side to this is that a client is not permitted to a request a resource whose push has been offered. This is all fairly common sense.

Prioritisation and Flow Control

The only other details of HTTP/2 that I think are interesting to note are some details about prioritisation and flow control.

Flow control is intended to prevent starvation of streams or entire connections — since many requests could be in flight on a single connection, this could become quite an important issue. Flow control in HTTP/2 is somewhat similar to the TCP protocol where each receiver has an available window which is an amount of data they’ve indicated they’re prepared to accept from the other end. This defaults to a modest 64KB. Windows can be set on a per-connection and a per-stream basis, and senders are required to ensure that they do not transmit any frames that would exceed either available window.

Each endpoint can update its advertised windows with WINDOW_UPDATE frames. On systems with a large number of resources flow control can be effectively disabled by advertising the maximum window size of . When deciding whether to use flow control, it’s important to remember that advertising a window size which is too small, or not sending WINDOW_UPDATE frames in a timely manner, could significantly hurt performance, particularly on high latency connections — in these days of mobile connectivity, I suspect latency is still very much an issue.

As well as simple flow control, HTTP/2 also introduces a prioritisation mechanism where one stream can be made to be independent of another by sending a PRIORITY frame with the appropriate details. Initially all streams are assumed to depend on the non-existent stream 0 which effectively forms the root of a dependency tree. The intent is that these dependencies are used to prioritise the streams — where there is a decision about on which stream to send/receive frames next, endpoints should give priority to streams on which other depend over those streams that depend on them.

As well as the dependencies themselves, there’s also a weighting scheme, where resources can be shared unevenly between the dependencies of a stream. For example, perhaps images used for a navigation are regarded as more important than a background texture and hence can be given either higher weight such that they download faster over a connection with limited bandwidth.

Conclusions

That’s about it for the protocol itself — so what does it all mean?

My thoughts on the new protocol are rather mixed. I think the the multiplexing of streams is a nice idea and there’s definite potential for improvement over secure a single secure connection. Aside from the one-time cost of implementing the protocol, I think it also has the potential to make life easier for browser vendors since they won’t have to worry so much about using ugly hacks to improve performance. There’s also good potential for the prioritisation features to really improve things over high latency and/or low bandwidth connections, such as mobile devices in poor reception — to make best use of this browsers will need to use a bit of cleverness to improve their partial rendering of pages before all the resources are fetched, but at least the potential is there.

On the flip side, I have a number of concerns. Firstly, it’s considerably more complicated, and less transparent, then standard HTTP/1.1. Any old idiot can knock together a HTTP/1.1 client library1 but HTTP/2 definitely requires some more thought — the header compression needs someone who’s fairly clueful to implement it and although the protocol allows the potential for improved performance, a decent implementation is also required to take advantage of this. With regards to transparency it’s going to make it considerably harder to debug problems — with HTTP/1.1 all you need is tcpdump or similar to grab a transcript of the connection and you can diagnose a significant number of issues. With HTTP/2 the multiplexing and header compression are going to make this much trickier, although I would expect more advanced tools like Wireshark to have decoders developed for them fairly quickly, if they haven’t already.

Of course, transparency of this sort isn’t really relevant if you’re already running over a TLS connection, and this reminds me of another point I’d neglected to mention — although the protocol includes negotiation methods over both HTTP and HTTPS, at least the Firefox and Chrome dev teams appear to be indicating they’ll only support HTTP/2 over TLS. I’m assuming they simply feel that there’s insufficient benefit over standard HTTP, where multiple connections are opened with much less overhead. I do find this stance a little disappointing, however, as I’m sure there are still efficiencies to be had in allow clients more explicit prioritisation and flow control of their requests.

Speaking of prioritisation, that’s my second concern with the standard — the traffic management features are quite complicated and rely heavily on servers to use them effectively. A standard webserver sending out static HTML can probably have some effective logic to, for example, push linked images and stylesheets — but this isn’t at all easy when you consider that the resources are linked in the URL space but the webserver will have to check for that content in the filesystem space. Life is especially complicated when you consider requests being served by some load-balanced cluster of servers.

But even aside from those complexities, today’s web is more dynamic than static. This means that web frameworks are likely going to need serious overhauls in order to allow their application authors the chance to take advantage of HTTP/2’s more advanced features; and said authors are similarly going to have to put in some significant work on their applications as well. Some of these facilities come “for free”, such as header compression, but not most of them by my estimation.

My third annoyance with the standard is its continued reliance on TCP. Now I can see exactly why they’ve done this — moving to something like UDP or SCTP would be a much more major shift and have made it rather harder to maintain any kind of HTTP/1.1 compatibility. But the fact remains that implementing a packet-based protocol on top of a stream-based protocol on top of a packet-based protocol is anything but elegant. I have all sorts of vague concerns about how TCP and HTTP/2 features could interact badly — for example, what happens if a client consistently chooses frame sizes that are a few bytes bigger than the path MTU? How will the TCP and HTTP/2 window sizes interact on slow connections? I’m not saying these are critical failings, I can just see potential for annoyances which will need to be worked around there.2

My final issue with it is more in the nature of an opportunity cost. The standard resolves some very specific problems with HTTP/1.1 but totally missed the chance to, for example, add decent session management. Cookies are, and always have been, a rather ugly hack — something more natural, and more secure by design, would have been nice.

Still, don’t get me wrong, HTTP/2 definitely looks to solve some of HTTP/1.1’s thorny performance problems and assuming at least major web services upgrade their sites intelligently, we should be seeing some improvements once support is rolled out more widely.


  1. And a startlingly large number of people appear to have done so. 

  2. As an aside it appears that Google have had this thought also as they’re currently working on another protocol called QUIC which replaces TCP with a UDP-based approximate equivalent which supports HTTP/2 with better performance. 

8 May 2015 at 11:52PM by Andy Pearce in Software  | Photo by Jonathan Denney on Unsplash  | Tags: http  |  See comments

☑ C++11: Library Changes: Pointers, Randoms, Wrappers and more

This is part 8 of the “C++11 Features” series which started with C++11: Move Semantics.

I’ve finally started to look into the new features in C++11 and I thought it would be useful to jot down the highlights, for myself or anyone else who’s curious. Since there’s a lot of ground to cover, I’m going to look at each item in its own post — this is the second of the final two that cover what I feel to be the most important changes to the standard template library.

child map

Continuing on from the last post here are the remaining library changes for C++11 which I think are particularly noteworthy.

Smart pointers

The venerable old std::auto_ptr has been in the STL for some time and works tolerably well for simple cases of an RAII pointer. It has, however, some signficant drawbacks:

  • It can’t be stored in containers.
  • Changes of ownership aren’t always obvious and may be unexpected.
  • It will always call delete on its owned pointer, never delete[].

As a result of this in C++11 std::auto_ptr is deprecated and has been replaced by the improved std::unqiue_ptr. This has similar semantics in that ownership can only exist with one instance at a time — if ownership is transferred, the pointer is set to nullptr in the old instance, as with auto_ptr.

The advantages of this class are essentially the removal of the three limitations mentioned above:

  • Can be stored in a container.
  • Move semantics are used to transfer ownership, so code will fail to compile without use of an explicit std::move, reducing the chances that the transfer could happen at times the programmer didn’t intend.
  • Support of arrays so either delete or delete[] will be called as appropriate.

This improved std::unqiue_ptr is a fairly thin wrapper around a native pointer and actually incurs pretty minimal overhead on dereference. On the other hand, its simple semantics make it unsuitable for multithreaded use-cases — that’s where the other new smart pointer std::shared_ptr comes in.

As the name implies, std::shared_ptr allows multiple references to the same underlying pointer. The pointer itself is wrapped in some reference counting structure and the destructor of each std::shared_ptr reduces the reference count by 1, the resource being deleted when the count reaches zero. Assignment of a different pointer to a `std::shared_ptr’ reduces the reference count on the previous pointer as well, of course. In short, it’s the obvious reference-counting semantics you’d expect.

Since one of the common use-cases of std::shared_ptr is in multithreaded applications, it’s probably no surprise that the underlying reference counting structure is thread-safe — multiple threads can all safely own their own std::shared_ptr instances which point to the same underlying resource. It’s worth remembering that the non-const methods of the std::shared_ptr themselves are not thread-safe, however, so sharing instances of std::shared_ptr itself between threads could lead to races. The semantics of the class make it easy to simply assign to a new instance when the pointer passes to a new thread, however, so it’s not hard to use correctly.

The final new smart pointer in C++11 is the std::weak_ptr. This can share ownership of a resource with std::shared_ptr instances (not std::unique_ptr), however the reference so created is not sufficient to prevent the underlying resource from being deleted. This could be useful for secondary structures which, for example, keep track of all allocated instances of a particular class, but don’t want those references to keep the instances in memory once the primary owning class has deleted it.

Because a std::weak_ptr can become nullptr at any point due to operations in other threads, some care has to be taken when dereferencing it. It’s possible to construct a std::shared_ptr from one, which will guarantee that either it’s still extant and will remain so for the lifetime of the std::shared_ptr, or it’s no longer extant and he std::shared_ptr constructor will raise an exception. std::weak_ptr itself also has a lock() method which returns a std::shared_ptr for the underlying pointer — the behaviour in this case is quite similar except that if the resource has been freed then an empty pointer object is returned instead of an exception raised.

All quite straightforward — exactly how language features should be.

Random Numbers

The core C and C++ libraries have never had hugely comprehensive facilities for random number generation. Well, let’s face it, they’ve had rand() and srand() and that’s about your lot. For many programmers I’m sure that’s plenty, and part of me thinks that more specialised use-cases in statistics and the like perhaps belong in their own third party libraries.

That said, if you’re writing code in a commercial context where things like licences and packaging become troublesome issues to deal with, there’s definite value in not having to resort to third party code. Comparing writing code in, say, Perl and Python, I vastly prefer Python. Now partly that’s just because I like the language better, of course, but it’s also the standard library — part of a language that’s too often discounted from comparisons in my opinion1.

Perl always seems to involve grabbing a whole raft of stuff from CPAN. Then you have the fun of figuring out where to install it so that your code can see it, whether anyone else in your organisation is already using it so that you don’t have to install multiple copies, how you handle sudden urgent security updates to the libraries while you’re in the midst of a release cycle, how you fix urgent bugs in your own copy whilst not losing the ability to merge in upstream fixes, whether your company allows you to contribute your fixes back to the upstream at all, and so on and so forth. All these issues have resolutions, of course, but the point is that it’s a pain when one of the main reasons you used a library was probably to reduce effort.

Taking all that2 into account, and getting back to the matter in hand, I’m therefore quite glad that the C++ standards committee has decided to include some much improved random number facilities in C++11 even if I don’t necessarily think I’ll have much cause to use them in the forseeable future.

The new facilities split random number generation into two parts:

  • The engine is responsible for actually generating pseudorandom numbers. It contains the basic algorithm and current state.
  • The distribution takes values from an engine and transforms them such that the resultant set follows some specified probability distribution.

Looking at the engine first, this is a functor which, when called, yields the next item in whatever pseudorandom sequence it’s generating. It also has a seed() method to set the initial seed state, discard() to skip ahead a specified amount in the sequence and static min() and max() methods to query the potential range of generated values.

The library provides three basic options:

std::linear_congruential_engine
Fast and with small storage requirements.
std::mersenne_twister_engine
Not quite so fast, and larger storage requirements, but with a longer non-repeating sequence and (in some ways) more randomly distributed results.
std::subtract_with_carry_engine
Fast even on archiectures with only simple arithmetic instruction sets, but with larger storage requirements.

Typically I suspect the architecture will determine the choice used — in lieu of strict performance and/or memory requirements, I’d suggest the old favourite Mersenne Twister.

There are also engine adapters such as the std::discard_block_engine which discards some of the output of an underlying base engine. Presumably the intention is that these adapters can be used to improve some randomness criteria of imperfect underlying engines, but I can’t help but think that in most cases where one is so concerned about the quality of the random numbers one should probably be using a true entropy source. Still, I’m hardly an expert and I’m sure there are good reasons why these were added.

Once the underlying engine has generated a raw stream of random numbers, the distribution transforms these in such a way that the resultant stream matches some probability distribution. There are over 20 options for this that I’m aware of, and presumably the potential for many more to be defined later. A handful for flavour:

std::uniform_int_distribution
Generates integers distributed evenly across a specified range.
std::uniform_real_distribution
Generates real numbers distributed evenly across a specified range.
std::normal_distribution
Generates values that are normally distributed around specified mean and standard deviation values.
std::binomial_distribution
Generates values that are binomially distributed given the number of trials and probability of success.

Overall it seems quite comprehensive, even if I’m still not quite sure for how many people this will remove the need for more specialist third party libraries.

Wrapper References

STL containers are both flexible and convenient, but to my mind they’ve always had two weaknesses: they involve an annoying amount of copying; and you can’t store references, only pointers. C++11 provides move semantics as a solution to the first problem — and wrapper references as a solution to the second.

A wrapper reference is really just a placeholder that can be used very much like a reference, although some minor hoops must be jumped through when performing assignment. The following code snippet shows how they can be used to assign to underlying values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <functional>
#include <iostream>
#include <vector>

int main()
{
    int array[5] = {1, 2, 3, 4, 5};
    std::vector<std::reference_wrapper<int>> vec;
    vec.push_back(array[0]);
    vec.push_back(array[2]);
    for (int& it : vec) {
        it = 99;
    }
    for (unsigned int i = 0; i < 5; ++i) {
        std::cout << array[i] << std::endl;
    }
    return 0;
}

The result is that array[0] and array[2] are set to 99, whereas all the other values in array remain the same.

As well as the template std::reference_wrapper there’s also the std::ref and closely related std::cref helper functions which allow a slightly more convenient syntax for creating wrapper references by using type inference.

As well as storing references in containers there are a number of other potential uses for wrapper references, one of the main ones being to instantiate template functions such that they take references to their arguments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#include <functional>
#include <iostream>

template <class T>
void function(T a, T b)
{
    a += b;
}

int main()
{
    int x = 3;
    int y = 4;
    // Will instantiate function<int>.
    function(x, y);
    std::cout << "x=" << x << ", y=" << y << std::endl;
    // Will instantiate function<std::reference_wrapper<int>>.
    function(std::ref(x), std::ref(y));
    std::cout << "x=" << x << ", y=" << y << std::endl;
    return 0;
}

Polymorphic Function Wrappers

C++ has lots of mechanisms to make things generic, such as polymorphism, overloading and templating. One thing that’s always been difficult to do, however, is have a generic notion of a callable — i.e. something which can be invoked in a function-like manner, but which may not necessarily be a function.

There are various ways to refer to callables — function pointers, member function pointers and functors are all examples. But C++’s type-safety makes it tricky to declare something that can wrap all these up.

No longer, however — now we have std::function for just this purpose. This can wrap up any callable which matches the return and argument types of the wrapper and otherwise acts like a regular functor.

One other benefit is that it has the property of a pointer in that it can be uninitialised — in this case calling it will throw the std::bad_function_call exeption. Here’s an example that demonstrates this, as well as using the wrapper with both a function and functor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <functional>
#include <iostream>
#include <string>

std::string multiplyStringFunc(std::string s, unsigned int num)
{
    std::string ret;
    for (/* blank */; num > 0; --num) {
        ret.append(s);
    }
    return ret;
}

class MultiplyStringFunctor
{
  public:
    std::string operator()(std::string s, unsigned int num)
    {
        return multiplyStringFunc(s, num);
    }
};

int main()
{
    std::function<std::string(std::string, unsigned int)> wrapper;
    try {
        std::cout << wrapper("bad", 1) << std::endl;
    } catch (std::bad_function_call) {
        std::cout << "Called invalid function" << std::endl;
    }
    wrapper = multiplyStringFunc;
    std::cout << wrapper("hello ", 2) << std::endl;
    wrapper = MultiplyStringFunctor();
    std::cout << wrapper("world ", 3) << std::endl;
    return 0;
}

While this is all undeniably convenient, you should bear in mind that there is, unsurprisingly, an overhead to using these wrappers. As a run-time mechanism there’s no way they’ll ever be as efficient as, say, a raw templated function, where the heavy lifting is done at compile time. The Boost documentation has some discussion of performance, but I think an issue that’s at least as critical as the raw overhead itself is the opportunity cost of the compiler being unable to inline or otherwise optimise your code at compile time.

Nonetheless, such performance concerns are often not dominant in a design compared to, say, extensibility and maintainability, so it’s definitely good to know there’s now a proper callable interface class available.

Type Traits

I’m only going to skim this one because it leans heavily in the direction of template metaprogramming, in whose waters I’ve dipped just the most tentative toe. However, I think the basic mechanism is simple enough to explain, so I’ll try to illustrate with some simplistic examples and leave it as an exercise to the reader to extrapolate into, say, an entire BitTorrent tracker that runs wholly at compile time3.

C++ has had templates for quite a long time — these are a great way to write generic code across multiple types. You make some basic assumptions about your type (e.g. it has a method foo() or supports operator +) and the compiler should let you know if your assumptions don’t hold at compile time — possibly in a fairly cryptic manner, but it should let you know.

Since it can be hard to write pleasant/optimal/correct code that works for a range of disparate types, there are also template specialisations. These are alternative implementations of the template for for specific types which can override the generic implementation. So far so C++03.

This system is quite restrictive, however, in that it’s quite hard to write code that modifies its behaviour in more subtle ways based on the types it’s given. It’s also hard to write a system that will restrict the types on which you can instantiate your template. If you write code to convert between big- and little-endian, it would be nice if someone using it on a non-integral type would get a pleasant compile-time error as opposed to randomly scrambling bytes around at runtime.

This is where type traits come in handy. Essentially they’re just expressions that return information about a type, evaluated at compile-time. One of the simplest uses is in concert with C++11’s new static_assert feature, to constrain the types of templates.

template <typename T>
T swapEndian(T value)
{
    static_assert(std::is_integral<T>::value);
    // Implementation left as an exercise
}

As well as std::is_integral there’s also std::is_floating_point, std::is_array, std::is_pointer and more. Things like std::is_rvalue_reference could come in handy when writing complex code that attmempts to support move semantics efficiently and perhaps std::is_trivially_copyable could be used to swap some optimisations in common cases. I also think that std::is_base_of sounds very promising for enforcing interfaces.

Not all of the expressions are boolean in nature, although it has to be said that most of them are. The few exceptions include std::rank, which evaluates the number of dimensions in an array, or 0 for non-array types; and std::extent, which evaluates to the size of an array in a specified dimension, or 0 for unknown bounds or non-array types. I could imagine these could make it very convenient to implement efficient templated operations on fixed-size arrays without the hassle of tricks like null-terminating them.

As I said, I’m just touching on the surface here, but I think it’s worth bearing these new facilities in mind even if you just plan to use fairly simplistic forms of templating.

Return Type Inference

With prolific use of templates and overloading, it can make it quite difficult to implement generic classes. For example, if you want to implement map() where some functor is applied to every item in a container then you have to decide how you’re going to template it. You could just assume that the return type matches the parameter types, but this is a constraint it would be nice to do without.

C++11 has therefore introduced std::result_of which, when passed any callable, will evaluate to the return type of that callable. This is perhaps demonstrated best with an (extremely contrived!) example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <type_traits>
#include <iostream>

struct PercentageConverter
{
    double operator()(int value);
    int operator()(double value);
};

double PercentageConverter::operator()(int value)
{
    return static_cast<double>(value) / 100.0;
}

int PercentageConverter::operator()(double value)
{
    return static_cast<int>(value * 100.0);
}

int main()
{
    PercentageConverter instance;
    std::result_of<PercentageConverter(int)>::type res1 =
            instance(50);
    std::cout << res1 << std::endl;
    std::result_of<PercentageConverter(double)>::type res2 =
            instance(0.15);
    std::cout << res2 << std::endl;
    return 0;
}

One of those things that’s probably incredibly useful in the few cases that you need it, but unless you’re using templates in a fairly heavy-duty manner, there’s a good chance that you can live in bliss ignorance of its existence.

Conclusion

So that’s (finally!) the end my posts on C++11. Wow, it’s taken rather longer than I first anticipated, but hopefully it’s been useful — I know it has for me, at least.

As a final note, if you have any more C++ questions in general, I can strongly recommend both the C++ Super-FAQ and cppreference.com as great resources. Personally I’d strongly suggest consulting them before resulting to the usual Stack Overflow search as their hit rate is great and the quality of the information is fantastic.


  1. For example, my mild dislike of Java has very little to do with the language, which I feel is quite adequate, and a lot to do with the standard library. I don’t like living in the kingdom of the nouns

  2. And let’s face it, there was rather too much of it. 

  3. Don’t worry, this is a joke. At least I hope so. 

24 Apr 2015 at 11:33AM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: c++  |  See comments

☑ C++11: Library Changes: Threading, Tuples, Hashing and Regexes

This is part 7 of the “C++11 Features” series which started with C++11: Move Semantics.

I’ve finally started to look into the new features in C++11 and I thought it would be useful to jot down the highlights, for myself or anyone else who’s curious. Since there’s a lot of ground to cover, I’m going to look at each item in its own post — this is the first of the final two that cover what I feel to be the most important changes to the standard template library.

child map

C++11 introduces many changes to the STL as well as the core language itself. This post is the start of a summary of what I feel to be the most important ones, with a second post to follow to cover the remainder.

Utilise core language features

I don’t want to cover this in too much detail as I feel most of the points are fairly self-evident, but as the core library has had several features added then the STL has, unsurprisingly, been updated to take advantage of them. A handful of these changes for flavour:

Move semantics
Library containers now support rvalue references and take full advantage of classes that provide move semantics. Combined with variadic templates, this also allows support for perfect forwarding.
Unicode
Types that support characters have been extended to also support the new char16_t and char32_t types for UTF-16 and UTF-32 respectively.
Explicit conversions
The new ability to mark type conversion methods as explicit allows some common gotchas to be addressed.

Threading

C++11 contains a variety of core language improvements which support multithreaded code, such as a new memory model which allows thread-safe construction of static objects and support for thread-local storage. In tandem with these changes the STL also has some platform-independent support for threads as well.

There’s a new std::thread class which can be passed a function object and a series of arguments to execute that function in a separate thread of execution. Generic waits for completion are possible with std::thread::join() and there’s also access to the native underlying thread ID where possible.

A handful of synchronisation primitives have also been added including std::mutex, std::recursive_mutex, std::condition_variable and std::condition_variable_any. These have all the usual straightforward semantics.

As well as these primitives there are some other very handy additions. The std::lock_guard class is a simple RAII wrapper for any lockable and std::unique_lock is a useful wrapper which allows deferred locking with RAII semantics and also locking with a timeout. There’s also std::lock which allows a set of lockables to be all acquired whilst avoiding deadlock.

For higher performance, and greater convenience, there’s also the templated std::atomic which allows safe, lock-free access to fundamental data types - there are also a set of convenience specialisations (e.g. std::atomic_int, std::atomic_char). It’s even possible to specify the precise ordering constraints to use for a particular operation with these types.

Finally some higher-level support for asynchronous operations has been added with support for futures. A function can be made asynchronous with std::async, std::packaged_task or std::promise, all of which return a std::future which allows the result of the operation to be obtained once available. One annoyance, however, is that there’s no way to compose futures and hence wait on any of a whole set of asynchronous operations to complete.

All in all, then, a not quite perfect but nonetheless comprehensive set of platform independent tools for threading C++. Frankly, it’s about time.

Update: A couple of further notes on threading that may be helpful.

Firstly, when using GCC on POSIX platforms (and possibly others) you should compile with the -pthread option. This enables any platform-specific requirements for pthreads — on my Linux box it seems to add (at least) the following options:

  • -D_REENTRANT to the pre-processor definitions, which changes the behaviour of some glibc code. For example, I think it puts errno into thread-local storage.
  • -lpthreads to the list of linker flags to link in the required library code.

Secondly, remember that the behaviour of an exception propagating to the outermost scope of a thread is unchanged — in other words, if it’s not handled then it will invoke std::terminate() just as if main() had thrown it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <chrono>
#include <iostream>
#include <thread>

void hello()
{
  std::cout << "hello, world" << std::endl;
  //throw "oops";
}

int main()
{
  std::thread t1(hello);
  std::this_thread::sleep_for(std::chrono::seconds(2));
  t1.join();
  std::cout << "Joined" << std::endl;
  return 0;
}

That code should run fine, but uncomment line 8 and you’ll see what I mean.

OK, that’s probably not too surprising. However, another more subtle cause of std::terminate() is if you destroy a joinable std::thread object without joining it first. For example, change lines 13-15 to this and see whether you get as far as the Joined message:

std::thread* t1 = new std::thread(hello);
std::this_thread::sleep_for(std::chrono::seconds(2));
delete t1;

Tuples

The variadic macro support added in C++11 allowed the addition of quite clean support for tuples — i.e. a heterogenous fixed-dimension list of members, a little like a struct whose members don’t have names. The types of members are declared with template parameters, as you would expect, and then std::get can be used to obtain references to the members which can be used to both query and update them.

std::tuple<int, std::string> myTuple(123, "hello, world");
std::cout << std::get<0>(myTuple) << " :: "
          << std::get<1>(myTuple) << std::endl;
std::get<0>(myTuple) = 456;
std::get<1>(myTuple) = "goodbye, world";

C++11’s newly repurposed auto keyword and type inference can be used to declare tuples even more conveniently:

auto myTuple = std::make_tuple(123, "hello, world");

Hash Tables

One thing that I was really happy to see finally added was a standardisation of hash tables. The venerable std::map has been around for some time now, and serves as a perfectly serviceable associative array — it’s always been a little irritating, however, that it’s implemented as a tree1. This is great if you need to iterate over the elements in sorted order, but if that’s not important then the access time seems a harsh price to pay when a well-implemented hash table can average constant time access.

There are four additional types which are hash equivalents of the existing associative containers:

I find this particularly useful as associative containers can often grow quite large but if the ratio of lookups to inserts is large then the logarithmic behaviour of std::map is a bit of a turnoff. This is almost certainly going to be become my default associative container, barring the requirement for sorted order iteration, which I find is rare; and the issue of the key type not being hashable, which is more likely to be the reason to prefer the ordered versions in some cases.

Speaking of types not being hashable the STL in C++11 provides hashing for integral, floating point and pointer types — anything else needs a little help from the application. To make a type hashable, a specialisation of the std::hash templated functor must be provided for it, which will typically just decompose the structure into constituent types which are themselves already hashable. Note that to support the containers, operator==() must also be defined, since they use chaining to resolve collisions.

One gotcha is that all pointers are treated the same, which is to hash on the raw pointer value regardless of the pointer type. Without a little care this could lead to some unfortunate bugs. For example, you might think the example below would store an instance of the (somewhat awful) Person class in a std::unordered_set and then retrieve it. If you run it, however, you’ll see that it finds no match on line 60. Can you spot the issue?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <cstring>
#include <iostream>
#include <functional>
#include <unordered_set>

class Person
{
  public:
    Person(const char *name, unsigned int age);
    ~Person();

    const char* getName() const { return m_name; }
    int getAge() const { return m_age; }

  private:
    char* m_name;
    unsigned int m_age;
};

Person::Person(const char* name, unsigned int age) : m_age(age)
{
    size_t nameLen = strlen(name);
    m_name = new char[nameLen + 1];
    memcpy(m_name, name, nameLen);
    m_name[nameLen] = '\0';
}

Person::~Person()
{
    delete[] m_name;
}

namespace std
{
template<>
struct hash<Person>
{
    size_t operator()(const Person& p) const
    {
        size_t const h_name = std::hash<const char*>()(p.getName());
        size_t const h_age = std::hash<int>()(p.getAge());
        return h_name ^ (h_age << 1);
    }
};    
} // namespace std

bool operator==(const Person& lhs, const Person& rhs)
{
    return (lhs.getAge() == rhs.getAge() &&
            !strcmp(lhs.getName(), rhs.getName()));
}

int main()
{
    Person andy1("Andy", 37);
    Person andy2("Andy", 37);

    std::unordered_set<Person> people;
    people.insert(andy1);
    if (people.count(andy2) > 0) {
        std::cout << "Match!" << std::endl;
    }

    return 0;
}

Did you spot the issue? The hash value is being calculated on the pointer value returned by Person::getName() and not the contents of the array. A more correct way to calculate h_name might be:

size_t h_name = 0;
for (const char* c = p.getName(); *c; ++c) {
    h_name = std::hash<char>()(*c) ^ (h_name << 1);
}

Regular Expressions

When parsing text, one potentially powerful tool that many people will be familiar with from scripting languages are regular expressions2. Some people are quite opposed to them, apparently on principle. I must admit to a healthy scepticism in some cases, but I’m with Jeff Atwood in that if well used then they have their place.

Well now C++11 has introduced a standard regex interface. The first step is to build a std::regex object using a string specification as usual. This can be used with std::match to check if another string is full match, or std::search to check for a substring match. The result of a successful match is a std::match_results instance which allows access to the string that matches the pattern as well as any other bracketed capture groups within it.

As with std::string the regex related classes are templated on the underlying type used. Where std::string is templated on character types, however, regex classes are templated on string types. For convenience, however, there are some predefined types - for example std::match_results has four specialisations such as std::smatch for use with std::string.

As well as the two functions to check for a single match, there’s also std::regex_iterator (and its predefined specialisations) for iterating across multiple matches. With these facilities it’s really easy to write code that’s similar to something you might see in Perl or Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <regex>
#include <string>
#include <iostream>

int main()
{
    std::regex needle("(([0-9]+\\.){3}[0-9]+)");
    std::string haystack("Router IP is 172.16.0.1 or 192.168.0.1");
    for (std::sregex_iterator it(haystack.begin(), haystack.end(), needle);
            it != std::sregex_iterator(); ++it) {
        std::cout << "Found match:" << std::endl;
        for (size_t i = 0; i < it->size(); ++i) {
            std::cout << "  Group " << i << ": " << (*it)[i] << std::endl;
        }
    }
    return 0;
}

The code above should generate the following output:

Found match:
  Group 0: 127.0.0.1
  Group 1: 127.0.0.1
  Group 2: 0.
Found match:
  Group 0: 192.168.0.1
  Group 1: 192.168.0.1
  Group 2: 0.

Something that’s probably not immediately obvious from my pattern above is that capture group zero is special — it always contains the entire matched substring. The remaining capture groups correspond to the bracketed expressions in the pattern in the order of occurrence of the opening brackets. Since I bracketed the entire pattern then capture group 1 also contains the entire matched string. Capture group 2 matches the inner bracketed expression, but the complexity here is that I’ve used the {3} specifier to indicate that group must be repeated three times. As you can see from the results, each match overwrites the previous one so the result of querying the capture group is whatever matched the final time.

To round off it’s important to know what dialect of regular expression is used, as there are significant differences in functionality — actually C++11 supports a surprising six dialects, which are selected when building the std::regex:

The last two are rather minor variations of basic and extended POSIX respectively, simply adding a newline (\n) as an alternative to the pipe character (|) for alternation.

One final note on regexes for gcc users — I had to install gcc 4.9 to get the regex example above to compile, they aren’t supported in any earlier versions.


  1. Typically a red-black-tree, to be specific. 

  2. I’m intentionally ignoring the fact that the “regular expressions” provided by most programming languages are in fact sufficiently expressive that they’re no longer regular languages in the formal definition. I’m as much of a fan of pedantry as the next software engineer, but one of the things Larry Wall and I appear to agree on is that this battle is lost

13 Apr 2015 at 7:26AM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: c++  |  See comments

☑ An Unhealthy Environment

Recently I’ve been writing code to spawn child processes that had to deal with the POSIX functions for querying and manipulating environment variables. I’ve only just realised how truly awful this interface is in the context of modern multi-threaded applications, and this post is simply me sharing the pain.

city smog

Many Unix, and some Windows, users will be familiar with environment variables. These are key/value strings such as USER=andy or SHELL=/bin/bash, and they form part of the global environment provided to a process by the OS. Windows has a similar concept, although it has a few subtle differences and in this post I’m only discussing the situation in POSIX.

POSIX provides various interfaces to query and set these variables. Probably the most well known of these are setenv() and getenv(), so let’s start with those.

The getenv() function is pretty modest - you pass in the name of an environment variable and it returns you a pointer to the value. Simple enough, but immediately the spidey sense starts tingling. The function returns a char* instead of a const char* for one thing, but “the application shall ensure that it does not modify the string pointed to by the getenv() function”. Well, OK, perhaps they didn’t have const in the days this function was written. They also presumably hadn’t heard of thread-safety or re-entrancy, because anything that returns a static pointer pretty clearly does neither.

The setenv() function is also fairly simple - you pass in a new variable name and value, and a flag indicating whether you’re happy for the assignment to overwrite any previous value. But the man page talks about this function modifying the contents of environ - oh yes, let’s talk about that first…

You’ll notice neither of the functions so far has given a way to iterate through all the current environment variables that are set. It turns out that the only POSIX-supported way to do this is use the global environ variable provided by the library. This is similar to argv that’s passed into main() except that instead of an argc equivalent, the environ array is null-terminated. Things start to smell a little fishy when you realise that environ isn’t actually declared in any header files - the application itself has to include something like this, as taken from the POSIX page on environment variables.

extern char** environ;

OK, so just like argv there’s some OS-supplied storage that contains the environment. It’s not const, but hey ho, neither is argv and we seem to cope fine with just not modifying that directly. Except that the key point here is that setenv() does modify environ - the man page even explicitly states that’s how it works. Unlike argv, therefore, you can’t just treat it as some effectively some read-only constant1 array and quietly ignore the fact that the compiler won’t stop you modifying it.

It gets even more crazy when you realise that, according to the man page for the exec family, it’s quite valid to replace your entire environment by assigning a whole new value to environ. You read that correctly - not updating the pointers within environ, just repointing the whole thing at your own allocated memory.

So then, when setenv() comes along and wants to modify this, how on earth can it do so? It has no idea how big an array you’ve allocated in your own code - it either has to copy the whole lot to somewhere provided by the system, or cross its fingers and hope there’s enough space.

And don’t even get me started on the memory management Pandora’s Box that is putenv()

In summary, therefore, I’ve decided that the only sensible course of action is to use environment variables as little as possible. If you must use them as opposed to command-line arguments, you should parse them away right at the beginning of main() and put them into other storage within your code, never worrying about the environment again. If you’re writing a library… Well, good luck with that - let’s hope your application doesn’t mess around with the environment too badly before you want to query it. Whatever you do, don’t update it!

It’s quite possible to work around all this brokenness, of course, as long as you can make some basic assumptions of sanity about your libraries. But it’s all just such a dirty little mess in the otherwise mostly sensible tidiness that POSIX has imposed on the various APIs that exist.

Surely there’s got to be a more sensible way to control the behaviour of applications and libraries? For example, we could have some sort of system-wide database of key/value pairs - unlike the environment it could be lovely and clean and type-safe, and all properly namespaced too. For performance reasons we could stick it in some almost unparseable binary blob. There’s no way such a system could be abused by applications, right? It would remain clean and usable, I’m sure. Now all we need is a snappy name for it - something that indicates the way that values can be registered with it. Perhaps, The Register? No, people will confuse it with the online tech site. What about The Repository? Hm, confusing with source control. I dunno, I’ll think about it some more.


  1. Yes, I’m aware there are some use-cases for modifying argv too, but I class those as unusual cases, and they also tend to be quite system-specific (assuming you want to resize the strings in the process). 

25 Mar 2015 at 7:28AM by Andy Pearce in Software  | Photo by Alex Gindin on Unsplash  | Tags: posix  environment-variables linux  |  See comments

☑ Hash Collision Fun

When dealing with data structures that involve hashing, most commonly hash tables, it’s fairly common knowledge that your choice of hash function is an important performance consideration. What’s perhaps less well known is that it can be an important security consideration too - this article briefly discusses why.

intersection traffic jam

Anyone who’s studied Computer Science at university, or has spent a few years in the software industry, will know about hash tables. Those who’ve looked deeper will know that there are other, equally powerful, uses for the same general technique. Its great strength is its performance in the general case, able to average constant time lookups where many data structures only manage performance. The complication is that to achieve this in practice you need to put a lot of care into the hash function that you choose. However, as I discovered recently, average case performance isn’t the only consideration.

The important characteristic of a hash function is that it assigns inputs to random buckets. This should hold regardless of which inputs are chosen as this helps to ensure the average constant time performance - if multiple items hash to the same bucket, the lookup time rapidly degrades towards linear performance. This can have a very large impact on running time of the application. With a reasonable hash function this occurs very rarely in practice, but if someone knows the hash function in use then it may be possible to deliberately engineer hash collisions - this is fairly obvious when you think about it, but it wasn’t something that occurred to me until recently.

Why would someone want to do that? One reason is part of a DoS attack, where an attacker uses some supposedly limited level of access to impair usage for other users. A great example of this is an attack someone devised on the Btrfs filesystem a few years ago. Essentially this involved constructing a large number of filenames which hash to the same bucket, thus causing the number of items in that bucket to grow artificially large and extending the time taken to perform operations massively. This approach could also be used by one user with access to a shared directory to prevent another user creating a file in the same directory by creating many files which hashed to the same value.

Apparently this isn’t as rare as you might imagine - there’s even a CERT advisory which contains a long list of programming languages whose builtin hashing algorithms are prone to this sort of attack. Many of these langauges have since resolved the issue - Python was one of the more recent of these, only having moved to using the cryptographic quality Siphash function in version 3.41.

Aside from all these language features, however, which will have a lot of pressure to change due to the number of people affected, it’s a little concerning to consider how much software there is out there which has used insecure hashing in the name of performance, leaving itself open to these sorts of cunning attacks.

Just one more reason not to roll your own hashing function - but I have no doubts that people will continue to do so for a long time yet.


  1. Prior to choosing Siphash, the Python team did attempt to address the issue using the old hashing algorithm but peturbed by a pseudorandomly chosen initial value per invocation of Python. Unsurprisingly this turned out to be less than adequate. 

20 Mar 2015 at 7:36AM by Andy Pearce in Software  | Photo by Jens Herrndorff on Unsplash  | Tags: data-structures security  |  See comments

☑ Netflix Neutrality

Well, after almost a year’s downtime I’ve finally found time to get my blog up and running again and even actually write an entry. Spurred by articles on Netflix’s rampant traffic growth last month, I’ve decided to lay down my thoughts on the related topic of Net Neutrality, which may be somewhat at odds with many in the technology community. This is a fairly old topic these days, but one that I think will be relevant for some time to come yet.

netflix tv

You’ve probably heard of the online video rental service Netflix — in case you hadn’t, they’re a company who started as a flat rate DVD rental service in the US but are now best known for their online streaming offering, making their entire catalogue available for watching online for a flat monthly fee.

Well, Netflix have enjoyed some pretty significant growth. Last month, for example, it was announced that they comprised almost 35% of downstream traffic and even 10% of upstream ACKs — that’s a massive proportion of bandwidth for anyone, not least of which a company whose market cap is only around 5% of Google’s. This growth in traffic speaks to Netflix’s rampant popularity, but this success has also brought them some pretty stern opponents — primarily ISPs1.

ISPs seem rather bitter about the success of companies such as Google and Netflix. This is because the ISPs feel that these companies are only able to make their profits because of the network infrastructure that the ISPs have built out at their own expense. This is a rather debatable position, which I’ll come to in a moment, but whether justified or not in recent years the ISPs have become increasingly desperate to somehow jump on the bandwagon and monetise the rampant success of online services.

Clearly the primary source of revenue for an ISP is subscription fees — could they just charge their customers a higher flat rate? Well, increasing fees is generally unpopular, especially in areas where there’s little competition, and would most likely attract a lot of negative attention from regulators. Another possibility for making additional money is to provide their own services, but in practice these are almost invariably significantly less compelling than existing players and gain very little market traction. It shouldn’t be a surprise, since the existing providers make this their sole business — in short, they know what they’re doing.

Faced with these avenues being frustrated, ISPs have instead made moves to try to leach some of the revenue streams away from service companies (e.g. Google and Netflix). They have a trump card to play to achieve this, which is to put roadblocks between their users and the servers used by these companies such that the service gets degraded or removed completely. End users typically don’t know the reasons for this and assume the service is poor, placing their pressure on the provider of the service to fix it. Indeed, exactly this has happened to Netflix where the ISP Comcast in the US started slowing down their data — the problem got so bad that some users were totally unable to watch movies and cancelled their Netflix subscriptions.

Understandably the likes of Google and Netflix are none too happy with what they view as unfair business practices. These fears were proved somewhat justified in Netflix’s case where they were forced to cave in and start paying ISPs for unthrottled access to their customers. This turn of events is a concern to a lot of the online services companies who feel that it’s the thin end of a wedge that’s going to leave them at the mercy of powerful ISPs. As a resut, for years now they’ve been lobbying governments worldwide, but primarily in the US, to pass laws enforcing what’s typically known as Net Neutrality — put simply, the notion that nobody gets to artificially accelerate, slow down or block traffic from one particular source relative to others.

Under Net Neutrality, Comcast wouldn’t be allowed to slow down Netflix movie streaming, for example, although they would be able to, say, offer a uniformly slow service for a lower cost to consumers (i.e. “fair” throttling). As well as these companies, a lot of grassroots Internet advocates also support Net Neutrality, believing that it will help protect the Internet in its current fairly open form from undue corporate interference which could harm the quality of services now and in the future.

Now the actions of all of these parties are quite understandable from their own points of view, but frankly I believe they’re all rather misguided — for the remainder of this post, I’ll try to explain why.

Firstly, as much as the ISPs tend to be demonised in these debates, it can’t be denied that they have a potential problem. As public companies they have shareholders, and like any shareholders they want to see growth in the company and a return on their investment. If ISPs are blocked from innovating to achieve this growth then they’ll stop attracting investment; and that means they’ll be unable to afford to upgrade their infrastructure; and that means gradually degrading service for everyone — that’s a lose/lose scenario. Hence it’s totally natural to see why they’d oppose major legislative constraints on their business.

Secondly, and conversely, it is certainly a problem that ISPs are in the position where they can so easily inflict untold damage on the bottom line of companies who sell their services online. At its most extreme this coud develop into a form of protection racket, where the ISPs can extract arbitrary sums of money from other companies in return for access — I’m not suggesting that it’s anything like this at present, but even the possible risk is clearly not an attractive state of affairs.

Thirdly, a lot of people seem to be putting a lot of faith into legislation to sort this out — they believe some strong Net Neutrality laws will block the ISPs from “abusing” their networks and interfering with other people’s business. But they apparently forget the atrocious record that governments have in passing laws that intersect with business, and especially technology. Those who make policy just do not understand the issues enough to make informed choices, so it becomes a battle of the lobbyists. Let us not forget that ISPs are typically well established companies with plenty of lobbyists of their own, so it’s not at all clear that some awful tangled up mess of a compromise won’t emerge at the end that doesn’t particularly please anyone except the lawyers who’ll get to wrangle over it in legal disputes for decades.

Finally, even if we could rely on government to pen some good legislation now which is fit for purpose and achieves the stated goals, how can we assess the potential for stifling future innovation? There are quite legitimate uses for traffic prioritisation — for example, a cable company might decide to deliver all of its TV services over its IP infrastructure and customers are clearly going to expect to receive this without interruption even if their neighbour is downloading a hundred movies over BitTorrent. Or perhaps a hospital decides to allow doctors to perform operations remotely via the Internet and requires the ISP to give this important traffic priority. Preventing ISPs from implementing this sort of mechanism risks harming both their service and customers in the future by preventing them from making best use of their infrastructure.

In summary, I accept there are problems to solve in the ISP business, but I don’t accept that Net Neutrality legislation is necessarily the best solution to them. So what is?

What about good old competition?

The root cause of these problems, in my mind, is not that the ISPs are able to make decisions about their own networks which affect other companies — the problem is that their consumers are, in many cases, lacking any viable alternative provider they can move to. This problem appears to be particularly acute in the US, where some areas really have no practical choice at all for high speed Internet access. This means the ISPs have more or less carte blanche to implement whatever draconian traffic management policies they like, and their customers can do very little but complain about it. Were there viable alternatives, customers could move away in droves and effectively force a reverse of the policy. For this to work adequately we could perhaps have some light legislation that forces ISPs to make public their traffic shaping measures, but that’s really just tightening up existing regulations on requiring products and services to be accurately described.

Let us not also forget that ISPs don’t just look to exploit other companies — they’ve shown a willingness to exploit their customers as well, admitting that data caps are more about extracting additional revenue than managing network congestion, which is the reason they typically cite publicly. Competition and free market economics handily resolve these sorts of issues as well, whereas Net Neutrality doesn’t say a lot about ISP charging structures, subscription fees or fairness to customers.

Of course, it’s undeniable that competition in this market is incredibly hard to engineer — a lot of basic infrastructure has to be built for even a basic service, and that’s ignoring ask those difficult issues like getting permission to dig up thousands of roads. There are ways this can be done, however, such as “local loop unbundling” or LLU, where the government instead force the incumbents to open up their “last mile” infrastructure to the competition — that’s the connection to your house, and it’s the expensive part that prevents competing entities starting a business.

This might seem unfair to incumbent ISPs but let’s not forget that, for example, the US cable companies only got their local monopolies with a little help from the government in the first place. It’s also important to note that while ISPs may be jealous of the profits of service companies, they’re still large companies with a healthy bottom line — compare Comcast’s profits of $2.6bn last quarter with Netflix’s rather more modest $60m, so incurring some capex in the short term to unbundle their services isn’t going to send them bankrupt.

The use of LLU has been quite successful in countries like the UK, for example. Indeed, earlier this year the incumbent BT asked the regulator to allow it to charge more for LLU access, which is probably a pretty good sign that competition is working. Also, LLU is just one example of an approach that’s been shown to work — I’m sure there are other forms of government intervention which could encourage competition.

Overall, therefore, I would argue that although fostering competition may be the harder path, ultimately the market will be a lot healthier for everyone, not least of which the consumer, if fair market economics is allowed to assert itself instead of relying on the vagaries of government legislation to swing the balance of power around one way or another.

Whilst we’re on the subject I should also mention that this isn’t regarded as a purely legislative or policy issue by everyone — some people believe that a technical solution is a feasible alternative. The most likely candidate seems to be switching to some sort of P2P system, where users share video streams among themselves as well as from Netflix’s servers. This approach is used to implement the popular music service Spotify, for example. It’s not without its potential issues, but P2P company BitTorrent have said they think this is the best approach. Well, OK, so they’re a P2P company, of course they’d say that, right? But actually not so crazy — looks like Netflix has advertised for someone with the relevant skills already, so it seems as if they’re at least investigating this possibility.

Personally I think that’s a mixed bag. On the one hand, it would undoubtedly make it a lot harder for ISPs to selectively block or throttle Netflix’s traffic; on the other hand, ISPs like Comcast have shown in the past that they’re quite willing to take on the challenges of blocking P2P traffic and if Netflix decided to stop paying them then it’s quite possible they’d be up for the fight of doing so again. I think it’s also debatable that a catalogue as large as Netflix’s would benefit from P2P’s swarm effect — the efficiencies will tend to come when a large number of people are concurrently watching a small amount of content. This might work with popular new releases, for example, but the long tail of Netflix’s content may mean that much video just gets streamed from its servers anyway due to lack of other peers. Finally, there are complex issues surrounding use of storage on customer machines and concerns from rights holders over storing their content for longish periods on computers outside of Netflix’s control. I’m sure all these issues are quite resolvable, but it’s certainly not a simple topic.

In conclusion, then, like many issues in life it’s complicated. Parties on both sides have legitimate concerns; and parties on both sides have, at times, been quite disingenuous with their petitions to the government and the public. I think it boils down to whether we can find a way to allow the tried and tested mechanisms of the free market take control; or whether we’re going to roll the dice and let the a handful of government policy makers come up with some sort of legislation that may help or hinder and whose long-term effects are hard to predict.

I know which I’d prefer, but I think there’s only one thing that we can all be certain of — this debate is likely to rumble on for a very long time yet before it’s decided one way or another.


  1. Internet Service Providers (ISPs) are those companies that provide Internet access to the public. 

19 Dec 2014 at 7:49AM by Andy Pearce in Software  | Photo by Jens Kreuter on Unsplash  | Tags: internet net-neutrality  |  See comments

← Page 1   |   Page 2 of 6   |   Page 3 →   |   Page 6 ⇒