☑ Off to Bassett

On Good Friday Michelle, Aurora, Jackie and I visited my parents for the day.

As a bit of an experiment I’ve written a blog entry with Adobe Slate and I’m seeing how well the link embeds in my blog. So without further ado, here it is.

Off to Bassett

Wed 15 Apr 2015 at 07:13PM by Andy Pearce in Outings tagged with photos, michelle, aurora, jackie and mum-and-dad  |  See comments

☑ C++11: Library Changes: Threading, Tuples, Hashing and Regexes

This is part 7 of the “C++11 Features” series which started with C++11: Move Semantics.

I’ve finally started to look into the new features in C++11 and I thought it would be useful to jot down the highlights, for myself or anyone else who’s curious. Since there’s a lot of ground to cover, I’m going to look at each item in its own post — this is the first of the final two that cover what I feel to be the most important changes to the standard template library.

C++11 introduces many changes to the STL as well as the core language itself. This post is the start of a summary of what I feel to be the most important ones, with a second post to follow to cover the remainder.

Utilise core language features

I don’t want to cover this in too much detail as I feel most of the points are fairly self-evident, but as the core library has had several features added then the STL has, unsurprisingly, been updated to take advantage of them. A handful of these changes for flavour:

Move semantics

Library containers now support rvalue references and take full advantage of classes that provide move semantics. Combined with variadic templates, this also allows support for perfect forwarding.

Unicode

Types that support characters have been extended to also support the new char16_t and char32_t types for UTF-16 and UTF-32 respectively.

Explicit conversions

The new ability to mark type conversion methods as explicit allows some common gotchas to be addressed.

Threading

C++11 contains a variety of core language improvements which support multithreaded code, such as a new memory model which allows thread-safe construction of static objects and support for thread-local storage. In tandem with these changes the STL also has some platform-independent support for threads as well.

There’s a new std::thread class which can be passed a function object and a series of arguments to execute that function in a separate thread of execution. Generic waits for completion are possible with std::thread::join() and there’s also access to the native underlying thread ID where possible.

A handful of synchronisation primitives have also been added including std::mutex, std::recursive_mutex, std::condition_variable and std::condition_variable_any. These have all the usual straightforward semantics.

As well as these primitives there are some other very handy additions. The std::lock_guard class is a simple RAII wrapper for any lockable and std::unique_lock is a useful wrapper which allows deferred locking with RAII semantics and also locking with a timeout. There’s also std::lock which allows a set of lockables to be all acquired whilst avoiding deadlock.

For higher performance, and greater convenience, there’s also the templated std::atomic which allows safe, lock-free access to fundamental data types - there are also a set of convenience specialisations (e.g. std::atomic_int, std::atomic_char). It’s even possible to specify the precise ordering constraints to use for a particular operation with these types.

Finally some higher-level support for asynchronous operations has been added with support for futures. A function can be made asynchronous with std::async, std::packaged_task or std::promise, all of which return a std::future which allows the result of the operation to be obtained once available. One annoyance, however, is that there’s no way to compose futures and hence wait on any of a whole set of asynchronous operations to complete.

All in all, then, a not quite perfect but nonetheless comprehensive set of platform independent tools for threading C++. Frankly, it’s about time.

Update: A couple of further notes on threading that may be helpful.

Firstly, when using GCC on POSIX platforms (and possibly others) you should compile with the -pthread option. This enables any platform-specific requirements for pthreads — on my Linux box it seems to add (at least) the following options:

  • -D_REENTRANT to the pre-processor definitions, which changes the behaviour of some glibc code. For example, I think it puts errno into thread-local storage.
  • -lpthreads to the list of linker flags to link in the required library code.

Secondly, remember that the behaviour of an exception propagating to the outermost scope of a thread is unchanged — in other words, if it’s not handled then it will invoke std::terminate() just as if main() had thrown it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <chrono>
#include <iostream>
#include <thread>

void hello()
{
  std::cout << "hello, world" << std::endl;
  //throw "oops";
}

int main()
{
  std::thread t1(hello);
  std::this_thread::sleep_for(std::chrono::seconds(2));
  t1.join();
  std::cout << "Joined" << std::endl;
  return 0;
}

That code should run fine, but uncomment line 8 and you’ll see what I mean.

OK, that’s probably not too surprising. However, another more subtle cause of std::terminate() is if you destroy a joinable std::thread object without joining it first. For example, change lines 13-15 to this and see whether you get as far as the Joined message:

std::thread* t1 = new std::thread(hello);
std::this_thread::sleep_for(std::chrono::seconds(2));
delete t1;

Tuples

The variadic macro support added in C++11 allowed the addition of quite clean support for tuples — i.e. a heterogenous fixed-dimension list of members, a little like a struct whose members don’t have names. The types of members are declared with template parameters, as you would expect, and then std::get can be used to obtain references to the members which can be used to both query and update them.

std::tuple<int, std::string> myTuple(123, "hello, world");
std::cout << std::get<0>(myTuple) << " :: "
          << std::get<1>(myTuple) << std::endl;
std::get<0>(myTuple) = 456;
std::get<1>(myTuple) = "goodbye, world";

C++11’s newly repurposed auto keyword and type inference can be used to declare tuples even more conveniently:

auto myTuple = std::make_tuple(123, "hello, world");

Hash Tables

One thing that I was really happy to see finally added was a standardisation of hash tables. The venerable std::map has been around for some time now, and serves as a perfectly serviceable associative array — it’s always been a little irritating, however, that it’s implemented as a tree1. This is great if you need to iterate over the elements in sorted order, but if that’s not important then the access time seems a harsh price to pay when a well-implemented hash table can average constant time access.

There are four additional types which are hash equivalents of the existing associative containers:

I find this particularly useful as associative containers can often grow quite large but if the ratio of lookups to inserts is large then the logarithmic behaviour of std::map is a bit of a turnoff. This is almost certainly going to be become my default associative container, barring the requirement for sorted order iteration, which I find is rare; and the issue of the key type not being hashable, which is more likely to be the reason to prefer the ordered versions in some cases.

Speaking of types not being hashable the STL in C++11 provides hashing for integral, floating point and pointer types — anything else needs a little help from the application. To make a type hashable, a specialisation of the std::hash templated functor must be provided for it, which will typically just decompose the structure into constituent types which are themselves already hashable. Note that to support the containers, operator==() must also be defined, since they use chaining to resolve collisions.

One gotcha is that all pointers are treated the same, which is to hash on the raw pointer value regardless of the pointer type. Without a little care this could lead to some unfortunate bugs. For example, you might think the example below would store an instance of the (somewhat awful) Person class in a std::unordered_set and then retrieve it. If you run it, however, you’ll see that it finds no match on line 60. Can you spot the issue?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <cstring>
#include <iostream>
#include <functional>
#include <unordered_set>

class Person
{
  public:
    Person(const char *name, unsigned int age);
    ~Person();

    const char* getName() const { return m_name; }
    int getAge() const { return m_age; }

  private:
    char* m_name;
    unsigned int m_age;
};

Person::Person(const char* name, unsigned int age) : m_age(age)
{
    size_t nameLen = strlen(name);
    m_name = new char[nameLen + 1];
    memcpy(m_name, name, nameLen);
    m_name[nameLen] = '\0';
}

Person::~Person()
{
    delete[] m_name;
}

namespace std
{
template<>
struct hash<Person>
{
    size_t operator()(const Person& p) const
    {
        size_t const h_name = std::hash<const char*>()(p.getName());
        size_t const h_age = std::hash<int>()(p.getAge());
        return h_name ^ (h_age << 1);
    }
};    
} // namespace std

bool operator==(const Person& lhs, const Person& rhs)
{
    return (lhs.getAge() == rhs.getAge() &&
            !strcmp(lhs.getName(), rhs.getName()));
}

int main()
{
    Person andy1("Andy", 37);
    Person andy2("Andy", 37);

    std::unordered_set<Person> people;
    people.insert(andy1);
    if (people.count(andy2) > 0) {
        std::cout << "Match!" << std::endl;
    }

    return 0;
}

Did you spot the issue? The hash value is being calculated on the pointer value returned by Person::getName() and not the contents of the array. A more correct way to calculate h_name might be:

size_t h_name = 0;
for (const char* c = p.getName(); *c; ++c) {
    h_name = std::hash<char>()(*c) ^ (h_name << 1);
}

Regular Expressions

When parsing text, one potentially powerful tool that many people will be familiar with from scripting languages are regular expressions2. Some people are quite opposed to them, apparently on principle. I must admit to a healthy scepticism in some cases, but I’m with Jeff Atwood in that if well used then they have their place.

Well now C++11 has introduced a standard regex interface. The first step is to build a std::regex object using a string specification as usual. This can be used with std::match to check if another string is full match, or std::search to check for a substring match. The result of a successful match is a std::match_results instance which allows access to the string that matches the pattern as well as any other bracketed capture groups within it.

As with std::string the regex related classes are templated on the underlying type used. Where std::string is templated on character types, however, regex classes are templated on string types. For convenience, however, there are some predefined types - for example std::match_results has four specialisations such as std::smatch for use with std::string.

As well as the two functions to check for a single match, there’s also std::regex_iterator (and its predefined specialisations) for iterating across multiple matches. With these facilities it’s really easy to write code that’s similar to something you might see in Perl or Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <regex>
#include <string>
#include <iostream>

int main()
{
    std::regex needle("(([0-9]+\\.){3}[0-9]+)");
    std::string haystack("Router IP is 172.16.0.1 or 192.168.0.1");
    for (std::sregex_iterator it(haystack.begin(), haystack.end(), needle);
            it != std::sregex_iterator(); ++it) {
        std::cout << "Found match:" << std::endl;
        for (size_t i = 0; i < it->size(); ++i) {
            std::cout << "  Group " << i << ": " << (*it)[i] << std::endl;
        }
    }
    return 0;
}

The code above should generate the following output:

Found match:
  Group 0: 127.0.0.1
  Group 1: 127.0.0.1
  Group 2: 0.
Found match:
  Group 0: 192.168.0.1
  Group 1: 192.168.0.1
  Group 2: 0.

Something that’s probably not immediately obvious from my pattern above is that capture group zero is special — it always contains the entire matched substring. The remaining capture groups correspond to the bracketed expressions in the pattern in the order of occurrence of the opening brackets. Since I bracketed the entire pattern then capture group 1 also contains the entire matched string. Capture group 2 matches the inner bracketed expression, but the complexity here is that I’ve used the {3} specifier to indicate that group must be repeated three times. As you can see from the results, each match overwrites the previous one so the result of querying the capture group is whatever matched the final time.

To round off it’s important to know what dialect of regular expression is used, as there are significant differences in functionality — actually C++11 supports a surprising six dialects, which are selected when building the std::regex:

The last two are rather minor variations of basic and extended POSIX respectively, simply adding a newline (\n) as an alternative to the pipe character (|) for alternation.

One final note on regexes for gcc users — I had to install gcc 4.9 to get the regex example above to compile, they aren’t supported in any earlier versions.


  1. Typically a red-black-tree, to be specific. 

  2. I’m intentionally ignoring the fact that the “regular expressions” provided by most programming languages are in fact sufficiently expressive that they’re no longer regular languages in the formal definition. I’m as much of a fan of pedantry as the next software engineer, but one of the things Larry Wall and I appear to agree on is that this battle is lost

Mon 13 Apr 2015 at 07:26AM by Andy Pearce in Software tagged with c++  |  See comments

☑ An Unhealthy Environment

Recently I’ve been writing code to spawn child processes that had to deal with the POSIX functions for querying and manipulating environment variables. I’ve only just realised how truly awful this interface is in the context of modern multi-threaded applications, and this post is simply me sharing the pain.

Many Unix, and some Windows, users will be familiar with environment variables. These are key/value strings such as USER=andy or SHELL=/bin/bash, and they form part of the global environment provided to a process by the OS. Windows has a similar concept, although it has a few subtle differences and in this post I’m only discussing the situation in POSIX.

POSIX provides various interfaces to query and set these variables. Probably the most well known of these are setenv() and getenv(), so let’s start with those.

The getenv() function is pretty modest - you pass in the name of an environment variable and it returns you a pointer to the value. Simple enough, but immediately the spidey sense starts tingling. The function returns a char* instead of a const char* for one thing, but “the application shall ensure that it does not modify the string pointed to by the getenv() function”. Well, OK, perhaps they didn’t have const in the days this function was written. They also presumably hadn’t heard of thread-safety or re-entrancy, because anything that returns a static pointer pretty clearly does neither.

The setenv() function is also fairly simple - you pass in a new variable name and value, and a flag indicating whether you’re happy for the assignment to overwrite any previous value. But the man page talks about this function modifying the contents of environ - oh yes, let’s talk about that first…

You’ll notice neither of the functions so far has given a way to iterate through all the current environment variables that are set. It turns out that the only POSIX-supported way to do this is use the global environ variable provided by the library. This is similar to argv that’s passed into main() except that instead of an argc equivalent, the environ array is null-terminated. Things start to smell a little fishy when you realise that environ isn’t actually declared in any header files - the application itself has to include something like this, as taken from the POSIX page on environment variables.

extern char** environ;

OK, so just like argv there’s some OS-supplied storage that contains the environment. It’s not const, but hey ho, neither is argv and we seem to cope fine with just not modifying that directly. Except that the key point here is that setenv() does modify environ - the man page even explicitly states that’s how it works. Unlike argv, therefore, you can’t just treat it as some effectively some read-only constant1 array and quietly ignore the fact that the compiler won’t stop you modifying it.

It gets even more crazy when you realise that, according to the man page for the exec family, it’s quite valid to replace your entire environment by assigning a whole new value to environ. You read that correctly - not updating the pointers within environ, just repointing the whole thing at your own allocated memory.

So then, when setenv() comes along and wants to modify this, how on earth can it do so? It has no idea how big an array you’ve allocated in your own code - it either has to copy the whole lot to somewhere provided by the system, or cross its fingers and hope there’s enough space.

And don’t even get me started on the memory management Pandora’s Box that is putenv()

In summary, therefore, I’ve decided that the only sensible course of action is to use environment variables as little as possible. If you must use them as opposed to command-line arguments, you should parse them away right at the beginning of main() and put them into other storage within your code, never worrying about the environment again. If you’re writing a library… Well, good luck with that - let’s hope your application doesn’t mess around with the environment too badly before you want to query it. Whatever you do, don’t update it!

It’s quite possible to work around all this brokenness, of course, as long as you can make some basic assumptions of sanity about your libraries. But it’s all just such a dirty little mess in the otherwise mostly sensible tidiness that POSIX has imposed on the various APIs that exist.

Surely there’s got to be a more sensible way to control the behaviour of applications and libraries? For example, we could have some sort of system-wide database of key/value pairs - unlike the environment it could be lovely and clean and type-safe, and all properly namespaced too. For performance reasons we could stick it in some almost unparseable binary blob. There’s no way such a system could be abused by applications, right? It would remain clean and usable, I’m sure. Now all we need is a snappy name for it - something that indicates the way that values can be registered with it. Perhaps, The Register? No, people will confuse it with the online tech site. What about The Repository? Hm, confusing with source control. I dunno, I’ll think about it some more.


  1. Yes, I’m aware there are some use-cases for modifying argv too, but I class those as unusual cases, and they also tend to be quite system-specific (assuming you want to resize the strings in the process). 

Wed 25 Mar 2015 at 07:28AM by Andy Pearce in Software tagged with posix, environment-variables and linux  |  See comments

☑ Hash Collision Fun

When dealing with data structures that involve hashing, most commonly hash tables, it’s fairly common knowledge that your choice of hash function is an important performance consideration. What’s perhaps less well known is that it can be an important security consideration too - this article briefly discusses why.

Anyone who’s studied Computer Science at university, or has spent a few years in the software industry, will know about hash tables. Those who’ve looked deeper will know that there are other, equally powerful, uses for the same general technique. Its great strength is its performance in the general case, able to average constant time lookups where many data structures only manage performance. The complication is that to achieve this in practice you need to put a lot of care into the hash function that you choose. However, as I discovered recently, average case performance isn’t the only consideration.

The important characteristic of a hash function is that it assigns inputs to random buckets. This should hold regardless of which inputs are chosen as this helps to ensure the average constant time performance - if multiple items hash to the same bucket, the lookup time rapidly degrades towards linear performance. This can have a very large impact on running time of the application. With a reasonable hash function this occurs very rarely in practice, but if someone knows the hash function in use then it may be possible to deliberately engineer hash collisions - this is fairly obvious when you think about it, but it wasn’t something that occurred to me until recently.

Why would someone want to do that? One reason is part of a DoS attack, where an attacker uses some supposedly limited level of access to impair usage for other users. A great example of this is an attack someone devised on the Btrfs filesystem a few years ago. Essentially this involved constructing a large number of filenames which hash to the same bucket, thus causing the number of items in that bucket to grow artificially large and extending the time taken to perform operations massively. This approach could also be used by one user with access to a shared directory to prevent another user creating a file in the same directory by creating many files which hashed to the same value.

Apparently this isn’t as rare as you might imagine - there’s even a CERT advisory which contains a long list of programming languages whose builtin hashing algorithms are prone to this sort of attack. Many of these langauges have since resolved the issue - Python was one of the more recent of these, only having moved to using the cryptographic quality Siphash function in version 3.41.

Aside from all these language features, however, which will have a lot of pressure to change due to the number of people affected, it’s a little concerning to consider how much software there is out there which has used insecure hashing in the name of performance, leaving itself open to these sorts of cunning attacks.

Just one more reason not to roll your own hashing function - but I have no doubts that people will continue to do so for a long time yet.


  1. Prior to choosing Siphash, the Python team did attempt to address the issue using the old hashing algorithm but peturbed by a pseudorandomly chosen initial value per invocation of Python. Unsurprisingly this turned out to be less than adequate. 

Fri 20 Mar 2015 at 07:36AM by Andy Pearce in Software tagged with data-structures and security  |  See comments

☑ Agile Government

I’m quite a fan of Agile software development, but it seems that the same approach can be used in a wide variety of other industry areas. In this post I’ll briefly describe how I discovered a very Agile-sounding approach to nursing in Holland.

Britain is currently undergoing a swathe of austerity under a Tory-lead government, with significant cuts to all sorts of public services. With a national debt of £1.4tn, rising by around £100bn a year, it’s not a surprise that the government is trying to balance the books regardless of what you may think of their means of doing so.

One of the biggest budget items in the UK is, of course, the NHS; and naturally it gets some of the most scrutiny for possible savings. At the same time it’s extremely sensitive when it comes to cuts in frontline services - nobody wants to feel that their health is being put at risk for the sake of saving a little money. As a result, there’s usually a great deal of rhetoric bouncing around about how to achieve savings without affecting actual care. Typically these take the form of some sort of hand-wavy diatribe produced by throwing phrase like “cut through the red tape”, “strip out middle management”, “get rid of bureaucracy” and “back to basics” into a cup, shaking it all around and throwing it onto a page to see what sticks1, but are conspicuously light on concrete plans to implement such measures, or even evidence that it’s possible.

Well, I read a Guardian article about a pilot scheme in Holland which has achieved some impressive-sounding efficiencies. The Buurtzorg community nursing organisation is now supporting 7,000 nurses with only 30 back office support staff; that’s over two hundred nurses for each non-medical employee. Apparently the quality of care is better, with patients requiring 40% fewer hours of care, and nurses have less than half the absenteeism and a third less turnover than other nursing organisations.

This all sounds rather too much like the holy grail of NHS savings for which successive governments have been searching all these years to be true, so decided to try and find out a few more details. Their US homepage has some interesting tidbits, but what really intrigued me was a report I found on the web page of The King’s Fund, a charity that works to improve policy around health care in the UK.

It’s a fairly brief bullet point summary, but here are the main points that leapt out at me:

  • Independent teams of up to 12 nurses.
  • Teams are autonomous and responsible for the whole process.
  • Assessment and care of all types of client: generalists.
  • Monitor the outcomes instead of effort.
  • Focus on activities instead of processes.

So essentially the group improved productivity and employee satisfaction by splitting the nurses into small, autonomous groups who self-organised into the optimal structure for their particular tasks and focused on whatever task they needed to carry out to achieve their objectives instead of rigidly adhering to some centralised process.

To any software engineers out there this might be starting to sound awfully familiar - specifically it really sounds rather like a Agile methodology to me. I’ve long been attracted by the benefits of Agile in software development, but it’s fascinating to see something that appears extremely similar being so gainfully employed in such a different industry. Perhaps I shouldn’t be too surprised, since Agile practices originally grew up in the automotive industry, but it does make me wonder how many sectors could be significantly improved by judicious use of these ideas.

In an oblique way it also reminds me of a TED talk by Sugata Mitra awhile ago about the deficiencies of the education system. His thesis is that our current approach to education was shaped by the Victorians, who needed people who could be employed as effectively cogs in the massive global machine that was the British Empire. He further suggests that today’s technology has rendered the need for such roles largely obsolete, and instead we should be trying to produce young adults who can build on a sound technological foundation and innovate upon it.

In both cases I suspect attempts at widespread change will face an uphill struggle against those who want to cling to the old ways of doing thing, regardless of evidence supporting the change. Personally I believe this is the real challenge in turning round organisations like the NHS, not any shortage of ideas for improvement. That’s why it’s so good to see real world schemes like Buurtzorg showing such massive improvement - such compelling evidence is going to be critical in pushing through change.

But I fear it’ll be a long, long road yet.


  1. Incidentally, the Daily Mail appear to use a similar approach to writing articles on the matter. 

Thu 19 Feb 2015 at 07:47AM by Andy Pearce in Politics tagged with agile and politics  |  See comments

Page 1 of 8   |   Page 2 →   |   Page 8 ⇒