# ☑ C++11: Initialization

This is part 2 of the “C++11 Features” series which started with C++11: Move Semantics.

I’ve finally started to look into the new features in C++11 and I thought it would be useful to jot down the highlights, for myself or anyone else who’s curious. Since there’s a lot of ground to cover, I’m going to look at each item in its own post — this one covers changes to initialization of variables.

Following on from my previous post on C++11’s new features, today I’m looking at a couple of changes to the way initialization works.

## Extended initializer lists

As an extension to C, C++03 supports initialization of arrays and structures by listing values in curly brackets, even allowing such definitions to be nested:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 struct RGB { unsigned char red; unsigned char green; unsigned char blue; }; struct Pixel { int x; int y; struct RGB colour; }; Pixel pixelArray[] = {{0, 0, {0xff, 0xff, 0xff}}, {1, 0, {0xff, 0x00, 0x00}}, /* ... */ }; 

Since there’s no difference between a class and struct in C++ except for the default access specifier being public in a struct and private in a class, this applies equally to both. However, in C++03 it’s only permitted if the type is POD1 — in C++11 it has been extended to cover all class types.

The following example demonstrates the difference:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #include struct PODClass { int member1; float member2; char member3; }; class NonPODClass { public: NonPODClass(int arg1, float arg2); virtual void method(); private: int member1; float member2; char member3; }; NonPODClass::NonPODClass(int arg1, float arg2) : member1(arg1), member2(arg2), member3('x') { std::cout << "NonPODClass constructed" << std::endl; } void NonPODClass::method() { // ... } int main() { PODClass podInstance = {1, 2.3, 'a'}; // Valid in C++03 and C++11 NonPODClass cpp03InitStyle(2, 4.6); // Valid in C++03 and C++11 NonPODClass cpp11InitStyle = {3, 6.9}; // Only valid in C++11 // ... return 0; } 

So far it doesn’t seem too much of a stretch, just a fancy way of calling a constructor. It’s also possible to override the constructor called when an initializer list is provided, however, by providing a constructor taking a std::initializer_list<>. For example, consider this class:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 #include #include #include class MyIntArray { public: MyIntArray(std::initializer_list values); ~MyIntArray(); int operator[](size_t index); private: size_t size; int *array; }; MyIntArray::MyIntArray(std::initializer_list values) : size(values.size()), array(NULL) { array = new int[values.size()]; int *p = array; for (std::initializer_list::iterator it = values.begin(); it != values.end(); ++it) { *(p++) = *it; } } MyIntArray::~MyIntArray() { delete array; } int MyIntArray::operator[](size_t index) { if (index >= size) { throw std::out_of_range("index invalid"); } return array[index]; } 

The following main() function shows how it could be used:

 1 2 3 4 5 6 7 8 #include int main() { MyIntArray fib = {1, 1, 2, 3, 5, 8, 13, 21}; std::cout << "5th element: " << fib[4] << std::endl; return 0; } 

All the STL container types have also been updated to support this form of initialization.

It turns out that std::initializer_list<> is just a standard type but you can only construct one using the curly-bracket syntax — after that, however, it can be copied, passed into functions and otherwise manipulated. The list itself is not mutable after creation, however. For example, the main() function above could be modified to read:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include int main(int argc, char *argv[]) { std::initializer_list init; if (argc > 1) { init = {1, 1, 2, 3, 5, 8, 13, 21}; } else { init = {1, 1, 2, 6, 24, 120, 720, 5040}; } MyIntArray fib = init; std::cout << "5th element: " << fib[4] << std::endl; return 0; } 

## Uniform initialization

Somewhat related to extended initializer lists are some changes to ensure that all objects may be consistently initialised with the same syntax regardless of their type. One example of where this can fall down in C++03 is the most vexing parse rule and it’s also convenient to have a constructor-like syntax which works with POD types even though they don’t have a defined constructor.

In C++11 the initializer list syntax can be used like a constructor for any type by putting braces in place of brackets, like so:

 1 MyIntVector instance{1, 2, 3, 4}; 

When used with POD types this will work like an initializer list and when used with other class types it will invoke the appropriate constructor. The standard form of constructor invocation with round brackets is still sometimes required if a constructor taking std::initializer_list<> is defined and an alternate constructor needs to be called.

Perhaps these initialization changes aren’t all that sweeping, but it’s nice to see some more consistent behaviour between simple and complex types, and I suspect that these changes may be particularly useful when defining a class which may be templated on both POD and non-POD types.

1. Plain Old Data: a class with no constructors and no non-public data members. In fact this is a simplification — see this SO answer for more details. Note that this covers C++03, the definition of POD changed a little in C++11 — I’ll cover that in a later post.

15 Jul 2013 at 1:38PM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: c++  |  See comments

# ☑ Passwords: You’re doing it wrong

There are few technical topics about which there’s more FUD than picking a strong password.

I’m getting a little sick of how much misinformation there is about passwords.

More or less everyone who’s been online for any length of time knows that picking a “good” password is important. After all, it’s more or less the only thing which stands between a potential attacker and unlimited access to your online services. So, what constitutes “good”?

Well, many, many services will recommend that you make your passwords at least 8 characters in length, avoid words that you’d find in a dictionary and make sure you include mixed case letters, numbers and symbols. Now, if you need a password which is both good and short then this is probably quite reasonable advice. If you picked 8 characters truly at random from all letters, numbers and symbols, that’s something like 4 x 1015 (four million billion) passwords to choose from, and that’s secure enough.

However, these passwords have a bit of a drawback — humans are bloomin’ awful at remembering things like this: 1Xm}4q3=. Hence, people tend to start with something which is a real word — humans are good at remembering words. Then apply various modifications like converting letters to similar digits. This process gives people a warm, fuzzy feeling because their password ends up with all these quirky little characters in it, but in reality something like sH33pd0g is a lot worse than just picking random characters.

Let’s say there are about 10,0001 common words and that each letter can be one of four variants2. That gives around 6 x 108 (six hundred million) possibilities. This might seem like a lot, but if someone can check a thousand passwords every second then it’ll only take them about a week to go through them all.

So, if that’s a bad way to pick memorable passwords, what’s a good way? Well, there are a few techniques which can help if you still need a short password. One of the best I know of is to pick a whole sentence and simply include the first letter of each word. Then mix up the case and swap letters for symbols as already discussed. This at least keeps the password away from being a dictionary word that can be easily guessed, although it’s still a pain to have to remember which letters have been swapped for which numbers and so on.

But wait, do we really need such a short password? If we make it longer, perhaps we don’t need to avoid dictionary words at all? After all, the best password is one that is both secure and memorable, so we don’t have to write it down on a Post-It note3 stuck on our computer. Fortunately the repository of all knowledge xkcd has the answer4!

As it turns out, picking four words more or less at random is much better from a security point of view and, for most people at least, is probably quite a bit easier to remember. Using the same 10,000 word estimate from earlier, picking four words entirely at random gives us 1 x 1016 (ten quadrillion, or ten million billion) possibilities. At a thousand per second this would take over 300,000 years to crack. The beauty is that you don’t need to remember anything fancy like strange numbers of symbols — just four words.

Is it any more memorable? Well, a random series of characters doesn’t give you anything to get to grips with — it’s a pure memory task, and that’s tough for a lot of people. However, if you’ve got something like enraged onlooker squanders dregs or scandalous aardvark replies trooper then you can immediately picture a scene to help you remember it.

So let’s stop giving people daft advice about passwords and certainly let’s remove all those irritating checks on passwords to make sure they’re “secure”, when in reality the net effect of forcing people to use all these numbers and strange symbols more or less the opposite. Most of all, let’s make sure that our online services accept passwords of at least 128 characters so that people can pick properly good passwords, not the ones that everyone’s been browbeaten into believing are good.

As an aside, even with this scheme it’s still really important to pick words at random and that’s something humans don’t do very well. Inspired by the xkcd comic I linked earlier, this site was set up to generate random passwords. Alternatively, if you’re on a Linux system you could use a script something like this to pick one for you5:

 1 2 3 4 5 6 7 8 9 #!/usr/bin/python import random import string with open("/usr/share/dict/words", "r") as fd: words = set("".join(c for c in line.strip().lower() if c in string.ascii_lowercase) for line in fd) print("Choosing 10 phrases from dict of %d words\n" % (len(words),)) print("\n".join(" ".join(random.sample(words, 4)) for i in range(10))) 

One final point. You might be thinking it’s going to be a lot slower to type four whole words than eight single characters, but actually it’s often almost as fast once you don’t need to worry about fiddling around with the SHIFT key and all those odd little symbols you never normally use, like the helium-filled snake (~)6 and the little gun (¬)7.

Especially on a smartphone keyboard. Let’s face it, if you’ve just upgraded your phone and are seeking help with it in some online forum, there isn’t a way to ask “how do I get the little floating snake?” without looking like a bit of an idiot — clearly this is the most important advantage of all.

1. While it’s true that the OED has something like a quarter of a million words, the average vocabulary is typically significantly smaller.

2. So o could also be O, 0 or (), say.

3. Although the dangers of writing passwords down is generally heavily overestimated by many people. I’m not saying it’s a good idea, but having a really good password on a scrap of paper in your wallet, say, is still a lot better for most average users than a really poor password that you can remember.

4. Note that Randall’s figures differ from mine somewhat and are probably rather more accurate — I was just plucking some figures in the right order of magnitude out of the air to illustrate the issues involved.

5. It’s written to run on a Linux system but about the only thing it needs that’s platform-specific is the filename of a word list.

6. Yes, yes, I know it’s called a tilde really.

7. OK, fine, it’s a logical negation symbol. Still looks like a little gun to me; or maybe hockey stick; or an allen key; or the edge of a table; a table filled with cakes… Yes, I like that one best. Hm, now I’m hungry.

11 Jul 2013 at 3:55PM by Andy Pearce in Software  | Photo by Jose Fontano on Unsplash  | Tags: security python  |  See comments

# ☑ C++11: Move Semantics

This is part 1 of the “C++11 Features” series.

I’ve finally started to look into the new features in C++11 and I thought it would be useful to jot down the highlights, for myself or anyone else who’s curious. Since there’s a lot of ground to cover, I’m going to look at each item in its own post — this one covers move semantics.

As most C++ programmers will know, a new version of the standard was approved a couple of years ago, replacing the previous C++03. This is called C++11, and was formerly known as C++0x. Since I’ve recently happened across a few Stack Overflow questions which mentioned C++11 features I thought I’d have a look over the (to me, at least) more interesting ones and jot down the highlights here for anyone who’s interested.

This post covers move semantics.

The STL tends to be very clever at enabling fairly high-level functionality whilst minimising performance impact. One of its weakest areas, however, is the fact that one often wants to initialise containers from temporary values, or return a container from a function by value, and this involves a potentially expensive copy.

This issue has been improved in C++11 with the addition of move constructors. These are the same as copy constructors in essence, but they take a non-const reference and are not required to leave the source object in a valid state. They are used by the compiler in cases like copying from a temporary value, where the source object is about to go out of scope anyway and hence cannot be accessed after the operation.

This allows classes to implement more efficient copy constructors by having the destination class take ownership of some underlying data directly instead of having to copy it, similar in function to things like std::vector::swap().

In C++03 this couldn’t work because references to temporary values could only ever be const:

  1 2 3 4 5 6 7 8 9 10 11 std::string function() { return std::string("hello, world"); } void anotherFunction() { // const is required on line below or code won't compile. const std::string& str = function(); std::cout << "Value: " << str << std::endl; } 

To enable this behaviour in C++11 a new type of reference known as an rvalue reference has been created. These may only be bound to rvalues (i.e. temporary values) but they allow non-const references to be bound. These are specified using an extra ampersand as shown below:

  1 2 3 4 5 6 7 8 9 10 11 std::string function() { return std::string("hello, world"); } void anotherFunction() { // Note extra & denoting rvalue ref, allowed to be non-const. std::string&& str = function(); std::cout << "Value: " << str << std::endl; } 

If a function or method is overloaded with different variants which take rvalue and lvalue references then this allows code to behave more optimally when dealing with temporary values which can safely be invalidated. As well as the previously-mentioned move constructor it’s also possible to define move assignment operators in the same way.

The following trivial class which holds a block of memory shows the definition of both a move and copy constructor to illustrate how the move constructor is more efficient:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 class MemoryBlob { public: // Standard constructor -- expensive copy from caller's buffer. MemoryBlob(const char *blob, size_t blobSize) : size(blobSize), buffer(new char[size]) { memcpy(buffer, blob, size); } // Copy constructor -- expensive copy from other class's buffer. MemoryBlob(const MemoryBlob& other) : size(other.size), buffer(new char[size]) { memcpy(buffer, other.buffer, size); } // C++11 move constructor -- cheap theft of other class's pointer. MemoryBlob(MemoryBlob&& other) : size(other.size), buffer(other.buffer) { other.buffer = NULL; } // Destructor -- remember delete of NULL is harmless. ~MemoryBlob() { delete buffer; } // Standard assignment operator. MemoryBlob& operator=(const MemoryBlob& other) { size = other.size; delete buffer; buffer = new char[size]; memcpy(buffer, other.buffer, size); } // C++11 move assignment operator. MemoryBlob& operator=(MemoryBlob&& other) { size = other.size; delete buffer; buffer = other.buffer; other.buffer = NULL; } // Remainder of class... private: size_t size; char *buffer; }; int main() { std::vector vec; vec.push_back(MemoryBlob("hello", 5)) // Will invoke move constructor. return 0; } 

In the example above it’s important to note how the buffer pointer of the source class gets reset to NULL during the move constructor and assignment operator. Without this, the destructor of the other class would delete the pointer still held by the destination class causing all sorts of mischief.

As a final note, named variables are never considered rvalues — only lvalue references to them can be created. There are, however, cases where you may need to treat an lvalue reference as an rvalue. In these instances the std::move() function can be used to “cast” an lvalue reference to an rvalue version. Of course, careless use of this could cause all sorts of problems, just as with casting.

9 Jul 2013 at 11:04AM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: c++  |  See comments

# ☑ Chunky requests

Why have webservers been so slow to accept chunked requests?

HTTP is, in general, a good protocol.

That is to say it’s not awful — in my experience of protocols, not being awful is a major achievement. Perhaps I’ve been unfairly biased by dealings in the past with the likes of SNMP, LDAP and almost any P2P file sharing protocol you can mention1, but it does seem like they’ve all got some major annoyance somewhere along the line.

As you may already be aware, the current version of HTTP is 1.1 and this has been in use almost ubiquitously for over a decade. One of the handy features that was introduced in 1.1 over the older version 1.0 was chunked encoding. If you’re already familiar with it, skip the next three paragraphs.

HTTP requests and responses consist of a set of headers, which define information about the request or response, and then optionally a body. In the case of a response, the body is fairly obviously the file being requested, which could be HTML, image data or anything else. In the case of a request, the body is often omitted for performing a simple GET to download a file, but when doing a POST or PUT to upload data then the body of the request typically contains the data being uploaded.

In HTTP, as in any protocol, the receiver of a message must be able to determine where the message ends. For a message with no body this is easy, as the headers follow a defined format and are terminated with a blank line. When a body is present, however, it can potentially contain any data so it’s not possible to specify a fixed terminator. Instead, it can be specified by adding a Content-Length header to the message — this indicates the number of bytes in the body, so when the receiving end has that many bytes of body data it knows the message is finished.

Sending a Content-Length isn’t always convenient, however — for example, many web pages these days are dynamically generated by server-side applications and hence the size of the response isn’t necessarily known in advance. It can be buffered up locally until it’s complete and then the size of it can be determined, a technique often called store and forward. However, this consumes additional memory on the sending side and increases the user-visible latency of the response by preventing a browser from fetching other resources referenced by the page in parallel with fetching the remainder of the page. As of HTTP/1.1, therefore, a piece-wise method of encoding data known as chunked encoding was added. In this scheme, body data is split into variable-sized chunks and each individual chunk has a short header indicating its size and then the data for that chunk. This means that only the size of each chunk need be known in advance and the sending side can use whatever chunk size is convenient2.

So, chunked encoding is great — well, as long as it’s supported by both ends, that is. If you look at §3.6.1 of the HTTP RFC, however, it’s mandatory to support it — the key phrase is:

All HTTP/1.1 applications MUST be able to receive and decode the “chunked” transfer-coding […]

So, it’s safe to assume that every client, server and library supports it, right? Well, not quite, as it turns out.

In general, support for chunked encoding of responses is pretty good. Of course, there will always be the odd homebrew library here and there that doesn’t even care about RFC-compliance, but the major HTTP clients, servers and libraries all do a reasonable job of it.

Chunk-encoded requests, on the other hand, are a totally different kettle of fish3. For reasons I’ve never quite understood, support for chunk-encoded requests has always been patchy, despite the fact there’s no reason at all that a POST or PUT request might feasibly be as large as any response — for example, when uploading a large file. Sure, there isn’t the same latency argument, but you still don’t want to force the client to buffer up the whole request before sending it just for the sake of lazy programmers.

For example, the popular nginx webserver didn’t support chunk-encoded requests in its core until release 1.3.9, a little more than seven months ago — admittedly there was a plugin to do it in earlier versions. Another example I came across recently was that Python’s httplib module doesn’t support chunked requests at all, even if the user does the chunking — this doesn’t seem to have changed in the latest version at time of writing. As it happens you can still do it yourself, as I recently explained to someone in a Stack Overflow answer, but you have to take care to make sure you don’t provide enough information for httplib to add its own Content-Length header — providing both that and chunked encoding is a no-no5, although the chunk lengths should take precedence according to the RFC.

What really puzzles me is how such a fundamental (and mandatory!) part of the RFC can have been ignored for requests for so long? It’s almost as if these people throw their software together based on real-world use-cases and not by poring endlessly over the intricate details of the standards documents and shunning any involvement with third party implementations. I mean, what’s all this “real world” nonsense? Frankly, I think it’s simply despicable.

But on a more serious note, while I can entirely understand how people might think this sort of thing isn’t too important (and don’t even get me started on the lack of proper support for “100 Continue”6), it makes it a really serious pain when you want to write properly robust code which won’t consume huge amounts of memory even when it doesn’t know the size of a request in advance. If it was a tricky feature I could understand it, but I don’t reckon it can take more than 20 minutes to support, including unit tests. Heck, that Stack Overflow answer I wrote contains a pretty complete implementation and that took me about 5, albeit lacking tests.

So please, the next time you’re working on a HTTP client library, just take a few minutes to implement chunked requests properly. Your coding soul will shine that little bit brighter for it. Now, about that “100 Continue” support… OK, I’d better not push my luck.

1. The notable exception being BitTorrent which stands head and shoulders above its peers. Ahaha. Ahem.

2. Although sending data in chunks that are too small can cause excessive overhead as this blog post illustrates.

3. Trust me, you don’t want a kettle of fish on your other hand, especially if it’s just boiled. Come to think of it, who cooks fish in a kettle, anyway?4

4. Well, OK, I’m pedantic enough to note that kettle originally derives from ketill which is the Norse word for “cauldron” and it didn’t refer to the sort of closed vessel we now think of as a “kettle” when the phrase originated. I’m always spoiling my own fun.

5. See §4.4 of the HTTP RFC item 3.

6. Used at least by the Amazon S3 REST API

3 Jul 2013 at 2:20PM by Andy Pearce in Software  | Photo by Annie Spratt on Unsplash  | Tags: http web  |  See comments

# ☑ Tuning in the static

In C++ the static keyword has quite a few wrinkles that may not be immediately apparently. One of them is related to constructor order, and I briefly describe it here.

The static keyword will be familiar to most C and C++ programmers. It has various uses, but for the purposes of this post I’m going to focus on static, local variables within a function.

In C you might find code like this:

 1 2 3 4 5 int function() { static int value = 9; return ++value; } 

On each call this function will return int values starting at 10 and incrementing by one on each call, as value is static and hence persists between calls to the function. The following, however, is not valid in C:

  1 2 3 4 5 6 7 8 9 10 int another_function() { return 123 + 456; } int function() { static int value = another_function(); return ++value; } 

This is because in C, objects with static storage must be initialised with constant expressions1. This makes life easy for the compiler because typically it can put the initial value directly into the data segment of the binary and then just omit any initialisation of that variable when the function is called.

Not so in C++ where things are rather more complicated. Here, static variables within a function or method can be initialised with any expression that their non-static counterparts would accept and the initialisation happens on the first call to that function2.

This makes sense when you think about it, because in C++ variables can be class types and their constructors can perform any arbitrary code anyway. So, calling a function to get the initial value isn’t really much of a leap. However, it’s potentially quite a pain for the compiler and, by extension, the performance-conscious coder as well.

The reason that this might impact performance is that the compiler can no longer perform initialisation by including literals in the binary, since the values aren’t, in general, known until runtime. It now needs to track whether the static variables have been initialised in a function, and it needs to check this every time the function is called. Now I’m not sure which approach compilers take to achieve this, but it’s most likely going to add some overhead3, even if just a little. In a commonly-used function on a performance-critical data path, this could become significant.

A further complicating factor is the fact that each static variable must be separately tracked because any given run through the function may end up not passing the definition if its within a conditional block. Also, objects are required4 to be destroyed in the reverse of the order in which they were constructed. Put these two together and there’s quite a bit of variability — consider this small program:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 #include #include #include class MyClass { public: MyClass(std::string id); ~MyClass(); private: std::string id_; }; MyClass::MyClass(std::string id) : id_(id) { std::cout << "Created " << id << std::endl; } MyClass::~MyClass() { std::cout << "Destroyed " << id_ << std::endl; } void function(bool do_first) { if (do_first) { static MyClass first("first"); } static MyClass second("second"); } int main(int argc, char *argv[]) { MyClass instance("zero"); function(argc % 2 == 0); if (argc > 2) { function(true); } return 0; } 

With this code, you get a different order of construction and destruction based on the number of arguments you provide. We may skip the construction of first entirely:

$static-order Created zero Created second Destroyed zero Destroyed second  We may construct first and second on the first call to function() and hence have them destroyed in the opposite order: $ static-order one
Created zero
Created first
Created second
Destroyed zero
Destroyed second
Destroyed first


Or we may skip over constructing first on the first call, and then have it performed on the second, in which case we get the opposite order of destruction to the above:

\$ static-order one two
Created zero
Created second
Created first
Destroyed zero
Destroyed first
Destroyed second


In all cases you’ll note that zero is both constructed and destroyed first. It’s constructed first because it’s created at the start of main() before any calls to function(). It’s destroyed first because this happens once main() goes out of scope, which happens just prior to the termination of the program which is the point at which static objects, whether local or global, go out of scope.

Static variables get more complex every time you look at them — I haven’t covered the order (or lack thereof) of initialising static objects in different compilation units, and we haven’t even begun to talk about multithreaded environments yet…

Just be careful with static. Oh, and, uh, the rest of C++ too, I suppose. In fact, have you ever considered Python?

1. See §6.7.8.4 of the C standard

2. Incidentally, this enables a useful way to prevent the static initialisation order fiasco, but that’s another story.

3. Well, if you assume you can write to your own code section then there are probably ways of branching to the static initialiser and then overwriting the branch with a no-op or similar, and this would be almost zero overhead. However, I believe on Linux at least the code section is read-only at runtime which puts the kibosh on sneaky tricks like that.

4. See §3.6.3.1 of the C++ standard

14 Jun 2013 at 4:18PM by Andy Pearce in Software  | Photo by Frantzou Fleurine on Unsplash  | Tags: c++ c  |  See comments

# ☑ Just like old

If you have the luxury of migrating your Linux installation to a new hard disk before the old one packs up entirely, it’s quite easily done with standard tools.

Recently on my Linux machine at work I started getting some concerning emails from smartd which looked like this:

Device: /dev/sda [SAT], 7 Offline uncorrectable sectors


Any errors from smartd are a cause for concern, but particularly this one. To explain, a short digression — anybody familiar with SMART and hard drives can skip the next three paragraphs.

Hard disks are split into sectors which are the smallest units which can be addressed1. Each sector corresponds to a tiny portion of the physical disk surface and any damage to the surface, such as scratches or particles of dust, may render one or more of these sectors inaccessible — these are often called bad sectors. This has been a problem since the earliest days of hard drives, so operating systems have been designed to cope with sectors that the disk reports as bad, avoiding their use for files.

Modern hard disk manufacturers have more or less accepted that some proportion of drives will have minor defects, so they reserve a small area of the disk for reallocated sectors, in addition to the stated capacity of the drive. When bad sectors are found, the drive’s firmware quietly relocates them to some of this spare space. The amount of space “wasted” by this approach is a tiny proportion of the space of the drive and saves manufacturers from having to deal with a steady stream of customers RMAing drives with a tiny proportion of bad sectors. There is a limit to the scope of this relocation, however, and when the spare space is exhausted the drive has no choice but to report the failures directly to the system2.

The net result of all this is that by the time your system is reporting bad sectors to you, your hard disk has probably already had quite a few physical defects crop up. The way hard drives work, this often means the drive may be starting to degrade and may suffer a catastrophic failure soon — this was confirmed by a large-scale study by Google a few years ago. So, by the time your operating system starts reporting disk errors, it may be just about too late to practically do anything about it. This is where SMART comes in — it’s a method of querying information from hard disks, including such items as the number of sectors which the drive has quietly reallocated for you.

The smartd daemon uses SMART to monitor your disks and watch for changes in the counters which may indicate a problem. Increases in the reallocated sector count should be watched carefully — occasionally these might be isolated instances, but if you see this number change continuously over a few days or weeks then you should assume the worst and plan for your hard disk to fail at any moment. The counter I mentioned above, the offline uncorrectable sector count, is even worse — this means that the drive encountered an error it couldn’t solve when reading or writing part of the disk. This is also a strong indicator of failure.

So, I know my hard disk is about to fail — what can I do about it? The instructions below cover my experiences on an Ubuntu 12.04 system, but the process should be similar for other distributions. Note that this is quite a low-level process which assumes a fair degree of confidence with Linux and is designed to duplicate exactly the same environment. You may find it easier to simply back up your important files and reinstall on to a fresh disk.

Since I use Linux, it turns out to be comparatively easy to migrate over to a new drive. The first step is to obtain a replacement hard disk, then power the system off, connect it up to a spare SATA socket and boot up again. At this point, you should be able to partition it with fdisk, presumably in the same way as your current drive but the only requirement is that each partition is at least big enough to hold all the current files in that part of your system. Once you’ve partitioned it, format the partitions with, for example, mke2fs and mkswap as appropriate. At this point, mount the non-swap partitions in the way that they’ll be mounted in the final system but under some root — for example, if you had just / and /home partitions then you might do:

sudo mkdir /mnt/newhdd
sudo mount /dev/sdb1 /mnt/newhdd
sudo mkdir /mnt/newhdd/home
sudo mount /dev/sdb5 /mnt/newhdd/home


Important: make sure you replace /dev/sdb with the actual device of your new disk. You can find this out using:

sudo lshw -class disk


… and looking at the logical name fields of the devices which are listed.

At this point you’re ready to start copying files over to the new hard disk. You can do this simply with rsync, but you have to provide the appropriate options to copy special files across and avoid copying external media and pseudo-filesystems:

rsync -aHAXvP --exclude="/mnt" --exclude="/lost+found" --exclude="/sys" \
--exclude="/proc" --exclude="/run/shm" --exclude="/run/lock" / /mnt/newhdd/


You may wish to exclude other directories too — I suggest running mount and excluding anything else which is mounted with tmpfs, for example.

You can leave this running in the background — it’s likely to take quite a long time for a system which has been running for awhile, and it also might impact your system’s performance somewhat. It’s quite safe to abort and re-run another time — the rsync with those parameters will carry on exactly where it left off.

While this is going on, make sure you have an up-to-date rescue disk available, which you’ll need as part of the process. I happened to use the Ubuntu Rescue Remix CD, but any reasonable rescue or live CD is likely to work. It needs to have rsync, grub, the blkid utility and a text editor available and be able to mount all your filesystem types.

Once that command has finished, you then need to wait for a time where you’re ready to do the switch. Make sure you won’t need to be interrupted or use the PC for anything for at least half an hour. If you had to abandon the process and come back to it later, make sure you re-run the above rsync command just prior to doing the following — the idea is to make sure the two systems are as closely synchronised as possible.

When you’re ready to proceed, shut down the system, open it up and swap the two disks over. Strictly speaking you probably don’t need to swap them, but I like to keep my system disk as /dev/sda so it’s easier to remember. Just make sure you remember that they’re swapped now!

Now boot the system into the rescue CD you created earlier. Get to a shell prompt and mount your drives — let’s say that the new disk is now /dev/sda and the old one is /dev/sdb, continuing the two partition example from earlier, then you’d do something like this:

mkdir /mnt/newhdd /mnt/oldhdd
mount /dev/sda1 /mnt/newhdd
mount /dev/sdb1 /mnt/oldhdd
mount /dev/sda5 /mnt/newhdd/home
mount /dev/sdb5 /mnt/oldhdd/home


I’m assuming you’re already logged in as root — if not, you’ll need to use sudo or su as appropriate. This varies between rescue systems.

As you can see, the principle is to mount both old and new systems in the same way as they will be used. As this point you can then invoke something similar to the rsync from earlier, except with the source changed slightly. Note that you don’t need all those --exclude options any more because the only things mounted should be the ones you’ve manually mounted yourself, which are all the partitions you actually want to copy:

rsync -aHAXvP /mnt/oldhdd/ /mnt/newhdd/


Once this final rsync has finished, you’ll need to tweak a few things on your target drive before you can boot into it. After this point do not run rsync again or you will undo the changes you’re about to make.

First, you need to update /mnt/newhdd/etc/fstab to reflect your new hard disk. If you take a look, you’ll probably find that the lines for the standard partitions start like this:

UUID=d3964aa8-f237-4b34-814b-7176719b2e42


What you need to do is replace these UUIDs with the ones from your new drive. You can find this out by running blkid which should output something like this:

/dev/sda1: UUID="b7299d50-8918-459f-9168-2a743f462658" TYPE="swap"
/dev/sda2: LABEL="/" UUID="43f6065c-d141-4a64-afda-3e0763bbbc9a" TYPE="ext4"
/dev/sdb2: LABEL="/" UUID="d3964aa8-f237-4b34-814b-7176719b2e42" TYPE="ext4"


What you want is to copy the UUID fields for your new disk to fstab over the top of the old ones. Be careful not to accidentally copy a stray quote or similar.

The other thing you need to do is change the grub.conf file to refer to the new IDs. Typically this file is auto-generated, but you can use a simple search and replace to update the IDs in the old file. First grep the file for the UUID of the old root partition to make sure you’re changing the right thing:

grep d3964aa8-f237-4b34-814b-7176719b2e42 /mnt/newhdd/boot/grub/grub.cfg


Then replace it with the new one, with something like this:

cp /mnt/newhdd/boot/grub/grub.cfg /mnt/newhdd/boot/grub.cfg.orig
sed 's/d3964aa8-f237-4b34-814b-7176719b2e42/43f6065c-d141-4a64-afda-3e0763bbbc9a/g' \
/mnt/newhdd/boot/grub.cfg.orig > /mnt/newhdd/boot/grub.cfg


As an aside, there’s probably a more graceful way of using update-grub to re-write the new configuration, but I found it a lot easier just to do the search and replace like this.

At this point you should install the grub bootloader on to the new disk’s boot sector:

grub-install --recheck --no-floppy --root=/mnt/newhdd /dev/sda


Finally, you should be ready to reboot. Cross your fingers!

If your system doesn’t come back up then I suggest you use the rescue CD to fix things. Also, since you haven’t actually written anything to the old disk, you should always be able to swap the disks back and try to figure out what went wrong.

At this point you should shut your system down again, remove the old disk entirely and try booting up again. If your system came back up before and fails now then it was probably booting off the old disk, which probably indicates the boot sector didn’t install on the new disk properly.

Hopefully that’s been of some help to someone — by all means leave a comment if you have any issues with it or you think I’ve made a mistake. Good luck!

1. Typically they’re 512 bytes, although recently drives with larger sectors have started to crop up.

2. Strictly speaking there’s also a small performance penalty when accessing a reallocated sector on a drive, so they’re also bad news in performance-critical servers — typically this isn’t relevant to most people, however.

5 Jun 2013 at 6:02PM by Andy Pearce in Software  | Photo by Cara Fuller  | Tags: linux backup  |  See comments

# ☑ Jinja Ninja

I recently had to do a few not-quite-trivial things with the Jinja2 templating engine, and the more I use it the more I like it.

This blog is generated using a tool called Pelican, which generates a set of static HTML files from Markdown pages and other source material. It’s a simple yet elegant tool, and you can customise its output using themes. This site uses a theme I created myself called Graphite. Of course, you’d know all this if you read the little footer at the bottom of the page1.

As it happens, Pelican themes use Jinja2, which is one of the more popular Python templating languages. Since I recently had to do some non-trivial things with the site theme here, I thought I’d post my thoughts on it — the executive summary, for anybody who’s already bored, is that I think it’s rather good.

The main thing I wanted to achieve was to reorganise the archive of old posts into one page per year, with sections for each month. To index this I wanted a top-level page which simply linked to each month of each year, with no posts listed. One thing I didn’t want to do was have to change core Pelican, since I’m trying to keep this theme suitable for anybody (even though it’s unlikely that anyone but myself will ever use it).

Pelican already had some configuration which got me part of the way there. It’s possible for it to put pages into subdirectories according to year and month, and also create index.html pages in them to provide an appropriate index. This was a great starting point, but some work was still needed since the posts were presented to the template as a simple sorted list of objects with appropriate attributes.

Since I wanted the year indices (such as this one) to have links to individual posts organised under headings per month. This was fairly easy to achieve by recording the date of the previous post linked and emitting a header if the month and/or year of the post about to be linked differed. Here’s a snippet from the template which actually generates the links:

<h1>Archives</h1>
{% set last_date = None %}
<dl>
{% for article in dates %}
{% if last_date != (article.date.year, article.date.month) %}
<dt>{{ article.date|strftime("%b %Y") }}</dt>
{% endif %}
<dd>
<a href="{{ SITEURL }}/{{ article.url }}"
title="{{ article.locale_date }}: {{ article.summary|striptags|escape }}">
{{ article.title }}
</a>
</dd>
{% set last_date = (article.date.year, article.date.month) %}
{% endfor %}
</dl>


Pelican has set dates to an iterable of posts, each of which is a class instance with some appropriate attributes. You can see that setting a tracking variable last_date is simple enough, as is iterating over dates. Then we conditionally emit a <dt> tag containing the date if the current post’s date differs from the previous one2. Since last_date starts at None, this will always compare unequal the first time and emit the month for the first post. Thereafter, the heading is only emitted when the month (or year) changes. This approach does, of course, assume that dates yields in sorted order.

The other points worth noting are the filters, which take the item on the left and transform it somehow. The strftime filter is provided by Pelican, and passes the input date and the format string parameter to strftime() in the obvious way. The striptags and escape filters are available as standard in Jinja2 — their operation should be fairly obvious from the code above.

What I like is the way that I can write fairly natural Pythonic code, referring to attributes and the like, but still have it executed in a fairly secure sandboxed environment instead of just passed to the Python interpreter, where it could cause all sorts of mischief.

Also, there are a few useful extensions to basic Python builtins, such as the ability to refer to the current loop index within a loop, and also refer to the offset from the end of the list as well, to easily identify final and penultimate items for special handling.

The other bit of Jinja2 that’s quite powerful is the concept of inheritance, something that seems to have become increasingly popular in templating engines. The way it’s done here is to be able to declare that one template extends another one:

{% extends base.html %}


The “polymorphism” aspect is handled with the ability to override “blocks”. So, perhaps base.html contains a declaration like this:

<head>
<title>{% block title %}Andy's Blog{% endblock %}</title>


Then, a page which wanted to override the title could do so by extending the base template and simply redeclaring the replacement block:

{% extends base.html %}
{% block title %}Andy's Other Page{% endblock %}


Finally, there’s also the ability to define macros, which essentially parameterised snippets of markup which can be called like functions to be substituted into place with the appropriate arguments included. Here’s a trivial example:

{% macro introduction(name) %}
<p>
Hello, my name is {{ name }}.
</p>
{% endmacro %}


Of course, many of these features are provided by other templating engines as well, but I’ve found Jinja2 to be convenient, Pythonic and certainly fast enough for my purposes. I think it’ll be my templating engine of choice for the foreseeable future.

1. You know, the bit that absolutely nobody ever reads.

2. In the original Jinja engine the ifchanged directive provided a more convenient way to do this, but it’s been removed in Jinja2 as it was apparently inefficient

1 Jun 2013 at 6:25PM by Andy Pearce in Software  | Photo by Jason Briscoe on Unsplash  | Tags: python  web html-templates  |  See comments

# ☑ Hooked on Github

Github’s web hooks make it surprisingly easy to write commit triggers.

I’ve been using Github for awhile now and I’ve found it to be a very handy little service. I recently discovered just how easy it is to add commit triggers to it, however.

If you look under Settings for a repository and select the Service Hooks option, you’ll see a whole slew of pre-written hooks for integrating your repository into a variety of third party services. These range from bug trackers to automatically posting messages to IRC chat rooms. If you happen to be using one of these services, things are pretty easy.

If you want to integrate with your own service, however, things are almost as easy. In this post, I’ll demonstrate how easy by presenting a simple WSGI application which can keep one or more local repositories on a server synchronised by triggering a git pull command whenever a commit is made to the origin.

Firstly, here’s the script:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 import git import json import urlparse class RequestError(Exception): pass # Update this to include all the Github repositories you wish to watch. REPO_MAP = { "repo-name": "/home/user/src/git-repo-path" } def handle_commit(payload): """Called for each commit any any watched repository.""" try: # Only pay attention to commits on master. if payload["ref"] != 'refs/heads/master': return False # Obtain local path of repo, if found. repo_root = REPO_MAP.get(payload["repository"]["name"], None) if repo_root is None: return False except KeyError: raise RequestError("422 Unprocessable Entity") # This block performs a "git pull --ff-only" on the repository. repo = git.Repo(repo_root) repo.remotes.origin.pull(ff_only=True) return True def application(environ, start_response): """WSGI application entry point.""" try: # The Github webhook interface always sends us POSTs. if environ["REQUEST_METHOD"] != 'POST': raise RequestError("405 Method Not Allowed") # Extract and parse the body of the POST. post_data = urlparse.parse_qs(environ['wsgi.input'].read()) # Github's webhook interface sends a single "payload" parameter # whose value is a JSON-encoded object. try: payload = json.loads(post_data["payload"][0]) except (IndexError, KeyError, ValueError): raise RequestError("422 Unprocessable Entity") # If the request looks valid, pass to handle_commit() which # returns True if the commit was handled, False otherwise. if handle_commit(payload): start_response("200 OK", [("Content-Type", "text/plain")]) return ["ok"] else: start_response("403 Forbidden", [("Content-Type", "text/plain")]) return ["ignored ref"] except RequestError as e: start_response(str(e), [("Content-Type", "text/plain")]) return ["request error"] except Exception as e: start_response("500 Internal Server Error", [("Content-Type", "text/plain")]) return ["unhandled exception"] 

Aside from the Python standard library it also uses the GitPython library for accessing the Git repositories. Please also note that this application is a bare-bones example — it lacks important features such as logging and more graceful error-handling, and it could do with being rather more configurable, but hopefully it’s a reasonable starting point.

To use this application, update the REPO_MAP dictionary to contain all the repositories you wish to watch for updates. The key to the dictionary should be the name of the repository as specified on Github, the value should be the full, absolute path to a checkout of that repository where the Github repository is added as the origin remote (i.e. as if created with git clone). The repository should remaind checked out on the master branch.

Once you have this application up and running you’ll need to note its URL. You then need to go to the Github Service Hooks section and click on the WebHook URLs option at the top of the list. In the text box that appears on the right enter the URL of your WSGI application and hit Update settings.

Now whenever you perform a commit to the master branch of your Github repository, the web hook will trigger a git pull to keep the local repository up to date.

Primarily I’m hoping this serves as an example for other, more useful web hooks, but potentially something like this could serve as a way to keep a production website up to date. For example, if refs/heads/master in the script above is changed to refs/heads/staging and you kept the local repository always checked out on that branch, you could use it as a way to push updates to a staging server just by performing an appropriate commit on to that branch in the master repository.

Also note that the webhook interface contains a lot of rich detail which could be used to do things like update external bug trackers, update auto-generated documentation or a ton of other handy ideas. Github have a decent enough reference for the content of the POSTs your hook will receive and my sample above only scratches the surface.

16 May 2013 at 11:52AM by Andy Pearce in Software  | Photo by Brina Blum on Unsplash  | Tags: web  git python  |  See comments

⇐ Page 1   |   ← Page 3   |   Page 4 of 6   |   Page 5 →   |   Page 6 ⇒