☑ Hooked on Github

I’ve been using Github for awhile now and I’ve found it to be a very handy little service. I recently discovered just how easy it is to add commit triggers to it, however.

If you look under Settings for a repository and select the Service Hooks option, you’ll see a whole slew of pre-written hooks for integrating your repository into a variety of third party services. These range from bug trackers to automatically posting messages to IRC chat rooms. If you happen to be using one of these services, things are pretty easy.

If you want to integrate with your own service, however, things are almost as easy. In this post, I’ll demonstrate how easy by presenting a simple WSGI application which can keep one or more local repositories on a server synchronised by triggering a git pull command whenever a commit is made to the origin.

Firstly, here’s the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import git
import json
import urlparse


class RequestError(Exception):
    pass


# Update this to include all the Github repositories you wish to watch.
REPO_MAP = {
    "repo-name": "/home/user/src/git-repo-path"
}


def handle_commit(payload):
    """Called for each commit any any watched repository."""

    try:
        # Only pay attention to commits on master.
        if payload["ref"] != 'refs/heads/master':
            return False
        # Obtain local path of repo, if found.
        repo_root = REPO_MAP.get(payload["repository"]["name"], None)
        if repo_root is None:
            return False

    except KeyError:
        raise RequestError("422 Unprocessable Entity")

    # This block performs a "git pull --ff-only" on the repository.
    repo = git.Repo(repo_root)
    repo.remotes.origin.pull(ff_only=True)
    return True


def application(environ, start_response):
    """WSGI application entry point."""

    try:
        # The Github webhook interface always sends us POSTs.
        if environ["REQUEST_METHOD"] != 'POST':
            raise RequestError("405 Method Not Allowed")

        # Extract and parse the body of the POST.
        post_data = urlparse.parse_qs(environ['wsgi.input'].read())

        # Github's webhook interface sends a single "payload" parameter
        # whose value is a JSON-encoded object.
        try:
            payload = json.loads(post_data["payload"][0])
        except (IndexError, KeyError, ValueError):
            raise RequestError("422 Unprocessable Entity")

        # If the request looks valid, pass to handle_commit() which
        # returns True if the commit was handled, False otherwise.
        if handle_commit(payload):
            start_response("200 OK", [("Content-Type", "text/plain")])
            return ["ok"]
        else:
            start_response("403 Forbidden", [("Content-Type", "text/plain")])
            return ["ignored ref"]

    except RequestError as e:
        start_response(str(e), [("Content-Type", "text/plain")])
        return ["request error"]

    except Exception as e:
        start_response("500 Internal Server Error",
                       [("Content-Type", "text/plain")])
        return ["unhandled exception"]

Aside from the Python standard library it also uses the GitPython library for accessing the Git repositories. Please also note that this application is a bare-bones example - it lacks important features such as logging and more graceful error-handling, and it could do with being rather more configurable, but hopefully it’s a reasonable starting point.

To use this application, update the REPO_MAP dictionary to contain all the repositories you wish to watch for updates. The key to the dictionary should be the name of the repository as specified on Github, the value should be the full, absolute path to a checkout of that repository where the Github repository is added as the origin remote (i.e. as if created with git clone). The repository should remaind checked out on the master branch.

Once you have this application up and running you’ll need to note its URL. You then need to go to the Github Service Hooks section and click on the WebHook URLs option at the top of the list. In the text box that appears on the right enter the URL of your WSGI application and hit Update settings.

Now whenever you perform a commit to the master branch of your Github repository, the web hook will trigger a git pull to keep the local repository up to date.

Primarily I’m hoping this serves as an example for other, more useful web hooks, but potentially something like this could serve as a way to keep a production website up to date. For example, if refs/heads/master in the script above is changed to refs/heads/staging and you kept the local repository always checked out on that branch, you could use it as a way to push updates to a staging server just by performing an appropriate commit on to that branch in the master repository.

Also note that the webhook interface contains a lot of rich detail which could be used to do things like update external bug trackers, update auto-generated documentation or a ton of other handy ideas. Github have a decent enough reference for the content of the POSTs your hook will receive and my sample above only scratches the surface.

Thu 16 May 2013 at 11:52AM by Andy Pearce in Software. Tags: web, git, python. comments.

☑ May Day! May Day!

You’re going to lose your files. All of them. Maybe not today, maybe not tomorrow. Maybe not even soon. The question is, will it be for the rest of your life?

When I looked up “back up” in the thesaurus it listed its synonyms as “abandon”, “fall back”, “retreat” and “withdraw”, and I’d say that’s a fair characterisation of many people when they try to back up their data. These people are making a rod for their own back, however, and one day it’ll hit them.

OK, so we need to back stuff up, we get told that all the time, usually by very smug people while we’re scrabbling around trying to recover some important report just before it’s due. So what’s the best way to go about it?

There are several elements to a successful backup solution. I’d say first and foremost among them is automation. If you need to do something manually to kick off a backup then, unless you’re inhumanly organised, you’re going to forget to do it eventually. Once you start forgetting, chances are you’re going to keep forgetting, right up until the point you need that backup. Needless to say, that’s a little late.

The second element is history - the ability to recover previous versions of files even after they’ve been deleted. Hardware failure is only one reason to restore from a backup, it’s also not implausible that you might accidentally delete a file, or perhaps accidentally delete much of its contents and save it over the original. If you don’t notice for a few days, chances are a backup solution without history will have quietly copied that broken version of the file over the top of the previous version in your backup, losing it forever.

The third element is off-site - i.e. your backups should be stored at a physically separate location to the vulnerable systems. I’ve heard of at least a couple of cases where people have carefully ensured they backed up data between multiple computers, only to have them all stolen one night. Or a burned in a fire. Or any of a list of other disasters. These occurrences are rare, of course, but not rare enough to rule them out.

The fourth and final element is that only you have access. You might be backing up some sensitive data, perhaps without realising it, so you want to make sure that your backups are useless to someone stealing them. Typically this is achieved by encrypting them. Actually this should be called something like “encryption” or “security” but then the list wouldn’t form the snappy acronym Ahoy1:

  • Automated
  • History
  • Off-site
  • You (have sole access)

So, how can we hit the sweet spot of all four of these goals? Because I believe that off-site backups are so important, I’m going to completely ignore software which concentrates on backing up to external hard disks or DVDs. I’m also going to ignore the ability to store additional files remotely - this is useful, but a true backup is just a copy of what you already have locally anyway. Finally, I’ll skip over the possibility of simply storing everything in the cloud to begin with, for example with services such as Google Docs or Evernote, since these options are pretty self-explanatory.

The first possibilities are a host of subscription-based services which will transparently copy files from your PC up into some remote storage somewhere. Often these are aimed at Windows users, although many also support Macs. Linux support is generally lacking. Services such as Carbonite offer unlimited storage for a fixed annual fee, although the storage is effectively limited by the size of the hard disk in your PC. Others, such as MozyHome prefer to bill you monthly based on your storage requirements. There are also services such as Jungle Disk which effectively supply software that you can use with third party cloud storage services such as Amazon S3.

These services are aimed squarely at general users and they tend to be friendly to use. They also generally keep old versions of files for 1-3 months, which is probably enough to recover from most accidental deletion and corruption. They can be a little pricey, however, typically costing anything from $5 to $10 a month (around £3-£6). This might not be too much for the peace of mind that someone’s doing the hard work for you but remember that the costs can increase as the amount you need to store goes up. Things can get even more expensive for people with multiple PCs or lots of external storage.

It’s hard to judge the security of these services - mostly these services claim to use well known forms of encryption such as Blowfish or AES and, assuming this is true, they’re pretty secure. Generally you can have more trust in a service where you provide the encryption key and where the encryption is performed client-side, although in this case you must, of course, keep the key safe as there’s no way they can recover your data without it. For those of you paying attention you’ll realise this means an off-site copy of your key as well, stored in a secure location, but it does depend how far you want to take it - there’s always a trade-off between security and convenience.

If you don’t mind doing a bit more of the work yourself, there are other options for backup which may be more economical. Firstly, if you already have PCs at multiple locations then you might be interested in the newly-released BitTorrent Sync. Many people may have already heard of the BitTorrent file-sharing protocol and this software is also from the company co-founded by Bram Cohen, the creator of the protocol. However, it has very little to do with public file-sharing, although it’s based on the same protocol under the hood. It’s more about keeping your own files duplicated across multiple devices.

You can download clients for Windows, OSX or Linux and once you’ve configured them, they sit there watching a set of directories. You do this on several machines which all link together and share changes to the files in the watched directories. As soon as you add, delete or edit a file on one machine, the sync clients will share that change across the others. Essentially it’s a bit like a private version of Dropbox.

This is a bit of a cheat in the context of this article, of course, because it doesn’t meet one of my own criteria, storing the history of files - it’s a straight sync tool. I’m still mentioning it for two reasons - firstly, it might form a useful component of another backup solution where some other component provides file history; secondly, they’re my criteria and I’ll ignore them if I want to.

Like BitTorrent, it becomes more efficient as you add more machines to the swarm and it has the ability to share links to other peers so in general you should only need to hook a new machine to one of the others in the cloud and it should learn about the rest. It’s also pretty secure as each directory is associated with a unique key and all traffic is encrypted with it - if a peer doesn’t have the key, it can’t share the files. The data at each site isn’t stored encrypted, however, so you still need to maintain physical security of each system as you’d expect. There’s also the possibility to add read-only and one-time keys for sharing files with other people, but I haven’t tried this personally.

I haven’t played with it extensively yet, but from my early experiments it seems pretty good. It’s synchronisation is fast, its memory usage is low and it seems to make good use of OS-specific features to react to file changes quickly and efficiently.

The main downside at the moment is that it’s still at quite an early stage and should be considered beta quality at best. That said, I haven’t had any problems myself. It’s also closed source which might be a turn-off for some people and it’s not yet clear whether the software will remain available for free indefinitely. It also doesn’t duplicate OS-specific meta-information such as Unix permissions which may be an issue for Linux and potentially OSX users.

On the subject of preserving Unix permissions and the like, what options exist for that? Well, there is a very handy tool called rdiff-backup which is based on rather wonderful rsync. Like rsync it’s intended to duplicate one directory somewhere else, either on the same machine or remotely via an SSH connection. Unlike rsync, however, it not only makes the destination directory a clone of the source, but it also stores reverse-diffs of the files back from that point so you can roll them back to any previous backup point.

I’ve had a lot of success using it, although you need to be fairly technical to set it up as there’s a plethora of command-line options to control what’s included and excluded from the backup, how long to keep historical versions and all sorts of other information. The flip side to this slight complexity is that it’s pretty flexible. It’s also quite efficient on space, since it only stores the differences between files that have changed as opposed to many tools which store an entire new copy of the file.

The one area where rdiff-backup falls down, however, is security - it’s fine for backing up between trusted systems, but what about putting information on cloud storage which you don’t necessarily trust? Fortunately there’s another tool based on rdiff-backup called Duplicity which I’ve only relatively recently discovered.

This is a fantastic little tool which allows you to create incremental backups. Essentially this means that the first time you do a backup, it creates a complete copy of all your files. The next time it stores the differences between the previous backup and the current state of the files, like rdiff-backup but using forward-diffs rather than reverse. This means to restore a backup you need the last full one plus all the incrementals in between.

The clever bit is that it splits your files up into chunks2 and also encrypts each chunk with a passphrase that you supply. This means you can safely deposit those chunks on any third party storage you choose without fear of them sneaking a peek at your files. Indeed, Duplicity already comes with a set of different backends for dumping files on a variety of third party storage solutions including Google Drive and Amazon S3, as well as remote SFTP and WebDAV shares.

It’s free and open source, although just like rdiff-backup it’s probably for the more technically-minded user. It also doesn’t run under Windows3. However, Windows users need not despair - it has inspired another project called Duplicati which is a reimplementation from scratch in C#. I haven’t used this at all myself, but it looks very similar to Duplicity in terms of its basic functionality, although there are some small differences which make it incompatible.

The main difference appears to be that it layers a more friendly GUI for configuring the whole thing, which probably makes it more accessible to average users. It still supports full and incremental backups, compression and encryption just as Duplicity does. It also will run on OSX and Linux with the aid of Mono, although unlike Duplicity it doesn’t currently support meta-information such as Unix permissions4, which probably makes Duplicity a more attractive option for Linux unless you really need to restore on different platforms.

Anyway, that’s probably enough of a summary for now. Whatever you do, however, if you’re not doing backups then start, unless you’re the sort of person who craves disappointment and despair. If not then you’ll definitely regret it at some point. Maybe not today- Oh wait, we’ve done that already.


  1. Everyone knows you need a catchy mnemonic when you’re trying to repackage common sense and sell it to people. 

  2. Bzipped multivolume tar archives, for the technically minded. 

  3. At least not without a lot of faff involving Cygwin and a handful of other packages. 

  4. Although there is an open issue in their tracker about implementing support for meta-information. 

Wed 01 May 2013 at 01:09PM by Andy Pearce in Software. Tags: backup, cloud. comments.

☑ Python destructor drawbacks

As you learn Python, sooner or later you’ll come across the special method __del__() on classes. Many people, especially those coming from a C++ background, consider this to be the “destructor” just as they consider __init__() to be the “constructor”. Unfortunately, they’re often not quite correct on either count, and Python’s behaviour in this area can be a little quirky.

Take the following console session:

>>> class MyClass(object):
...   def __init__(self, init_dict):
...     self.my_dict = init_dict.copy()
...   def __del__(self):
...     print "Destroying MyClass instance"
...     print "Value of my_dict: %r" % (self.my_dict,)
... 
>>> instance = MyClass({1:2, 3:4})
>>> del instance
Destroying MyClass instance
Value of my_dict: {1: 2, 3: 4}

Hopefully this is all pretty straightforward. The class is constructed and __init__() takes an initial dict instance and stores a copy of it as the my_dict attribute of the MyClass instance. Once the final reference to the MyClass instance is removed (with del in this case) then it is garbage collected and the __del__() method is called, displaying the appropriate message.

However, what happens if __init__() is interrupted? In C++ if the constructor terminates by throwing an exception then the class isn’t counted as fully constructed and hence there’s no reason to invoke the destructor1. How about in Python? Consider this:

>>> try:
...   instance = MyClass([1,2,3,4])
... except Exception as e:
...   print "Caught exception: %s" % (e,)
... 
Caught exception: 'list' object has no attribute 'copy'
Destroying MyClass instance
Exception AttributeError: "'MyClass' object has no attribute 'my_dict'" in <bound method MyClass.__del__ of <__main__.MyClass object at 0x7fd309fbc450>> ignored

Here we can see that a list instead of a dict has been passed, which is going to cause an AttributeError exception in __init__() because list lacks the copy() method which is called. Here we catch the exception, but then we can see that __del__() has still been called.

Indeed, we get a further exception there because the my_dict attribute hasn’t had chance to be set by __init__() due to the earlier exception. Because __del__() methods are called in quite an odd context, exceptions thrown in them actually result in a simple error to stderr instead of being propagated. That explains the odd message about an exception being ignored which appeared above.

This is quite a gotcha of Python’s __del__() methods - in general, you can never rely on any particular piece of initialisation of the object having been performed, which does reduce their usefulness for some purposes. Of course, it’s possible to be fairly safe with judicious use of hasattr() and getattr(), or catching the relevant exceptions, but this sort of fiddliness is going to lead to tricky bugs sooner or later.

This all seems a little puzzling until you realise that __del__() isn’t actually the opposite of __init__() - in fact, it’s the opposite of __new__(). Indeed, if __new__() of the base class (which is typically responsible for actually doing the allocation) fails then __del__() won’t be called, just as in C++. Of course, this doesn’t mean the appropriate thing to do is shift all your initialisation into __new__() - it just means you have to be aware of the implications of what you’re doing.

There are other gotchas of using __del__() for things like resource locking as well, primarily that it’s a little too easy for stray references to sneak out and keep an object alive longer than you expected. Consider the previous example, modified so that the exception isn’t caught:

>>> instance = MyClass([1,2,3,4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __init__
AttributeError: 'list' object has no attribute 'copy'
>>>

Hmm, how odd - the instance can’t have been created because of the exception, and yet there’s no message from the destructor. Let’s double-check that instance wasn’t somehow created in some weird way:

>>> print instance
Destroying MyClass instance
Exception AttributeError: "'MyClass' object has no attribute 'my_dict'" in <bound method MyClass.__del__ of <__main__.MyClass object at 0x7fd309fbc2d0>> ignored
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'instance' is not defined

Isn’t that interesting! See if you can have a guess at what’s happened…

… Give up? So, it’s true that instance was never defined. That’s why when we try to print it subsequently, we get the NameError exception we can see at the end of the second example. So the only real question is why was __del__() invoked later than we expected? There must be a reference kicking around somewhere which prevented it from being garbage collected, and using gc.get_referrers() we can find out where it is:

>>> instance = MyClass([1,2,3,4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __init__
AttributeError: 'list' object has no attribute 'copy'
>>> import sys
>>> import gc
>>> import types
>>> 
>>> for obj in gc.get_objects():
...   if isinstance(obj, MyClass):
...     for i in gc.get_referrers(obj):
...       if isinstance(i, types.FrameType):
...         print repr(i)
... 
<frame object at 0x1af19c0>
>>> sys.last_traceback.tb_next.tb_frame
<frame object at 0x1af19c0>

Because we don’t have a reference to the instance any more, we have to trawl through the gc.get_objects() output to find it, and then use gc.get_referrers() to find who has the reference. Since I happen to know the answer already, I’ve filtered it to only show the frame object - without this filtering it also includes the list returned by gc.get_objects() and calling repr() on that yields quite a long string!

We then compare this to the parent frame of sys.last_traceback and we get a match. So, the reference that still exists is from a stack frame attached to sys.last_traceback, which is the traceback of the most recent exception thrown. What happened earlier when we then attempted print instance is that this threw an exception which replaced the previous traceback (only the most recent one is kept) and this removed the final reference to the MyClass instance hence causing its __del__() method to finally be called.

Phew! I’ll never complain about C++ destructors again. As an aside, many of the uses for the __del__() method can be replaced by careful use of the context manager protocol, although this does typically require your resource management to extend over only a single function call at some level in the call stack as opposed to the lifetime of a class instance. In many cases I would argue this is actually a good thing anyway, because you should always try to minimise the time when a resource is acquired, but like anything it’s not always applicable.

Still, if you must use __del__(), bear these quirks in mind and hopefully that’s one less debugging nightmare you’ll need to go through in future.


  1. The exception (haha) to this is when a derived class’s constructor throws an exception, then the destructor of any base classes will still be called. This makes sense because by the time the derived class constructor was called, the base class constructors have already executed fully and may need cleaning up just as if an instance of the base class was created directly. 

Tue 23 Apr 2013 at 10:48AM by Andy Pearce in Software. Tags: python, destructors. comments.

☑ When is a closure not a closure?

Scoping in Python is pretty simple, especially in Python 2.x. Essentially you have three scopes:

  • Local scope
  • Enclosing scope
  • Global scope

Local scope is anything defined in the same function as you. Enclosing scopes are those of the functions in which you’re defined - this only applies to functions which are lexically contained within other functions1. Global scope is anything at the module level. There’s also a special “builtin” scope outside of that, but let’s ignore that for now. Classes also have their own special sorts of scopes, but we’ll ignore that as well.

When you assign to a variable within a function, this counts as a declaration and the variable is created in the local scope2 of the function. This is unless you use the global keyword to force the variable to refer to one at module scope instead3.

When you read the value of a variable, Python starts with the local scope and attempts to look up the name there. If it’s not found, it recurses up through the enclosing scopes looking for it until it reaches the module scope (and finally the magic builtin scope). This is more or less as you’d expect if you’re used to normal lexically-scoped languages.

However, if you were paying attention you’ll notice that I specifically said that a local scope is defined by a function. In particular, constructs such as for loops do not define their own scopes - they operate entirely in the local scope of the enclosing function (or module). This has some beneficial side-effects - for example, loop counters are still available once the loop has exited, which is rather handy. It has some potential pitfalls - take this code snippet, for example4:

1
2
functions = [(lambda: i) for i in xrange(5)]
print ", ".join(str(func()) for func in functions)

So, this builds a list of functions5 and then executes each one in turn and concatenates and prints the results. Intuitively one would expect the results to be 0 1 2 3 4, but actually we get 4 4 4 4 4 - eh?

What’s happening is that each of the functions created is in a closure with the variable i in its global scope bound to the one used in the loop. However, each iteration just updates the same loop counter in the local scope of the enclosing function (or module) and so all the functions end up with a reference to the same variable i. In other words, closures in Python refer directly to the enclosing scopes, they don’t create “frozen copies” of them6.

This works fine when a closure is created by a function and then returned, because the enclosing scope is then kept alive only by the closure and inaccessible elsewhere. Further invocations of the same function will produce new scopes and different closures. In this case, though, the functions are all defined under the same scope. So when they’re evaluated, they all return the final value of i as it was when the loop terminated.

We can illustrate this by amending the example to delete the loop counter:

1
2
3
functions = [(lambda: i) for i in xrange(5)]
del i
print ", ".join(str(func()) for func in functions)

Now the third line raises an exception:

NameError: global name 'i' is not defined

Of course, if you use the generator expression form to defer generation of the functions until the point of invocation then everything works as you’d expect:

1
2
3
# This prints "0 1 2 3 4" as expected.
functions = ((lambda: i) for i in xrange(5))
print ", ".join(str(func()) for func in functions)

So, all this is quite comprehensible once you understand what’s going on, but I do wonder how many people get bitten by this sort of thing when using closures in loops.

As a final note, this behaviour is the same in Python 3.x. There is a small difference with regards to scopes that is the addition of the nonlocal keyword which is the equivalent of global except it allows updating the value of variables in enclosing scopes which are between the local and global scopes. I believe that with regards to reading the values of such variables, however, the behaviour is unchanged.


  1. Note that this is a lexical definition of enclosure, which is to say it’s to do with where the function is defined. It’s nothing to do with where the function was called from. Unlike dynamically-scoped languages, Python gives a function no access to variables defined in the scope of a calling function. 

  2. This actually extends to the entire function, which is why it’s an error to read the value of a variable assigned to later in the function even if it exists in an enclosing scope. 

  3. Or the nonlocal keywords in Python 3.x - see the note at the end of this post. 

  4. This example uses a list comprehension for concision, but the issues described would apply equally to a for loop. 

  5. Yes I’m using lambda - so sue me, it’s just an example. 

  6. Actually, once you think of closures as references to a scope rather than some sort of “freeze-frame” of the state, some things are easier to understand. For example, if two functions are defined in the same closure, updates that each of them makes to the state can be felt by the other. This is especially relevant if they use Python 3’s nonlocal keyword (see the note at the end this post). 

Wed 10 Apr 2013 at 03:41PM by Andy Pearce in Software. Tags: python, scoping. comments.

☑ The Dark Arts of C++ Streams

I’ve never really delved into the details of C++ streams. For formatted output of builtin types they’ve always seemed less convenient than printf() and friends due to the way they mix together the format and the values. However, I recently decided it was time to figure out the basics for future reference, and I’m noting my conclusions in this blog post in case it proves a useful summary for anybody else.

A stream in C++ is a generic character-oriented serial interface to a resource. Streams may only accept input, only produce output or be available for both input and output. Reading and writing streams is achieved using the << operator, which is overloaded beyond its standard bit-shifting function to mean “write LHS to RHS stream”, and the >> operator, which is similarly overloaded to mean “read from RHS stream into LHS”.

To use streams, include the iostream header file for the basic functionality, and any additional headers for the specific streams to use - for example, file-oriented streams require the fstream header. Here’s an example of writing to a file using the stream interface1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <iostream>
#include <fstream>

int main()
{
    std::fstream myFile;
    myFile.open("/tmp/output.txt", std::ios::out | std::ios::trunc);
    myFile << "hello, world" << std::endl;
    myFile.close();

    return 0;
}

Most of this example is pretty standard. Since streams use the standard bit-shift operators which are left-associative, the first operation performed in the third line of main() above is myFile << "hello, world". This expression also evaluates to a reference to the stream, allowing the operators to be chained to write multiple values in sequence. In this case, the std::endl identifier pushes a newline into an output stream, but also implicitly calls the flush() method as well.

So far so obvious. What about reading from a file? Reading into strings is fairly obvious:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <iostream>
#include <fstream>
#include <string>

int main()
{
    std::fstream myFile;
    std::string myString;
    myFile.open("/tmp/input.txt", std::ios::in);
    myFile >> myString;
    std::cout << "Read: " << myString << std::endl;

    return 0;
}

In this example the cout stream represents stdout, but otherwise this example seems quite straightforward. However, if you run it you’ll see that the string contains only the text up to the first whitespace in the file. It turns out that this is the defined behaviour for strings, which strikes me as a little quirky but hey ho.

It’s quite possible to also read integer and other types - the example below demonstrates this as well as a file stream open for read/write and seeking:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <fstream>

int main()
{
    std::fstream myFile;
    int intVar;
    float floatVar;
    bool boolVar;

    myFile.open("/tmp/testfile", std::ios::in | std::ios::out | std::ios::trunc);
    myFile << 123 << " " << "1.23" << " true" << std::endl;
    myFile.seekg(0);
    myFile >> intVar >> floatVar >> std::boolalpha >> boolVar;
    std::cout << "Int=" << intVar << " Float=" << floatVar
              << " Bool=" << boolVar << std::endl;
    myFile.close();

    return 0;
}

The file is opened for both read and write here, and also any existing file will be truncated to zero length upon opening. The output line is much the same as previous examples, but the input line demonstrates how input streams are overloaded based on the destination type to parse character input into the appropriate type. The seekg() method fairly obviously seeks within the stream in a similar way to standard C file IO.

Also demonstrated here is an IO manipulator, in this case std::boolalpha which converts the strings "true" and "false" to a bool value. This can be used to modify the value on both input and output streams. The important thing to remember about these is that they set flags on the stream which are persistent, they don’t just apply to the following value. For example, the following function will show the first bool as an integer, the next two as a string and the final one as an integer again:

1
2
3
4
5
void showbools(bool one, bool two, bool three, bool four)
{
    std::cout << one << ", " << std::boolalpha << two << ", " << three
              << ", " << std::noboolalpha << four << std::endl;
}

Other examples include std::setbase, which displays integers in other number bases; and std::fixed, which displays floating point values to a fixed number of decimal places, determined by the std::setprecision manipulator.

All these manipulators are really placeholders for the setf() method being called at the appropriate portions in the stream. So, printf()-like formatting can be done, albeit in a slightly more verbose manner. Many of them require the iomanip header to be included.

So what about more basic file IO, such as reading an entire line into a string as opposed to a single word? To do this you need to avoid the stream operators and instead use appropriate methods - for example, std::getline() will read up to a newline into a string:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Displays specified file to stdout with line numbers.
void dumpFile(const char *filename)
{
    std::fstream myFile;
    myFile.open(filename, std::ios::in);
    unsigned int lineNum = 0;
    while (myFile) {
        std::string line;
        std::getline(myFile, line);
        cout << std::setw(3) << ++lineNum << " " << line << std::endl;
    }
    myFile.close();
}

To instead read an arbitrary number of characters, the read() method can be used, which can read into a standard C char array. There are also overloads of the get() method which do the same thing, but since the read() method has only a single purpose it’s probably clearer to use that.

To read into a std::string, however, requires another concept of C++ streams - the streambuf. This is just a generalisation of a buffer which holds characters which can be sent or received from streams. Existing streams use a buffer to hold characters read and written to the stream which can be accessed via the rdbuf() method. Using this and our own std::stringstream, which is a stream wrapper around a std::string, we can read an entire file into a std::string:

1
2
3
4
5
6
7
// Requires the <sstream> header.
std::string readFile(std::fstream &inFile)
{
    std::stringstream buffer;
    buffer << inFile.rdbuf();
    return buffer.str();
}

However, this still doesn’t address the issue of reading n characters from the stream directly into a std::string. I’ve looked into this and frankly I don’t think it’s possible without resorting to reading into a char array, although as a result of the Stack Overflow question which I asked just now I’ve realised that this can be done into a std::string safely. The trick is to call resize() to make sure the string has enough valid space to store the result of the read and then use the non-const version of operator[] to get the address of the string’s character storage2. Crucially you can use neither c_str() nor data(), which both return read-only pointers, the result of modifying which is undefined.

Finally, I’ll very briefly cover the issue of customising a class so that it can be sent to and from streams like the built-in types. Actually this is just as simple as creating a new overload of the operator>> method with the appropriate stream type. The example below shows a class which can output itself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <iostream>

class MyIntWrapper
{
public:
  MyIntWrapper(int value) : value(value) { }
private:
  int value;

  friend std::ostream &operator<<(std::ostream &o, const MyIntWrapper &i);
};

std::ostream &operator<<(std::ostream &ost, const MyIntWrapper &instance)
{
  ost << "<MyIntWrapper: " << instance.value << ">";
  return ost;
}

int main()
{
  MyIntWrapper sample(123);
  std::cout << sample << std::endl;
  return 0;
}

The only thing to note is the friend declaration, which is required to allow the operator function to access the private data members of the class. Input operators can be overloaded in a similar way.


  1. Note that error handling has been omitted for clarity in all examples. 

  2. Implementations which use copy-on-write, such as the GNU STL, are forced to perform any required copy operations when the non-const version of operator[] is used. Interestingly, C++11 effectively forbids copy-on-write implementations of std::string which makes the whole thing rather less tricky (but also potentially slower for some use-cases, although those cases should probably be using their own classes anyway). 

Mon 08 Apr 2013 at 12:28PM by Andy Pearce in Software. Tags: c++, streams. comments.

Page 1 / 4 »