# ☑ Hash Collision Fun

When dealing with data structures that involve hashing, most commonly hash tables, it’s fairly common knowledge that your choice of hash function is an important performance consideration. What’s perhaps less well known is that it can be an important security consideration too - this article briefly discusses why.

Anyone who’s studied Computer Science at university, or has spent a few years in the software industry, will know about hash tables. Those who’ve looked deeper will know that there are other, equally powerful, uses for the same general technique. Its great strength is its performance in the general case, able to average constant time lookups where many data structures only manage $\log n$ performance. The complication is that to achieve this in practice you need to put a lot of care into the hash function that you choose. However, as I discovered recently, average case performance isn’t the only consideration.

The important characteristic of a hash function is that it assigns inputs to random buckets. This should hold regardless of which inputs are chosen as this helps to ensure the average constant time performance - if multiple items hash to the same bucket, the lookup time rapidly degrades towards linear performance. This can have a very large impact on running time of the application. With a reasonable hash function this occurs very rarely in practice, but if someone knows the hash function in use then it may be possible to deliberately engineer hash collisions - this is fairly obvious when you think about it, but it wasn’t something that occurred to me until recently.

Why would someone want to do that? One reason is part of a DoS attack, where an attacker uses some supposedly limited level of access to impair usage for other users. A great example of this is an attack someone devised on the Btrfs filesystem a few years ago. Essentially this involved constructing a large number of filenames which hash to the same bucket, thus causing the number of items in that bucket to grow artificially large and extending the time taken to perform operations massively. This approach could also be used by one user with access to a shared directory to prevent another user creating a file in the same directory by creating many files which hashed to the same value.

Apparently this isn’t as rare as you might imagine - there’s even a CERT advisory which contains a long list of programming languages whose builtin hashing algorithms are prone to this sort of attack. Many of these langauges have since resolved the issue - Python was one of the more recent of these, only having moved to using the cryptographic quality Siphash function in version 3.41.

Aside from all these language features, however, which will have a lot of pressure to change due to the number of people affected, it’s a little concerning to consider how much software there is out there which has used insecure hashing in the name of performance, leaving itself open to these sorts of cunning attacks.

Just one more reason not to roll your own hashing function - but I have no doubts that people will continue to do so for a long time yet.

1. Prior to choosing Siphash, the Python team did attempt to address the issue using the old hashing algorithm but peturbed by a pseudorandomly chosen initial value per invocation of Python. Unsurprisingly this turned out to be less than adequate.

20 Mar 2015 at 7:36AM by Andy Pearce in Software  | Photo by Jens Herrndorff on Unsplash  | Tags: data-structures security