☑ Uncovering Rust: Loops and Collections

17 Apr 2023 at 12:18PM in Software
 | 
Photo by Luke Hodde on Unsplash
 | 

Rust is fairly new multi-paradigm system programming language that claims to offer both high performance and strong safety guarantees, particularly around concurrency and memory allocation. As I play with the language a little, I’m using this series of blog posts to discuss some of its more unique features as I come across them. This one discusses looping constructs and standard library collections.

This is the 3rd of the 7 articles that currently make up the “Uncovering Rust” series.

rusty bolts

Having completed a couple of articles where I researched the QUIC and HTTP/3 protocols, I’m now getting to the stage of the Geta project which involves actually starting to write code. I’ve deliberately chosen a language with which I’m still only somewhat familiar, Rust, as an opportunity to become more familiar with it. In an effort to get back into the swing of it, I thought a good approach might be to resurrect this series of articles on Rust that I started a few years ago.

The two previous articles in this series, written in the heady pre-pandemic days of 2019, covered ownership, scalar types and the match construct. In this article I’ll be extending the discussion of types to common collections within the standard library, which provide more facilities than builtin tuples and arrays; but first I want to look at the loop constructs that Rust provides.

Loops and Conditionals

Let’s look at the constructs that control program flow. I’m going to assume you’ve written some sort of code before so you’re familiar with the concept of ifelse and while loops, and just focus on the aspects which are specific to Rust.

ifelse

I don’t want to dwell on ifelse, since it’s one of those things that’s very similar in many languages. The only thing I will mention is that it, along with the loop constructs, is counted as an expression and can therefore “return” values. This assignment is a simple example:

let odd_even = if x % 2 == 0 { "even" } else { "odd" };

Note that this is a fully flexible ifelse construct where each of the blocks can contain an abitrary sequence of instructions, not just a single value as in many languages’ ternary operators. For example, this is also valid:

let odd_even = if x % 2 == 0 {
    println!("EVEN");
    "even"
} else {
    println!("ODD");
    "odd"
};

One thing that’s reassuring to know is that the compiler will confirm that the return types of the multiple arms of the conditional evaluate to consistent types — if they did not, the type of odd_even in the example above would be ambiguous, and that would prevent the compiler doing further type checking on it further down.

while

Rust has three main loop constructs, which are while, loop and for. We’ll look at each of these in the next few subsections, starting with while.

The while construct in Rust is quite similar to other languages, and anyone who’s written more or less any code in any imperative language can probably figure out what this code is going to do.

let mut i = 0;
while i < 10 {
    println!("Value is {}", i);
    i += 1;
}

Also, Rust supports break and continue in these loops with the usual semantics1. Also similar to some other languages, loops can be labelled to allow breaking out of multiple levels at once.

let mut i = 0;
let mut j = 0;
'outer: while i < 10 {
    while j < 10 {
        if i * j > 45 {
            println!("{} and {}", i, j);
            break 'outer;
        }
        j += 1;
    }
    i += 1;
    j = 0;
}

The final thing to say about while is that unlike some other constructs, it does not return a value — or rather, the value it returns is always empty. This is unlike the loop construct that we’ll look at next.

loop

The next construct loop may seem superficially similar to while true {...}, and it does have the same semantics — it loops until the loop is explicitly broken out of.

However, because the semantics of the loop are intentionally simple, the compiler can be certain that the only exit points from the loop are break statements, and this allows these statements to return a value in a similar way than we saw with ifelse earlier. Of course, the language designers could have just made while true a special case instead of introducing a new keyword, but if that’s the sort of decision that upsets you then you’re probably going to have real trouble finding a language with which you’re satisified2.

Let’s see an example of this in action.

let mut n = 2;
let mut m = 2;
let (found_m, found_n) = loop {
    if m * n == 35 {
        break (m, n);
    }
    n += 1;
    m = if n > 34 { n = 2; m + 1 } else { m };
};
println!("{} x {} = 35", found_m, found_n);

Now I’m not saying this is an idiomatic, or even sensible, way to solve this particular problem — as with all the code excerpts in this article, it’s just meant to illustrate some syntactic details. But you can see the fact that the loop block is returning a value — in this case, it’s a 2-tuple of integers. In this case, of course, m and n are still in scope and could have been used directly, but that would illustrate returning a value from the loop.

Of course, you can use loop just as you would while true without returning a value, and the use of loop labels also applies equally to loop.

for and Iterators

Now we come to iteration though collections, which is something that tends to differ a little more between languages than while and ifelse do. Rust’s for construct is similar to Python’s in that it iterates over the contents of a collection, as opposed to being some syntactic sugar around a while loop as it is in C.

let favourite_things = [
    "raindrops on roses",
    "whiskers on kittens",
    "bright copper kettles",
    "warm woolen mittens",
    "brown paper packages tied up with strings"
];
for thing in favourite_things {
    println!("Favourite thing: {}", thing);
}

A collection can make itself iterable in this way by implementing the std::iter::IntoIterator trait — we haven’t looked at traits yet, but think of them as a defined interface for now. This requires providing an into_iter() method which returns another object which implements the std::iter::Iterator trait, and provides a next() method. This is somewhat similar to the use of iterators in other languages, even if the precise semantics differ somewhat.

Following on from the same definition of favourite_things above, we could instead rewrite the for loop to use the iterator directly like this.

let mut my_iter = favourite_things.into_iter();
loop {
    match my_iter.next() {
        None => break,
        Some(thing) => println!("Favourite thing: {}", thing),
    }
}

The use of match here is because next() returns an Option which becomes None once the iterator is exhausted3.

The use of into_iter() depends on the context in which you call it, but is typically expected to iterate by value. Given Rust’s ownership rules, this can lead to some surprising consequences. Consider the simple application shown below, which is similar to the example above except instead of immuatable string slices, we’ve got an array of heap-allocated String objects. Also, we try to iterate through it twice.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
fn main() {
    let favourite_things = [
        String::from("raindrops on roses"),
        String::from("whiskers on kittens"),
        String::from("bright copper kettles"),
        String::from("warm woolen mittens"),
        String::from("brown paper packages tied up with strings"),
    ];
    for thing in favourite_things {
        println!("Favourite thing: {}", thing);
    }
    for thing in favourite_things {
        println!("Still favourite thing: {}", thing);
    }
}

You might expect this doesn’t seem to be particularly contraversial, but Rust’s compiler is having none of it.

error[E0382]: use of moved value: `favourite_things`
  --> src/main.rs:12:18
   |
2  |     let favourite_things = [
   |         ---------------- move occurs because `favourite_things` has type
`[String; 5]`, which does not implement the `Copy` trait
...
9  |     for thing in favourite_things {
   |                  ---------------- `favourite_things` moved due to this
implicit call to `.into_iter()`
...
12 |     for thing in favourite_things {
   |                  ^^^^^^^^^^^^^^^^ value used here after move
   |
note: `into_iter` takes ownership of the receiver `self`,
which moves `favourite_things`
  --> /rustc/[...]/library/core/src/iter/traits/collect.rs:262:18
help: consider iterating over a slice of the `[String; 5]`'s content to avoid
moving into the `for` loop
   |
9  |     for thing in &favourite_things {
   |                  +

(I have manually edited and wrapped some lines for readability)

What’s happening here is that in the first iteration, we’re taking ownership of the String values in the array, and hence consuming them. The easiest way to solve this is just to iterate over a slice of the array. Since a slice is a type of reference, this can refer to the original items without moving them. We could just iterate over a slice of the whole array by prefixing the name with & and this borrows immutable references to the values.

 9
10
    for thing in &favourite_things {
        ...

We can also iterate over what I’m going to call proper slices (i.e. a proper subset of the elements in a collection), and we can make them mutable. To do so we’d also have to make favourite_things mutable, so we can borrow a mutable reference to its members. The updated example below illustrates these features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
fn main() {
    let mut favourite_things = [
        String::from("raindrops on roses"),
        String::from("whiskers on kittens"),
        String::from("bright copper kettles"),
        String::from("warm woolen mittens"),
        String::from("brown paper packages tied up with strings"),
    ];
    for thing in &mut favourite_things[0..3] {
        println!("Really favourite thing: {}", thing);
        thing.push_str("!!!");
    }
    for thing in &favourite_things {
        println!("Favourite thing: {}", thing);
    }
}

As well as the into_iter() used by for loops, there are two other methods which are commonly (but not necessarily always) provided by collections to derive your own iterators:

  • iter() returns an iterator yielding immutable references to items.
  • iter_mut() returns an iterator yielding mutable references to items.

These can be used where you want to be quite certain of the type of value iterated, rather than relying on the context-sensitive behaviour of into_iter(), but whether a given collection offers them is entirely up to its author.

Collections

In the second half of this article I’m going to run through some of the common collections in the Rust standard library, and see how they compare with similar containers in other languages. The ones we’re going to look at are:

  • Vec
  • String
  • HashMap

Vec

Vectors in Rust, as in most languages, are just dynamically-sized heap-allocated arrays. Similar to arrays, they are homogenous, so each one can only store a single type, although that type can be, say, an enumeration of multiple other types. As such, they’re implemented as generics, the principles of which will be familiar to those used to many strong-typed languages. Rust uses the same angle-bracket syntax as C++, C# and Java, as opposed to the square brackets seemingly preferred by Go, Scala and Python4.

Here’s a simple example of creating and manipulating a vector.

let mut v:Vec<i32> = Vec::new();
for i in 0..100 {
    v.push(i * i);
}
let total: i32 = v.iter().sum();
println!("Total is {}", total);

You’ll note that due to the type specification of v, the compiler can deduce the type of the other use of Vec. It’s possible to specify that as well, but you need to use the slightly unusual Vec::<i32>::new() syntax, where ::<> is affectionately known as the turbofish operator. I’m sure we’ll look at that in more detail when I get around to writing an article on Rust generics.

You can reference a Vec using the usual v[i] syntax, which will panic the item doesn’t exist, or you can use the get(i) method, which returns Option<T> to check for existence more gracefully.

As in C++, vectors are always contiguous in memory and hence updating them potentially breaks code that holds references, as the values may need to be moved to a reallocated block. In C++ this is handled by invalidating iterators, but Rust handles this using its usual ownership and reference consistency rules — if you take an immutable reference to an element in a vector, and then you try to use a mutable reference to (say) call push() whilst the immutable reference is still live, the compiler will flag the error.

Similarly the borrow checker will also make sure you’re not holding any references to the contents of the vector when the vector itself goes out of scope, because that will drop every item within it.

String

Some people may debate whether a String is indeed a collection, but I would argue those people are fooling themselves. String types in any language have a decent amount of complexity around them, and I suspect those who disagree with that are simply unaware of the complexities involved, and that lack of understanding will increase their risk of creating bugs, or simply building systems with a lack of internationalisation, which should be regarded as unacceptable in today’s world.

Besides, if you look at Rust’s source you’ll see that fundamentally a String is just a Vec<u8> — and Vec must be a collection, since we just talked about it.

The core language itself only provides str which is generally referred to as a string slice. This is because it refers to a subset of a string that’s held somewhere else — in the case of string literals, that somewhere is the read-only data section of the process, and for variables it’s wherever on the heap the String is storing its content. The str itself is just a reference to the first character and a length. As with Vec earlier, that means if you have a mutable String and you’re holding a reference to a str covering some subset of it, you’ll find you can’t actually change the string for the same reasons as outlined in the previous section for Vec.

In other words, this code will fail to compile.

1
2
3
4
5
6
7
fn main() {
    let mut s = String::from("Andy");
    let sub = &s[0..3];
    s.push_str(" Pearce");
    println!("My name starts with {}", sub);
    println!("My name is {}", s);
}

However, if you swapped lines 4 and 5 then everything would be fine, because sub is not used after bring printed, so the mutable reference borrowed for the push() doesn’t conflict with anything.

As well as the push_str() method we saw above, the + operator can only be used to concatenate strings, though it’s behaviour with regards to ownership is a little fiddly. The first argument has its value moved out, and so is no longer valid aferwards. The second argument, which must be passed by reference, is unchanged. The + operator strictly takes &str as its second parameter, but if you pass &String then the compiler can coerce it.

If you’re going to be doing this a lot, however, the format!() macro is probably more convenient — this uses the same syntax as println!() but returns a String instead of printing the text.

One thing that’ll be familiar to Python programmers is that both str and String are both unicode-aware. As it happens, this is achieved using UTF-8 encoding. As a result of this, Rust doesn’t allow you to index in to get a single character of the string — this is because the nth character doesn’t necessarily correspond to the nth byte, so indexing may be slower than the programmer expects for long strings. You can take a single-character slice if you like, of course, but keep reading for the caveats.

Another aspect that Rust exposes to programmers quite plainly are the different interpretations of Unicode data. There are three different ways of looking at a UTF-8 string:

  • A sequence of u8 values.
  • A sequence of char, which are Unicode scalars.
  • A sequence of grapheme clusters, which is the closest thing to what most people would call a “letter”.

When you take a slice then you’re indexing by byte — if this slices through the middle of a multibyte UTF-8 character you will get a panic at runtime. I think this definitely has the potential to catch some people out. For an example of this, try running the following variant of “hello world” in Hindi5.

This code will panic
1
2
3
4
5
fn main() {
    let s = String::from("हैलो वर्ल्ड");
    let sub = &s[0..4];
    println!("Substring: {}", sub);
}

If you call the len() method on s it’ll tell you the string is 31 bytes long. If you call chars() it’ll return an iterator over the char values which make up the string — in this case there are 11 of those. Since some of them are diacritics, however, there are only 7 of those — to check this I had to go find a community library called unicode-segmentation because the Rust standard library, quite sensibly, doesn’t get drawn this far into the depths of Unicode processing.

Anyway, that’s about it for String — conceptually simple, but in practice there’s quite a lot to think about. I think it’s important to stress that Rust hasn’t really created this complexity, it’s simply exposed the complexity that a properly internationalised application would need to deal with anyway, potentially with a lot less language support in some cases. But for people used to those languages, it might be a bit of an unexpected learning curve.

HashMap

Now we come to our old friend the associative array, aka dictionary, aka hash table, aka HashMap. This is pretty standard stuff in more or less any language, and anyone who’s been around the block will know that no matter how many fancy data structures you learn about, a simple hash table does a good enough job in a slightly dishearteningly huge number of cases. So knowing how to use it is important.

As with Vec, HashMap uses generics — HashMap<String, i32> uses String as a key and i32 as a value. Here’s a simple code example to show inserting and retrieving values from the map.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use std::collections::HashMap;

fn main() {
    let mut counts = HashMap::new();
    counts.insert("Rust", 3);
    counts.insert("Python", 40);
    counts.insert("C++", 13);

    let rust_articles = counts.get("Rust").copied().unwrap_or(0);
    println!("I have written {rust_articles} Rust articles");

    for (language, count) in &counts {
        println!("{language}: {count} article(s)");
    }
}

I’d hope most of that was fairly self-explanatory, but there are a couple of things to note. Firstly, the insert() method will silently overwrite any existing entry, so it can be used both to insert and update. Secondly, there are a few things worth observing going on in line 9. The get() method returns an Option<&T>, so either an immutable reference to the entry or None. Since we don’t want a reference but the actual value, I use copied(), which converts references to copies of the underlying value6. Then I use unwrap_or() to return either the unwrapped value from the Option, or 0 if the value is None.

One thing you might also notice is that there’s nothing here declaring the type of counts — this indicates the flexibility of the type deduction employed by the Rust compiler, which will have given it type HashMap<&str, i32>. Of course, you’re entirely free to declare the type yourself to force the issue — for example, you might want to make sure that i64 is used for the counters.

Another thing that’s not obvious from the example above is the issue of ownership. Values like i32 implement the Copy trait, and so values can be copied into the map. But for heap-allocated values like String then using the actual values will cause them to be moved into the HashMap, with the original variables becoming invalid. If you store them by reference, then of course references to the values will be stored, but in that case the original lifetimes of the values could cause you problems if you don’t know what you’re doing. We’ll look at the lifetimes of values in a future article.

If we want to read and update an existing entry, as opposed to simply replacing it with insert(), then the entry() functioTo n can be used instead of get(). This returns an enum called Entry which is a little like Option, but tailored for the HashMap. This provides an or_insert() method, which returns a mutable reference to the entry’s value if it’s already in the HashMap, or inserts an entry with a value provided as a parameter and returns a mutable reference to that. For the counts structure above, a new article about a language that might or might not be new could be added like this.

fn count_article(counts: HashMap<&str, i32>, langauge: &str)
{
    let count = counts.entry(language).or_insert(0);
    *count += 1;
}

I think that about wraps it up for HashMap. You might be interested to know that Rust uses SipHash as its hashing algorithm, which helps guard against some attacks I discussed in an old article. It’s not the fastest out there, buf if you decide that raw performance is more important to you than security then it’s possible to construct a HashMap that uses other hashers.

Other Collections

It’s worth noting that the standard library provides a number of other collections that I haven’t gone through here. Just to give you a flavour, here are some examples.

LinkedList
Our old friend linked list, the structure that so many Computer Science students learn how to implement, and then learn why, in the real world, it’s rarely worth implementing. Still, it does have its uses, such as threading LRU lists through other data structures, so it would be remiss not to offer it.
VecDeque
As the name implies, this is a double-ended queue, or deque. It’s implemented with a ring buffer, so the elements aren’t necessarily contiguous in memory — you can access the contiguous portions as slices, and there’s a make_contigous() method which will essentially turn it into a vector and return that as a slice.
BTreeMap
An ordered map based on the BTree data structure. This is useful for cases where ordering of elements is important, and you need to select ranges rather than just individual items by key, or you want to know the minimum or maximum items. BTrees offer a good compromise between theoretically ideal balanced binary trees, and the practical constraints of cache efficiency in real systems. Unlike the constant time(ish) access with HashMap, most operations on BTrees are $O(\log{n})$.
HashSet and BTreeSet
These are the set equivalents to the maps already discussed — essentially they’re just maps where the value is always ().
BinaryHeap
Finally we have a priority queue implemented with a binary heap. This is useful when you want to always know the maximum item, by some criteria — traditionally the “highest priority”. You can push items or peek at the maximum item in constant time, but removing the item requires a partial re-sort and is $O(\log{n})$.

Conclusions

That’s about it for this article — a fairly lightweight one this time, but even at this basic level I found a few surprises. Once again I underestimated the degree to which more or less any construct is an expression which can be evaluated for a value, and I hadn’t appreciated the full subtleties around the loop vs the while construct. Also, the reminders about the importance of ownership in Rust were useful.

In terms of the collections, I was pleased to see those offered were a sensible subset of data structures that cover all the common use-cases that I can think of. The String type had some surprising subtleties, but I’m really glad to see that Rust puts Unicode support and internationalisation front and centre of the developer’s mind — this sort of thing is really difficult to add later if you’ve built a whole application around the assumption of ASCII. Mind you, the runtime panics of splitting a UTF-8 string midway through a code point were a bit of a surprise — that’s definitely something to be very careful about when taking substrings.

Hopefully that’s been enlightening, and it was certainly useful to me as an easy re-introduction to Rust after not using it for a long time. In the upcoming articles I’m going to drill in to the important concept of lifetimes, look at the trait system, take a look at error handling, and a close look at concurrency — these will all be important in the Geta implementation. As usual, if you spot any errors, or have any other feedback, do feel free to get in touch in the comments or by email.


  1. In some langauges these terms differ, but the semantics are typically very similar. For example, Ruby and Perl have next instead of continue, and Perl also has last instead of break

  2. In particular, if strictly unnecessary constructs upset you, I really recommend you stay away from Perl

  3. You might recall we looked at Option in the previous article in this series — but given that was published nearly four years ago, I think I’d have to forgive you if you didn’t. 

  4. I have very little opinion on this, it seems like a choice that’s only really of interest to those stuck with writing the languager’s lexers and parsers. It appears that some people object to the use of angle brackets for this purpose, but I selfishly hope this view doesn’t gain too much traction or language maintainers might be tempted to change their choice. Whilst I may have no strong opinion of the original choice, I would hope most people would agree that changing it later, and braking huge amounts of code in the process, is almost invariably a terrible idea. 

  5. Full disclosure: I’ve relied on Google Translate for this, I don’t speak, read or write Hindi, and I wouldn’t know Devanagari script if it knocked on my door and said, “hello, I’m Devanagari script.” So please accept my apologies if this translation is incorrect, and I do hope some naughty Google Engineer hasn’t made me publish something rude on my blog… 

  6. There are some details and caveats about copied() and the closely related cloned(), but they’re not relevant to this example. 

The next article in the “Uncovering Rust” series is Uncovering Rust: Errors and Methods
Sun 23 Apr, 2023
17 Apr 2023 at 12:18PM in Software
 | 
Photo by Luke Hodde on Unsplash
 |