☑ Uncovering Rust: Types and Matching

This is part 2 of the “Uncovering Rust” series which started with Uncovering Rust: References and Ownership.

Rust is fairly new multi-paradigm system programming language that claims to offer both high performance and strong safety guarantees, particularly around concurrency and memory allocation. As I play with the language a little, I’m using this series of blog posts to discuss some of its more unique features as I come across them. This one discusses Rust’s data types and powerful match operator.

rusty boat

There are a few features you expect from any mainstream imperative programming language. One of them is some support for basic builtin types, such as integers and floats. Another is some sort of structured data type, where you can assign values to named fields. Yet another is some sort of vector, array or list for sequences of values.

We’re going to start this post by looking at how these standard features manifest in Rust. Some of this will be quite familiar to programmers from C++ and similar languages, but there are a few surprises along the way and my main aim is to discuss those.

Scalar Types

Rust has builtin scalar types for integers, floats, booleans and characters.

Due to Rust’s low-level nature, you generally have to be explicit about the sizes of these. There are integral types for 8-, 16-, 32-, 64- and 128-bit values, both signed and unsigned. For example i32 is a signed 32-bit integer, u128 is an unsigned 128-bit integer. There are also architecture-dependent types isize and usize which use the native word size of the machine. These are typically used for array offsets. Floats can be f32 for single-precision and f64 for double.

One point that’s worth noting here is that Rust is a strongly typed language and won’t generally perform implicit casts for you, even for numeric types. For example, you can’t assign or compare integers with floats, or even integers of different sizes without doing an explicit conversion. This keeps costs explicit, but it does mean programmers need to consider their types carefully; but that’s no bad thing in my humble opinion.

Specifically on the topic of integers it’s also worth noting that Rust will panic (terminate the execution) if you overflow your integer size, but only in a debug build. If you compile a release build, the overflow is instead allowed to wrap around. However, the clear intention is that programmers shouldn’t be relying on such tricks to write safe and portable code.

Types of bool can be true or false. Even Rust hasn’t managed to introduce anything surprising or unconventional about booleans! One point of interest is that the expression in an if statement has to be a bool. Once again there are no implicit conversions, and there is no assumption of equivalence between, say, false and 0 as there is in C++.

The final type char has a slight surprise waiting for us, which is that it has a size of four bytes and can represent any Unicode code point. It’s great to see Unicode support front and centre in the language like this, hopefully making it very difficult for people who want to assume that the world is ASCII. Those of you familiar with Unicode may also know that the concept of what constitutes a “character” may surprise those who are used to working only with ASCII, so there could be puzzled programmers out there at times. But we live in a globalised world now and there’s no long any excuse for any self-respecting programmer to write ASCII-first code.

Arrays

Rust arrays are homogeneous (each array contains values of only one type) and are of a fixed-size, which must be known at compile time. They are always stored on the stack. Rust does provide a more dynamic Vec type which uses the heap and allows resizing, but I’m not going to discuss that here.

In the interests of safety, Rust requires that every element of an array be initialise when constructed. Because of this, it’s usually not required to specify a type, but of course there is a syntax for doing so. It’s also possible to initialise every item to the same value using a shorthand. These are all illustrated in the example below.

1
2
3
4
// These two are equivalent, due to type inference.
let numbers1 = [9, 9, 9, 9, 9];
let numbers2: [i32; 5] = [9, 9, 9, 9, 9];
let numbers3 = [9; 5];  // Repeated value shorthand.

Although the size of the array must be known at compile-time, of course the compiler can’t police your accesses to the array. For example, you may access an item based on user input. Rust does do bounds-checking at runtime, however, Discussion of how to handle runtime errors like this is a topic for another time, but the default action will be to terminate the executable immediately.

Structures and Tuples

The basic mechanics of structs in Rust work quite analogously to those in C++, aside from some minor syntactic differences. Here’s a definition to illustrate:

1
2
3
4
5
6
7
struct Contact {
    first_name: String,
    last_name: String,
    email: String,
    age: u8,
    business: bool,
}

To create an instance of a struct the syntax is similar except providing values instead of types after the colons. After creation the dot notation to read and assign struct fields will also be familiar to both C++ and Python programmers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fn main() {
    let mut contact1 = Contact {
        first_name: String::from("John"),
        last_name: String::from("Doe"),
        email: String::from("jdoe@example.com"),
        age: 21,
        business: false,
    };
    println!("Contact name is {} {}",
             contact1.first_name, contact1.last_name);
    contact1.first_name = String::from("Jane");
    println!("Contact name is {} {}",
             contact1.first_name, contact1.last_name);
}

Note that to assign to first_name we had to make contact1 mutable and that this mutability applies to the entire structure, not to each field. No surprises for C++ programmers there either.

Now there are a couple more unique features that are worth mentioning. The first of them comes when creating constructor methods. Let’s say we want to avoid having to set the business field, so we wrap it up in a function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn new_business_contact(first_name: String,
                        last_name: String,
                        email: String,
                        age: u8)
                        -> Contact {
    Contact {
        first_name: first_name,
        last_name: last_name,
        email: email,
        age: age,
        business: true
    }
}

However, it’s a bit tedious repeating all those field names in the body. Well, if the function parameters happen to match the field names you can use a shorthand for this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn new_business_contact(first_name: String,
                        last_name: String,
                        email: String,
                        age: u8)
                        -> Contact {
    Contact {
        first_name,
        last_name,
        email,
        age,
        business: true
    }
}

Another convenient syntactic trick is the struct update syntax, which can be used to create a copy of another struct with some changes:

1
2
3
4
5
6
7
8
9
let contact1 = Contact {
    
};

let contact2 = Contact {
    first_name: String::from("John"),
    last_name: String::from("Smith"),
    ..contact1
};

This will duplicate all fields not explicitly changed. There can be a sting in this particular tail, though, due to the ownership rules. In this example, the String value from contact1.email will be moved into contact2.email and so the first instance will no longer be valid after this point.

Finally in this section I’ll briefly talk about tuples. I’m talking about them here rather than along with other compound types because I feel they work in a very similar way to structs, just without the field names. They have a fixed size defined when they are created and this cannot change, as with an array. Unlike an array, however, they are heterogeneous: they can contain multiple different types.

One thing that might surprise Python programmers in particular, however, is that the elements of a tuple are accessed using dot notation in the same way as a struct. In a way you can think of it as a struct where the names of the fields are just automatically chosen as base-zero integers.

1
2
3
4
5
6
fn main() {
    let tup = (123, 4.56, "hello");
    println!("{} {} {}", tup.0, tup.1, tup.2);
    // Can also include explicit types for the tuple fields.
    let tup_copy: (u32, f64, String) = tup;
}

If you want to share the definition of a tuple around in the same way as for a struct but you don’t want to give the fields names, you can use a tuple struct to do that:

1
2
3
4
5
6
struct Colour(u8, u8, u8);

fn main() {
    let purple = Colour(255, 0, 255);
    println!("R={} G={}, B={}", purple.0, purple.1, purple.2);
}

In all honesty I’m not entirely sure how useful that’ll be, but time will tell.

The final note here is that structs can also hold references, although none of the examples here utilised that. However, doing so means exercising a little more care because the original value can’t go out of scope any time before any structs with references to it. This is a topic for a future discussion on lifetimes.

Enumerations

Continuing the theme of data types that C++ offers, Rust also has enumerations, hereafter referred to as enums. Beyond the name the similarity gets very loose, however. In C++ enums are essentially a way to add textual aliases to integral values; there’s a bit of syntactic sugar to treat them as regular values, but you don’t have to dip your toes too far under the water to get them bitten by an integer.

In Rust, however, they have features that are more like a union in C++, although unlike a union they don’t rely on the programmer to know which variant is in use at any given time.

You can use them very much like a regular enum. The values defined within the enum are scoped within the namespace of the enumeration name1.

1
2
3
4
5
6
7
8
9
enum ContactType {
    Personal,
    Colleague,
    Vendor,
    Customer,
}

let contact1_type = ContactType::Personal;
let contact2_type = ContactType::Vendor;

However, much more powerfully than this these variants can also have data values associated with them, and each variant can be associated with its own data type.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// We reference contacts by their email address except for
// colleagues, where we use employee number; and vendors,
// where we use supplier ID, which consists of three numbers.
enum ContactType {
    Personal(String),
    Colleague(u64),
    Vendor(u32, u32, u32),
    Customer(String)
}

let customer = ContactType::Customer("andy@example.com");
let colleague = ContactType::Colleague(229382);
let supplier = ContactType::Vendor(23, 223, 4);

This construct is great for implementing the sort of code where you need to branch differently based on the underlying type of something. I can just hear the voices of all the object-orientation purists declaring that polymorphism is the correct solution to this problem: that everything should be exposed as an abstract method in the base class that all the derived classes implement. I wouldn’t say I disagree necessarily, but I would also say that this isn’t a clean fit in every case and polymorphism isn’t the one-size-fits-all solution as which it has on occasion been presented.

Rust implements some types of polymorphism and features such as traits are a useful alternative to inheritance for code reuse, as we’ll see in a later post. But since Rust doesn’t implement true inheritance, more properly called subtype polymorphism, then I suspect this flexibility of enumerations is more important in Rust than it would be in C++.

A little further down we’ll see how to use the match operator to do this sort of switching in an elegant way, but first we’ll see one example of a pre-defined enum in Rust that’s particularly widely used.

Option

It’s a very common case that a function needs to return a value in the happy case or raise some sort of error in the less happy case. Different languages have different mechanisms for this, one of the more common in modern languages being to raise exceptions. This is particularly common in Python, where exceptions are used for a large proportion of the functionality, but it’s also quite normal in C++ where the function of the destructors and the stack unwinding process are both heavily oriented around making this a fairly safe process.

Despite its extensive support for exceptions, however, C++ is still a bit of a hybrid and it has a number of cases where its APIs still use the other primary method of returning errors, via the return value. A good example of this is the std::string::find() method which searches for a substring within the parent string. This clearly has two different classes of result: either the string is found, in which case the offset within the parent string is returned; or the string is not found, in which case the method returns the magic std::string::npos value. In other cases functions can return either a pointer for the happy case or a NULL in case of error.

Rust does not support exceptions. This is for a number of reasons, partly related to the overhead of raising exceptions and also the fact that return values make it easier for the compiler to force the programmer to handle all error cases that a function can return.

To implement these error returns in Rust, therefore, is where the Option enum comes in useful. It’s defined something like this:

1
2
3
4
enum Option<T> {
    Some(T).
    None,
}

This enum is capable of storing some type T which is a template type (generics will be discussed properly in a later post), or the single value None. This allows a function to return any value type it wishes, but also leave open the possibility of returning None for an error.

That’s about all there is to say about Option, and we’ll see the idiomatic way to use it in the next section.

Matching

The final thing I’m going to talk about is the match flow control operator. This is conceptually similar to the switch statement in C++, but it’s got rather more cleverness up its sleeves.

The first thing to note about match is that unlike switch in C++ it is an expression instead of a statement. One aspect of Rust I haven’t talked about yet is that expressions may contain statements, however, so this isn’t a major obstacle. But it does mean that it’s fairly easy to use simple match expressions in assignments or as return values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
enum Direction {
    North,
    South,
    East,
    West,
}

fn get_bearing(d: Direction) -> u16 {
    match d {
        Direction::North => 0,
        Direction::East => 90,
        Direction::South => 180,
        Direction::West => 270,
    }
}

The match expression has multiple “arms” which have a pattern and a result expression. To do more than just return a value from the expression, we can wrap it in braces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fn get_bearing(d: Direction) -> u16 {
    match d {
        Direction::North => 0,
        Direction::East => {
            println!("East is East");
            90
        },
        Direction::South => {
            println!("Due South");
            180
        },
        Direction::West => {
            println!("Go West");
            270
        },
    }
}

We can use the patterns to do more than just match specific values, though. Taking the Option type from earlier, we can use it to extract the return values from functions whilst still ensuring we handle all the error cases.

For example, the String::find() method searches for a substring and returns an Option<usze> which is None if the value wasn’t found or the offset within the string if it was found. We can use this to, say, extract the domain part from an email address:

1
2
3
4
5
6
fn get_domain(email: &String) -> &str {
    match email.find('@') {
        None => "",
        Some(x) => &email[x+1..],
    }
}

This function takes a String reference and returns a string slice representing the domain part of the email, unless the email address doesn’t contain an @ character in which case we return an empty string. I’m not going to say that the semantics of an empty string are ideal in this case, but it’s just an example.

As another example we could write a function to display the contact details for the ContactType defined earlier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
enum ContactType {
    Personal(String),
    Colleague(u64),
    Vendor(u32, u32, u32),
    Customer(String)
}

fn show_contact(contact: ContactType) {
    match contact {
        ContactType::Personal(email) => {
            println!("Personal: {}", email);
        },
        ContactType::Colleague(employee_number) => {
            println!("Colleague: {}", employee_number);
        },
        ContactType::Vendor(id1, id2, id3) => {
            println!("Vendor: {}-{}-{}", id1, id2, id3);
        },
        ContactType::Customer(email) => {
            println!("Customer: {}", email);
        },
    }
}

One aspect of match statements that isn’t immediately obvious is that they are required to be exhaustive. So, if you don’t handle every time enum value, for example, then you’ll get a compile error. This is what makes things like the Option example particularly safe as it forces handling of all errors, which is generally regarded as a good practice if you’re writing robust code. This also makes perfect sense if you consider that match is an expression: if you assign the result to a variable, say, then then compiler needs something to assign and if you hit a case that your match doesn’t handle then what’s the compiler going to do?

Of course if we’re using match for something other than an enum then handling every value would be pretty tedious. For these cases we can use the pattern _ as the default match. The example below also shows how we can match multiple patterns using | as a separator:

1
2
3
4
5
6
fn is_perfect(n: u32) -> bool {
    match n {
        6 | 28 | 496 | 8128 | 33_550_336 => true,
        _ => false
    }
}

Here we’re meeting the needs of match by covering every single case. If we removed that final default arm, the compiler wouldn’t let us get away with it:

error[E0004]: non-exhaustive patterns: `0u32..=5u32`,
`7u32..=27u32`, `29u32..=495u32` and 3 more not covered
  --> src/main.rs:10:11
   |
10 |     match n {
   |           ^ patterns `0u32..=5u32`, `7u32..=27u32`,
`29u32..=495u32` and 3 more not covered
   |
   = help: ensure that all possible cases are being handled,
possibly by adding wildcards or more match arms

But what if we really wanted to only handle a single case? It would be pretty dull if we had to have a default arm in a match then check for that value being returned and ignore it.

Let’s take the get_domain() example from earlier. Let’s say that if you find a domain, you want to use it; but if not, you have some more complicated logic to invoke to infer the domain by looking at the username. You could handle that by doing something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn get_domain(email: &String) -> &str {
    let ret = match email.find('@') {
        None => "",
        Some(x) => &email[x+1..],
    };
    if ret != "" {
        ret;
    } else {
        // More complex logic goes here...
    }
}

But that’s a little clunky. Rust has a special syntax called if let for handling just a single case like this:

1
2
3
4
5
6
7
fn get_domain(email: &String) -> &str {
    if let Some(x) = email.find('@') {
        &email[x+1..];
    } else {
        // More complex logic goes here...
    }
}

I only recently came across this syntax and my opinions are honestly a little mixed. Whilst I find the match statements comprehensible and intuitive, this odd combination of if and let just seems unusual to me. Mind you, I suspect it’s a common enough case to be useful.

So that’s a whirlwind tour of match and Rust’s pattern-matching. It’s important to note that this is a much more powerful feature than I’ve managed to express here as we’ve only really discussed matching by literals and by enum type. In general patterns can be used in fairly creative ways to extract fields from values at the same time as matching literals, and they can even have conditional expressions added, which Rust calls match guards. These are illustrated in the (rather contrived!) example below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
struct Colour {
    red: u8,
    green: u8,
    blue: u8
}

fn classify_colour(c: Colour) {
    match c {
        Colour {red: 0, green: 0, blue: 0} => {
            println!("Black");
        },
        Colour {red: 255, green: 255, blue: 255} => {
            println!("White");
        },
        Colour {red: r, green: 0, blue: 0} => {
            println!("Red {}", r);
        },
        Colour {red: 0, green: g, blue: 0} => {
            println!("Green {}", g);
        },
        Colour {red: 0, green: 0, blue: b} => {
            println!("Blue {}", b);
        },
        Colour {red: r, green: g, blue: 0} => {
            println!("Brown {} {}", r, g);
        },
        Colour {red: r, green: 0, blue: b} => {
            println!("Purple {} {}", r, b);
        },
        Colour {red: r, green: g, blue: b} if r == b && r == g => {
            println!("Grey {}", r);
        }
        Colour {red: r, green: g, blue: b} => {
            println!("Mixed colour {}, {}, {}", r, g, b);
        }
    }
}

Hopefully most things there are fairly self-explanatory and in any case it’s just intended as an illustration of the sorts of facilities that are available. It’s also worth mentioning that the compiler does give you some help to detect if you’re masking patterns with earlier ones, but it doesn’t appear to be perfect. For example, if I moved the first two matches to the end of the list, they’re both correctly flagged as unreachable. However, if I move the pattern for white after the pattern for grey it didn’t generate a warning; I’m guessing the job of determining reachability around match guards is just too difficult to do reliably.

Conclusions

Rust’s type system certainly offers some powerful flexibility, and the pattern matching looks like a fantastic feature for pulling apart structures and matching special cases within them. The specific Option enum also looks like quite a pleasant way to implement the “value or error” case given that Rust doesn’t offer exceptions for this purpose.

My main reservation around these features is that there’s an awful lot of syntax building up here, and it’s a fine line between a good amount of expressive power and edging into Perl’s “there’s too many ways to do it” philosophy. The if let syntax in particular seems possibly excessive to me. But I’m certainly reserving judgement on that for now until I’ve had some more experience with the language.


  1. For anyone familiar with C++11, this is what you get when you declare a C++ enum with enum class MyEnum { … }

22 Jun 2019 at 8:00AM by Andy Pearce in Software  | Photo by Matt Lamers on Unsplash  | Tags: rust  |  See comments

☑ Uncovering Rust: References and Ownership

This is part 1 of the “Uncovering Rust” series.

Rust is fairly new multi-paradigm system programmating langauge that claims to offer both high performance and strong safety guarantees, particularly around concurrency and memory allocation. As I play with the language a little, I’m using this series of blog posts to discuss some of its more unique features as I come across them. This one talks about Rust’s ownership model.

rusty boat

Over the last few years I’ve become more aware of the Rust programming langauge. Slightly more than a decade old, it has consistently topped the Stack Overflow Developer Survey in the most loved langauge category for the last four years, so there’s clearly a decent core of very keen developers using it. It aims to offer performance on a par with C++ whilst considerably improving on the safety of the language, so as a long-time C++ programmer who’s all too aware of its potential for painfully opaque bugs, I thought it was definitely worth checking what Rust brings to the table.

As the first article in what I hope will become a reasonable series, I should briefly point out what these articles are not. They are certainly not meant to be a detailed discussion of Rust’s history or design principles, nor a tutorial. The official documentation and other sources already do a great job of those things.

Instead, this series is a hopefully interesting tour of some of the aspects of the language that set it apart, enough to get a flavour of it and perhaps decide if you’re interested in looking further yourself. I’m specifically going to be comparing the language to C++ and perhaps occasionally Python as the two languages with which I’m currently most familiar.

Mutability

Before I get going on the topic of this post, I feel it’s important to clarify one perhaps surprising detail of Rust to help understand the code examples below, and it is this: all variables are immutable by default. It’s possible to declare any variable mutable by prefixing with the mut keyword.

I could imagine some people considering this is a minor syntactic issue as it just means what would be const in C++ is non-mut in Rust, and non-const in C++ is mut in Rust. So why mention it? Well, mostly to help people understand the code examples a little easier; whilst it’s debatably not a fundamental issue, it’s also not something that’s necessarily self-evident from the syntax either.

Also, I think it’s a nice little preview of the way the language pushes you towards one of its primary goals: safety. If you forget the modifier things default to the most restrictive situation, and the compiler will prod you to add the modifier explicitly if that’s what you want. But if it isn’t what you want, you get the hint to fix a potential bug. Immutable values typically also make it much easier to take advantage of concurrency safely, but that’s a topic for a future post.

Ownership

Since one of the touted features of the language is safety around memory allocation, I’m going to start off outlining how ownership works in Rust.

Ownership is a concept that’s stressed many times during the Rust documentation, although in my view it’s pretty fundamental to truly understanding any language. Manipulating variables in memory is the bulk of what software does most of the time and errors around ownership are some of the most common sources of bugs across multiple langauges.

In general “owning” a value in this context means that a piece of code has a responsibility to manage the memory associated with that value. This isn’t about mutability or any other concept people might feasibly regard as forms of ownership.

Just to be clear, I’m going to skip discussion of stack-allocated variables here. Management of data on the stack is generally similar in all mainstream imperative languages and generally falls out of the language scoping rules quite neatly, so I’m going to focus this discussion on the more interesting and variable topic of managing heap allocations.

In C++ ownership is a nebulous concept and left for the programmer to define. The language provides the facility to allocate memory and it’s up to the programmer to decide when it’s safe to free it. Techniques such as RAII allow a heap allocation to be tied to a particular scope, either on the stack or linked with an owning class, but this must be manually implemented by the programmer. It’s quite easy to neglect this in some case or other, and since it’s aggressively optional then the compiler isn’t going to help you police yourself. As a result, memory mismangement is a very common class of bugs in C++ code.

Higher-level languages tend to utilise different forms of garbage collection to avoid exposing the programmer to these issues. Python’s reference counting is a simple concept and covers most cases gracefully, although it adds peformance overhead to many operations in the language and cyclic references complicate matters such that additional garbage collection algorithems are still required. Languages like Java with tracing garbage collectors impose less performance penalty on access than reference counting, but may be prone to spikes of sudden load when a garbage sweep is done. These systems are also often more complex to implement, especially as in the real world they’re often a hybrid of multiple techniques. This isn’t necessarily a direct concern for the programmer, as someone else has done all the hard work of implementing the algorithm, but it does inch up the risk of hitting unpredictable pathalogical performance behaviour. These can be the sort of intermittent bugs that we all love to hate to investigate.

All this said, Rust takes a simpler approach, which I suppose you could think of as what’s left of reference counting after a particularly aggressive assult from Ockham’s Razor.

Rust enforces three simple rules of ownership:

  1. Each value has a variable which is the owner.
  2. Each value has exactly one owner at a time.
  3. When the owner goes out of scope the value is dropped1.

I’m not going to go into detail on the scoping rules of Rust right now, although there are some interesting details that I’ll probably cover in another post. For now suffice to say that Rust is lexically scoped in a very similar way to C++ where variables are in scope from their definition until the end of the block in which they’re defined2.

This means, therefore, that because a value has only a single owner, and because the scope of that owner is well-defined and must always exit at some point, there is no possible way for the value to not be dropped and its memory leaked. Hence achieving the promised memory safety with some very simple rules that can be validated at compile-time.

So there you go, you assign a variable and the value will be valid until such point as that variable goes out of scope. What could be simpler?

1
2
3
4
5
6
7
8
9
// Start of block.
{
    
    // String value springs into existence.
    let my_value = String::from("hello, world");
    println!("Value: {}", my_value);
    
}
// End of block, my_value out of scope, value dropped.

Moving right along

Well of course it’s not quite that simple. For example, what happens if we assign the value to another variable? I mean, that’s a pretty simple case. How hard can it be to figure out what this code will print?

1
2
3
4
5
fn main() {
    let my_value = String::from("hello, world");
    let another_value = my_value;
    println!("Values: {} {}", my_value, another_value);
}

The answer is: slightly harder than you might imagine. In fact the code above won’t even compile:

   Compiling sample v0.1.0 (/Users/apearce16/src/local/rust-tutorial/sample)
error[E0382]: borrow of moved value: `my_value`
 --> src/main.rs:4:31
  |
2 |     let my_value = String::from("hello, world");
  |         -------- move occurs because `my_value` has type
`std::string::String`, which does not implement the `Copy` trait
3 |     let another_value = my_value;
  |                         -------- value moved here
4 |     println!("Values: {} {}", my_value, another_value);
  |                               ^^^^^^^^ value borrowed here after move

  error: aborting due to previous error

This is because Rust implements move semantics by default on assignment. So what’s really happening in the code above is that a string value is created and ownership is assigned to the my_value variable. Then this is assigned to another_value which results in ownership being transferred to the another_value variable. At this point the my_value variable is still in scope, but it’s no longer valid.

The compiler is pretty comprehensive in explaining what’s going on here, the value is moved in the second line and then the invalidated my_value is referenced in the third line, which is what triggers the error.

This may seem unintuitive to some people, but before making any judgements you should consider the alternatives. Firstly, Rust could abandon its simple ownership rules and allow arbitrary aliasing like in C++. Except that would mean either exposing manual memory management or replacing it with a more expensive garbage collector, both of which compromise on the goals of safety and performance respectively.

Secondly, Rust could perform a deep copy of the data on the assignment, so duplicating the value and ending up with two variables each with its own copy. This is workable, but defeats the goal of performance as memory copying is pretty slow if you end up doing an awful lot of it. It also violates a basic programmer expectation that a simple action like assignment should not be expensive.

And so we’re left with the move semantics defined above. It’s worth noting, however, that this doesn’t apply to all types. Some are defined as being safe to copy: generally the simple scalar types such as integers, floats, booleans, and so on. The key property of these which make them safe is that they’re stored entirely on the stack, there’s no associated heap allocation to handle. It’s also possible to declare that new types are safe to copy by adding the Copy trait, but traits are definitely a topic for a later post.

It’s also worth noting that these move semantics are not as restrictive as they might seem due to the existence of references, which I’ll talk about later in this post. First, though, it’s interesting to look at how these semantics work with functions.

Onwnership in and out of functions

The ownership rules within a scope are now clear, but what about passing values into functions? In C++, for example, arguments are passed by value which means that the function essentially operates on a copy. If this value happens to be a pointer or reference then of course the original value may be modified, but as mentioned above we’re deferring discussion of references in Rust for a moment.

Argument passing would appear to suffer the same issues as the assignment example above, in that we don’t want to perform a deep copy, but neither do we want to complicate the ownership rules. So it’s probably little surprise that argument passing into functions also passes ownership in the same way as the assignment.

This code snippet will fail to compile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
fn main() {
    let s = String::from("hello");
    my_function(s);
    // Oops, s isn't valid here any more!
    println!("Value of s: {}", s);
}

fn my_function(arg: String) {
    // Ownership passes to the 'arg' parameter.
    println!("Now I own {}", s);
    // Here 'arg' goes out of scope and the String is dropped.
}

Although this may seem superficially surprising, when you really think about it argument passing is just a fancy form of assignment into a form of nested scope, so it shouldn’t be a surprise that it follows the same semantics.

The same logic applies to function return values, and this is where things could get slightly surprising for C++ programmers who are used to returning pointers or references to stack values being a tremendous source of bugs; and returning non-referential values as a cause of potentially expensive copy operations.

In C++ when the function call ends, any pointer or reference to anything on its stack that is passed to the caller will now be invalid. These can be some pretty nasty bugs, particuarly for less experienced programmers. It doesn’t help that the compiler doesn’t stop you doing this, and also that these situations often give the appearance of working correctly initially, since the stack frame of the function has often not been reused yet so the pointer still seems to point to valid data immediately after the call returns. This clearly harms the safety of the code.

If the programmer decides to resolve this issue by returning a complex class directly by value instead of by pointer or reference, then this generally entails default construction of an instance in the caller, then execution of the function and then assignment of the returned value to the instance in the caller which might involve some expensive copying. This potentially harms the performance of the code.

I’m deliberately glossing over some subtleties here around returning temporary objects, return value optimisation and move semantics in C++ which are all well outside the scope of this post on Rust. But even though solutions to these issues exist, they require significant knowledge and experience on the part of the programmer to take advantage of correctly, particularly for user-defined classes.

In Rust things are simpler: you can return a local value and ownership passes to the caller in the obvious manner.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
fn main() {
    let my_value = create();
    // At this point 'my_value' owns a String.
    println!("Now I own {}", my_value);

    let another_value = transform(my_value);
    // At this point 'another_value' owns a string,
    // but 'my_value' is now invalid.
    println!("Now I own {}", another_value);
}

fn create() -> String {
    let new_str = String::from("hello, world");
    // Ownership will pass to the caller.
    new_str
}

fn transform(mut arg: String) -> String {
    // We've delcared the argument mutable, which is OK
    // since ownership has passed to us. We append some
    // text to it and then return it, whereupon ownership
    // passes back to the caller.
    arg.push_str("!!!");
    arg
}

For anyone puzzled by the bare expressions at the end of the functions on lines 15 and 24, suffice to say for now this is an idiomatic way to return a value in Rust. The language does have a return statement, but a bare expression also works in some cases. I’ll discuss this more in a later post.

So in the case of return values, the move semantics of ownership in Rust turn out to be pretty useful: the ownership passes to the caller safely and with no need for expensive copying, since somewhere under the hood it’s just a transfer of some reference to a value on the heap. Since the rules apply everywhere it all feels quite consistent and logical.

But as logical as it is, it may seem awfully inconvenient. There are many cases we want a value to persist after it has been operated on by a function. It would be annoying to have to deep-copy an object every time, or to constantly have to return the argument to the caller as in the example above.

Fortunately Rust provides references to resolve this inconvenience.

References

In Rust references provide a way to refer to a value without actually taking ownership of it. The example below demonstrates the syntax, which is quite reminiscent of C++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fn main() {
    let my_string = String::from("one two three");
    let num_words = count_words(&my_string);
    // 'my_string' is still valid here.
    println!("'{}' has {} words", my_string, num_words);
}

// I'm sure there are more elegant ways to implement
// this function, this is just for illustrating the point.
fn count_words(s: &String) -> usize {
    let mut words = 0;
    let mut in_word = false;
    for c in s.chars() {
        if c.is_alphanumeric() {
            if !in_word {
                words += 1;
                in_word = true;
            }
        } else {
            in_word = false;
        }
    }
    words
}

The code example above shows a value being passed by immutable reference. Note that the function signature needs to be updated to take a reference &String, but the caller must also explicitly declare the parameter to be a reference with &my_string. This is unlike in C++ where there’s no explicit hint to someone reading the code in the caller that a value might be passed by reference. For immutable references (or const refs in C++ parlance) this isn’t a big deal, but I’ve always felt that it’s always important to know for sure whether a function might modify one of its parameters in-place, and in C++ you have to go check the function signature every time to tell whether this is the case. This has always been one of my biggest annoyances with C++ syntax and it’s great to see it’s been addressed in Rust.

Taking a reference is rather quaintly known as borrowing in Rust. You can take as many references to a value as you like as long as they’re immutable.

1
2
3
4
5
6
fn main() {
    let mut my_value = String::from("hello, world");
    let ref1 = &my_value;
    let ref2 = &my_value;
    let ref3 = &my_value;
}

Of course, attempting to modify the value through any of these references will result in a compile error, since they’re immutable. As you’d expect it’s also possible to take mutable references:

1
2
3
4
5
6
7
8
9
fn main() {
    let mut my_value = String::from("world");
    prefix_hello(&mut my_value);
    println!("New value: {}", my_value);
}

fn prefix_hello(arg: &mut String) {
    arg.insert_str(0, "hello ");
}

This example also illustrates that it’s once again clear in the context of the caller that it’s specifically a mutable reference that’s being passed.

This all seems great, but there’s a couple of restrictions I haven’t mentioned yet. Firstly, it’s only valid to have a single mutable reference to a value at once. If you try to create more than one you’ll get an error at compile-time. Secondly, you can’t have both immutable and a mutable reference valid at the same time, which would also be a compile-time error.

The logic behind this is around safety when values are used concurrently. These rules do a good job of ruling out race conditions, as it’s not possible to multiple references to the same object unless they’re all immutable, and if the data doesn’t change then there can’t be a race. It’s essentially a multiple readers/single writer lock.

The compiler also protects you against creating dangling references, such as returning a reference to a stack function. That will fail to compile3.

A slice of life

Whilst I’m talking about references anyway, it’s worth briefly mentioning slices. These are like references, but they only refer to a subset of a collection.

1
2
3
4
5
6
fn main() {
    let my_value = String::from("hello there, world");
    // String slice 'there'.
    let there = &my_value[6..11];
    println!("<<{}>>", there);
}

The example above shows a use for an immutable string slice. Actually you may not realised it but you’ve seen one of those earlier in this post: all string literals are in fact immutable string slices.

As with slices in most languages the syntax is a half-open interval where the first index is inclusive, the second exclusive. It’s also possible to have slices of other collections that are contiguous and it’s possible to have mutable slices as well.

1
2
3
4
5
6
7
fn main() {
    let mut my_list = [1,2,3,4,5];
    let slice = &mut my_list[1..3];
    slice[1] = 99;
    // [1, 2, 99, 4, 5]
    println!("{:?}", my_list);
}

As far as I’ve been able to tell so far, however, it doesn’t seem to be possible to assign to the entirity of a mutable slice to replace it. I can understand several reasons why this might not be a good idea to implement, not least of which that it can change the size of the slice and hence necessitate moving items around in memory that aren’t even part of the slice (if you assign something of a different length). But I thought it was worth noting.

Conclusions

In this post I’ve summarised what I know so far about ownership and references in Rust and generally I think it’s shaping up to be a pretty sensible language. Of course it’s hard to say until you’ve put it to some serious use4, but I can see that there are good justifications for the quirks that I’ve discovered so far, bearing in mind the overarching goals of the language.

The ownership rules seem simple enough to keep in mind in practice, and it remains to be seen whether they will make writing non-trivial code more cumbersome than it needs to be. I like the explicit reference syntax in the caller and whilst the move semantics might seem odd at first, I think they’re simple and consistent enough to get used to pretty quickly. The fact that the compiler catches so many errors should be particularly helpful, especially as I’ve found its output format to be pleasantly detailed and particularly helpful compared to many other languages.


  1. What you would call memory being freed in C++ is referred to as a value being dropped in Rust. The meaning is more or less the same. 

  2. Spoiler alert: the scope of a variable in Rust actually extends to the last place in the block where it is referenced, not necessarily to the end of the block, but that doesn’t materially alter the discussion of ownership. 

  3. Unless you specify the value has a static lifetime but I’ll talk about lifetimes another time. 

  4. I came across Perl in 1999 and thought it was a pretty cool from learning it right up until I had to try to fix bugs in the first large project I wrote in it, so it just goes to show that first impressions of programming languages are hardly infallible. 

18 Jun 2019 at 7:45PM by Andy Pearce in Software  | Photo by Matt Lamers on Unsplash  | Tags: rust  |  See comments

☑ Tracing MacOS Filesystem Events

Recently I had cause to find out where a particular process is currently writing a file on MacOS and I wanted to describe how I went about it for reference.

magifying glass macbook

Now I should point out at this stage that I’m very far from a MacOS expert. I know a few basics, but generally things are slick enough that I don’t tend to need to drop down to the terminal to do a lot. As a result, I’m still discovering little corners where MacOS either provides better tools than I’m used to on Linux, or has some quirky differences to how they work.

Disclaimer aside, here’s the deal. I had this process, which I knew was downloading a file. I knew it was a very large file; but I didn’t know where it was being downloaded to. I knew the end destination, but it rapidly became clear this process was downloading it to somewhere temporary, so it could presumably rename it into place later. I wanted to monitor the size of the file so I could see how far along it was, so I could figure out how long I was able to spend making a cup of tea. Important stuff.

My go-to solution to this issue on Linux would be to locate the PID, then just ls -l /proc/<pid>/fs. This lists all open filehandles for a process and they’re shown as symlinks to the open file. Handy.

MacOS, sadly, does not have /proc in any form. A little Googling around the subject did turn up something called fs_usage, however.

This is in the same vein as strace on Linux, except it’s a little more specific. I won’t go into full details, but suffice to say it logs all filesystem (and other) activities on the machine. Or you can provide a PID or process name and it’ll focus in on that.

So I ended up running something like this:

sudo fs_usage -f filesys pid 1234

This shows all filesystem events for PID 1234. The output you get looks a little like this:

18:02:54.019999  getattrlist                            /tmp/filename
18:02:54.020828  getattrlist                            /tmp/filename
18:02:54.020907  getxattr               [ 93]           /tmp/filename
18:02:54.022703  open              F=24       (R_____)  /tmp/filename
18:02:54.022706  fcntl             F=24  <GETPATH>
18:02:54.022711  close             F=24
18:02:54.025311  open              F=24       (R_____)  /tmp/filename
18:02:54.424061  write             F=24   B=0x190d

I’ve simplified and truncated the output a little, but you get the idea.

This was great and a concrete step forward. You’ll notice, however, that the trace for read() and write() calls doesn’t print the filename that’s being manipulated. That’s probably because those calls operate only on a filehandle, and this tool doesn’t want to delve into process state, it just wants to write the parameters to the call out and get on with it.

That’s fine if, as in the trace above, you’ve captured the open() call; you can use that F=24 to link up the traces and figure out which file is being updated.

If, however, you come in halfway through then that’s not a lot of help; you’d need to persuade the process to close and re-open the file on demand, and that’s pretty hairy stuff.

What we need, then, is a way to look up this file descriptor 24 into a file path.

What we need, then is lsof.

This is a sufficiently standard Unix tool that its own Wikipedia page1, so I won’t go into an in-depth discussion. Suffice to say that its core competency is listing the open files of processes, and that’s exactly what we need here.

You can invoke it as lsof -p 1234 and it will show something like this:

COMMAND  PID      USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
Python  5715 myuser     cwd    DIR    1,4      256 4297892359 /Users/myuser
Python  5715 myuser     txt    REG    1,4    51744 4304043073 /System/Library/…/Python.app/Contents/MacOS/Python
Python  5715 myuser     txt    REG    1,4    52768 4304046592 /System/Library/…/lib-dynload/_locale.so
Python  5715 myuser     txt    REG    1,4    63968 4304046668 /System/Library/…/lib-dynload/readline.so
Python  5715 myuser     txt    REG    1,4   973824 4304094362 /usr/lib/dyld
Python  5715 myuser       0u   CHR   16,0 0t101338        903 /dev/ttys000
Python  5715 myuser       1u   CHR   16,0 0t101338        903 /dev/ttys000
Python  5715 myuser       2u   CHR   16,0 0t101338        903 /dev/ttys000
Python  5715 myuser       3r   REG    1,4     6804 4297372781 /private/etc/passwd

In this example you can see that my Python process has a cwd entry which corresponds to its current working directory (which it has open) as well as txt entries which correspond to the binary itself and various shared libraries. This will be populating the text segment.

Then we have four entries where the FD column reads 0u, 1u, 2u and 3r. The numbers represent the file descriptors of the open files within the process and as those familiar with Unix will recognise the first three correspond to standard input, output and error respectively. Perhaps slightly oddly the process has all three open for both read and write, as indicated by the u suffix; since these are all open on the terminal device I can only assume that this is just some quirk of the default way the OS creates the process.

The final file descriptor 3r shows that the process has file /private/etc/passwd open for reading (indicated by the r suffix), which is exactly right. This was an interactive Python process and I’d just run fd = open('/etc/passwd'). You’ll notice lsof is giving us the real absolute path name; I’d opened /etc/passwd but since on MacOS /etc is a symlink to /private/etc then the path that’s reported above is in that destination folder.

So now we have all the pieces we need: we fun fs_usage to find out the file descriptors that a process is accessing and then we can map these to filenames using lsof.

Frankly there are probably slightly easier ways to solve the problem, but these are probably handy utilties for future reference so I don’t regret the path2 I took.


  1. Mind you, that doesn’t necessarily mean it’s not obscure. There are also surprisingly extensive wikipedia pages for correct toilet roll orientation and animals with fraudulent diplomas

  2. Pun very much intended, I’m afraid. Sorry about that. 

11 Jun 2019 at 8:19AM by Andy Pearce in Software  | Photo by Agence Olloweb on Unsplash  | Tags: macos debugging  |  See comments

Page 1 of 16   |   Page 2 →   |   Page 16 ⇒