Rust Flow

Published on

Estimated reading time: .

Rust allows for a very functional style of value “flow” without sacrificing the performance of a more traditionally imperative sequence. Furthermore, the functional flow may offer more clarity about value lifetimes and error handling that the imperative sequence might obscure.

Contents

Introduction

Programming is, essentially, the construction of sequences of actions that inspect and modify data. A program can often be viewed as a set of routes that data follows, changing and being changed by the program as it moves from the start to the finish.

Various paradigms of programming have different ways of expressing the sequence of inspection and transformation. The language in which we work gives us the tools to express our thoughts, and in turn shapes the way we think. All languages have to be able to express the same basic logical processes, but the way in which they do so favors or penalizes different ways of thinking about and modeling the data and control structure of the program.

Function Sequences

The classical syntax for applying functions to data names the function first, then the data that enters it. This means that function sequences wind up written out of order, like this C example:

third(second(first(f_one, f_two), m_two), l_two);

When you read this, you process third, then second, then first, but the computer processes them in the order that they are named. Like Richard Hendricks in the HBO show Silicon Valley, function application evaluates middle-out.

This is hard to read. You have to read the middle terms, then expand to both the left and the right, in order to trace the whole expression. And if I had written that to use additional function evaluations for m_two and l_two, it would be even more confusing. Just for fun, here’s the full order of evaluation in a C expression1:

fifth(third(first(), second()), fourth());

Method Syntax

Many languages offer something called method syntax, where functions can be defined so that they are invoked with a special first argument preceding the function name, and any additional arguments after the method name, like this example in Elixir:

first |> func(second, third) |> func_two(fourth, fifth)

I chose Elixir for this sample because Elixir has a very loose concept of method scope, and any value can be piped (the |> is called pipe) into any function whose first argument is the correct type.

The above code is equivalent to this C-style form:

func_two(func(first, second, third), fourth, fifth)

Languages with a strong concept of datatypes and methods (all object-oriented languages, Rust, and thus ends my familiarity) often define methods as special functions attached to a datatype, and call them with the same operator used for data field access. For example, in C++:

class Example {
  public:
    void func(void) { /* … */ }
};
Example ex;
ex.func();

This defines a datatype called Example, which has a method called func attached to it that returns nothing and ostensibly takes nothing. In reality, C++ and the languages that follow its lead (Java, JavaScript, C♯, even Ruby) treat the “receiver” as magic, and don’t require that it be explicitly listed in the function signature. These also typically don’t even allow using C-style function syntax to call the method as func(receiver, arguments…)!2 In C++, the receiver is implicitly available inside the body of all methods, and can be explicitly accessed via the keyword this, which magically populates as a pointer to the receiver.

When methods return values, instead of void, those values can themselves receive additional methods. This leads to something called a “method chain”, where an initial value receives a method and returns a value, which receives a method and returns a value, and so on until the chain stops. A classic example in JavaScript is selecting a DOM node and then manipulating it:

$("#america")
  .css({ "border": "none" })
  .click();

Line 1 finds a DOM node; line 2 receives it, manipulates it, and returns it; and line 3 receives it and manipulates it (and returns it, but the snippet doesn’t collect the return value). The DOM node is the receiver of both .css and .click().

Rust

Rust sits midway between the object-oriented C++ family, and the data-oriented Elixir. It allows the programmer to define functions on a type and be called like methods, but it does not allow arbitrarily passing any value into any function whose first parameter matches the type (D, another object-oriented language, does allow this, and I’m very jealous).

In Rust, a method is defined using impl blocks:

struct Example;

impl Example {
  fn new() -> Self;
  fn chain(self) -> Self;
  fn borrow(&self);
  fn take(self);
}

The first function (“inherent method”) I defined here would be considered a “static method” in object-oriented languages, and require a special keyword or decorator in order to not receive an instance. Rust, by contrast, explicitly lists out every function parameter and the special self keyword indicates a receiver.

Thus, Example::new() produces an Example, but does not receive one. It has to be called as a function, by name, and cannot be called as value.new().

The chain and take methods both receive self, which means they take an Example instance directly. When an Example instance exists in the code, Rust can call it using C-style function syntax:

let e = Example::new();
let f = Example::chain(e);

or C++-style method syntax:

f.take();

Rust also distinguishes between functions that take a value, and functions that take a reference3, with the & (or &mut) prefix sigil. The function-style syntax requires explicitly writing out the borrow, while the method-style syntax makes this implicit.

let e = Example::new();
Example::borrow(&e);
e.borrow(); // equivalent to (&e).borrow()

Method Flow

Rust strongly favors (especially in Iterator use) long chains of methods that act on data. These can be thought of as assembly lines – data comes on to the line at the start, proceeds through each method in the chain, being used and changed, and then emerges as a more finished product. An Iterator example might be:

let sum: u64 = vec![1, 2, 3, 4]
  .into_iter()
  .filter(|x| x % 2 == 0)
  .map(|x| x * 2)
  .sum();

This creates a construct where each element of the Vec (line 1, right side) proceeds through iteration (line 2 turns the Vec into a source), being thrown away or preserved based on inspection (line 3), transformed (line 4), and collected (line 5, becomes line 1 left side).

Let me write this in some equivalent syntaxes, just to drive home why method chains are good things to have and use.

No chaining:

let v: Vec<u64> = vec![1, 2, 3, 4];
let i = v.into_iter();
let f = i.filter(|x| x % 2 == 0);
let m = f.map(|x| x * 2);
let sum: u64 = m.sum();

No methods:

let v = vec![1, 2, 3, 4];
let i = IntoIter::into_iter(v);
let f = Iterator::filter(i, |x| x % 2 == 0);
let m = Iterator::map(f, |x| x * 2);
let sum = Iterator::sum(m);

No intermediate variables at all:

let sum: u64 =
Iterator::sum(Iterator::map(
  Iterator::filter(
    IntoIter::into_iter(vec![1, 2, 3, 4]),
    |x| x % 2 == 0
  ),
  |x| x * 2
));

So, we can see that it is very useful to have pipelines through which data “flows”. This example only shows a single pipeline inspecting and mutating data; the data cannot mutate the pipeline – there is no easy way to apply different behaviors depending on a condition. Changing behavior based on data is a critical part of programming. The if and match constructs exist specifically to choose a segment of the program based on what the data is.

Mandatory Participation

Method chains require that each method in it return something, typically the receiver or the result of transforming the receiver.

Often, however, types have methods that just change internal state and don’t return anything! These completely break the method chain.

There is a crate called tap that provides a trait, Tap, with one function, tap(mut self, func: |&mut Self| -> _) -> Self. The trait is implemented on all types, so you can take any borrowing method and make it chaining instead, like this:

let v = vec![5, 1, 4, 2, 3]
  .tap(|v| v.sort())
  .into_iter()
  .map(|x| x * 2)
  .collect::<Vec<_>>();

The tap runs Vec::sort (which has a signature &mut Self -> ()), but takes the Vec by value and returns it as a value, for the rest of the chain to consume.

Disclosure: I maintain a fork of tap.

Carrier Types

Rust borrows from the functional family of languages in that it has types that exist soley to wrap other types and provide some additional meaning. The archetypes of this pattern in Haskell are Maybe and Either; their analogues in Rust are Option and Result.

These two types don’t exist or mean anything on their own. In fact, writing either of those names out alone fails to compile in Rust, because there is no such type as Option or Result4. Instead, they are Option<T> and Result<T, E> – programmers provide types T and E to insert into the carriers in order to make fully constructed, concrete, types.

An Option<T> does not have the methods on it that an instance of the T type does. Option<T> is its own type, and has its own methods. It and Result are useful because they can expose their interior instance of T (if it exists) and that instance can then have its own methods invoked on it.

These and other carrier types have methods on them that perform control flow manipulation while inside a method chain, and I will describe this in the next major section.

Lifting to Carrier Types

The data with which a program wants to primarily work is typically a bare value of some kind, and needs to be lifted into a carrier type before it can be used in a method chain driven by the carrier.

A bare type can be wrapped in Option<T> by using the standard library’s From or Into conversion traits: all types can be wrapped in Option to become Option<Type>::Some(instance) with val.into() or Option<_>::from(val).

An Option can be expanded into a Result of the same type and a new error type by just adding the error type, with Option<T>::ok_or(e: E) -> Result<T, E>.

Carrier Flow

The carrier types define methods for transforming the data they carry, and for being changed by that data. Rust defines control flow methods that can accept values, or accept functions that are conditionally entered depending on the interior data. This lets us recreate branching constructs entirely inside methods.

Linear Transformation

Once a bare value has been lifted into a carrier, the carrier can apply transformations to it. The carrier also knows whether or not to apply them – there is no use or sense in running a function that expects a value on the None variant of an Option, or a function that operates on the success type of a Result running on the Err variant or an error-handling function on the Ok variant.

The carriers use the same map function from Iterator. They take a function to run, and internally decide whether the carrier instance is suitable to run it or not.

type R = Result<i32, &'static str>;

let r: R = Ok(5);
r.map(|x| x as u64 * 2); // change type and value
r.map_err(ToString::to_string); // no effect

let e = R = Err("error message");
e.map(|x| x * 2); // no effect
e.map_err(ToString::ToString);

map is capable of changing the success type of Result, or the presence type of Option. The map_err function that acts on, possibly changing, the error type. It does not exist on Option.

Both of these only run if the Result instance is the correct variant. map only runs its function argument if the instance is Result::Ok, and likewise map_err only runs when it receives a Result::Err.

The above code snippet produces, respectively, the following values:

Result::Ok::<u64, String>(10);
Result::Err::<i32, String>(String { "error message "});

The value returned by the function or closure given to map is the new interior value of the carrier. This is something of which to be aware when using map to briefly inspect the value! Return it, or use the tap crate. This is also something of which to be aware when using map to apply a function that returns a carrier rather than a bare value! Improper use of map can result in deeply nested Result<Option<Result<_, _>>, _> towers, which is likely not what you want. I will cover the equivalent to Iterator::flat_map shortly.

Value vs Value-Producing Function

Rust has a dogged determination for enabling laziness in an eager language. All carrier methods that take external values, have variants of those same functions that take a function which, when evaluated, produces a value of the correct type. The difference between these takes-value and takes-function-produces-value method names is typically a suffix on those that take producers.

There is an and which takes a value and an and_then which takes a producing function; same for or and or_else, and unwrap_or and unwrap_or_else. Option has even more: ok_or and ok_or_else, and get_or_insert and get_or_insert_with.

Branching

map is useful for transforming the interior data, but the interior data can’t affect the method chain except by determining whether or not the transforms run.

There are two branching logic operations: and and or. These methods take values or functions that enter the pipeline depending on the existing state of the data, and correspond to if/else branch constructs. The and method executes only if the carrier is in a “truthy” state, and the or method executes only if the carrier state is “falsey”. For Option, the Some variant is truthy and None falsey; for Result, the truthy variant is Ok and Err is falsey.

This example demonstrates using the and_then and or_else combinators to perform control flow branching without writing explicit if/else or match constructs.

let a: Result<&'static str, &'static str> = Ok("/tmp/sample.txt");
let b: Result<File, &'static str> = a.and_then(|a_ok| {
  File::open(a_ok).map_err(|_| a_ok)
}); // Ok(File) if it opened, Err(&str) if it didn't
let c: Result<File, &'static str> = b.or_else(|b_err| {
  File::create(b_err).map_err(|_| "file creation failed")
}); // Ok(File) if it created, Err(&str) if it didn't

The above code takes an initial Result and runs it through the pipeline. Each stage inspects both the variant of the Result and the data contained within it. The first and_then is guaranteed to run, because it runs when the Result is Ok.

The closure in it takes the value inside the Ok and runs a function body. In this case, that means running File::open (which returns a Result) and replacing the io::Error failure case with a &'static str. This is necessary because each combinator can only change one type at a time. For this example, I just returned the original text, but this time inside an Err wrapper, on failure.

The or_else runs only if the and_then failed. If the and_then succeeded, then the Result is carrying a File instance already, and we don’t need to try the failure handlers. The or_else’s closure takes the value inside the Err variant – here, the same text snippet with which the pipeline started – and uses that to attempt to create a File, again replacing Err(io::Error) with Err(&'static str) via map_err on failure.

It’s important to remember the distinction between map and and_then, and between map_err and or_else. map and map_err change the interior of the carrier variant on which they work; and_then and or_else change the whole carrier.

List of Combinators and Transformers

The standard library documentation on Option and Result is an excellent resource, but for the sake of completeness I will list a short tour of the combinator and transformer methods on those carriers here.

  • map

    This runs a transformation function on the inner value of the carrier’s success variant, but leaves the failure variant unaffected.

    fn Option<T>::map<U>(self, op: impl FnOnce(T) -> U) -> Option<U>;
    fn Result<T, E>::map<U>(self, op: impl FnOnce(T) -> U) -> Result<U, E>;
  • map_err

    Implemented only on Result, this runs the transformation function on the failure variant but leaves the success variant unaffected.

    fn Result<T, E>::map_err<F>(self, op: impl FnOnce(E) -> F) -> Result<T, F>;

To save space, I am not going to write out the signatures or descriptions for the lazy versions of eager functions. The only difference is that the lazy versions take a function to produce the value that the eager versions take immediately.

  • map_or/map_or_else

    Implemented only on Option, this serves as a shorthand for .map().or().unwrap(). It runs the transformation function on Some or replaces None with the fallback value, and the final value (produced by op or provided in def) is returned without the Option wrapping it.

    fn Option<T>::map_or<U>(self, def: U, op: impl FnOnce(T) -> U) -> U;
  • and/and_then

    This replaces a success variant with other, but leaves the failure variant untouched. The function executed by and_then receives the inner data value of the success variant as its parameter.

    and unconditionally replaces its success value, so it can change its success type, but the error type produced by it might come from either the first carrier or the second, so both error sources must have the same type.

    fn Option<T>::and<U>(self, other: Option<U>) -> Option<U>;
    fn Result<T, E>::and<U>(self, other: Result<U, E>) -> Result<U, E>;
  • or/or_else

    This replaces a failure variant with other, but leaves the success variant untouched. The function executed by or_else receives the inner data value of the failure variant as its parameter. (None has no inner value, so the function on Option has no input parameter.)

    The success output of or might come from either of the two carriers the method received, so both success values must have the same type. The first error variant is discarded, however, so the second carrier’s error type is unrestricted.

    fn Option<T>::or(self, other: Option<T>) -> Option<T>;
    fn Result<T, E>::or<F>(self, other: Result<T, F>) -> Result<T, F>;
  • get_or_insert/get_or_insert_with

    This is an Option-only function that is equivalent in logical behavior to or/or_else, except rather than returning an Option to continue the pipeline, this returns a mutable borrow of the interior value. This is useful as a termination method rather than as a pipeline method, but it follows the same principles so I wanted to include it here all the same.

    fn Option<T>::get_or_insert(&mut self, other: T) -> &mut T;

Complex decision trees result in complexly nested closures, but they would also result in complexly nested if/else structures if rewritten in the imperative style. The advantage here is that the value produced by the branch is automatically available for use in continuing the method chain, just as branches in imperative style should have a unified state available after they conclude.

I want to stress how cool this is. A multi-stage pipeline can be expressed as a sequence of method calls that directly name what they’re doing: “do this, and then when it succeeds, do this next thing” or “do this, or else if that failed, do this other thing instead”. The or functions should do their best to provide a success value so that the pipeline can continue, but if they fail, the pipeline is still able to continue.

Lowering to Interior Types

When the pipeline is completed, you are left with a carrier type. The carriers are not all that interesting once you’re done processing them; you want the interior data back out.

Result can evaporate entirely with either unwrap() -> T to produce success, or unwrap_err to produce the failure. Note that these will crash your thread if the carrier is the wrong variant! They ask the carrier for something it does not have. unwrap_or and unwrap_or_else and unwrap_or_default permit a guaranteed unwrap by substituting a fallback value, running a fallback function, or using the Default impl, on error.

Result can also step down to Option with .ok() -> Option<T>, producing Some(t) from success and None from failure, and .err() -> Option<E>, producing Some(e) from error and None from success.

Option has unwrap, unwrap_or, and unwrap_or_else. unwrap will panic on None, while the others provide a fallback value if the Option was empty.

Composition

Carrier composition works best not only when being extended “horizontally”, that is, with method chains, and also when extended “vertically”, by calling functions that return carriers. A common pattern in libraries is to have deep function stacks that return the same carrier (Option or Result) and use the ? operator (governed by the still-unstable Try trait) to immediately punt failure or continue working and return a success carrier when their work is done.

The ? operator serves as a very pleasant bridge between imperative and carrier styles, by producing successful interior values in the current scope or returning failed interior values. It is tricky to use deep in nested closures, though, so keep some care with its use and the general fractal complexity of your code.

The following snippets of code are identical ways of finding a file, reading from it, and using the contents:

fn grab(path: &str) -> Option<i32> {
  let mut f = if let Ok(f) = File::open(path) {
    f
  } else {
    File::open("/path/to/fallback").ok()?
  };
  let mut s = String::new();
  match f.read_to_string(&mut s) {
    Ok(_) => {},
    Err(_) => return None,
  }
  match s.trim().parse() {
    Ok(n) => Some(n),
    Err(_) => None
  }
}

Line 2 attempts to open the given path. If that fails, line 5 attempts to open a fallback path. If that fails, the function bails, otherwise it now has a file handle.

Line 8 attempts to read the contents of the file. This might succeed, in which case nothing needs to happen, or it might fail, in which case the function bails.

Line 12 attempts to parse the contents of the file, returning the value parsed or nothing.

There’s nothing wrong with writing in this style! It’s perfectly performant, it’s clear about what is going on, and the type system ensures that we wrote the code such that every let name = binding either has a value, or the function returned failure.

But there is a lot of noise in the if let/else and match structures.

fn grab(path: &str) -> Option<i32> {
  File::open(path)
    .or_else(|_| File::open("/tmp/foo.txt"))
    .ok()
    .and_then(|mut f| {
      let mut s = String::new();
      f.read_to_string(&mut s)
        .ok().map(|_| s)
    })
    .and_then(|s| s.parse().ok())
}

This is shorter, has less extraneous line noise (lines 9 and 14 of the first sample are completely useless to us!), and displays exactly the sequence of events and behavior we want.

Line 2 opens the path. If that didn’t work, line 3 tries another path. Note that .or(File::open("…")) would unconditionally open the other file, then close it if the first file opened. Line 4 throws away any error.

Line 5 turns the file into a string, by passing the file handle into a closure that allocates a string and tries to read the file into it. That closure must return Option<String>, so .ok() downgrades the Result to an Option and then .map replaces the byte count of read_to_string with the string into which it read, then closure returns that.

Line 10 then attempts a parse, and if the parse succeeds, we get a Some, and if not, a None! Overall, this is much tidier than the first example. We also gain the advantage of having the combinator method names tell us what is happening very closely to the way we would write the function body out in plain English:

  1. Open the file, or else if that fails, open a different file.
  2. And then, try to read the file to a string. If that worked, keep the string.
  3. And then try to parse the string and let us know how it went.

Open, or else open, transform by reading, and then parse. Nice and clear.

Code Generation

Here’s an ugly secret: there’s not really any such thing as eliminating work. What the compiler wants done, must be done. Those match and if let/else statements in the last example? They still have to exist. The compiler refuses to let you get away with not checking the variants of carriers just because you used nifty methods. Those methods are in the standard library, not in the compiler. They’re not magic. You can write your own carrier types and your own combinator methods and they’ll work exactly the same.

Those methods all have a match self { … } branch in them. They all check for Ok vs Err or Some vs None and conditionally do things. The reason the two definitions of fn grab that I wrote have the same behavior is because they expand to the exact same code.

You might be thinking that it really hurts to repeatedly check a value for the exact same condition! The map method has to check the variant on the carrier, but the imperative version we wrote knows that if execution reaches that line, then things are fine, and has no check.

So what’s the advantage of writing a different style of code if it makes our programs worse?

It doesn’t.

rustc is well aware of the method pipeline idioms in Iterator and Result. Because it knows that the combinators only do work on one branch, and have immediate returns on the other, it can inline our method calls and then combine branch bodies that have logically identical or related conditions. The redundant “return if failed or do work if succeeded” and “do nothing if succeeded or do handling if failed” checks can be reordered and combined to something much more like this:

fn grab(path: &str) -> Option<i32> {
  let (mut fail: bool, mut file: File) = File::open(path);
  if fail {
    (fail, file) = File::open(fallback);
  }
  if fail { return None };
  let mut s = String::new();
  fail, _ = file.read_to_string(&mut s);
  if fail { return None };
  let (fail, num) = s.parse();
  if fail { return None };
  return Some(num);
}

Because of a Weird Quirk5 of modern CPUs, if flag { return; } statements are effectively zero-cost. By arranging the code like this, the instruction stream in the CPU doesn’t have to jump around; it just goes down the line and in the event that the failure flag becomes true, it can exit the function cheaply.

The one major downside of using carrier types (that, interestingly, throwing languages like Swift can avoid) is the necessity of moving the interior values around to be in the right slot for the carrier’s discriminant.

Rust has field-reordering optimizations in play that allow the compiler to move enum discriminants around as it chooses. As of this writing, I don’t believe Rust is able to do much optimization in the way of separating the discriminant far enough from the value slot that it can completely avoid register or stack rearranging between successive manipulations, but the possibility is certainly there.

One idea I’ve seen kicked around is to officially “bless” the Option and Result enums in the compiler (like how core::nonzero works) and teach the compiler to represent the carrier types, in applicable situations such as method chains and deep call stacks, as not tuples of (discriminant, value) but as a register or stack slot for the value only, and use a CPU status register or a special-treatment value register or stack slot for the discriminants.

Ultimately, Rust’s code generation on Iterator and Result is good enough that in the vast majority of cases, you don’t need to think about microoptimizations like obsessively linear code or branch deduplication. You’ll have much more prominent performance costs doing other work before you need to worry about these things.

Conclusion

Rust strongly favors using method chains to concisely express intent and program flow. It provides types and methods capable of producing control-flow effects by using method calls rather than branch and loop constructs, and these can be used to great effect to make your source code easier to read and modify.

Furthermore, by abstracting the details of the control flow and byte representation below the visible types, this style of writing frees the compiler to make powerful optimization choices without sacrificing flexibility and clarity at the surface level.

The threading pattern (methods that take self and return Self) and the converter pattern mean that you can create pipeline sequences on any type, not just the carrier enums or Iterator type adapters. Use of the tap crate and std::convert’s conversion traits means that these patterns can be applied to a wide variety of types, even those that didn’t explicitly design their API to accomodate them.

And most importantly, the conclusion common to almost all Rust posts:

Code clarity, type surety, and performance, are not mutually exclusive! We can have all three.

Thanks for reading this 5,600-word monstrosity. You’ve reached the end! You’re free! 6

Next, on Insufficiently Magical:


  1. Annoyingly, the order of evaluation of sibling arguments to a function is implementation-defined, and cannot be assumed constant or reliable. It can be observed to differ between Clang and GCC, for instance.

  2. I write enough asides in the main body. The way you indirectly invoke a class method on a class pointer in C++ is unpleasant. Here’s how it looks:

      class Example { public: void foo(); };
      void (Example::* method)(void) = &Example::foo;
      Example ex;
      ex.*method(); // calls ex.foo();
      Example* ex = new Example();
      ex->*method(); // calls (*ex).foo();

    Those are not “instance calls (deref method-pointer)”. .* and ->* are two more operators in the language, and classes can override them.

  3. Work is in progress to allow arbitrary receiver types, not just raw values and references, as long as the receiver’s outer type implements Deref to the main type. This will allow defining methods like:

      struct Example;
      impl Example {
        fn value(self);
        fn borrow(&self);
        fn borrowmut(&mut self);
        fn ownptr(Box<Self>);
        fn rc(Rc<Self>);
        fn arc(Arc<Self>);
        fn borrowarcmut(&mut Arc<Self>);
      }

    Each method can only be called on objects that match the receiver type, and in the function body, the self keyword will have the fully specified type without any Derefs towards the impl type applied by the compiler.

  4. Expressing only the base name of a generic type requires a language feature called “higher-kinded types” and Rust just does not have it. It’s a hard problem to run solely in the compiler, and Rust does not have the compiler available at runtime via an interpreter or a reflection system.

  5. It turns out, programs follow a really useful heuristic of “if you haven’t done it before, you probably won’t do it now; if you have done it before, you will probably do it again”. That is, conditions to jump backwards in code (loops) are considered likely to be taken, but conditions to jump forwards in code (if stacks) are considered unlikely. Thus, where the compiler can reörder the code to have an extremely linear main path and use if to handle the edge cases, the CPU will rip through the main path with little cost, and only if the if condition turns out to be true will it pay the jump penalty.

  6. You’ve reached the end of the bonus content! You’re really free!