On Safety, and How Rust Can Help

Published on 2018, Jan 28

Estimated reading time: .

The Rust community issued a call for blog posts about Rust’s potential in 2018 and this is my response. It’s not very much about Rust.

Contents

One of my unabashed background goals at work is to introduce them to Rust. I mentioned it in my interview, roughly ten minutes into the first impression any of the folks who eventually hired me had, and I have continued to talk about its potential in our use case since then. I am still very new at this job (at time of writing, thirteen months and two projects), and so I have not had the time or opportunities to build significant Rust projects for production use.

For full context, dear readers, I should mention that I am in various aspects of my life, a carpenter and cabinetmaker, an American Red Cross lifeguard, a Boy Scouts of America lifeguard instructor and aquatics director, a Professional Association of Diving Instructors rescue diver, and most recently a satellite software engineer at Space Dynamics Laboratory.

So I think about safety a lot, and I will be drawing on these for parts of this article.

Because I’m incapable of writing in the short form (even when I was on Twitter, a platform designed for brevity, I routinely ran insufferably long), this piece is going to be a journey. I encourage you to jump to the Rust bits if you want to cut down your reading time.

I’ll first talk about my opinions about safety as an abstract concept and my experiences with practicing safety in various environments; I’ll then talk about my experiences programming at work in C and Ruby; then I’ll draw upon these to talk about what Rust can currently offer my work and what I believe it still needs in order for it to strengthen its utility and efficacy in our contexts.

What Does Safety Mean

Contractual Obligations

The process of establish a “safe” environment means establishing a contract between the environment and the participant about what kinds of activities and behavior are permitted, what work the environment must undertake for the participant, and what constraints the participant must obey for the environment.

For example, when I open a swimming area at Scout camp, I have a contract which I must fulfill for my campers. This contract includes:

cleaning the swimming area of debris, holes, plants, animals
training my staff
monitoring water conditions
maintaining support equipment
staffing the area
informing the campers of restrictions and freedoms
not abusing my power

And furthermore, my campers have a contract which they must fulfill for me. This contract includes:

not being alone
obeying my instructions
not harming each other

Note that my obligations as the environment provider outweigh the obligations of the campers as the participant. This is true of all systems which aim to provide a safe environment as a service: the provider specializes in knowing and implementing the obligations that make something safe, and the consumer should not have to specialize in knowing and doing these things.

Well-Defined Rules

In my capacity as a lifeguard, aquatics instructor, and rescue diver, I have ready access to a plethora of instruction, standard, and reference material with which to inform and guide my actions as a safety provider. As a satellite software engineer, I have similar materials:

the language reference, such as the C standard
compiler manuals
hardware reference manuals
MISRA and NASA programming rules
system documentation, including OS manuals and manpages

These all serve to establish rules about what the pre-existing environment provides to me, and what is left for me to provide to my clients. For example, the VxWorks operating system manual informs me that it will handle blocking I/O operations for me, and that my code need not worry about these details.

For a negative example, the C standard does not inform me that it will ensure that a dereference operation in source code is always applied to a valid item.

And for a mixed example, the C standard informs me that dereferencting NULL is Undefined Behavior and must not occur, the compiler manual informs me that NULL is defined as address 0x0 on this architecture, and my architecture’s memory map informs me that address 0x0 is a perfectly valid cell in which to store information of any kind, and furthermore, the MMU knows this and will not raise an exception when dereferencing it. So … I have an interesting time at work.

Where the rules are well-established and known, the environment can consider itself safe in that aspect. Where the rules are ambiguous, or left to the client to handle, the environment is unsafe, and any safety must be actively performed by the client.

Transparency

Secrets are directly antithetical to safety.

This is an admittedly interesting position for me to take, as I work in a position and environment that require secrecy, and I am even currently in the process of receiving a federal security clearance. Allow me to clarify my position.

For anyone present in an environment, that environment cannot both withhold information from them and also guarantee safety. The rules about permission to enter an environment are an orthogonal concern which I will address farther down the page.

When I operate a swimming area at scout camp, I proactively inform all swimmers of the rules and responsibilities that govern the behavior of my staff, myself, and them. I cannot simply say, “hop in!” and then blow my whistle later and say they’ve broken rules. This behavior would be unsafe on my part, as it means that the swimmers will not have any concept of safe boundaries and now safety is dependent on my vigilance to actively spot unsafe behavior and halt it, after it has already occurred!

Similarly, it is unsafe for a computing environment to withhold rules and information about permissible and impermissible behavior from application code, especially if the environment does not actively enforce its rules before violations occur. Let me illustrate this with a bug I am currently facing at work:

The operating system allocates various regions of memory in the map for various purposes, and sets up the MMU to guard them. (There is no virtual memory here; the MMU exists simply to deal with the fact that the address space is discontiguous and heterogenous.) It then spawns user code which does not know of these regions, and permits that user code to access raw memory.

As I mentioned above, *0 is valid on this system, because the MMU knows that 0x0 points to a real word in a real memory core, and the operating system has for some reason not instructed the MMU to trap it.

*0x1000 is NOT valid: the operating system reserves the region 0x1000 up to 0x80_000 for its own use. This is defined somewhere in the OS compilation settings, I know not where – which is frustrating, because I need to change it. That memory needs to be used for other things, and be accessible from application code.

Having a non-trapping dereference of zero and a trapping dereference of non-zero are fine; I can know about this and act around them. Here’s the bug:

memmove(0x1000, buffer, length);

succeeds.

The MMU traps, the offending process is killed by the operating system, and life goes on.

But the contents of buffer are still written into this supposedly reserved memory region.

When the operating system is not proactively transparent and informative about what resources are allowed and denied to application code, the environment is unsafe. The environment saw my unsafe behavior and blew its whistle to stop me, but only after the unsafety had occurred.

Controlled Access

A feature common to all areas which purport to offer safety is that they have restrictions on entering and exiting the area. Since safety is actively enforced by agents of the environment, those agents must have as full awareness of the environment as they can, and free entry or exit diverges their awareness from reality.

In a waterfront, this means checking in and out at the buddy board, where a supervisor can ensure that swimmers maintain the precondition of “nobody swims alone” and the postcondition of “nobody emerges alone.” It means ensuring that everyone who comes in has passed a test of ability and everyone knows where they’re allowed to be and where they are not allowed.

At my work, we all sign into and out of buildings so that the security office can maintain a working knowledge of information containment, and so the building manager can enforce that everyone who is physically present in a building has the authorization to be there. Some rooms are sensitive, like the lab in which I do hardware testing, and the safety of that particular room is dependent on access controls established at the boundary, such as “do they know how to open the door” and “do they have the appropriate personal protective equipment to approach the bench”.

At API boundaries such as peripheral drivers, syscalls, the FFI sites of a language, or the exposed symbols of a crate, it is the responsibility of the environment to ensure that data entering the boundary from outside is in a state that is sensible and reasonable to continue, and that data leaving the boundary from inside is safe to depart.

In Rust’s case, this means marking FFI-exported symbols as unsafe, because they are exposed to untrusted and untrustworthy input, and it means ensuring that all data passed into or out of an FFI symbol is manually and thoroughly checked for sanity. At present, the compiler does very little to help in this regard and I believe more can be done (for instance, making Option<*T> representationally equivalent to *T and requiring that FFI functions which receive pointers receive the Option and check it for Some(ptr) before proceeding), but having this fixed boundary is a good start.

It is still possible for safe Rust code to be induced into failure when foreign data is introduced, but this can almost always be required to be the result of malicious action on the part of the client, which is nearly impossible to defend while remaining functional. Rust could maximally ensure safety of, for example, slice iteration by refusing to even look at slices handed to it from outside, but this would make Rust unusable in many situations, and there its absence would be a net negative to system safety. Foreign data can lie to Rust: consider the C API of decoupled pointers and lengths; passing in a valid pointer with a length that does not match up to real allocation boundaries will induce unsafe state. This risk can be mitigated in various ways, such as by forbidding entry of foreign data which refers to allocations and only allowing entry of immediate values, and ultimately this is a safety/utility tradeoff decision that must be made by the environment owners and then guarded to the best of their ability in practice.

We often say at Scout camp that the safest waterfront or range is the one that’s torn down; followed by the one that’s completely empty, below them is one that’s well maintained and staffed by competent workers and visited by competent campers, and everything gets worse from there. Manish often says that the safest program is the one that doesn’t compile. Unfortunately, these ideal states are also utterly useless, and so environment providers have to figure out how to adjust the often-contradictory concerns of safety and utility to strike the balance that they feel is right.

Convenience

For better or for worse, humans are creatures of convenience. Given an exclusive choice between the right way and the easy way, we tend to choose the easy way. Therefore, safety must be easy and unsafety must be hard. Consumers must have easy access to rules and information, and it must be easy to obey the rules and know the information. It must also be difficult to violate the rules, or to purposefully remain ignorant. The environment is obligated to exist in such a manner, because consumers have no interest in doing work that doesn’t help them get what they want, even if that work is “right.”

Habits

Learning is a repetitive process.

If it is to be learned, it must be repeated, and what is repeated, will be learned.

I cannot pick up a power tool without checking its trigger and safety mechanism, its power supply, and its bit, because these are habits that have been instilled in me since childhood. Furthermore, I have directly seen the consequences of failure.

I cannot look at water without evaluating its visibility, turbulence, temperature, and occupants, because these are habits that have been instilled in me at work and at Camp School (yes, that’s a real thing and I have a real certification from it). Furthermore, I am aware of, and thankfully have never seen in truth, the consequences of failure.

I cannot put on a SCUBA rig without triple-checking the tank pressure, the valve seals, clearing the regulators, ensuring the BCD inflates and deflates, checking the air purity, etc., because these are habits that have been instilled in me in training and in practice. Furthermore, I have directly experienced the consequences of failure.

I cannot write pointer dereferences without an if (ptr == NULL) { abort; } guard; I cannot write malloc() without immediately typing free() and then writing the use logic in between them; I cannot write int a; without immediately assigning to it, etc. etc., because these are habits that have been instilled in me at school and at work. Furthermore, I have directly experienced the consequences of failure.

I didn’t come to any of these things by instinct! When I was five, I didn’t know how to handle a drill or a saw or a planer correctly intrinsically! I never knew how to actively judge water safety or dive gear safety just by showing up! And I certainly didn’t show up to my programming classes knowing how to dodge all the pitfalls. I have a very clear memory of my intro C++ course’s final project segfaulting on exit because I messed up order of destructors and returned into a deleted object.

Humans do what they have done before; we must be trained in the correct and safe behavior and made to do it repeatedly, and then we will continue to do it after the training wheels are removed.

I am safe because of my habits. I have dedicated active effort to building them and now I don’t have to actively exert myself in familiar environments because the knowledge of boundaries and rules and obligations is baked into me.

I am unsafe because of my habits. I ski without a helmet. I hike without a buddy. I drive with one hand. I have done these things many times in the past and now they persist, and will require active and conscious effort to change. When dad taught me to ski, we didn’t rent a helmet. When I moved to Utah, I didn’t bring any friends, and now I wander the mountains on my own. I’ve been driving since I was ten years old and have become lax since I’ve never been in an accident. I recognize that these are unsafe behaviors, and I continue to do them, because it’s habit, and because they’re easy, and because changing them would take effort and I don’t currently have incentive to make that effort worthwhile.

How Does Safety Happen

Through experience.

Through pain.

We find something we want to do, some area we want to explore, and we go for it. We bring preconceived notions about what to do or what not to do, and then the unknown dangers hurt us. We write them down, and we turn them into knowledge, and we turn that knowledge into rules and procedures and best practices. We teach it, and eventually, the rate of harm goes down. Ideally to zero, but never in reality, at least not yet.

We learn not to touch hot stoves, by touching one hot stove.

We learn not to swim in muddy water, because someone did and got stuck and couldn’t be seen and drowned.

We learn not to blindly dereference pointers, because sometimes when you do that your program crashes.

Then we write up rules: metal that glows is not to be touched, water in which you can’t see a white Frisbee at eight feet is not to be swum, pointers returned from functions are not to be used unchecked.

Then we write up procedures: how to touch metal, how to verify aquatic safety, how to program in a way that doesn’t get segfaulted or use-after-freed or double-freed or data-raced.

Then we teach them.

We let kids touch a stove that is painfully hot but not damagingly so (or at least, my parents did). We make new SCUBA divers experience what running out of air feels like, by shutting off their tanks underwater and seeing how they handle the truly horrifying sensation of moving their ribs and having nothing happen. We have new programmers use a variable and then watch their program die in mysterious ways. “This is bad,” we say, “and now that you know what you must avoid, here is how to avoid it, and because you have experienced it, you will.”

In an ideal world, we will learn how to be safe in dangerous environments without having to directly, personally, experience the pain of letting the danger hit us. And we’re getting there: I learned to drive safely without ever having been in a collision, for example. But for now, really internalizing the reasons for safety protocols requires, or at least is greatly aided by, the experience of not having them.

Working in C and Ruby

I have worked on two different projects so far, BioSentinel and DHFR. For BioSentinel, I wrote a kernel driver in C, which included basic MPSC queue, ring buffer, and linked list implementations because libc doesn’t have those. For DHFR, I wrote Ruby scripts to automate the operations room. (I also wrote a Rust tool, but we never got to use it due to Unforseen Circumstances.)

My experiences on both projects were educational, and I ran into all the common problems you would expect and some that you might not. I have had logic errors during byte reinterpretation and bit fiddling, I have used one equals sign where I meant two and two where I meant one and if accepts them both, I have had race conditions that kill data and/or leak memory, I have had alignment and endianness faults and switch fallthroughs and macro errors and union failures and kernel API misuses.

Company policy is to compile with -Wall -Werr at all times, as is right and just, but when your task is to purposefully be doing really weird things in the name of performance since you’re targeting a 35MHz processor and you have to keep a shallow call stack because it’s SPARC and register flushes are brutal, well, there’s only so much you can do before your targeting computer can’t help you anymore and it’s time to use the Force, Luke.

But enough about languages that are terrible; let’s talk about Ruby. (My apologies to Gary Bernhardt.)

The Ruby framework we use has a very large stringly-typed API surface and no concept of ahead-of-time checking, so typos don’t get reported until they’re executed, which is guaranteed to be at least once while your client from JPL is in the room on launch night wondering why a nine-month-old employee is responsible for making contact with his satellite and why the procedure that’s supposed to do so is throwing an error on the absurdly large wall display.

Ahem.

Unexpected nil and the propagation thereof, using strings for accessing custom types that at best raises an error and at worst doesn’t, slow performance, and a type system that will smile and nod when you tell it what’s coming down the network and then do what it sees fit anyway, I could go on.

Ordinarily, Ruby doesn’t permit hairy unsafe behavior like C does, which is really nice, if expensive. I say ordinarily because this Ruby is responsible for turning user input into binary message packets that have a very strict set of correctness, and sending an incorrect packet to a satellite on orbit is Considered Harmful. It does its best to limit incorrectness, but the checking is all opt-in, not opt-out, and from experience I feel very strongly that in is the wrong way to opt when it comes to safety. Out is also the wrong way to opt, because safety shouldn’t be optional, but like I said; sometimes you need the computer to let you do your thing, but you should at least have to explicitly tell it so.

Let’s Talk About Rust (Finally)

When I talk about Rust at work, I lead with safety. It’s very important to me personally, as I’m sure you have noticed by now; it’s also very important to my company and to our clients. We take the matter very seriously.

The fact that Rust can statically guarantee absence of data races, resource release failure (well, not leaks, and for us leaks are unsafe no matter what Gankra says, but other than that), use of null, and silent errors is really compelling.

The fact that Rust has known soundness holes and non-thoroughly proven output safety is less so. Now, that may be a bit hypocritical of a position for us to take since we fly C and C++ code, but there is a lot of institutional knowledge, safeguards, practices, and experience surrounding the C spec, the MISRA and NASA guidelines, and GCC, that LLVM and Rust don’t have yet, and they have to compete with those facts.

Remember, it’s not enough to be just as good as the competition; Rust has to be actively better than fifty years accumulated experience and familiarity and awareness and institutionalized habits and tools. We already have an imperfect language; we already have a build system that can target a proprietary operating system and esoteric processors; we already have sanitizers and lints and checks.

Rust isn’t going to enter my company’s toolbox anytime soon if it doesn’t have a good story to meet these points, or to make it so resoundingly advantageous that we’re willing to accept the effort of bringing it into the fold.

Now, because I feel very rude if I don’t offer some positive points that show I’m being more than a demanding and unsatisfied customer…

Fluff

Everybody to whom I’ve shown Cargo, has loved it. Now that it’s grown the ability to use private repositories, there’s only one remaining pain point for our use case.

Everybody to whom I’ve shown Rustdoc, has loved it unconditionally. We use Doxygen and it’s bearable at best.

Tests as first-class citizens of the language are wildly popular. We have a culture of unit and integration testing that results in a lot of work to maintain because C and C++ have difficult stories there and client code often doesn’t keep up to our standards. I’ve shown a couple people my workflow of cargo watch -s "cargo check && cargo test && cargo doc" and it’s blown their minds.

Rust’s ability for binary FFI is a mandatory criterion for anyone to even look at it; the fact that it integrates flawlessly into Visual Studio’s debugger is really nice.

So What Do We Need From Rust

I can summarize this in two words: integrate mrustc.

All the excitement and energy in the community for targeting WASM? The existing emscripten and asmjs targets? Yeah. Those are good.

Do that for C.

A nice starting point would be to teach rustc to emit a fully-contained LLIR bundle that includes everything necessary to create an artifact with LLVM. We should be able to specify LLIR as a target and staticlib as an artifact type, and get a LLIR bundle that LLVM can consume alone and emit a finished archive.

(If rustc already knows how to do that, then teach me, because I sure don’t, and haven’t been able to find literature on the topic.)

I say starting point, because LLVM is nowhere in our toolbox at work. GCC is. ICC is. These compilers target, natively or with vendor patches, the operating systems and processors that we use. LLVM doesn’t do so very well at this time.

If rustc could transpile to C, then all of our other problems would go away.

Final compilation to the weird targets we use? It’s C, and we have a C compiler.

Integration in our CMake suites? It’s C.

Source distribution to clients who are accustomed to receiving C? It’s C.

It doesn’t matter that C is a hilariously unsafe language; Rust source is only going to result in C that will avoid the traps.

…Okay, Seriously Now

I recognize that Rust-to-C transpilation is probably not going to be a topic of significant traction, so here are some slightly more real points.

Cargo needs to play nicely in other build systems.
I’ve written CMake scripts that are capable of using it, but it’s a black box, and CMake can’t use Cargo intelligently. Or I am not good enough at CMake to make it do so, but I shouldn’t have to be a CMake wizard just for Rust source to be on par with C source, should I?
Rust needs a solid embedded story.
Integration of Xargo is, to my knowledge, probably going to be 90% of what we’d want there.
The last 10% is going to be defining new targets and compiling libcore and libstd for them.
Note from the future: The Cargo build-std setting is improving, but still not quite here yet (2021-04-03). Ongoing.
Artifact sizes.
Satellites aren’t large.
Compilation times.
I trust I don’t need to expound on this. I’m aware of the progress being made on incremental compilation and I’m excited to see how that plays out.
Const generics, const fn.
We have a lot of C++ templates that are parameterized on integers, and a lot of compile-time computation. That can be faked in a build script, but, c’mon.
Note from the future: const fn has been growing increasingly pervasive, starting later in 2018; const generics began stabilizing in Rust 1.51. Ongoing.
Close soundness holes.
The most safety-critical niches of industry are, paradoxically, going to have the most uses of unsafe and esoteric code. I’m sure the Rust team, especially everyone who’s contributed to libcore, is aware of how this works.
If there’s a way to make Rust fail its guarantees, we will run into it.
We have tools and literature for a formally-verifiable subset of C, and we strive to stay within that domain wherever possible. Rust is absolutely going to have to have some story there; even if the compiler is never formally verified (GCC sure isn’t!), there is going to have to be at least some literature on formally, provably, sound Rust code. Rust already has demonstrably safe checking and guarantees in the language; it’s doing phenomenally well there! I need that in language I can give to my managers, technical and not, about why Rust is the good business decision that it is.
Allocators.
We will never, ever, ship a satellite that has two allocators on it. We need an easy, correct, way to strip jemalloc and just use the system allocator. Stabilizing, or at least exposing more of, an allocator API would also be very nice. We do a lot of work with unsized types that currently is easy, though not nice, to do with C’s allocator API that I at least have not figured out how to replicate in Rust without essentially becoming libcore. Maybe this just needs more literature?
Note from the future: crate alloc stabilized in Rust 1.36, and Rust switched from jemalloc to “the system allocator” in Rust 1.32. Solved.

That’s all that I’ve been able to conceive so far, so let me wrap up. According to my Communications Merit Badge (I’m very Boy Scout), it is now time for…

More Fluff

Result and Option are very well received, especially now that enums are more aggressively optimized.

Seriously, the C FFI story is really good.

I’ve played with Serde a little and since ser/des is a staple of our work, being able to delegate that to the machine is a dream come true, ESPECIALLY if we’re able to use the same process on the ground as on flight.

An asset on orbit that can only be contacted every 90 minutes is the epitome of an asynchronous, I/O-bound system. I’m very excited to see Rust making this area of programming accessible to our niche.

Conclusion

Safety is complex and critically important. Rust does a stellar job of providing safe environments and making explicit where it fails, or chooses not, to do so. It has excellent potential and actual value for software in all domains, but especially in mine, and I very much want it to succeed there. To do so has challenges and requirements, which I hope I have correctly identified and reasonably communicated.

I don’t expect any, much less all, of this to happen in 2018, but you know what they say about journeys of a thousand miles (or in my case, 300 kilometers):

They begin by looking at a map, making a plan, and facing in a useful direction.

Then by taking a step.