Fast Development In Rust, Part One

Like... really f***ing fast.

Bo Lin

Alexander Bogdanowicz

, and

Lukas Schulte

Mar 13, 2024

Over the last 12 months, we’ve been working on building a SQL compiler, semantic analyzer, and data build system. It’s called SDF, and it’s written entirely in Rust. Our initial decision to adopt Rust was not without reservations -- despite all the hype around Rust's features (the great "Rewrite in Rust" movement had long become a meme by the time we hopped on the bandwagon), the language is also well known for having a "steep learning curve". While we were a team of seasoned engineers, none of us had any prior production experience with Rust. For an early stage startup, iterating fast is literally a matter of life and death, and if the bottleneck to productivity turned out to be our chosen programming language, we would be regretting our decision for a very long time indeed.

Looking back from this one year mark, we're happy to report that not only did we not die on the proverbial hill of learning Rust, the (much hyped) benefits of the language started materializing into tangible returns far sooner than we anticipated. While there is undeniably an upfront investment to learn Rust, by this point for us it has easily paid for itself many, many times over. Rust makes certain tasks easy that would have been exceedingly challenging in other languages. More importantly, our experience with Rust has turned out to be every bit as fun as we thought it would, and perhaps even increasingly so as we progress further on this journey:)

In this two-part blog series, we'd like to add our own perspective to what is now surely an overdone topic, in the hopes that this may prove helpful to other rustaceans.

We’re going to focus on three main areas:

Language -- how we managed to quickly build MVPs and hit milestones while learning the intricacies of Rust itself
Tooling -- how we leverage and build on Rust's excellent tooling ecosystem to further enhance team velocity
And finally, miscellaneous tricks we learned along the way that helped keep our project scalable.

We will cover the first topic in this article, and the remainder in part two. Let’s go. 🦀

Language

Fighting the Borrow Checker

The first major obstacle one must overcome on the road to productivity, and indeed the primary contributor to Rust's famed "learning curve", is the borrow checker itself. This is a process that's been aptly termed "fighting the borrow checker", where the "fighting" involves copious amounts of angry cussing along with some amount of hair-pulling. Eventually, there would come a point where the concept of compiler-enforced ownership sinks in and incorporated into one's mental model, then the whole thing "clicks" and one comes to the realization that "fighting the borrow checker" was nothing more than a symptom of Rust manifesting the very complexities of software engineering upfront at compile time. From then on, the borrow checker becomes their formidable ally, instead of a frustrating adversary seemingly trying to thwart you at every turn.

But such revelations only come much later, with time and experience, and in our early days of developing in Rust, under real pressure to build a working product, the borrow checker was very much an adversary we had to overcome. What worked for us in that situation was doing the simplest and most straightforward thing -- `Clone`. Yes, whenever the borrow checker screeched “lifetime does not live long enough!!”, we would just `clone()` and `clone()` until the compiler stopped complaining, and then `clone()`d some more. As a result, our early codebase was riddled with `clone()`s. But hey, it worked! By adopting this simple yet effective strategy, we were able to keep the project moving without agonizing over Rust's ownership semantics too much.

You may now be wondering if the performance of our program suffered because of this. (It would be tempting here to default to the clichéd, and frequently misapplied, adage that "premature optimization is the root of all evil.", but we'll spare you from that common retort and tell the rest of our story :) ) As we gained proficiency with the borrow checker, we did incrementally refactor away the many unnecessary clones by adopting improved ownership models in our data structures and function signatures, perhaps hundreds of them.

And guess how many of those instances actually resulted in tangible real world performance gains? 0. Yes, zilch, nada -- all those clones added up to no more than a minuscule overhead in real world workloads. This was true even for the many instances where we were so sure that cloning was bad as to leave a `// TODO: refactor this clone` note in the code! Imagine our disappointment when we meticulously cleaned up such TODOs, only to measure a net zero in performance gains…

So there you have it -- if you ever find yourself "fighting" the Rust borrow checker, keep calm, clone, and move on. And if you feel guilty about cloning too much -- don't, because in all likelihood, the clones are not the root cause of your performance problems.

Chasing Performance

This brings us to everyone's favorite topic -- performance. After all, if you were drawn here by the keyword "Rust", then it's probably a safe bet that "building fast programs" is one of the top things on your mind :)

We'll start off by saying that, making your program run fast is hard. Unlike "fighting the borrow checker", there's really no fast and easy rule to follow when it comes to building performant applications. And no, if your expectation is that "if I build it in Rust then it will be fast", you may be setting yourself up for a quick disappointment -- as far as our story goes, we've hit performance bottlenecks on multiple production scenarios. One major complicating factor is that "performance" in practice is often not a simple, fixed objective criteria as many would like to believe, but rather a highly scenario dependent one -- how your program perform can vary drastically from one type of workload to the next. And ultimately which type of workload “matters more”, very much comes down to subjective opinions.

With that out of the way, we'll talk about types of optimizations that did yield significant results for us:

Squashing asymptotic complexity
Maximizing parallelism
Shunning synchronization locks (to maximize the parallelism)
Discovering the right memory allocator (unexpected, but massively impactful)

Each of these optimizations alone at some point improved our real world performance by 2x ~ 20x, depending on the type of workload and the number of CPU cores in the system -- the margin of difference between "it’s alright" and "holy sh** that’s fast!!".

On Asymptotic Complexity - and why understanding your workload is crucial ⛑️

In the last section, we described how sanitizing redundant cloning had little to no impact on real world performance. In asymptotic terms, all that's really saying is simply that optimizing away some low constant factor probably isn't worth the effort.

However, when it comes to super-linear run time, the small costs will absolutely dominate on larger inputs. Here you are faced with just the cold, hard theory of computational complexity, and no amount of compiler magic is going to save you. In practice pinpointing such performance bottlenecks is probably 80% of the work. Once the offending logic is found, the solution is usually self-evident -- either it's going to be "very straightforward", or "theoretically impossible". The latter case may sound like some ominous doomsday scenario, but it rarely is. Recall that performance is actually a subjective criteria, and all subjective criterion can be "molded". In practice, "theoretically impossible" just means relaxing the right constraint, or sacrificing less important workloads to make the important ones run faster. In all likelihood, "theoretically impossible" might end up being even less actual work than "very straightforward"!

Back to the story of SDF. At one point, we were choking on some benign looking queries. After a quick investigation, we found that those queries were operating on a large number of tables, each with more than 10k columns. Since our column resolution logic was using a linear scan to find matching columns, the overall compilation was quadratic in the size of input schemas. By using a lookup table for column resolution, we observed an average 3x speedup on such queries. Note again the very scenario-dependent nature of performance here -- the speedup was observed only on queries over very large schemas, the lookup table had no effect on (the far more common) queries over normal sized schemas (say, <100 columns). Had the large schemas never showed up in our workloads, then there would be no need to optimize -- the simple linear scan algorithm would have worked completely fine.

Parallelization - and why compilers ❤️ many core architectures

We built SDF in the beginning as a single threaded application. This kept things simple and allowed us to focus on rapidly building up the basic functionalities. But we always knew that parallelization had to come at some point, given the "embarrassingly parallelizable" nature of compilation. Perhaps we were even quietly dreading the exercise in the back of our minds -- we were of course aware of Rust's well-hyped reputation for "safe and easy" concurrency, but remained somewhat skeptical given our past experiences with building concurrent applications.

That point came roughly 5 months into the project -- when it became clear that we would not be able to scale to handle a full production warehouse if we only utilize a single CPU core -- and that's when we truly appreciated the power of Rust for the first time. As things turned out, the effectiveness of Rust in this domain is one of those rare cases where the hype matched the reality. Gone were the countless hours sunk into tracking down weird segmentation faults, intermittent bugs from race conditions, and deadlocks. Instead, once we hammered our code enough to make it pass the compiler, things mostly just worked! Though it should be noted that it did take a lot of hammering to get our parallelized code to pass the compiler -- SDF at that point was already far from a toy prototype -- but all in all it was orders of magnitude less time and stress than debugging concurrency bugs at runtime.

As for how we got it to pass the compiler? Mainly by putting things in `Arc`s/`RwLocks`, and, you guessed it, by copious amounts of `clone`s (see section Fighting the Borrow Checker). And that's the beauty of Rust's concurrency model -- there is no separate "thread safety" mechanism in Rust, it's all just the same underlying object ownership model. IOW, turns out memory safety is also thread safety!

Parallelizing gave us an instant 10~20x speed boost on the end-to-end workload (on a consumer grade 16-core CPU), no other optimization came close.

Synchronization locks - and why you should avoid them in Rust 🙅‍♂️

An immediately related topic to parallelization is that of locking. Locks are perhaps the go-to mechanism to safely introduce parallelism to an existing code base, especially for people coming from an imperative-paradigm background. A neat thing about Rust is that the type system tells you which objects need to be locked, via the `Send`/`Sync` traits, which makes the whole exercise a lot less stressful.

But there is a dark side to locking. For one, locks have relatively high constant overhead on each invocation, often involving some mix of busy spins and calls into OS primitives. More importantly, locks absolutely wreck the potential parallel factor of your program. In a sense, locking is the "anti-parallelization" -- adding locks is like working to spoil your hard-earned fruits from the labor of parallelization. In the extreme case, locks can reduce your parallel factor to ~1, effectively reducing your parallel code to merely an illusion (i.e. like Python programs). Furthermore, it's just really hard to reason about how much of a negative effect any given lock actually has on your parallel factor, sometimes it might not matter much, but some inconspicuous lock can make all the difference.

Your best bet, is therefore to avoid locks whenever possible. Here we found that the functional way of thinking really helps -- instead of having resources to be shared among the workers, clone a "privately-owned" copy of all the required resources for each work "unit", then aggregate all the results from workers at the end. The tradeoff is here is using some additional cloning (which as we mentioned, is usually very cheap) before starting the workers, to avoid locking while the workers are running.

The Memory Allocator - and a lesson on hidden 🥷 performance

MUSL is a powerful C library to allow static linking of a library. It allows you to create an isolated, self contained, binary that works across almost any Linux distribution. All of our releases are built for MUSL. But, one day we were informed by one of our customers that they were seeing massive performance degradation of SDF between a 10-core M1 MacBook pro and their 64-core X86 Linux server. The server was not much faster.

To the Apple fanboys: M1s are good, but they’re not that good.

It turns out that MUSL incorporates its own global memory allocator, and, in heavily multithreaded situations this allocator is extremely not performant. The benefits have lessened over time as we’ve made more performance improvements, but is still 2-3X depending on workload. To read more, check out Andy Groves more extensive writeup on the matter and how it impacted Datafusion’s performance.

## COMPILE TIME 16K QUERIES   ## 
## 7950X3D, 16 cores          ##
## AVG 500 columns per query  ##
## Date: 03/10/2024           ##

+-----------+-------------+
| JEMALLOC  | NO JEMALLOC |
+-----------+-------------+
| 0m19.088s | 0m40.706s   |
+-----------+-------------+

Switching to a modern global allocator like Jemalloc not only solves the issue, but provides performance improvements on every SKU we offer! Add Jemallocator to your cargo.toml and use it as your program’s entry point as below.

/*
jemalloc used to be the default Rust allocator til circa November 2018. Here we explicitly opt back into it to avoid the abysmal musl allocator
*/

#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

Fearless Refactors

An often under appreciated benefit of using a strongly-typed language like Rust in a fast-paced startup environment, is how much it helps with the task of refactoring. In the Parallelization section we briefly touched on this point from the perspective of adding parallelism, but the point extends far beyond that specific context. By now, aside from parallelization SDF has undergone perhaps three or four broad sweeping refactors touching almost every part of the codebase. Each one was spurred by some key objective we had to hit at the time, all were accomplished within matter of days and without causing noticeable regressions on existing workflows.

It may seem counter-intuitive to claim that a language with stronger type restrictions results in faster development cycles, but such has been our experience with Rust. There are two related aspects here.

First is the fact that it's generally improbable that you can design a "perfect" system from the get-go. And even if you somehow can, you can't foresee the future -- software requirements change and evolve over time, what may be a "perfect" design for yesterday's use case may not be able to handle tomorrow's scenario. It is therefore better to embrace (large) refactors as a regular and necessary part of the software development cycle, rather than something that is "abnormal" and to be avoided.
The other perhaps more subtle aspect, is the mentality shift that comes as a result of low-stress refactors -- because we know that better abstractions can be easily added later, it frees us to be much more bold in experimenting with new features! In Rust we are actually far more comfortable doing the "quick and dirty" thing to rapidly implement some new feature, knowing that it can all be easily cleaned up into better abstractions later when the requirements become better understood.

In contrast, when working with dynamically typed languages one would generally be more inclined to try and "get it right" the first time, in the hopes of avoiding nightmare refactors later (or, much worse, "deprecation" and "legacy systems"), which often leads to over thinking and inaction. Simply put, Rust enables us to adopt a “Get things done now, add abstractions later” approach to evolving our product and meeting our clients’ needs.

Closing Thoughts

We’ve gotten immense value out of the Rust ecosystem. It’s not a perfect language, and the community is still small. But for developing complex systems, Rust and its ecosystem removes an extraordinary amount of mental burden allowing you to focus on what matters - delivering an exceptional product.

Our team built their careers on the development of compilers & programming languages. We can confidently say that we have benefitted greatly from buying into this ecosystem. Rust has allowed us to move faster, more safely, and with greater confidence than we would have with any other language.

Next time we will dive into how we leverage and build on Rust’s excellent tooling ecosystem and miscellaneous tricks we learned along the way.

Interested in contributing to this series? If you have any other tips, tricks, or resources to share, please do so in the comments below. Or, if you’re thinking of adopting Rust for your project, we’d be happy to talk, share our experiences, and grow the ecosystem.

Cijo Thomas

Mar 15, 2024

Very nice writeup! Loved everything about it.

My own experience with clone and its costs were different - cloning did affect the performance significantly, and avoiding that boosted performance significantly! So it really depends on the application itself.

Expand full comment

1 reply

1 more comment...

SDF Blog

Discussion about this post