Engineering Principles: Back of the Envelope Calculations

Let’s say I give you the following problem statement: you have to design a database-migration protocol for a customer. As an input, we know the following product requirements are given

The customer has 100GB of data
The customer can accept a 1-minute downtime.

How would we build a solution for this? Let’s say we have a simple, single-node database that already exists. How should we proceed?

The Expensive Way #

One way, obviously, is to implement the migration protocol that does the following:

Snapshot the database contents and turn the database off.
Copy the snapshot to another machine.
Bring the database up again, rehydrating it from the snapshot.

👉🏾 We could just implement it and see how it goes: but that’s a perilous execution strategy.

Database migration protocols are complex, with several edge cases, and getting things correct can be difficult. If one spends a lot of time on the details, one can end up in a situation where a proper protocol is implemented. Still, the solution may not fit the requirement constraints of a maximum 1-minute downtime.

If your intuition says we could implement something quick and dirty and measure it: you’re on the right track. We can push the idea of ‘quick-and-dirty’ to the limit. We can distill the problem into simpler representative components. Once identified, we can measure their performance with minimal to no code.

Distill & Identify Components #

Let’s assume that the database is an in-memory database (say Redis). One way to distill the problem down into skeletal components is to measure:

End-to-end time for snapshotting the database to disk.
End-to-end time for copying the snapshot to another machine.
End-to-end time for restarting the database on the other machine using the copied snapshot.

Let’s focus only on (2) - End-to-end time for copying the snapshot to another machine. We can measure the inter-machine network bandwidth to understand how quickly the snapshot can be copied. We would also need to know the disk bandwidth available to each machine for an accurate picture.

👉🏾 We can perform these measurements without writing code using tools such as iperf and fio. This could be done in a single afternoon. Combining the Measurements

Perform & Interpret Measurements #

Once we perform the measurements, we will learn two things:

We may discover that disk bandwidth is lower than network bandwidth (a common scenario). Let’s say we find out that the disk bandwidth is 500 megabytes/second, and the network is 1250 megabytes per second (basically a 10 gig connection).
500 megabytes/second * 60 seconds < 100 gigabytes.

👉🏾 In other words, the disk is the bottleneck in the system - and unless we have a faster disk, there’s no way we can do this with 1-minute downtime.

What we have seen here is an application of the back-of-the-envelope technique. The example here is elementary but highlights the critical aspects of problem-solving.

Negotiate Requirements #

One obvious (often ignored) way to proceed is to reconsider the requirements. In this case, an excellent question to ask the customer is, “What is the worst thing that can happen if the downtime is more than 1 minute?”

If the answer is something like “We lose $100M for every minute of downtime.”, it is evident that the downtime is unacceptable. If it is something like “The daily report runner job will run a minute late”, it sounds like there is some wiggle room in the requirements.

👉🏾 Note that changing the maximum downtime from one to two minutes is a 100% increase. This gives the system design a wider berth.

Work Around Bottlenecks #

👉🏾 Back-of-the-envelope calculations surface bottlenecks.

Once the bottleneck is known, you can mitigate the situation by trading off the bottlenecked resource for another resource. We need to do more work with the same bandwidth in this example. A go-to solution, in this case, is compression. However, there’s no such thing as free lunch: compression consumes CPU. Compression is not feasible if the protocol cannot get the compute capacity it needs to improve data transmission.

Risks & Gotchas #

Obviously, back-of-the-Envelope calculations are not perfect. We need to recognize a few limitations.

Deeply Understand the Measurement #

A deep understanding of the thing being measured is essential. For example, when measuring disk bandwidth, it is pretty easy to run the experiment incorrectly and somehow include the impact of the page cache in latency measurements.

👉🏾 Superficial understanding of the experiment can lead to a false start.

Bounding Values #

Tools such as fio and iperf are designed to saturate the thing they measure. The first-cut protocol implementation may not be as optimal. With back-of-the-envelope calculations, we are often estimating bounds. In this example, we measured the lower bound on data transfer timing: we can’t transfer the same amount of data in lesser time. Alternatively, we measured the upper bound on data size: we can’t move more data in the same fixed time. The first-cut protocol will likely be worse on both of these measurements.

👉🏾 The first-principles-based skeletonization is a potentially lossy simplification.

Northstar for Performance #

👉🏾 Back-of-the-envelope calculations provide us the Northstar for performance.

Once the theoretical performance bounds are known and we decide on the system design, we can proceed with implementation. Thanks to the back-of-the-envelope calculations, our expectations from the post-implementation performance are clear. Initially, performance will likely be sub-optimal. Knowing peak theoretical performance allows us to bisect the implementation until a new bottleneck is revealed.

Additional Resources #

An excellent video explaining the technique in general: