Engineering Principles: Back of the Envelope Calculations
Table of Contents
Let’s say I give you the following problem statement: you have to design a database-migration protocol for a customer. As an input, we know the following product requirements are given
- The customer has 100GB of data
- The customer can accept a 1-minute downtime.
How would we build a solution for this? Let’s say we have a simple, single-node database that already exists. How should we proceed?
The Expensive Way #
One way, obviously, is to implement the migration protocol that does the following:
- Snapshot the database contents and turn the database off.
- Copy the snapshot to another machine.
- Bring the database up again, rehydrating it from the snapshot.
Database migration protocols are complex, with several edge cases, and getting things correct can be difficult. If one spends a lot of time on the details, one can end up in a situation where a proper protocol is implemented. Still, the solution may not fit the requirement constraints of a maximum 1-minute downtime.
If your intuition says we could implement something quick and dirty and measure it: you’re on the right track. We can push the idea of ‘quick-and-dirty’ to the limit. We can distill the problem into simpler representative components. Once identified, we can measure their performance with minimal to no code.
Distill & Identify Components #
Let’s assume that the database is an in-memory database (say Redis). One way to distill the problem down into skeletal components is to measure:
- End-to-end time for snapshotting the database to disk.
- End-to-end time for copying the snapshot to another machine.
- End-to-end time for restarting the database on the other machine using the copied snapshot.
Let’s focus only on (2) - End-to-end time for copying the snapshot to another machine. We can measure the inter-machine network bandwidth to understand how quickly the snapshot can be copied. We would also need to know the disk bandwidth available to each machine for an accurate picture.
iperf
and fio
. This could be done in a single afternoon.
Combining the MeasurementsPerform & Interpret Measurements #
Once we perform the measurements, we will learn two things:
- We may discover that disk bandwidth is lower than network bandwidth (a common scenario). Let’s say we find out that the disk bandwidth is 500 megabytes/second, and the network is 1250 megabytes per second (basically a 10 gig connection).
- 500 megabytes/second * 60 seconds < 100 gigabytes.
What we have seen here is an application of the back-of-the-envelope technique. The example here is elementary but highlights the critical aspects of problem-solving.
Negotiate Requirements #
One obvious (often ignored) way to proceed is to reconsider the requirements. In this case, an excellent question to ask the customer is, “What is the worst thing that can happen if the downtime is more than 1 minute?”
If the answer is something like “We lose $100M for every minute of downtime.”, it is evident that the downtime is unacceptable. If it is something like “The daily report runner job will run a minute late”, it sounds like there is some wiggle room in the requirements.
Work Around Bottlenecks #
Once the bottleneck is known, you can mitigate the situation by trading off the bottlenecked resource for another resource. We need to do more work with the same bandwidth in this example. A go-to solution, in this case, is compression. However, thereβs no such thing as free lunch: compression consumes CPU. Compression is not feasible if the protocol cannot get the compute capacity it needs to improve data transmission.
Risks & Gotchas #
Obviously, back-of-the-Envelope calculations are not perfect. We need to recognize a few limitations.
Deeply Understand the Measurement #
A deep understanding of the thing being measured is essential. For example, when measuring disk bandwidth, it is pretty easy to run the experiment incorrectly and somehow include the impact of the page cache in latency measurements.
Bounding Values #
Tools such as fio
and iperf
are designed to saturate the thing they measure. The first-cut protocol implementation may not be as optimal. With back-of-the-envelope calculations, we are often estimating bounds. In this example, we measured the lower bound on data transfer timing: we canβt transfer the same amount of data in lesser time. Alternatively, we measured the upper bound on data size: we canβt move more data in the same fixed time. The first-cut protocol will likely be worse on both of these measurements.
Northstar for Performance #
Once the theoretical performance bounds are known and we decide on the system design, we can proceed with implementation. Thanks to the back-of-the-envelope calculations, our expectations from the post-implementation performance are clear. Initially, performance will likely be sub-optimal. Knowing peak theoretical performance allows us to bisect the implementation until a new bottleneck is revealed.
Additional Resources #
An excellent video explaining the technique in general: