~/writing/cannot-average-percentiles
You cannot average percentiles
Finding an IMAP server's real capacity means driving load from many hosts at once. The hard part isn't generating the load. It's combining their latencies into one honest P99 without quietly lying about it.
How many operations a second can an IMAP server actually sustain before it stops keeping up? Sounds simple. It isn't, and the difficulty isn't in producing the load.
One client can't produce enough of it. A single machine saturates its own CPU and network long before it stresses a real mail server, so you generate load from many hosts at once. Dovecot's imaptest is the honest way to drive each one: a full IMAP session, login, list, select, fetch, store, append, logout, not just a tight APPEND loop. Run it across a fleet and you have plenty of load. Now you have to combine what each host measured, and that is where it gets statistically ugly.
Finding the knee
The number worth reporting isn't peak throughput. It's the rate at which the server stops keeping up. Step the offered rate up, 1k/s, 2k, 4k, and watch the gap open between what you asked for and what actually got delivered. The knee is the first rate where delivered throughput drops below some fixed fraction of target, say 90 percent. Below it the server tracks demand; above it, asking for more buys you errors and longer tails, not more work done. The exact threshold is arbitrary, but pick one and hold it, so "where it falls over" becomes a number you can plot and compare between runs instead of a gut call.
The trap: percentiles don't average
Every host measures its own latencies. The tempting way to get a cluster-wide P99 is to take each host's P99 and average them. That number is wrong, and wrong in a way that looks plausible enough to ship.
A percentile is a property of a distribution, not something you can take the mean of. The P99 of the combined latencies of fifty machines is generally not the average of their fifty P99s. Say one host drew the accounts that hash to a slow backend: its P99 is 400 ms while everyone else sits at 90. Average the per-host P99s and that host is one vote in fifty, smeared into nothing. But its slow requests are real requests a real user would have made. They belong in the tail. Averaging the percentiles erases exactly the tail you were trying to measure.
The mean of percentiles is not a percentile of anything
P99(A ∪ B) is not (P99(A) + P99(B)) / 2. You can't rebuild a percentile of the union from per-group percentiles, and definitely not from per-group means. An honest cluster percentile needs the actual combined distribution, or a faithful sample of it. There is no shortcut that keeps the tail, and the tail is the whole report.
The fix: sample, then pool
You need the real distribution across all hosts, but shipping every latency from every machine running flat out for a minute is a lot of data to move. So have each host keep its raw samples and return a uniform random subsample, a few thousand points per phase. Concatenate those subsamples into one pool and compute the percentiles from the pool.
Uniform sampling is what makes it honest. Because each host samples its own latencies uniformly, the pool is itself a uniform sample of the true cluster-wide distribution, tail included. The slow host's 400 ms requests appear in the pool in proportion to how many there actually were. A P99 over the pool is a real estimate of the real cluster P99, not an average of summaries that already threw the distribution away.
Keep the phases apart, too:
- Dial: time to complete the TLS handshake, per connection.
- Login: time for the IMAP LOGIN. Often bimodal: a warm auth cache answers in microseconds, a cache miss pays the full bcrypt cost. Blended into everything else, the two populations average into a meaningless middle. On its own, the cliff is obvious.
- Append and fetch: the actual mailbox work.
A single "latency" number hides all of it. The bcrypt cliff only shows up when login is measured on its own.
Make the load resemble real users
Honest aggregation is wasted on a caricature workload. Two things matter most. Put the same account on several hosts at once: real users hit one mailbox from a phone, a laptop, and webmail at the same time, and that concurrency is what stresses the server's per-user locking. And open more than one session per account, because a single connection's synchronous request-and-reply hides the server's real concurrency ceiling.
Two failure modes are worth designing out before they pollute the numbers. If every host dials at the same instant you measure your own thundering herd at t=0, not steady-state capacity, so stagger the starts across a window. And when a connection errors and reconnects, jitter the backoff; otherwise one server hiccup resynchronizes every client onto the same beat and you have built a self-inflicted spike.
Count errors by kind, not just total
When requests fail, group them by normalized message instead of reporting a bare count. "A hundred errors" could be a capacity limit or a server bug. "Mostly login timeouts" and "mostly server errors on append" point at completely different problems, and you cannot tell which from the total alone.
A load test is only as honest as its worst aggregation step. You can generate flawless load and still publish a fiction if you average the percentiles at the end. Keep the distribution, sample it uniformly, pool the samples, and let the tail speak for itself.