Why you cannot average percentiles for p99 latency

How many operations a second can an IMAP server sustain before it stops keeping up? The hard part isn’t producing the load.

One client can’t produce enough of it. A single machine saturates its own CPU and network long before it stresses a real mail server, so you generate load from many hosts at once. Dovecot’s imaptest is the honest way to drive each one: a full IMAP session, login, list, select, fetch, store, append, logout, not just a tight APPEND loop. Run it across a fleet and you have plenty of load. Now you have to combine what each host measured, and that is where it gets statistically ugly.

Finding the knee

The number worth reporting is the rate at which the server stops keeping up, not peak throughput. Step the offered rate up, 1k/s, 2k, 4k, and watch the gap open between what you asked for and what got delivered. The knee is the first rate where delivered throughput drops below some fixed fraction of target, say 90 percent. Below it the server tracks demand; above it, asking for more buys you errors and longer tails, not more work done. The exact threshold is arbitrary, but pick one and hold it, so “where it falls over” becomes a number you can plot and compare between runs instead of a gut call.

The trap: percentiles don’t average

Every host measures its own latencies. The tempting way to get a cluster-wide P99 is to take each host’s P99 and average them. That number is wrong, and plausible enough to ship.

A percentile is a property of a distribution, not something you can take the mean of. The P99 of the combined latencies of fifty machines is generally not the average of their fifty P99s. Say one host drew the accounts that hash to a slow backend: its P99 is 400 ms while everyone else sits at 90. Average the per-host P99s and that host is one vote in fifty, smeared into nothing. But its slow requests are real requests a real user would have made. They belong in the tail. Averaging the percentiles erases the tail you were trying to measure.

The mean of percentiles is not a percentile of anything

P99(A ∪ B) is not (P99(A) + P99(B)) / 2. You can’t rebuild a percentile of the union from per-group percentiles, and definitely not from per-group means. An honest cluster percentile needs the combined distribution, or a faithful sample of it. No shortcut keeps the tail.

The fix: sample, then pool

You need the real distribution across all hosts, but shipping every latency from every machine running flat out for a minute is a lot of data to move. So have each host keep its raw samples and return a uniform random subsample, a few thousand points per phase. Concatenate those subsamples into one pool and compute the percentiles from the pool.

Uniform sampling makes it honest. Because each host samples its own latencies uniformly, the pool is itself a uniform sample of the cluster-wide distribution, tail included. The slow host’s 400 ms requests appear in the pool in proportion to how many there were. A P99 over the pool estimates the cluster P99, instead of averaging summaries that already threw the distribution away.

Keep the phases apart, too:

Dial: time to complete the TLS handshake, per connection.
Login: time for the IMAP LOGIN. Often bimodal: a warm auth cache answers in microseconds, a cache miss pays the full bcrypt cost. Blended into everything else, the two populations average into a meaningless middle. Measured on its own, the cliff is obvious.
Append and fetch: the actual mailbox work.

A single “latency” number hides all of it. The bcrypt cliff only shows up when login is measured on its own.

Make the load resemble real users

Honest aggregation is wasted on a caricature workload. Put the same account on several hosts at once: real users hit one mailbox from a phone, a laptop, and webmail at the same time, and that concurrency stresses the server’s per-user locking. Open more than one session per account too, since a single connection’s synchronous request-and-reply hides the server’s concurrency ceiling.

Design out two failure modes before they pollute the numbers. If every host dials at the same instant you measure your own thundering herd at t=0, not steady-state capacity, so stagger the starts across a window. When a connection errors and reconnects, jitter the backoff; otherwise one server hiccup resynchronizes every client onto the same beat and you’ve built a self-inflicted spike.

Count errors by kind, not just total

When requests fail, group them by normalized message instead of reporting a bare count. “A hundred errors” could be a capacity limit or a server bug. “Mostly login timeouts” and “mostly server errors on append” point at completely different problems, and you cannot tell which from the total alone.

You can generate flawless load and still publish a fiction if you average the percentiles at the end. Keep the distribution, sample it uniformly, and pool the samples.

You cannot average percentiles

Finding the knee

The trap: percentiles don’t average

The fix: sample, then pool

Make the load resemble real users

Count errors by kind, not just total

Keyboard shortcuts

Finding the knee#

The trap: percentiles don’t average#

The fix: sample, then pool#

Make the load resemble real users#

Count errors by kind, not just total#

Finding the knee

The trap: percentiles don’t average

The fix: sample, then pool

Make the load resemble real users

Count errors by kind, not just total