All writing

~/writing/ceph-read-amplification

Systems debugging
9 min read

What is putting those bytes on the wire?

The storage interface on a production cluster was pulling 4 to 12x more than the public interface where client requests arrive. Getting to a defensible number meant distrusting everything, including my own profiler, which was off by 30x.

On the front-end nodes of a production cluster I help run, the private storage interface carrying CephFS traffic was receiving 4 to 12 times more bytes than the public interface where client requests arrive. These machines exist to accept inbound requests and write to storage, if anything. Instead the storage network was pulling down a torrent of reads.

What is putting those bytes on the wire? I wanted a number I could defend when someone who knows the kernel better than I do comes to poke at it.

This is how I got it, including the part where my own instrument was off by a factor of thirty and I trusted the counter instead.

Pick the one number that cannot lie

Pick what you trust before you start collecting data, because the moment you have data you’ll be tempted to trust whatever was easiest to collect.

On a CephFS mount, the kernel client keeps a metrics block in debugfs:

the only number I trusted
cat /sys/kernel/debug/ceph/*/metrics
# read total_sz: bytes actually pulled from the OSDs
# write total_sz: bytes actually pushed to the OSDs

read total_sz is bytes the client physically fetched from the object storage daemons (OSDs). Not bytes an application asked for, not bytes that came back from cache. It’s incremented where bytes leave the OSD and hit the client, so it’s immune to cache hits, async timing, and burstiness. If another number disagrees with this counter, the other number is wrong until I can prove otherwise.

Everything else (process names, file paths, which PID was reading) I treated as attribution. Attribution is a hypothesis about who caused the bytes, not the measurement of the bytes. Keeping those two separate is the whole methodology.

The cache trap

The obvious place to hook is ceph_read_iter, the VFS entry point for a read on a Ceph file. Probe it, sum the bytes, attribute to the process. I ran a sanity check by md5sum-ing a shared map file I knew was already cached on the node, and watched the probe report 203 MiB of reads.

That nearly went into the report as load. It has the shape of a finding: big file, real process, fat byte count. But ceph_read_iter counts cache hits. Those 203 MiB never touched the OSDs, and the debugfs counter barely moved during the md5sum. I’d measured VFS demand and called it wire traffic.

Demand is not wire

A read served from page cache is real I/O to the application and zero bytes on the storage network. If you measure at the VFS layer you are measuring what the application wanted, not what the cluster had to deliver. Those two numbers can differ by orders of magnitude on a warm cache.

The fix was to measure fetch instead of demand. I separated two things the first probe had smeared together: ceph_read_iter for what the application requested, and the readahead path for what the client actually prefetched off the OSDs. With both, I could compute a per-file cache-hit percentage, and the cached shared map dropped out of the “load” column. It reads plenty at the application layer and puts almost nothing on the wire.

The amplifier

With fetch measured instead of demand, the central number came out of one file:

the answer was sitting in sysfs
cat /sys/class/bdi/ceph*/read_ahead_kb
# 8192

8 MB readahead on every CephFS mount. When a process reads a few bytes through a file descriptor, the kernel fetches up to 8 MB ahead on the theory that you’ll want the next bytes soon. For a workload streaming large files sequentially, that’s a gift. For a workload poking at tiny files, every byte-sized read drags up to 8 MB off the OSDs.

That’s the amplification. Measured against actual read operations it came out to roughly 147 KiB pulled per read, about 13 times the bytes the application asked for. It explains the shape of the whole graph:

Read amplification
~13x
~147 KiB pulled per read op
Lifetime read:write
31:1
906 GB read vs 29 GB written
Private RX:TX
~19:1
storage iface receives, barely sends

Reads dominate writes structurally, not because of a spike. These nodes’ storage job is per-entity config lookups, not bulk writes. Lifetime ratio near 31 to 1: 906 GB read against 29 GB written. The private interface receives about 19 times what it transmits. The storage network was answering a flood of tiny config lookups, each one inflated 13x by readahead.

Why a front-end node reads shared storage at all

Why do these nodes touch CephFS on the read path at all? A node that only forwarded requests wouldn’t. But these don’t just forward, they make a per-request decision, and that decision needs configuration. Every inbound request has to read a handful of small per-entity config files off shared storage before it can act. That’s roughly 0.65 MiB pulled per request, none of it bulk payload. These nodes never read large objects off CephFS; they read tiny per-entity config files, and the 8 MB readahead inflates each byte-sized config read into a large wire transfer.

Broken down by what the per-request path actually touches:

reads per inbound request, by source
identity / auth lookups           ~19-22%
limits / policy data              ~22%
per-entity rules                  ~19%
periodically refreshed shared map ~7%   (mostly cache hits)

The shared map is the file that nearly fooled me in the cache trap. Read constantly, served almost entirely from cache, so it shows up large in demand and small on the wire. With the cache-hit percentage in hand, it sat in the right column this time.

When the heavy reads are not the workload at all

A second pattern in the data didn’t fit “per-request config lookups,” and it would have looked like noise if I’d averaged across all the nodes.

Some nodes were reading 6 to 11 MiB/s, about 4.7 MiB per request, an order of magnitude over the config-lookup baseline. Others were light. The split didn’t track request volume; it tracked migration state. A background replication job bulk-reads full data objects on the nodes where it runs, and the heavy nodes were the ones still being migrated. The node that had finished cutting over was light.

That gave a falsifiable prediction: once the remaining clusters finish cutting over, their read rate drops to the light baseline. Not “should improve.” Drops to a specific baseline I’d already measured on the node that was done.

The 30x lie in my own tool

To do attribution properly I’d written a custom profiler, bpftrace feeding a small Go aggregator, that hooked ceph_readahead and read the readahead window size out of the readahead_control structure passed to it:

the probe that was wrong
kprobe:ceph_readahead
  $rac = (struct readahead_control *)arg0;
  $pages = $rac->_nr_pages;     // window size, in pages
  // attribute $pages * 4096 to the current process

Reasonable on its face. But this kernel uses the netfs layer, and netfs expands the readahead window after ceph_readahead is entered, in ceph_netfs_expand_readahead. So _nr_pages at entry to ceph_readahead is the pre-expansion size. My probe was capturing the window before the kernel grew it, undercounting by about 30x the bytes that actually hit the OSDs.

I knew it was wrong because it disagreed with the debugfs counter. The counter said one thing, my profiler said about 1/30th of that, and the rule I set at the start is that the counter wins. My instrument was wrong until I could prove it right, and I couldn’t.

The fix was to anchor the probe deeper, on the core mm function page_cache_ra_unbounded, which runs after the window is finalized. It works where the old hook didn’t for two reasons. It’s present in this kernel’s BTF, so bpftrace can attach to it and read its arguments by type. And its readahead_control argument carries the file pointer, which lets me reach the inode and superblock and filter to CephFS by the superblock magic:

the probe that reconciled
kprobe:page_cache_ra_unbounded
  $rac = (struct readahead_control *)arg0;
  $sb = $rac->file->f_inode->i_sb;
  if ($sb->s_magic == 0x00c36400) {        // CEPH_SUPER_MAGIC
      $bytes = arg2 * 4096;                 // post-expansion nr_to_read
      // now attribute, knowing the window is final
  }

After moving the anchor, the profiler’s total reconciled to about 170 MB over 6 minutes, roughly 471 KB/s, matching the debugfs counter delta exactly. Then a three-way comparison: the debugfs counter, /proc/net/dev on the storage interface, and the time-series database we already had all agreed within 6 percent. The attribution was now trustworthy, because the total it summed to was the number I already trusted.

Your instrument is a suspect

When your custom tool disagrees with a counter you have independent reasons to trust, assume your tool is wrong, not the counter. Don’t reconcile a 30x discrepancy by averaging or adding a fudge factor. It’s a defect in the part you built, and it stays one until the tool reproduces the trusted number on its own.

The reason traces lie about who

One more trap, because it’s why attribution is hard even after the byte count is correct.

I’d considered attributing reads by tracing openat, matching opened paths to the processes that opened them. That produces confident, plausible, wrong answers. Tracing openat only catches files opened during the trace window, but real processes read through descriptors opened earlier or inherited from a parent. The service forks worker processes that inherit open fds onto shared storage, so the file an openat trace never saw is the file being hammered.

The process that shows up in ceph_readahead is the one whose read triggered the prefetch, not necessarily the one that opened the file. A tiny read through an inherited descriptor pulls 8 MB, and the byte count lands on whoever issued that read. Attribution by openat alone would have produced a clean, wrong story about who was responsible. The byte count anchored on the trusted counter is correct regardless; the who is the part that requires care.

What the investigation was

No patch at the end of this one. The 8 MB readahead is arguably correct for a filesystem designed to stream large objects; the right fix is a per-mount tuning conversation, not a kernel change, and that conversation is now quantified instead of a guess. The deliverable was the explanation: the storage network’s receive load is per-entity config lookups amplified roughly 13x by readahead, plus a migration job that disappears on cutover, with a prediction for what happens when it does.

The explanation wasn’t the hard part. The hard part was refusing to ship the cache-trap reading, the openat attribution, and my own profiler when it disagreed with a counter I had reasons to believe. Anchor on a number that can’t lie, treat process and path attribution as a hypothesis rather than the measurement, and separate what the application demanded from what the cluster delivered. When your instrument disagrees with the anchor, the instrument is wrong until it reproduces the anchor on its own.