A throwaway Ceph cluster in Docker and tmpfs

Most of the time I don’t want a Ceph cluster. I want the endpoint: an S3 URL with creds, a CephFS mount, or an RBD device, so I can point app code at real RADOS. The cluster is overhead I put up with to get there.

The normal way to get this locally is Rook: spin up a k8s cluster with minikube or kind, install the operator chart, write a CephCluster CR, carve out PVCs for the OSDs, wait a few minutes for it to converge. That’s the right tool when you actually want a Ceph-cluster-shaped thing, with replication and failure injection and the whole operator lifecycle. It’s a lot of machinery when all you wanted was an S3 endpoint to point a test suite at.

So I built the small version. One Docker container, one CLI verb, up in about thirty seconds, every byte it writes in RAM. launch, run your tests, destroy. Reboot the box and there’s no trace it existed, because nothing was ever on disk.

Everything in tmpfs

The constraint: no persistent state. Not “cleaned up on exit,” never written to disk in the first place. My /tmp is tmpfs, 62.8 GiB of it, so that’s where the cluster lives.

BlueStore wants a block device and I don’t have a spare one, so the playground makes a sparse file in tmpfs, wraps it in a loop device, and hands BlueStore that.

the OSD backing store, in RAM

# Sparse file in tmpfs becomes the OSD block device via a loopback.
truncate -s 8G /tmp/cephplayground/<name>/osd0.img
losetup --find --show /tmp/cephplayground/<name>/osd0.img
# Every S3 object, CephFS file, and RBD block ends up in RAM
# through that loop device into that file.

Inside the container the daemons’ runtime dirs are tmpfs too: /var/lib/ceph (mon db, mgr db, MDS journal, OSD bookkeeping), plus /etc/ceph, /run/ceph, /var/log/ceph, /tmp. The only thing that touches my SSD is the read-only quay.io/ceph/ceph image layer Docker already cached. Reboot and it’s all gone.

State versus data

It’s tempting to call the mon and MDS bookkeeping “state” and the objects “data,” as if only one needed to be throwaway. Same requirement. If the mon db survives a reboot but the OSD doesn’t, the cluster remembers a map of storage that isn’t there anymore. Both live in tmpfs or neither does.

No cephadm, no nested containers

The other call was to skip cephadm. It’s how you’re supposed to bootstrap modern Ceph, but it orchestrates a container per daemon. Do that inside a container and you’ve got a container runtime inside a container, which I didn’t want.

So the entrypoint starts the daemons directly: a mon, a mgr, one OSD, then optionally an MDS for CephFS and a RADOS gateway for S3, all plain processes in the one container. A --services flag (default rgw,cephfs,rbd) picks which optional ones come up. When CephFS or RBD is on, the container flips to host networking so a client on the host can reach the mon, MDS and OSD directly; S3-only keeps the simpler port-forward.

That left the one job cephadm normally does for you: creating the OSD. That’s where the version sweep fell over.

The Quincy launch that hung

I wanted this working on more than one Ceph release, so I ran a sweep: launch each major, probe it, destroy it. The by-hand way to bring up an OSD is ceph-volume raw prepare pointed at the loop device. Fine on the newer releases. On v17, Quincy, the launch hung.

The container log scrolled an error about “no LV,” which sent me into LVM. Dead end: the playground doesn’t use LVM, it hands BlueStore a raw loop device.

The real cause: ceph-volume raw prepare is buggy on Quincy when the target is a loop device. The “no LV” line is rollback noise it prints while unwinding a prepare that never should’ve started. Not an LVM problem, not a config problem, just the tool on that release against that device.

the misdirection

[cephplayground] preparing OSD on /dev/cephplay-osd0
[cephplayground] raw prepare with explicit OSD id failed; retrying fresh-cluster prepare
# "no LV found" scrolls here. There is no LV. There was never going
# to be an LV. The message is describing the rollback, not the cause.

The fix was to stop using ceph-volume for this. The OSD bootstrap doesn’t need it. You can register a new OSD and lay down a BlueStore fs by hand, and that path behaves the same from v16 through v20.

manual BlueStore, identical across four majors

# Register the OSD and mkfs BlueStore directly onto the loop device.
osd_uuid=$(uuidgen)
osd_id=$(ceph osd new "$osd_uuid")
ceph-osd -i "$osd_id" --mkkey
ceph-osd -i "$osd_id" --mkfs --osd-uuid "$osd_uuid"
# No ceph-volume, no LVM, no per-release surprises.

Dropping ceph-volume raw prepare for a manual BlueStore mkfs got the whole sweep passing.

Ceph majors working

v16 to v20

Pacific through Squid

Time to ready

~30 s

one container, one verb

SSD writes for state

tmpfs all the way down

One smaller wrinkle I guarded instead of fixing: the RGW realm bootstrap. Older releases want you to create the realm, zonegroup and zone yourself; v19 and up auto-create the defaults, so the explicit call turns into an error. It’s gated behind a radosgw-admin zonegroup get probe: manual setup where that’s needed, no-op where it isn’t. The version knob stays --image quay.io/ceph/ceph:v18 instead of a dedicated flag, because a --ceph-version flag would be sugar over the same string and I’d be chasing upstream tag renames forever.

The client side was the fiddly part

The cluster side was easy. RGW is just HTTP, so handing someone an S3 endpoint means printing AWS_ENDPOINT_URL and the access/secret keys for a pre-made user. CephFS and RBD aren’t that polite. A client needs a keyring, a ceph.conf, and an actual mount or map, and kernel mounts want root and matching kmods.

So env prints everything a client needs: the conf and keyring paths, plus a ready-to-paste mount line, and you do the mount yourself with the userspace tools. It writes per-service scoped keyrings instead of handing out client.admin: client.cephplay-fs can touch the filesystem, client.cephplay-rbd can touch the pool, and neither can administer the cluster.

what the user actually does

# CephFS over FUSE, no kernel module required:
ceph-fuse --id cephplay-fs --conf $CEPHPLAY_CONF /mnt/play
 
# RBD over NBD, same idea:
sudo rbd-nbd map rbd/play --id cephplay-rbd --conf $CEPHPLAY_CONF

The end-to-end test matters, because “the daemons came up” is not the same as “a client can use it.” RGW answered 200 and a bucket round-tripped. CephFS mounted over ceph-fuse with the scoped keyring, and a file written through it read back. RBD was the strict one: map over rbd-nbd, mkfs.ext4, mount, write a file, unmap, remap, check the file’s still there. The unmap-and-remap proves the block landed in the backing store and wasn’t sitting in a client cache.

The devices that look like leftovers but are not

After an RBD test, /dev/nbd0 through /dev/nbd15 stick around, all size zero, like the playground failed to clean up. It didn’t. Those nodes are what the kernel nbd module makes when it loads (nbds_max defaults to 16), and rbd-nbd map auto-loads it. The slots sit idle until the next reboot or a modprobe -r nbd. Same as the pre-allocated /dev/loop* nodes.

What it is for

This isn’t a production deployment. One OSD, no replication, no failure domains, the whole cluster a single point of failure by design, all in volatile memory. It’s a real RADOS endpoint with the real RGW, MDS and RBD surfaces, cheap enough to spin up and tear down inside a test run, that leaves nothing behind.

Rook’s still the right answer when you want a cluster you can break and watch heal. When you want an S3 URL in thirty seconds and your SSD untouched, this is the smaller tool.

The code is on GitHub.

A Ceph cluster I throw away on every reboot

Everything in tmpfs

No cephadm, no nested containers

The Quincy launch that hung

The client side was the fiddly part

What it is for

Keyboard shortcuts

Everything in tmpfs#

No cephadm, no nested containers#

The Quincy launch that hung#

The client side was the fiddly part#

What it is for#

Everything in tmpfs

No cephadm, no nested containers

The Quincy launch that hung

The client side was the fiddly part

What it is for