All writing

~/writing/ts2phc-go-frozen-clock

Precision timing
8 min read

The frozen clock that nobody noticed

A GPS-disciplined NTP server stopped being disciplined, and every monitor stayed green. The giveaway was a freq_ppb value that repeated byte-for-byte across three seconds, which a live servo never does.

A Grafana dashboard had been stale since 2pm. Not down, not erroring: stale. Lines flat and plausible, panels green, NTP pool monitors quiet. Every check said there was no problem.

The clock had stopped knowing the time hours earlier and had been wrong ever since. I caught it because I was staring at a number that should never repeat.

This is about ts2phc-go, a Go reimplementation I wrote of linuxptp’s ts2phc. Its job is narrow: discipline a NIC’s PHC to the PPS signal off a GPS receiver. On a server in the public NTP pool, that PHC is the foundation. chrony reads it as a local reference and serves stratum 1 to whoever asks. If the PHC drifts or freezes, everyone downstream does too, and nobody finds out, because a frozen clock that still ticks is indistinguishable from a working one to anything that only asks it the time.

The number that should not repeat

The servo in ts2phc-go logs its state every second. One of the fields is freq_ppb, the frequency correction it’s currently applying to the PHC oscillator in parts per billion. A disciplining servo is always nudging this. The oscillator drifts with temperature, the loop chases it, the correction wobbles in the last few digits. A live servo doesn’t report the same float twice in a row, because the physical world doesn’t hand it the same offset twice in a row.

I was scrolling the logs and saw this:

servo log, 3 seconds apart
14:02:51  offset       4 ns  freq_ppb 17139.674132458233  ...
14:02:52  offset       4 ns  freq_ppb 17139.674132458233  ...
14:02:54  offset       4 ns  freq_ppb 17139.674132458233  ...

17139.674132458233, byte for byte identical, across a three-second gap. A clock holding steady still jitters in the low digits. This one had stopped computing. The servo was printing the last value it ever calculated, over and over, like a stopped watch propped up to look like it was running.

Stable is not frozen

A repeated measurement isn’t a stable measurement. A live servo’s freq_ppb drifts in its least significant digits on every sample, because the physical offset it chases never repeats exactly. A byte-identical float across two samples means the clock is dead and nobody pulled the plug.

Why every monitor stayed green

The failure was upstream of the servo. gpsd’s NTP feed had died. The main loop in ts2phc-go reads an offset from the GPS source each iteration, and when the source has nothing to give, the loop hits a continue. So the servo never sampled a new offset, never updated, and the PHC stopped being disciplined. It coasted on its own oscillator from 2pm onward, drifting freely, while the servo recited its last freq_ppb.

chrony, downstream, kept reading that PHC as its reference and kept advertising stratum 1 to the pool. From chrony’s point of view nothing changed. The PHC still returned a time. The time was wrong and getting wronger, but chrony had no way to know that, because it reads the PHC as a local reference with no independent cross-check. A frozen-but-plausible PHC is indistinguishable from a live one to anything that only asks it what time it is. You have to ask it something it can’t fake: whether it’s still being told the time by something outside itself.

The external pool monitors couldn’t see it either. They measure offset against the consensus, and a clock that froze while it was already correct stays correct enough, for a while, to pass. By the time the drift tripped a threshold the damage to downstream trust was already done.

The shape of this failure

The dangerous failures in timekeeping aren’t the loud ones. A clock that crashes gets noticed. A clock that keeps serving and stops updating takes out everything that trusts it, because every check designed to catch a broken clock checks for one that stopped responding, not one that stopped thinking.

Recovery was straightforward once I knew where to look. Restart gpsd, wait for a 3D fix (it came back on 23 to 25 satellites), the source started handing the loop fresh offsets again, the servo woke up, and downstream chrony relocked to roughly +9 ns at about +/-19 ns. The fix was a process restart. Finding it was noticing a float.

How I got here: reading the structs against the headers

I trusted that freq_ppb tell because I’d spent a lot of time inside this code, checking what it reported against what the hardware was doing. That habit made the frozen value jump out.

When I first wrote ts2phc-go I read my Go structs line by line against the UAPI headers in the kernel tree rather than trusting that they looked right. They didn’t.

The ClockCaps struct was 76 bytes: eight int32 fields plus an [11]int32 reserved area. But PTP_CLOCK_GETCAPS encodes an 80-byte copy in its ioctl number. The kernel doesn’t care what size your struct is. It copies the number of bytes the ioctl says, so on every startup the kernel was writing four bytes past the end of my struct. It hadn’t crashed yet. That’s the worst kind of bug, the one corrupting memory without a word.

ptp_caps.go
// PTP_CLOCK_GETCAPS encodes an 80-byte copy. This struct was 76.
// The kernel writes the size the ioctl says, not the size you allocated.
type ClockCaps struct {
	MaxAdj    int32
	NChannels int32
	NAlarm    int32
	NExtTs    int32
	NPerOut   int32
	PPS       int32 // was swapped with NPins below
	NPins     int32
	CrossTs   int32
	AdjPhase  int32
	MaxPhaseAdj int32 // was missing entirely
	Reserved  [11]int32
}

While I was in there I found two more. The pps and n_pins fields were swapped relative to the UAPI definition, so I was reading each as the other. And max_phase_adj was missing from my struct entirely, which is part of why the size was off.

Then a separate one. PTP_EXTTS_REQUEST2, the v2 external-timestamp request, was encoded with ioctl number 10. Number 10 is PTP_CLOCK_GETCAPS2. The v2 request is number 11. So every time the code asked for a v2 external-timestamp request it was issuing a getcaps, the kernel returned ENOTTY, and the code fell back to v1. It worked. It just lost PTP_STRICT_FLAGS every time and nobody noticed, because falling back to v1 is perfectly functional.

The bug I left in on purpose

I left that v2 ioctl-number bug, the one that made the strict request fall back to v1, unfixed on the production NIC.

Making the v2 request fire had no upside for this deployment. The v2 path enables a strict single-edge mode. This design doesn’t want that. It wants both edges plus a filter (more on that next), a deliberately different strategy. Fixing the bug would have, at best, changed nothing useful, and at worst changed the behavior of the NIC serving the pool in production, for no benefit. The fall-back to v1 was the behavior I wanted, arrived at by accident. Knowing which bugs not to fix is a skill too.

The both-edge filter, which is the actual point

The reason this design wants both edges is the NIC. The Intel i210 delivers both the rising and the falling edge of the PPS to the timestamp FIFO. A naive reader that only expects one edge per second gets two, interleaves them, and computes garbage. An earlier audit of the raw pulse stream had flagged “pulses” of 458 ms, which is the gap between a rising edge of one second and a falling edge of the previous one being misread as a single interval.

So the headline feature of ts2phc-go is a dynamic edge filter that handles both edges, locks onto the real second boundary, and rejects the noise. When it locks it logs what it sees:

edge filter lock
edge filter locked, pulse width 9.999994ms

A 9.999994 ms pulse width is the real PPS pulse off this receiver. Once the filter’s locked there are none of the 458 ms garbage intervals the naive audit had flagged, because the filter knows which edge is which and only disciplines against the boundary that matters.

The 37-second clue

One more measurement from early on. The first time the servo ran end to end, it reported an offset of 37000000084 ns. That’s 37 seconds and change. A servo that’s 37 seconds off isn’t a tuning problem, it’s a servo measuring two different time scales.

The PHC is disciplined to TAI by ptp4l. NMEA, from the GPS, reports UTC. TAI has been exactly 37 seconds ahead of UTC since 2017. The C ts2phc handles this by reading leap-seconds.list. My fix added the 37-second offset directly in the NMEA source, and the offset collapsed from 37 seconds to low tens of nanoseconds. The remaining 84 ns was the real offset, underneath a 37-second labeling error.

I matched other points of hygiene against the C version once I knew to look: draining the EXTTS FIFO on startup so the servo doesn’t begin by chewing on stale kernel-buffered timestamps, configuring the SDP pin correctly so the i210 routes the PPS where the code expects it, and getting the edge handling right rather than assuming one edge per second.

Frozen freq_ppb
17139.674132458233
byte-identical across a 3s gap
Recovery offset
+9 ns
downstream chrony relock, +/-19 ns
Re-lock time
~1 s
after the production swap

When a measurement becomes a memory

A frozen servo looks exactly like a working one unless you have an independent check.

The PHC couldn’t tell anyone. It had no way to distinguish “I’m disciplined and correct” from “I’m coasting on a value I computed at 2pm,” because nothing asked it the question that matters, which isn’t what time is it but how recently were you told. chrony asked the first question and got a confident answer. Nobody asked the second until I noticed a float that refused to change.

So now the loop’s failure to sample is loud instead of silent, and the thing I trust most is the byte-identical number, the sign that a measurement has stopped being a measurement.

The code is on GitHub.