~/writing/esp32-stratum-1-ntp
The $20 stratum 1 that chrony kept benching
I built a GPS-disciplined NTP server on a $20 ESP32 with single-nanosecond internal timing. Then chrony refused to use it. The fix was not in the clock; it was in how I served the time.
How good a clock can you build for twenty dollars?
The obvious answer is “not very.” A real stratum 1 reference clock is a rubidium standard, an OCXO holdover box, or a sawtooth-corrected PTP grandmaster in a rack with a roof antenna. Those cost real money. An ESP32, a u-blox GPS module, and a WIZnet W5500 Ethernet chip cost about twenty dollars together. The conventional wisdom is that a microcontroller is fine for a “good enough” hobby clock and won’t stand next to serious hardware.
I didn’t like that answer, so I measured it. The cheap clock was, internally, one of the best clocks on my network. And it was useless anyway: I’d built an excellent clock and a mediocre NTP server, and those aren’t the same thing.
What “stratum 1” actually has to mean
A stratum 1 server is one that gets its time directly from a reference source rather than from another NTP server. In practice that means a GPS receiver. The GPS module spits out two things: an NMEA sentence over UART that says “it is currently 14:02:53 UTC,” and a pulse-per-second (PPS) line that goes high at the instant each second begins. The NMEA tells you which second it is; the PPS tells you when that second starts. You need both.
The naive way to read PPS is to wire it to a GPIO pin and take an interrupt on the rising edge. But a GPIO interrupt on a general-purpose CPU isn’t punctual. The interrupt has to be dispatched, the handler scheduled, and other interrupts can be in flight ahead of it. On an ESP32 that costs one to ten µs of jitter, and jitter isn’t a fixed offset, so you can’t calibrate it out.
The ESP32 has a way around this that almost nobody uses for clocks. The MCPWM peripheral has a hardware capture unit. You point it at a GPIO, and when the edge arrives the hardware latches a free-running timer into a register, in silicon, with no software in the path. The CPU reads the latched value later. The edge is timestamped before the interrupt even exists.
// MCPWM capture: the hardware latches the timer on the PPS edge.
// The ISR runs later and just reads what silicon already recorded.
static bool IRAM_ATTR pps_capture_cb(mcpwm_cap_channel_handle_t ch,
const mcpwm_capture_event_data_t *ed,
void *user) {
BaseType_t hp_task_woken = pdFALSE;
// ed->cap_value was latched in hardware at the instant of the edge.
// No dispatch jitter, no scheduling jitter. This is the whole trick.
g_last_pps_ticks = ed->cap_value;
g_pps_event = true;
return hp_task_woken == pdTRUE;
}The APB clock runs at 80 MHz, so each tick is 12.5 ns. That’s the capture resolution. Then comes what the servo did with clean edges to discipline against.
11 ns of RMS jitter on a 12.5 ns tick means the servo is averaging below the quantization of its own clock, which is what a well-behaved PLL should do. Zero rejected pulses over 94,650 means the PPS line was clean and the outlier rejection never fired. By every internal measurement, the clock was solid.
Then I asked chrony what it thought.
The benching
I had three GPS-disciplined boxes on the bench, each with its own receiver, deliberately not synced to each other. I wanted an independent chrony instance to watch all three and tell me which one it trusted. One was an Intel i210 NIC doing hardware PTP timestamping off a u-blox NEO-M9N. One was a BeagleBone with a GPS PPS feed. The third was my $20 ESP32.
chrony marks each source with a symbol. ^* is the selected source, the one it’s actually steering the clock to. ^+ is a good candidate it would fall back to. ^- means benched: a source it has measured, doesn’t trust enough to use, and has set aside.
My clock came up ^-. Benched. The expensive i210 grandmaster was ^*.
The conventional wisdom says a twenty dollar microcontroller isn’t a grandmaster. But the numbers didn’t support that. The monitoring host sat on the ESP32’s own segment, 192.168.69.0/24, while the grandmaster and the other reference clock were a routed hop away on 192.168.0.0/24. From where chrony was watching, the ESP32 was the closest clock on the network and should have had the shortest, most stable round trip of the three. Instead chrony measured its round trip at 1.88 ms against 0.19 ms for the others. Ten times slower than peers a subnet further away.
The tell
A topologically closer source should have a shorter round trip. When the closest clock measures the worst delay, that’s not noise you average away. It’s a defect, in the part of the system you control.
The problem wasn’t the GPS or the clock. It was the NTP server, specifically the timestamps it put into packets.
How NTP actually computes who is right
Every NTP exchange has four timestamps. The client stamps when it sends the request (t1). The server stamps when it receives that request (t2) and when it sends its reply (t3). The client stamps when the reply arrives (t4). From those four numbers, the client computes two things:
offset = ((t2 - t1) + (t3 - t4)) / 2
delay = (t4 - t1) - (t3 - t2)The whole thing rests on t2 and t3 being honest about when the packet crossed the wire. If your server stamps t2 a few hundred µs after the packet arrived, and stamps t3 well before it leaves, the math sees phantom delay that was never on the network. chrony can’t tell real network latency from a server lying about its own timestamps. It sees a source with two milliseconds of slop and benches it.
So I stopped trusting my own timestamps and measured where they came from.
Receive: stamp it in hardware, again
The same move as PPS, applied to packet arrival. The original code stamped t2 in the UDP receive path, in software, after the network stack had handed the packet up. The W5500 has an INTn line that asserts the moment a packet lands in its buffer. I wired that to a GPIO, took the interrupt, and stamped t2 there, as early as I could physically get.
// INTn from the W5500 asserts when a packet hits the socket buffer.
// Stamp the receive instant here, not up in the UDP read path.
static void IRAM_ATTR w5500_int_isr(void *arg) {
uint64_t now = now_ns_disciplined();
g_rx_hw_ts = now; // becomes t2
g_rx_irq_count++; // 1:1 with served requests, for auditing
xSemaphoreGiveFromISR(g_rx_sem, NULL);
}I added a counter, ntp_rx_irq_total, incremented on every interrupt, to prove it fired once per served request rather than assume it. Receive jitter dropped from a sigma of 26 µs to 2.6 µs, making the ESP32’s receive timestamp the tightest of all three clocks from that vantage, grandmaster included.
The two things blocking between t2 and t3
Receive was honest now. The rtt was still bad, so time was being lost between receiving the request and sending the reply. I instrumented the gap and found two stalls.
The first was an ARP prime. Before sending a reply, the code called a helper that, on a cold ARP cache, busy-waited one to two hundred milliseconds resolving the client’s MAC. The second was subtler: the main loop called gps->loop() to service the UART at the top of every iteration, and that call could block up to 100 ms draining the GPS, right before the NTP poll. A request could wait on the GPS before the server ever looked at it.
Neither is a clock problem. Both are a “you wrote a blocking server and then measured it serving the time it was blocking on” problem.
Transmit: you cannot stamp t3 after you send
The last stall was hard to fix correctly. You want t3 to be the instant the reply hits the wire. But you write t3 into the packet, so you have to know the transmit time before you transmit. You’re stamping a timestamp for an event that hasn’t happened.
I measured the send. w5k_sendto was blocking ~636 µs pushing the packet to the W5500 over SPI. I’d predicted ~560 µs from the SPI clock and packet size; measuring 636 confirmed I understood the path. So t3 was being written 636 µs before the packet left, every time, and that error went straight into every client’s offset calculation.
The fix is to predict the send time and pre-correct t3 by it, using an EWMA of the measured send duration so it adapts to conditions instead of being a magic constant.
// t3 is written INTO the packet, so it must be stamped before egress.
// Pre-correct by the predicted on-wire moment using an EWMA of measured
// SPI send time, then split the write so the loop never blocks on it.
uint64_t predicted_tx = now_ns_disciplined() + g_send_ewma_ns;
pkt->transmit_ts = ns_to_ntp64(predicted_tx);
uint64_t t_before = now_ns_disciplined();
w5500_sendto_nonblocking(sock, pkt, sizeof(*pkt), client);
uint64_t measured = now_ns_disciplined() - t_before;
// Adapt the prediction toward what actually happened. 1/8 gain.
g_send_ewma_ns += ((int64_t)measured - (int64_t)g_send_ewma_ns) >> 3;The non-blocking split write shaved roughly 1.4 milliseconds off each request. The EWMA correction took the residual transmit error out of the offset.
The result
With receive stamped in hardware, the two stalls removed, and transmit pre-corrected and non-blocking, I asked chrony again.
The twenty dollar ESP32 went from benched to selected: the lowest root distance of the three, sub-microsecond jitter, a clean ^*.
Then I checked whether that was real or a trick of where I was standing. The judge sat on the ESP32’s own segment, with the grandmaster a routed hop away on the other subnet. From that judge the ESP32’s round trip measures about 2 µs and the grandmaster’s about 260, purely because one is local and the other crosses a router. That asymmetry flatters the ESP32, so “selected ahead of the grandmaster” is partly about topology, not clocks. So I measured again from the grandmaster’s own segment, where the roles invert. There chrony’s verdict flips: it selects the grandmaster and leaves the ESP32 out of its combine entirely as ^-, with a wider root distance and roughly three times the jitter from that seat. Line the two GPS clocks up directly, where the measuring host’s own clock error cancels, and they agree to within about 50 to 90 µs, and even that spread is the routing asymmetry shifting with the vantage, not the clocks disagreeing.
So I’m keeping the claim narrow. The ESP32 was no longer benched. To a client on its own segment, the twenty dollar box was the tightest and most trusted source available, ahead of a grandmaster that cost many times more and sat a subnet away. That isn’t being more accurate than a grandmaster, and from the grandmaster’s vantage it isn’t. It’s genuinely in that class from where the client sits, which is all I needed.
This still isn’t the best clock in the world. The tier above it is real: GPSDO and OCXO holdover boxes that keep time through GPS outages, and sawtooth-corrected grandmasters with hardware an ESP32 doesn’t have. It’s best in its own bracket. Among microcontroller-based stratum 1 servers, the combination of MCPWM capture for PPS, hardware receive timestamping off the W5500 INTn, and a self-calibrating transmit correction is, as far as I can measure, best in class. And the hardware timestamping that makes it work costs nothing extra. It was in the silicon all along.
The part I will not compromise on
A stratum 1 server has to be honest about when it doesn’t know the time. If the GPS loses lock, the worst behavior is to keep serving stratum 1 with stale time, because every client downstream will believe it. So the server checks lock on every request: PPS pulses arriving and an NMEA fix less than 1.5 seconds old, or it doesn’t claim to be synced. The moment it isn’t certain, it advertises stratum 16 with the leap indicator set to alarm, which is NTP for “do not use me.” A clock that lies once is worse than no clock at all.
The expensive part was never the hardware. It was refusing to believe my own clock until the independent monitor, watching from across the network, agreed with it.
The code is on GitHub.