As of block #2630942, one PoT slot is 1.006823858 seconds on average:
block number
slot
timestamp (approximate, ms)
1
150
1730910108773
2630942
15866834
1746885065456
The cumulative drift during just over 6 months is ~30 hours.
All this likely indicates that Timekeepers on the network are not running at expected 6 GHz in practice or not running at 6.2 GHz reliably/consistently.
Does this account for interruptions, hangs, and similar issues?
(For example, a slot that is late because it was blocked by another task/thread/process/kernel, or timekeepers all being down, or even electricity supply frequency or voltage drift?)
This is not really a variance, it is a permanent drift measured as end-to-end time.
If there was a computer running perfectly at 6.2 GHz for all that time since mainnet started, it’d produce the same number of slots as the difference between timestamps of corresponding blocks.
The timekeeper loop is designed to be as tight as possible and most likely doesn’t have 0.6% delay even remotely. It does have a few instructions outside of AES itself, but considering that AES is a specialized module inside CPU, it is very likely that the extra computation is almost “free” in that context. It should be close to perfect compute-bound infinite loop that works almost entirely with registers pinned to a fixed CPU core.
If one timekeeper somewhere is offline, but another is running at 6.2 GHz, it doesn’t really make a lot of difference. Restart will introduce some permanent drift, but it should be on the order of less than two seconds from the fastest timekeeper out there (which will be reduced if someone implements Optimistic Timekeeper reorg · Issue #1977 · autonomys/subspace · GitHub), it can’t explain 30 hours difference, not even close (30 hours / 2 seconds = 54000 restarts of the fastest timekeeper).
Someone should probably do a deeper analysis of blocks and see at which time the drift was larger than usual. I recall our timekeepers spent quite some time being offline, which probably explains some of this.
I would have expected the drift to be several orders of magnitude lower, 30 hours is a lot.
Another explanation for this might be PoT reorg, which shouldn’t be the case with a good networking situation, but if for some reason timekeeper didn’t receive a block within 15 seconds, it’ll keep building PoT chain that is no longer valid and will have to reset back to the block + 15 slots once it receives it, see entropy injection section of the protocol specification.
Just an FYI, my timekeeper has been at 6.1 for about 6 weeks now. Before that it was at 6.0 for another two weeks. My downtime has been less than 10 minutes during that time.
In case we can run timekeepers at 6.1 GHz all the time, we should update number of PoT iterations to match it or else our slots will become slightly shorter than 1 second instead.
The code for updating PoT iterations was tricky, so it is best to try changing it on a testnet first. I have done a bunch of testing and IIRC we also tried it on one of the testnets, but it is one of those rare and thus less tested procedures.
This seems like something worth benchmarking before we make any changes. It is likely that extra instructions are cheap due to pipelining. But other factors like pipeline stalls, interrupts, or temperature or voltage sags could be causing slightly slower clocks than expected.
We expect that CPU runs at 6.2 GHz, that is what we calculated PoT iterations for. It either runs at that speed or not, voltage doesn’t really matter here, it just needs to be stable enough to work (see Timekeeper RPC and an app · Issue #2326 · autonomys/subspace · GitHub for a plan to make use of unstable timekeepers to be able to crank the frequency even higher).
Yes, we the extra instructions are almost free due to superscalar CPU nature with multiple instances of different kinds of blocks and fairly long CPU pipelines.
We pin timekeeper thread to a single CPU core, so if everything works as expected, there should be virtually nothing running on the same thread most of the time. Kernel should be smart enough to use other CPU cores for whatever it needs to do and as long as that is the case, there should be no interruptions or pipeline stalls because we run a very small piece of perfectly predictable code that uses a few registers almost exclusively. Only once a second it does a quick check and even less frequently it does entropy injection.