The fastest Linux timestamps
hmpcabral.com61 points by hmpc a day ago
61 points by hmpc a day ago
If you do this, please be aware that there is absolutely no guarantee that you will not observe time going backwards. You probably will not have one thread ask for the time twice in a row and get results that are out of order, but you can have thread 1 ask for the time and do a store-release and then have thread 2 do a load-acquire, observe thread 1’s write, and ask for the time, and thread 2’s time may be earlier than thread 1’s. This is because RDTSC by itself does not respect x86’s memory order — it does not act like a load.
source: I wrote a bunch of this code and I’ve tested it fairly extensively.
This is explicitly called out in the post as well as the Intel instruction manual. Every codebase I've ever seen that reads the TSC either issues an LFENCE or uses RDTSCP.
In my benchmarks RDTSCP has a slight advantage, despite the slower latency on paper, because it doesn't fully serialise the instruction stream (later instructions can start executing, unlike with LFENCE). Whether the ECX clobber will outweigh that will depend on the situation.
Do you know if the rust quanta / fastant crates have this problem? I feel like they don’t but I haven’t actually dug into the implementation. The reason I think not is that at least in the case of quanta the clock value can be made to be broadcast from a single clock maintainer thread. But even when its using plain rdtsc it says it upholds monotonicity barring kernel/virtualization bugs:
https://docs.rs/quanta/latest/quanta/struct.Instant.html
So I think it’s possible to do this correctly?
If it’s calling clock_gettime, it should be fine. If it uses RDTSCP, it should be fine (assuming your system actually has synchronized TSCs, and there is a long history of this failing). If it uses the sadly vendor-dependent magic incantation involving LFENCE or MFENCE, it should be fine. If it does plain RDTSC, it may not be fine.
(I have no special insight into what Intel and AMD CPUs do under the hood, but my best guess has always been that they are implemented by ucode that has no dependencies on anything in the register file except whatever might be internal to the ucode for the instruction itself. And the dispatch logic will cheerfully schedule it as such, including moving it before loads that precede in the instruction stream. Since RDTSC itself isn’t a load, the magic that makes all loads be acquires does not apply. RDTSCP is probably an excessively heavily pessimized version that waits for earlier loads to actually happen. The really nice hypothetical version where RDTSC “loads” a virtual loadable register in the coherency domain and can be speculated just like a real load is probably too complex to be worth implementing.)
You can still do better by not doing any TSC to nanosecond conversion. Instead, you inject clock adjustments into your log stream allowing you to convert when you decode the logs rather than when you generate the logs.
Using the final version, you would just make the cache refresh() function emit the clock adjustment log entry instead of actually caching anything. Then, any later log entry TSC would implicitly be relative to that clock adjustment log entry when you decode the log. Worst case you would need to persist every clock adjustment entry even when sampling, but that would still only be on the order of a few KB/s at worst and you could still drop entrys with no non-clock adjustment entrys between them.
Yep, I should probably have mentioned this option for completeness. In practice, however, it only saves you a few cycles/nanos and adds more complexity and failure points downstream.
Related reading is absl’s linear approximation of gettime from cycle counters, which I thought is a neat trick: https://github.com/abseil/abseil-cpp/blob/351086314d46e73d43...
Thanks for sharing this! I hadn't seen it before, I'll definitely add a mention in the post.
This approach is a smart way to get the same precision as the kernel without tying it to the vDSO implementation as in the post, so presumably someone at Google was worried about very similar issues. I also considered an implicit automatic refresh, but ultimately for my purposes controlling when to take the hit was more important than the slight increase in API complexity introduced by an explicit call.
If you can get away from measuring in the seconds domain, just sticking to timecounter values really simplifies. Embed a single reference timestamp pair (seconds domain <-> ticks) in your trace and do post-facto conversion in your analysis tool.
Great read. A question: what is the status of this problem on other architectures such as ARM and RISC-V, would the analysis and solution be the same? e.g. does ARM have invariant TSC?
riscv has mtime. it is somewhat implementation defined, but it should be a single hardware timer shared by all harts. The Zicntr extension defines user space rdtime psuedo instruction to acesss it from userspace.
aarch64 has cntvct_el0 status register that can be read from userspace.
I can beat this by not trying to wrap a trace span around something that only takes 100ns.
If the thing of interest just runs on the CPU briefly, tracing is not what you want. You want a profiler that only runs when you're looking at it. Distributed tracing is for things that can go wrong and take uncertain amounts of time.
You might've misunderstood the requirements. The time scale was 1-10 micros per component; 100 ns was the overhead per span we were aiming for.
In this case distributed tracing absolutely was the right choice. These were not simple computational tasks. The components were highly stateful and interconnected both on- and cross-host. Between this and the timescale, as well as the volume of events and the dollar-value impact of each potential failure (of which there were many), we needed real-time analysis capabilities, not a profiler.
I guess my skepticism about the application colored my reading of the rest of it. If it had only said you needed it to be faster, that would have been easier for a simpleton like me.