Discussion
Optimizing a Lock-Free Ring Buffer
kristianp: This is in C++, other languages have different atomic primitives.
jitl: Really? Pretty much all atomics i’ve used have load, store of various integer sizes. I wrote a ring buffer in Go that’s very similar to the final design here using similar atomics.https://pkg.go.dev/sync/atomic#Int64
wat10000: They generally map directly to concepts in the CPU architecture. On many architectures, load/store instructions are already guaranteed to be atomic as long as the address is properly aligned, so atomic load/store is just a load/store. Non-relaxed ordering may emit a variant load/store instruction or a separate barrier instruction. Compare-exchange will usually emit a compare and swap, or load-linked/store-conditional sequence. Things like atomic add/subtract often map to single instructions, or might be implemented as a compare-exchange in a loop.The exact syntax and naming will of course differ, but any language that exposes low-level atomics at all is going to provide a pretty similar set of operations.
amluto: Huh? Other languages that compile to machine code and offer control over struct layout and access to the machine’s atomic will work the same way.Sure, C++ has a particular way of describing atomics in a cross-platform way, but the actual hardware operations are not specific to the language.
dalvrosa: Yeah, different languages will have different syntaxes and ways of using atomicsBut at the hardware level all are kindof the same
JonChesterfield: It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.
dalvrosa: Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)Regarding the style, it follows the "almost always auto" idea from Herb Sutter
dalvrosa: Nice one, thanks for sharing. Do you wanna share the ring buffer code itself?
sanufar: Super fun, def gonna try this on my own time later
dalvrosa: Feel free to share your findings
dalvrosa: From 12M ops/s to 305 M ops/s on a lock-free ring buffer.In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.This pattern is widely used to share data between threads in the lowest-latency environments.
loeg: Your blog footer mentions that code samples are GPL unless otherwise noted. You don't seem to note otherwise in the article, so -- do you consider these snippets GPL licensed?
loeg: > It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.Are we reading the same code? The stores are clearly after value accesses.> Also doesn't have fences on the store?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.
erickpintor: Great post!Would you mind expanding on the correctness guarantees enforced by the atomic semantics used? Are they ensuring two threads can't push to the same slot nor pop the same value from the ring? These type of atomic coordination usually comes from CAS or atomic increment calls, which I'm not seeing, thus I'm interested in hearing your take on it.
secondcoming: If you enforce that the buffer size is a power of 2 you just use a mask to do the if (next_head == buffer.size()) next_head = 0; part
dalvrosa: Indeed that's true. That extra constraint enables further optimizationIt's mentioned in the post, but worth reiterating!
blacklion: JVM has almost the same (C++ was modeled after JVM ones, with some subtle fixes).
kevincox: Random idea: If you have a known sentinel value for empty could you avoid the reader needing to read the writer's index? Just try to read, if it is empty the queue is empty, otherwise take the item and put an empty value there. Similarly for writing you can check the value, if it isn't empty the queue is full.It seems that in this case as you get contention the faster end will slow down (as it is consuming what the other end just read) and this will naturally create a small buffer and run at good speeds.The hard part is probably that sentinel and ensuring that it can be set/cleared atomically. On Rust you can do `Option<T>` to get a sentinel for any type (and it very often doesn't take any space) but I don't think there is an API to atomically set/clear that flag. (Technically I think this is always possible because the sentinel that Option picks will always be small even if the T is very large, but I don't think there is an API for this.)
loeg: Yeah, or you could put a generation number in each slot adjacent to T and a read will only be valid if the slot's generation number == the last one observed + 1, for example. But ultimately the reader and writer still need to coordinate here, so we're just shifting the coordination cache line from the writer's index to the slot.
ramon156: Something to add to this; if you're focussing on these low-level optimizations, make sure the device this code runs on is actually tuned.A lot of people focus on the code and then assume the device in question is only there to run it. There's so much you can tweak. I don't always measure it, but last time I saw at least a 20% improvement in Network throughput just by tweaking a few things on the machine.
erickpintor: I see you replied on comment below with:> note that there are only one consumer and one producerThat clarify things as you don't need multi-thread coordination on reads or writes if assuming single producer and single consumer.
dalvrosa: Exactly, that's right
JonChesterfield: Push:buffer_[head] = value;head_.store(next_head, std::memory_order_release);return true;There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.
judofyr: This is just wrong. See https://en.cppreference.com/w/cpp/atomic/memory_order.html. Emphasis mine:> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).
dalvrosa: Thanks! That's not ensured, optimizations are only valid due to the constraints- One single producer thread- One single consumer thread- Fixed buffer capacitySo to answer> Are they ensuring two threads can't push to the same slot nor pop the same value from the ring?No need for this usecase :)
kevincox: I think the key difference is that they only need to coordinate when the reader and writer are close together. If that slows one end down they naturally spread apart. So you don't lose throughput, only a little latency in the contested case.
loeg: > I think the key difference is that they only need to coordinate when the reader and writer are close together.This was already the case with the cached index design at the end of the article, though. (Which doesn't require extra space or extra atomic stores.)
kevincox: That's a good point. They are very similar. I guess the sentinel design in theory doesn't need to synchronize at all as long as there is a decent buffer between them. But the cached design synchronizes less commonly the more space there is which sounds like it would be very similar in practice. The sentinel design might also have a few thrashing issues when the reader and writer are on the same page which would probably be a bit less of an issue with the cached index design.