Keep in mind that your counter is not monotonic increasing: it can go back in time. If a thread has read the counter, and then runs into a context switch, it could be quite some time it gets back on a core. And then will add 1 to a very old counter. There is some overlap between opaque and memory_order_relaxed. But there is at least 1 fundamental difference. memory_order_relaxed is coherent and opaque is not. I would also include a benchmark using opaque only. Opaque doesn't order surrounding loads/stores and hence restricts the compiler and CPU less. Probably won't be noticeable in this particular benchmark, but when you do more than only increasing a counter, it might make a difference. Also, the difference might be more visible on ARM than X86 due to its more relaxed memory model. If you want to optimize for the writing side, I would really use some kind of striped counter like the LongAdder. So that threads don't need to contend for cache lines. And if it is purely for progress indication and performance is critical, I would explicitly give each thread its own counter line and push everything through opaque for optimal performance. The burden is completely shifted to the reader.