Here are some explanations of atomics and shared memory so that you can understand the context.
Many operations that we think of logically as a single operation are composed of multiple smaller operations. Let's say you have a web server serving requests with many threads and you want to count how many requests you receive. You do this by incrementing a value in memory by 1 for every request: in C(*request_counter)++
. We think of "incrementing a value in memory by 1" as being a single logical operation but in fact it is composed of three operations:
Thread 1 | Thread 2 Read value(0) | Read value (0) Increment (1) | Increment (1) Write back (1) | Write back (1)
The counter now records you have received one request when you actually received two.
To fix this, we make the single logical operation a single physical operation ("atomic", in the sense of "indivisible"). It now becomes a single instruction that the hardware guarantees will always be indivisible.
while (compare_and_swap(memory_location=lock_ptr, compare=0, replace_with=1) == 1) {} // take lock // write value to end of array *lock_ptr = 0; // release lockThis will try to set the lock value to 1 over and over until it succeeds. This is also known as a spinlock. We can implement any operation we want on top of this, for example we can implement our own increment instead of using a hardware increment instruction:
int old = *request_counter; // load the value initially int assumed; while (assumed != old) { assumed = old; int incremented = assumed + 1; // get the value it was before the compare-and-swap attempt. If it was the value we thought it was, exit out. Otherwise, try to compare-and-swap again with the incremented new value. old = compare_and_swap(memory_location=request_counter, compare=old, replace_with=incremented); }
Because CAS is so flexible, we can implement anything we want as the operation -- atomic multiplication, atomic log-sum-exp, atomic XOR-then-shiftleft etc.
In general, the more localized and limited communication is, the more efficient it is to implement in hardware. Therefore NVIDIA hardware is optimized to provide several levels of communication of different flexibility so your programs can use the least expensive communication possible.
In CUDA, programs execute in units of what NVIDIA calls a "warp"4 which is analogous to a CPU thread. Each warp has 32 "threads", analogous to SIMD lanes. GPUs have many registers, 256KB per SMs, which allows them to implement an aggressive form of hyperthreading where up to 64 warps can be running at the same time. This makes masking the latency of expensive memory accesses easy, because while one warp is waiting for data other threads can be computing. Communication between threads in a warp can be done using warp shuffle functions.
The next highest level of hierarchy is that multiple warps can be grouped into a "CTA" or "block" which are all guaranteed to run on the same SM, which consists of 4 SMSPs plus 128KB of "shared memory" which is like very fast addressible L1 cache. Multiple warps can communicate and work together through shared memory. For example in a matrix multiplication kernel many warps do different computations on the same data, reducing memory bandwidth.
In order for the warps in a block to communicate together they often need atomics: to track the maximum value they've seen, to calculate a sum, to keep a list or queue.
Here is the list of shared memory atomic SASS instructions.5.
ATOMS.ADD
Atomically add a u32/i32 cudaATOMS.POPC.INC
Increments a u32/i32. It's very fast (<10 cycles) and never suffers from contention. Discussed here cuda ATOMS.{MIN,MAX}
Atomically min/max a u32/i32 cudaATOMS.{XOR,OR,AND}
Atomically apply 32 bit wide bitwise operations cudaATOMS.CAS.{32,64,128}
Atomically Compare-And-Swap a 32/64/128 bit value. These can be generated with user code. cudaATOMS.CAST.SPIN.{32, 64}
Atomically Compare-And-Swap a 32/64 bit value. These are only generated by ptxas in loops (where you SPIN
) to implement atomics not supported natively. atomicAdd((float *)ptr, 15)
.
.LOOPSTART: LD15 // Add 15 to R2 and put it in R3 ATOMS.CAST.SPIN R3, [R0], R2, R3 // Attempt to replace the value at *ptr with R3 if *ptr==R2 ISETP.EQ.U32.AND P06, PT, R3, 0x1, PT // If the CAS failed @!P0 BRA `(.LOOPSTART) // Loop backR2, [R0] // Load current value from memory and put it in R2 FADD R3, R2,
This has several implications:
Here are some graphs to substantiate these claims. This is unfortunately incomplete. The original code for all of these is here, in blog.ipynb.
First, the headline numbers. This shows the number of cycles per atomicAdd() from each of 256 threads to a varying number of values. The X axis is the average number of threads trying to write to a particular address and the Y axis is the number of cycles. Both are on a log scale.We can see that under heavy contention, float32 atomicAdd()s are about 50x slower than int32 atomicAdd()s. Even under no contention at contending threads=1, int32 atomicAdd()s are about 10x faster than float32 atomicAdd()s.
A note on a common pattern: Lines stay closely log-linear (2x reduction in contention means a ~2x reduction in runtime.) until a kink at which time stops reducing almost at all (here for int32 at 32 and float32 at 4). Often this kink comes because of bank conflicts.
Some operations that by default are expensive CAS operations can be converted into a single hardware operation or a series of hardware operations. Notably: float32 max can be implemented with int max, 64 bit AND/OR/XOR can be implemented with two 32 bit operations, int64 add can be implemented with two in32 adds. The easiest and most useful of these is float32 max. Graphed below is float32 max with CAS vs float32 max with int max.
We can see int32 max is about 10x faster and has better contention behavior. Implementations of 64 bit add/AND/OR/XOR -> 32 bit are left to the reader :p
This is an expanded version of the first graph, with two new lines: "add manual", which uses a CAS loop in CUDA, and "add nochange", which sets the initial value being atomicAdd()ed to to be +INF, so atomics never change the value.
As we suspected, "nochange" improves performance significantly under high contention. This is because we are performing a CAS with the same value before and after.
Manual is almost the same speed but confusingly sometimes faster than atomicAdd(). My only guess at why this is is that atomicAdd() is not optimized for high contention (reasonable).
ATOMS.CAS
(32/64/128), and one that
: There are good reasons why spinlocks are bad on the CPU (cf Linus) that don't apply on the GPU because there's no scheduler.
: because it was the first "parallel thread technology"
: I'm ignoring ATOMS.ARRIVE
(AKA mbarrier.arrive.shared
) because it doesn't seem important. Otherwise, we can be pretty confident about this because of Kuter Dinel's good work -- he listed all the modifiers for ATOMS, the generic instruction for "Atomic Operation on Shared Memory".
: "Predicate register 0". Predicate registers are special registers that hold a boolean for every lane in a warp, which then enables you to only execute an instruction for lanes where the predicate is true/false. The ISETP
sets P0 to a value indicating whether the CAS succeeded, then @!P0 BRA `(.LOOPSTART)
makes all the lanes where the CAS failed start over.
: Specifically with atomicCAS(), which is atom.shared.cas
in PTX
: (NOTE: mention __shfl_sync, mention random index stuff too)