Critique of “Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis”

By Sophia Wisdom
Concerning: Aaron Jarmusch & Sunita Chandrasekaran,
“Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis”,
available at arxiv.org/pdf/2512.02189

Original email to Dr. Chandrasekaran

From: sophia wisdom (email redacted) To: Sunita Chandrasekaran (email redacted) Dear Dr. Chandrasekaran, I recently saw your paper "Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis" with Aaron Jarmusch at https://arxiv.org/pdf/2512.02189. The paper is awful and makes no sense whatsoever. Take subsection A in section V, "Tensor Memory (TMEM)". The first paragraph is somewhat confused but at least has the semblance of sense. The second paragraph is either describing something else or is written by someone who doesn't know what TMEM is. The paper talks about how TMEM achieves a lower latency in loading from global memory -- this is not possible. TMEM only has instructions for reading to and from registers and shared memory: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-instructions-tcgen05-cp . It goes on to say that this is faster than the conventional `ld.global` path. `ld.global` has not been the conventional path for memory loading since Ampere was released 5 years ago. The paper also says "On Hopper, the standard pipeline for tensor operations relied on cp.async.bulk.tensor.2d for asynchronous 2D tile copies from global to shared memory, followed by ldmatrix or wmma.load to stage operands into registers before MMA execution". The Hopper wgmma instruction has `A` in either registers or shared memory and `B` in shared memory. The paper later says "This represents a significant departure from Hopper’s 32×32 optimal tile size". Not only is that not the optimal tile size, it's not even a possible tile size. The list of shapes possible for the `wgmma` instruction are here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shape and they do not include 32x32 (M must be at least 64). "Blackwell’s TMEM operates with independent address generation units that pre-fetch tensor tiles directly into staging buffers" this makes no sense whatsoever. "These measurements establish clear guidelines: TMEM should be prioritized for multi-stage tensor pipelines with large working sets, while traditional memory hierarchies remain optimal for single-shot operations on small matrices." This makes no sense because the Hopper instructions cannot be used on Blackwell. You would have to use Ampere instructions, and if you did so you would face something like a 10x slowdown. In table V there's an instruction latency table. One of the lines shows that for the `tcgen05.mma` instruction with shapes `m256n256k16` for float16 there's a latency of 11.4 cycles. It also says this is per-warp, which I think is false too, but let's ignore that and assume it's per-SM. That's 183,960 flops/cycle to do a matmul of that size. The B200 operates with a boost clock of 1965mhz, but let's assume an operating clock of 1500mhz. With 148 SMs this gives flops/s of 183960*1_500_000_000*148=40 quadrillion flops/s. In fact the B200 is marketed as achieving 2.25 quadrillion flops/s and can probably only achieve 70% of that. There are surely other things wrong with the paper, but I think this should be enough. You should take down the paper. Best, Sophia Wisdom