What is the access speed of tensor memory compared to shared memory? (original) (raw)

In CC 10.0 Blackwell, tensor cores can load their inputs from either shared memory or tensor memory. The latter is a new type of on-chip memory 1. Introduction — PTX ISA 8.8 documentation

Registers can be stored to and loaded from tensor memory via ptx (subject to specific access patterns).

How fast are tensor memory accesses compared to shared memory accesses?
Can non-tensorcore code which is limited by shared memory speed be improved as well when using tensor memory instead, assuming the access patterns fit the restrictions of the load/store instructions?

rs277 June 15, 2025, 7:28pm 2

Reading through this GTC, it seems reasonable to think of TMEM as a less flexible duplication of the register file, with performance being at a similar level:

“New memory on each SM; same size as the Register File: 256 KB.”
“TMEM addresses can NOT be dereferenced!”

Not sure if this comment affects what you have in mind:
“Used for Tensor Core (TC) ops. SIMT operations not supported on TMEM.”