Imagine that you’re loading data performing the following operation in CUDA:
float2
is 2-member float tuple, and occupies 8 bytes. The operation seems to be doing two random memory access on arrays p2
and p1
, both of which are of type float2
as well.
The GPU1’s SM executes this operation, 4 warps (of 32 threads) at a time. So, upon each execution, the size of the coalesced memory read in one go is:
That happens to correspond to the GPU’s memory page size of 1024 bytes. Since the number of bytes per read cycle is a multiple of the memory page size, this particular example’s access pattern is optimal with regard to memory access (up to 13x speed improvements)2. This is because all memory read from a single page is used.
Conclusion
By carefully designing your program to read data optimally, seemingly random memory access can be optimal.
Footnotes
-
At least based on an Nvidia A100’s spec ↩
-
Is this necessarily true? It assumes that an array pointer is aligned with the page, which itself implies that extra memory at the tail end of the page is not utilized. This likely is not the case in real-world settings, which adds to the complexity of optimal kernel design. ↩