Imagine that you’re loading data performing the following operation in CUDA:

float2 dp = p2[index] - p1[index]

float2 is 2-member float tuple, and occupies 8 bytes. The operation seems to be doing two random memory access on arrays p2 and p1, both of which are of type float2 as well.

The GPU1’s SM executes this operation, 4 warps (of 32 threads) at a time. So, upon each execution, the size of the coalesced memory read in one go is:

That happens to correspond to the GPU’s memory page size of 1024 bytes. Since the number of bytes per read cycle is a multiple of the memory page size, this particular example’s access pattern is optimal with regard to memory access (up to 13x speed improvements)2. This is because all memory read from a single page is used.

Conclusion

By carefully designing your program to read data optimally, seemingly random memory access can be optimal.

todo Update links in this file

Footnotes

  1. At least based on an Nvidia A100’s spec ↩

  2. Is this necessarily true? It assumes that an array pointer is aligned with the page, which itself implies that extra memory at the tail end of the page is not utilized. This likely is not the case in real-world settings, which adds to the complexity of optimal kernel design. ↩