Optimal Random Memory Access

Imagine that you’re loading data performing the following operation in CUDA:

float2 dp = p2[index] - p1[index]

float2 is 2-member float tuple, and occupies 8 bytes. The operation seems to be doing two random memory access on arrays p2 and p1, both of which are of type float2 as well.

The GPU¹’s SM executes this operation, 4 warps (of 32 threads) at a time. So, upon each execution, the size of the coalesced memory read in one go is:

4 warps/read cycle \times 32 threads/warp \times 8 bytes = 1024 bytes/read cycle

That happens to correspond to the GPU’s memory page size of 1024 bytes. Since the number of bytes per read cycle is a multiple of the memory page size, this particular example’s access pattern is optimal with regard to memory access (up to 13x speed improvements)². This is because all memory read from a single page is used.

Conclusion

By carefully designing your program to read data optimally, seemingly random memory access can be optimal.

At least based on an Nvidia A100’s spec ↩
Is this necessarily true? It assumes that an array pointer is aligned with the page, which itself implies that extra memory at the tail end of the page is not utilized. This likely is not the case in real-world settings, which adds to the complexity of optimal kernel design. ↩

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Optimal Random Memory Access

Conclusion

Graph View

Backlinks

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Optimal Random Memory Access

Conclusion

Footnotes

Graph View

Backlinks