Occupancy

Definition

How many blocks of work can you pack onto an SM

This is a concept that is the most powerful tool for tuning your GPU program.¹

Footprint of a block

Each block has its own “footprint” which comprises of the following components:

Component	Description
Block size	The number of threads in the block to run
Shared memory	Shared by all
Total² number of registers	Working space for a thread,³ and a per-thread resource.⁴ The number of registers is based on the program’s complexity⁵ and is determined by the CUDA compiler.

There is a “budget” that limits how many blocks you can fit to an SM:

By optimizing the footprint of your block, you can increase the number of blocks you fit into the program.

A fundamental assumption in this is:

NOTE

A block never spans multiple SMs.

CUDA architect Stephen Jones starts with this in mind before designing his programs. ↩
Registers/thread x Threads/block ↩
Akin to RAM. Unlike CPUs, this “RAM” is not relying on cache, but actual registers. They allow direct access to data since memory performance is so critical for GPUs. ↩
A hundred registers per thread is rather common for a CUDA program ↩
For example, multiply and division operations need a lot of working space, so they take up more registers than simpler operations ↩
Rarely reached, so can be mostly ignored ↩