Arithmetic Intensity of a Neural Network Linear Layer

A linear layer is a fundamental set of operations that happen within many neural network architectures. This layer linearly transforms the input, which can be followed by nonlinear activation functions to form a feedforward layer. Multiple stacked feedforward layers give rise to powerful mapping functions that neural networks use to generalize complex patterns from data.

A linear layer is defined as

Linear (X) = W X + B

where $W \in R^{m \times n}, X \in R^{n \times p}, B \in R^{m \times 1}$ .

Even an operation as simple as this can have different arithmetic intensity, depending on whether optimizations like kernel fusion is performed.

It would be interesting to compare the results without and with kernel fusion.

Unoptimized intensity; without kernel fusion

Here are the set of steps performed in a naive linear kernel implementation.

Step	Work (FLOPs)	Memory Traffic (Bytes)
Load $M$		$m \cdot n$
Load $X$		$n \cdot p$
Perform matmul	$2 \cdot m \cdot n \cdot p$ ¹
Save matmul result $R$		$m \cdot p$
Load $R$		$m \cdot p$
Load $B$		$m$
Broadcast $B$	N/A	Implementation-dependent²
Element-wise Sum ( $R + B$ )	$m \cdot p$
Save result		$m \cdot p$

Note

Expressing the arithmetic intensity for the above in terms of the matrix dimensions $m$ , $n$ and $p$ aren’t particularly insightful. However, we can explore it more visually later below.

A kernel involves loading data into registers, performing computations and saving it back into memory.

There are two kernels in an unoptimized linear layer, one for matmul and another for element-wise summation. From the table above, it becomes apparent that a naive implementation is inefficient due to redundant data transfers. The intermediate result $R = W X$ is saved, at the end of the first matmul kernel, before loaded in again as part of the element-wise summation kernel $R + B$ .

This is where the kernel fusion technique comes into play. The intuition is to avoid redundant memory transfers, by performing as much work as possible on the same set of data, i.e. fusing the kernels.

Tip

Kernel fusion is most beneficial when there are many element-wise computation on the same set of data.

A feedforward layer—a linear layer followed by element-wise non-linear activations like ReLU—would benefit even more with fusion ⚡!

Optimized intensity, using kernel fusion

The steps are similar as the example above, just without the redundant data transfers for intermediate result $R$ .

The impact of kernel fusion

Let’s assume a ridge point of 5, which roughly corresponds to the specs of an AWS p4d.24xlarge GPU server.

Regions in the visualizations above with arithmetic intensity above 5 are compute-bound instead of memory-bound, which is generally what you want for kernels.

A matrix multiplication can be thought of a series of dot products between rows and columns of a matrix. The factor of 2 arises from performing $n$ multiplications, followed by summing these resulting $n$ terms in each dot product operation. ↩
I may be wrong here, but an optimized implementation would incur zero byte transfers as the broadcasting is virtual. ↩

🪴 Chris' Digital Garden

Recent Notes