A linear layer is a fundamental set of operations that happen within many neural network architectures. This layer linearly transforms the input, which can be followed by nonlinear activation functions to form a feedforward layer. Multiple stacked feedforward layers give rise to powerful mapping functions that neural networks use to generalize complex patterns from data.
A linear layer is defined as
where .
Even an operation as simple as this can have different arithmetic intensity, depending on whether optimizations like kernel fusion is performed.
It would be interesting to compare the results without and with kernel fusion.
Unoptimized intensity; without kernel fusion
Here are the set of steps performed in a naive linear kernel implementation.
Step | Work (FLOPs) | Memory Traffic (Bytes) |
---|---|---|
Load | ||
Load | ||
Perform matmul | 1 | |
Save matmul result | ||
Load | ||
Load | ||
Broadcast | N/A | Implementation-dependent2 |
Element-wise Sum () | ||
Save result |
Note
Expressing the arithmetic intensity for the above in terms of the matrix dimensions , and aren’t particularly insightful. However, we can explore it more visually later below.
A kernel involves loading data into registers, performing computations and saving it back into memory.
There are two kernels in an unoptimized linear layer, one for matmul and another for element-wise summation. From the table above, it becomes apparent that a naive implementation is inefficient due to redundant data transfers. The intermediate result is saved, at the end of the first matmul kernel, before loaded in again as part of the element-wise summation kernel .
This is where the kernel fusion technique comes into play. The intuition is to avoid redundant memory transfers, by performing as much work as possible on the same set of data, i.e. fusing the kernels.
Tip
Kernel fusion is most beneficial when there are many element-wise computation on the same set of data.
A feedforward layer—a linear layer followed by element-wise non-linear activations like ReLU—would benefit even more with fusion ⚡!
Optimized intensity, using kernel fusion
The steps are similar as the example above, just without the redundant data transfers for intermediate result .
Comparing the results
Footnotes
-
A matrix multiplication can be thought of a series of dot products between rows and columns of a matrix. The factor of 2 arises from performing multiplications, followed by summing these resulting terms in each dot product operation. ↩
-
I may be wrong here, but an optimized implementation would incur zero byte transfers as the broadcasting is virtual. ↩