Related: Ridge point
I was curious on what these numbers are for leading Nvidia GPUs, given my recent exploration on Arithmetic Intensity, so here they are:
Datatype | A100 (PCIe)1 | A100 (SXM)1 | H100 (PCIe) | H100 (SXM) |
---|---|---|---|---|
FP64 | 5 | 4 | 17 | 10 |
FP64 (Tensor Core) | 10 | 9 | 33 | 20 |
FP32 | 10 | 9 | 33 | 20 |
TF32 (Tensor Core) | 80 | 76 | 494 | 295 |
BF16 (Tensor Core) | 161 | 153 | 989 | 590 |
FP16 (Tensor Core) | 161 | 153 | 989 | 590 |
FP8 (Tensor Core) | 322 | 306 | 1979 | 1181 |
INT8 (Tensor Core) | 322 | 306 | 1979 | 1181 |
Ridge points for Nvidia A100 and H100 GPUs, across different configurations
These were derived from the A100 and H100 spec sheets.
Perhaps it would be more effective to map this to commercially available VMs of cloud providers, but a quick look at AWS’ spec sheet shows different parameters, especially for their H100’s, clocking in at 1000 TFLOPs FP16 instead of 1513 TFLOPs FP16 (33% below max performance). I might create a table for this in the future.todo
Reproducing Code
import pandas as pd
flops_df = pd.DataFrame(
[
dict(gpu='A100', fp64=9.7, fp64tc=19.5, fp32=19.5, tf32=156, bf16=312, fp16=312, i8=624),
dict(gpu='H100', fp64=34, fp64tc=67, fp32=67, tf32=989, bf16=1979, fp16=1979, i8=3958),
]
).set_index('gpu').T
memory_bandwidth_df = pd.DataFrame(
[
dict(gpu='A100', pcie=1.935, sxm=2.039), # Assuming 80GB HBM2e specs
dict(gpu='H100', pcie=2, sxm=3.35),
]
).set_index('gpu').T
ridge_point_pcie_df = flops_df / memory_bandwidth_df.loc['pcie']
ridge_point_sxm_df = flops_df / memory_bandwidth_df.loc['sxm']
# Convert columns into a multi-index column levelled by <GPU>-<interconnect>
ridge_point_pcie_df.columns = pd.MultiIndex.from_product([['pcie'], flops_df.columns])
ridge_point_sxm_df.columns = pd.MultiIndex.from_product([['sxm'], flops_df.columns])
ridge_point_df = pd.concat([ridge_point_pcie_df, ridge_point_sxm_df], axis=1)
ridge_point_df.columns = ridge_point_df.columns.swaplevel()
ridge_point_df = ridge_point_df.sort_index(axis=1)
ridge_point_df.astype(int)