Related: Ridge point


I was curious on what these numbers are for leading Nvidia GPUs, given my recent exploration on Arithmetic Intensity, so here they are:

DatatypeA100 (PCIe)1A100 (SXM)1H100 (PCIe)H100 (SXM)
FP64541710
FP64 (Tensor Core)1093320
FP321093320
TF32 (Tensor Core)8076494295
BF16 (Tensor Core)161153989590
FP16 (Tensor Core)161153989590
FP8 (Tensor Core)32230619791181
INT8 (Tensor Core)32230619791181

Ridge points for Nvidia A100 and H100 GPUs, across different configurations

These were derived from the A100 and H100 spec sheets.

Perhaps it would be more effective to map this to commercially available VMs of cloud providers, but a quick look at AWS’ spec sheet shows different parameters, especially for their H100’s, clocking in at 1000 TFLOPs FP16 instead of 1513 TFLOPs FP16 (33% below max performance). I might create a table for this in the future.todo


Reproducing Code

import pandas as pd
 
flops_df = pd.DataFrame(
    [
        dict(gpu='A100', fp64=9.7, fp64tc=19.5, fp32=19.5, tf32=156, bf16=312, fp16=312, i8=624),
        dict(gpu='H100', fp64=34, fp64tc=67, fp32=67, tf32=989, bf16=1979, fp16=1979, i8=3958),
    ]
).set_index('gpu').T
 
memory_bandwidth_df = pd.DataFrame(
    [
        dict(gpu='A100', pcie=1.935, sxm=2.039),  # Assuming 80GB HBM2e specs
        dict(gpu='H100', pcie=2, sxm=3.35),
    ]
).set_index('gpu').T
 
ridge_point_pcie_df = flops_df / memory_bandwidth_df.loc['pcie']
ridge_point_sxm_df = flops_df / memory_bandwidth_df.loc['sxm']
 
# Convert columns into a multi-index column levelled by <GPU>-<interconnect>
ridge_point_pcie_df.columns = pd.MultiIndex.from_product([['pcie'], flops_df.columns])
ridge_point_sxm_df.columns = pd.MultiIndex.from_product([['sxm'], flops_df.columns])
 
ridge_point_df = pd.concat([ridge_point_pcie_df, ridge_point_sxm_df], axis=1)
ridge_point_df.columns = ridge_point_df.columns.swaplevel()
ridge_point_df = ridge_point_df.sort_index(axis=1)
 
ridge_point_df.astype(int)

Footnotes

  1. Assuming 80GB HBM2e memory bandwidths. 2