Quanto

The main APIs lack documentation. Also, it’s README is not up to date w.r.t. it’s calibration API.

Dynamic Quantization

Train a model
Run quanto.quantize(model, weights=quanto.qint8, activations=quanto.qint8)
(Optional; but recommended) Calibrate the input and output activations with some samples
freeze(model) to obtain the weights

Thoughts

The API is much more straightforward compared to doing it the manual way in PyTorch — you need to think in terms of quantize, Calibration and freeze
You don’t have to create a quantization-aware architecture of your model
The resulting weights are already in the target dtype, whereas the PyTorch way displays the quantized weights in its dequantized form
This library can go to lower bit width types like int4 and even int2¹

Easier to use, but slower

Although the API is simpler, it seems to be worse performing in terms of speed, compared to regular PyTorch.

See benchmarking results.

In my test, int4 still worked on MNIST, but going down to int2 led to really bad performance ( todo Move it to a table) ↩

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Quanto

Dynamic Quantization

Thoughts

Graph View

Table of Contents

Backlinks

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Quanto

Dynamic Quantization

Thoughts

Footnotes

Graph View

Table of Contents

Backlinks