The main APIs lack documentation. Also, it’s README is not up to date w.r.t. it’s calibration API.

Dynamic Quantization

  1. Train a model
  2. Run quanto.quantize(model, weights=quanto.qint8, activations=quanto.qint8)
  3. (Optional; but recommended) Calibrate the input and output activations with some samples
  4. freeze(model) to obtain the weights

Thoughts

  • The API is much more straightforward compared to doing it the manual way in PyTorch — you need to think in terms of quantize, Calibration and freeze
  • You don’t have to create a quantization-aware architecture of your model
  • The resulting weights are already in the target dtype, whereas the PyTorch way displays the quantized weights in its dequantized form
  • This library can go to lower bit width types like int4 and even int21

Easier to use, but slower

Although the API is simpler, it seems to be worse performing in terms of speed, compared to regular PyTorch.

See benchmarking results.

Footnotes

  1. In my test, int4 still worked on MNIST, but going down to int2 led to really bad performance (todo Move it to a table) ↩