The main APIs lack documentation. Also, it’s README is not up to date w.r.t. it’s calibration API.
Dynamic Quantization
- Train a
model
- Run
quanto.quantize(model, weights=quanto.qint8, activations=quanto.qint8)
- (Optional; but recommended) Calibrate the input and output activations with some samples
freeze(model)
to obtain the weights
Thoughts
- The API is much more straightforward compared to doing it the manual way in PyTorch — you need to think in terms of
quantize
,Calibration
andfreeze
- You don’t have to create a quantization-aware architecture of your model
- The resulting weights are already in the target dtype, whereas the PyTorch way displays the quantized weights in its dequantized form
- This library can go to lower bit width types like int4 and even int21
Easier to use, but slower
Although the API is simpler, it seems to be worse performing in terms of speed, compared to regular PyTorch.
See benchmarking results.