- The quantization range lies between -1 to 1 and uses 4-bit NormaFloat (NF4) and double quantization methods.
- QLoRA reduces the memory footprints using model parameters, gradients, two-state optimizers, and activations.
- Model quantization reduces the precision of model parameters by reducing the model size and improving inference speed by maintaining the model’s accuracy. The useful tools and libraries for model quantization are TensorFlow Lite and PyTorch.
- Some of the model quantization techniques are:
- Uniform quantization
- Non-uniform quantization
- Weight clustering
- Pruning and quantization