What is Quantization?
Compressing AI models to use less memory and run faster with minimal quality loss.
Definition
Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.
๐ก Example
Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB โ runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.
Related concepts
A type of AI trained on massive text datasets to understand and generate human language.
An efficient method to fine-tune AI models using much less compute and memory.
The process of running a trained AI model to generate predictions or outputs.
AI models with publicly available weights that anyone can download and run.
Explore AI tools
Find tools that use quantization in practice.
What is Quantization?
Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.
How does Quantization work in practice?
Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB โ runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.