Architecture

What is Quantization?

Compressing AI models to use less memory and run faster with minimal quality loss.

Definition

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.

💡 Example

Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB — runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.

Related concepts

LLM (Large Language Model)

A type of AI trained on massive text datasets to understand and generate human language.

LoRA (Low-Rank Adaptation)

An efficient method to fine-tune AI models using much less compute and memory.

The process of running a trained AI model to generate predictions or outputs.

AI models with publicly available weights that anyone can download and run.

Explore AI tools

Find tools that use quantization in practice.

Browse all tools → Back to glossary

What is Quantization?

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.

How does Quantization work in practice?

Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB — runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.