Skip to content
Architecture

What is Quantization?

Last updated May 2026

Compressing AI models to use less memory and run faster with minimal quality loss.

Definition

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.

💡 Example

Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB — runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.

Related concepts

LLM (Large Language Model)

A type of AI trained on massive text datasets to understand and generate human language.

→
LoRA (Low-Rank Adaptation)

An efficient method to fine-tune AI models using much less compute and memory.

→
Inference

The process of running a trained AI model to generate predictions or outputs.

→

Why this matters

Quantization compresses AI models to run on smaller hardware — your laptop instead of a data center. This is critical for local AI tools like Ollama and LlamaFile, where you run models privately without sending data to cloud APIs.

Real-world example

A full Llama 3.1 70B model needs ~140GB of RAM. Quantized to 4-bit (Q4), it fits in ~40GB — runnable on a high-end desktop. You trade some quality for massive efficiency gains. For most tasks, the quality difference is negligible.

See it in action

Open-Source AI

AI models with publicly available weights that anyone can download and run.

→

Explore AI tools

Find tools that use quantization in practice.

Browse all tools → Back to glossary
What is Quantization?

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point numbers to lower precision formats (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory usage and increases inference speed, making it possible to run large models on consumer hardware with minimal quality degradation.

How does Quantization work in practice?

Running Llama 3 70B normally requires 140GB of VRAM (multiple expensive GPUs). With 4-bit quantization, it fits in 35GB — runnable on a single high-end consumer GPU. The quality loss is typically 1-3% on benchmarks.

How does quantization affect AI model quality?

Quantization reduces model file size and speeds up inference by using lower-precision numbers, but introduces small accuracy losses. Going from 16-bit to 8-bit typically has minimal quality impact. More aggressive 4-bit quantization can cause noticeable degradation, especially on complex reasoning tasks.

When should you use a quantized model?

Use quantized models when running AI locally on consumer hardware, when inference speed matters more than maximum accuracy, or when deploying to edge devices with limited memory. For production applications requiring the highest quality, full-precision models on powerful GPUs are preferable.

What quantization formats are commonly used?

Common formats include GPTQ, GGUF (used by llama.cpp for CPU inference), AWQ, and bitsandbytes for GPU inference. GGUF is the most popular format for running models locally on personal computers, while GPTQ and AWQ are commonly used for GPU-accelerated deployment.