Skip to content
Core Concepts

What is Inference?

Last updated May 2026

The process of running a trained AI model to generate predictions or outputs.

Definition

Inference is the process of using a trained AI model to generate outputs from new inputs. When you send a prompt to ChatGPT and receive a response, that is inference. Inference costs (compute, memory, latency) are a major factor in AI API pricing and deployment decisions. Faster inference means quicker responses and lower costs.

💡 Example

Every time you press "Send" in ChatGPT, the GPT-4 model performs inference — processing your tokens through its neural network layers to generate a response. API pricing reflects this inference compute cost.

Related concepts

LLM (Large Language Model)

A type of AI trained on massive text datasets to understand and generate human language.

→
Token

The basic unit of text that AI models process — roughly 4 characters or 0.75 words.

→

Why this matters

Inference is running a trained AI model — every time you ask ChatGPT a question, that's inference. Inference speed and cost are the main factors that determine AI tool pricing and responsiveness. Faster inference = snappier tools.

Real-world example

Groq achieves 500+ tokens/second inference speed — roughly 10x faster than standard GPT-4o responses. This is why some tools feel instant while others lag. Together AI and Replicate compete on inference speed and cost for developers.

API (Application Programming Interface)

A way for developers to programmatically access AI models in their own applications.

→

Explore AI tools

Find tools that use inference in practice.

Browse all tools → Back to glossary
What is Inference?

Inference is the process of using a trained AI model to generate outputs from new inputs. When you send a prompt to ChatGPT and receive a response, that is inference. Inference costs (compute, memory, latency) are a major factor in AI API pricing and deployment decisions. Faster inference means quicker responses and lower costs.

How does Inference work in practice?

Every time you press "Send" in ChatGPT, the GPT-4 model performs inference — processing your tokens through its neural network layers to generate a response. API pricing reflects this inference compute cost.

Why does inference speed matter for AI tools?

Inference speed determines how quickly an AI tool responds to your requests. Faster inference means shorter wait times for chat responses, quicker image generation, and more practical real-time applications like voice assistants and coding autocomplete.

What affects the cost of AI inference?

Inference cost depends on model size, hardware requirements (GPU type and quantity), input and output token counts, and optimization techniques used. Smaller or quantized models are cheaper to run. Providers pass these costs to users through per-token pricing or subscription tiers.

What is the difference between inference and training in AI?

Training is the one-time process of teaching a model by processing massive datasets, which requires enormous compute resources. Inference is the ongoing process of using the trained model to generate responses, which is cheaper per request but adds up with high usage volumes.