Lepton AI

Freemium

Cloud-native AI inference platform with per-minute compute billing, now integrated with NVIDIA DGX Cloud

What is Lepton AI?

Lepton AI is a cloud-native platform for deploying, serving, and scaling AI models with a developer-first focus on simplicity and fast iteration, founded by former Alibaba AI researchers and acquired by NVIDIA in March 2025. Following the acquisition, Lepton was integrated into NVIDIA's DGX Cloud Lepton offering, which unifies Lepton's managed inference tooling with NVIDIA's global GPU supply across multiple cloud providers — giving developers access to NVIDIA Blackwell, H100, H200, and A100 GPUs through a single Lepton interface. Lepton's core product is a serverless-style inference platform where you can deploy any open-weight model from Hugging Face, a custom PyTorch model, or one of Lepton's pre-deployed endpoints, and get a production-ready API in minutes. Pricing is usage-based and billed per minute of actual compute, with no idle charges when your endpoint scales to zero. For pre-deployed LLM inference, Lepton offers aggressive per-token rates — Llama 3.2 3B at $0.03 per million tokens and Llama 3.1 8B at $0.07 per million tokens make it one of the cheapest inference options on the market. Storage for models, data, and logs is charged at $0.153 per GB per month. Where Lepton differentiates is flexibility: you can self-deploy any custom model, use it for non-LLM workloads, and mix inference with batch training jobs on the same platform.

⚡ Quick Verdict

Best for

Developers who want cheap pay-per-compute inference with flexibility to deploy custom models

Not ideal for

Teams that need the largest model catalog or the fastest LPU-style inference

Starting price

From $0.03 per million tokens · Compute billed per minute

Free plan

Starter credits available for new users

Key strength

Cheap pay-per-token pricing plus flexible custom model hosting on NVIDIA infrastructure

Limitation

Smaller catalog than Together AI or OpenRouter

Bottom line: Lepton scores 4.3/5 — a strong pick when you want low-cost Llama inference plus the option to deploy custom models, backed by NVIDIA's GPU supply.

Pricing

Pre-deployed LLM inference (per-token): Llama 3.2 3B at $0.03 per million tokens · Llama 3.1 8B at $0.07 per million tokens · Larger models priced proportionally. Among the cheapest per-token rates on the market.

Custom model deployment — Pay-per-compute: Billed by the minute for actual GPU usage. No idle charges when your endpoint scales to zero. Supports NVIDIA Blackwell, H100, H200, A100, and A10G GPUs.

Storage: $0.153 per GB per month for models, data, and logs stored on the Lepton platform.

DGX Cloud Lepton integration: Following NVIDIA's 2025 acquisition, Lepton unifies with NVIDIA DGX Cloud's global GPU supply across multiple cloud providers.

Key Features

Cloud-native inference platform with per-minute billing
Pre-deployed Llama and other open LLMs from $0.03 per million tokens
Custom model deployment from Hugging Face or PyTorch
Scale-to-zero for cost efficiency
NVIDIA DGX Cloud Lepton integration
Access to Blackwell, H100, H200, A100 GPUs
Supports LLM, vision, audio, and custom models
Python SDK and REST API

Pros & Cons

Pros

Very cheap per-token pricing for pre-deployed Llama models
NVIDIA backing means reliable GPU supply and new hardware access
Flexible — serve any open-weight or custom model, not just a catalog
Scale-to-zero saves money for bursty workloads

Cons

Smaller ecosystem than Together AI or Replicate
Less polished docs than AWS Bedrock or Vertex AI
NVIDIA acquisition created some product direction uncertainty

✅ Pricing verified April 2026 · ✅ Independently reviewed · ✅ Scoring methodology

FAQ

Is Lepton still independent after the NVIDIA acquisition?

Lepton was acquired by NVIDIA in March 2025 and integrated into the DGX Cloud Lepton product line. The platform continues to operate under the Lepton brand and remains available to external developers at lepton.ai, now with deeper access to NVIDIA GPU capacity across multiple cloud providers.

How does Lepton pricing compare to Together AI?

Lepton's pre-deployed Llama 3.2 3B at $0.03 per million tokens is cheaper than Together AI's equivalent tier, and Llama 3.1 8B at $0.07 per million is also competitive. For custom model hosting, Lepton's per-minute compute billing is simpler than Together's per-second pricing on dedicated endpoints.

Can I deploy my own fine-tuned model?

Yes. Lepton supports deploying any open-weight model from Hugging Face or a custom PyTorch model via their Python SDK. You package your model code, push it to Lepton's platform, and get a production endpoint with automatic scaling. Strong choice for teams that have fine-tuned Llama and need managed hosting.

Does Lepton scale to zero?

Yes. Custom deployments on Lepton can scale to zero when idle, meaning you only pay for compute when actual requests are being processed. Major cost advantage for bursty or low-traffic workloads compared to always-on dedicated endpoints on AWS SageMaker or Azure ML.

What GPUs are available?

As part of NVIDIA DGX Cloud Lepton, the platform provides access to NVIDIA Blackwell, H200, H100, A100 80GB, and A10G GPUs across multiple cloud regions. For the latest GPUs (Blackwell), availability is tighter but Lepton has direct NVIDIA supply advantages over third-party providers.

Is Lepton good for non-LLM workloads?

Yes — this is one of Lepton's strengths compared to LLM-only providers. Lepton can host vision models, speech models, recommendation systems, and custom PyTorch pipelines. For teams with mixed AI workloads, Lepton simplifies infrastructure by consolidating on one platform.