Lepton AI
FreemiumCloud-native AI inference platform with per-minute compute billing, now integrated with NVIDIA DGX Cloud
What is Lepton AI?
Lepton AI is a cloud-native platform for deploying, serving, and scaling AI models with a developer-first focus on simplicity and fast iteration, founded by former Alibaba AI researchers and acquired by NVIDIA in March 2025. Following the acquisition, Lepton was integrated into NVIDIA's DGX Cloud Lepton offering, which unifies Lepton's managed inference tooling with NVIDIA's global GPU supply across multiple cloud providers — giving developers access to NVIDIA Blackwell, H100, H200, and A100 GPUs through a single Lepton interface. Lepton's core product is a serverless-style inference platform where you can deploy any open-weight model from Hugging Face, a custom PyTorch model, or one of Lepton's pre-deployed endpoints, and get a production-ready API in minutes. Pricing is usage-based and billed per minute of actual compute, with no idle charges when your endpoint scales to zero. For pre-deployed LLM inference, Lepton offers aggressive per-token rates — Llama 3.2 3B at $0.03 per million tokens and Llama 3.1 8B at $0.07 per million tokens make it one of the cheapest inference options on the market. Storage for models, data, and logs is charged at $0.153 per GB per month. Where Lepton differentiates is flexibility: you can self-deploy any custom model, use it for non-LLM workloads, and mix inference with batch training jobs on the same platform.
⚡ Quick Verdict
Developers who want cheap pay-per-compute inference with flexibility to deploy custom models
Teams that need the largest model catalog or the fastest LPU-style inference
From $0.03 per million tokens · Compute billed per minute
Starter credits available for new users
Cheap pay-per-token pricing plus flexible custom model hosting on NVIDIA infrastructure
Smaller catalog than Together AI or OpenRouter
Bottom line: Lepton scores 4.3/5 — a strong pick when you want low-cost Llama inference plus the option to deploy custom models, backed by NVIDIA's GPU supply.
Pricing
Pre-deployed LLM inference (per-token): Llama 3.2 3B at $0.03 per million tokens · Llama 3.1 8B at $0.07 per million tokens · Larger models priced proportionally. Among the cheapest per-token rates on the market.
Custom model deployment — Pay-per-compute: Billed by the minute for actual GPU usage. No idle charges when your endpoint scales to zero. Supports NVIDIA Blackwell, H100, H200, A100, and A10G GPUs.
Storage: $0.153 per GB per month for models, data, and logs stored on the Lepton platform.
DGX Cloud Lepton integration: Following NVIDIA's 2025 acquisition, Lepton unifies with NVIDIA DGX Cloud's global GPU supply across multiple cloud providers.
Key Features
- Cloud-native inference platform with per-minute billing
- Pre-deployed Llama and other open LLMs from $0.03 per million tokens
- Custom model deployment from Hugging Face or PyTorch
- Scale-to-zero for cost efficiency
- NVIDIA DGX Cloud Lepton integration
- Access to Blackwell, H100, H200, A100 GPUs
- Supports LLM, vision, audio, and custom models
- Python SDK and REST API
Pros & Cons
Pros
- Very cheap per-token pricing for pre-deployed Llama models
- NVIDIA backing means reliable GPU supply and new hardware access
- Flexible — serve any open-weight or custom model, not just a catalog
- Scale-to-zero saves money for bursty workloads
Cons
- Smaller ecosystem than Together AI or Replicate
- Less polished docs than AWS Bedrock or Vertex AI
- NVIDIA acquisition created some product direction uncertainty
FAQ
Is Lepton still independent after the NVIDIA acquisition?
Lepton was acquired by NVIDIA in March 2025 and integrated into the DGX Cloud Lepton product line. The platform continues to operate under the Lepton brand and remains available to external developers at lepton.ai, now with deeper access to NVIDIA GPU capacity across multiple cloud providers.
How does Lepton pricing compare to Together AI?
Lepton's pre-deployed Llama 3.2 3B at $0.03 per million tokens is cheaper than Together AI's equivalent tier, and Llama 3.1 8B at $0.07 per million is also competitive. For custom model hosting, Lepton's per-minute compute billing is simpler than Together's per-second pricing on dedicated endpoints.
Can I deploy my own fine-tuned model?
Yes. Lepton supports deploying any open-weight model from Hugging Face or a custom PyTorch model via their Python SDK. You package your model code, push it to Lepton's platform, and get a production endpoint with automatic scaling. Strong choice for teams that have fine-tuned Llama and need managed hosting.
Does Lepton scale to zero?
Yes. Custom deployments on Lepton can scale to zero when idle, meaning you only pay for compute when actual requests are being processed. Major cost advantage for bursty or low-traffic workloads compared to always-on dedicated endpoints on AWS SageMaker or Azure ML.
What GPUs are available?
As part of NVIDIA DGX Cloud Lepton, the platform provides access to NVIDIA Blackwell, H200, H100, A100 80GB, and A10G GPUs across multiple cloud regions. For the latest GPUs (Blackwell), availability is tighter but Lepton has direct NVIDIA supply advantages over third-party providers.
Is Lepton good for non-LLM workloads?
Yes — this is one of Lepton's strengths compared to LLM-only providers. Lepton can host vision models, speech models, recommendation systems, and custom PyTorch pipelines. For teams with mixed AI workloads, Lepton simplifies infrastructure by consolidating on one platform.
📋 Good to know
Sign up at lepton.ai, top up credits, and deploy a pre-built endpoint from the catalog or push a custom model via the Python SDK.
Lepton does not train on your data. Enterprise agreements available for regulated workloads.
Move from pre-deployed endpoints to custom model hosting when you need fine-tuned weights.
Moderate — Python SDK is straightforward but custom deployment involves more concepts than pure inference APIs.