Guide
Best LLM API & Inference Platforms for 2026
If you are shipping an AI feature in production, the model is only half the decision. The other half is the inference platform that actually serves it: the API that turns a prompt into tokens, the hardware that decides whether the first token lands in 80 milliseconds or 8 seconds, and the pricing meter that determines whether your unit economics survive contact with real traffic. The best LLM API for a chatbot demo is rarely the best one for a high-throughput RAG pipeline, a fine-tuned classifier, or a video model that needs an H100 for ninety seconds at a time.
This guide ranks ten cloud inference platforms developers call from production code. These are not desktop apps you run on your own laptop — if that is what you want, see our companion guide to the best local LLM tools. These are hosted, usage-based services with OpenAI-compatible endpoints, generous free credits to start, and per-token or per-GPU-hour billing. We verified every price below against each vendor's official pricing page in June 2026, and where a vendor uses dynamic or marketplace pricing, we say so plainly rather than inventing a number.
TL;DR — the quick picks
- Best overall: Together AI — The widest catalog of open models with strong serverless prices, dedicated endpoints, fine-tuning, and GPU clusters under one roof — the safest default for most teams.
- Fastest: Groq — Custom LPU hardware delivers 500–1,000 tokens/sec on Llama and gpt-oss models, with Llama 3.1 8B at just $0.05/$0.08 per 1M tokens.
- Most models: OpenRouter — One OpenAI-compatible API routes to 400+ models across 60+ providers at pass-through pricing, with automatic failover and 25+ free models.
- Best for RAG: Cohere — Enterprise-grade Command, Embed 4, and Rerank models purpose-built for retrieval, with private deployment and strong multilingual support.
- Run any model: Replicate — Per-second billing on any open model — text, image, or video — with thousands of community models and a one-line deploy path.
Top picks at a glance
Together AI
The widest catalog of open models with strong serverless prices, dedicated endpoints, fine-tuning, and GPU clusters under one roof — the safest default for most teams.
Read review →Groq
Custom LPU hardware delivers 500–1,000 tokens/sec on Llama and gpt-oss models, with Llama 3.1 8B at just $0.05/$0.08 per 1M tokens.
Read review →OpenRouter
One OpenAI-compatible API routes to 400+ models across 60+ providers at pass-through pricing, with automatic failover and 25+ free models.
Read review →Cohere
Enterprise-grade Command, Embed 4, and Rerank models purpose-built for retrieval, with private deployment and strong multilingual support.
Read review →Replicate
Per-second billing on any open model — text, image, or video — with thousands of community models and a one-line deploy path.
Read review →How we ranked them
We score every tool with our 8-parameter framework and verify pricing on each vendor's official page (last checked June 2026). Rankings are independent and never paid for.
The state of the market in 2026
The hosted-inference market in 2026 has split into two clear camps, and picking the right one starts with knowing which camp your workload belongs to. The first camp is token-metered open-model inference: providers like Groq, Together AI, Fireworks, Novita, and Nebius run open-weight models (Llama, DeepSeek, Qwen, GLM, gpt-oss) on their own clusters and bill you per million tokens, often a fraction of a cent for an 8B model. Competition here has been brutal in your favor — Llama 3.1 8B now serves for roughly $0.02–$0.05 per million input tokens at the cheapest providers, an order of magnitude below 2024 prices.
The second camp is compute and aggregation: Replicate and the GPU marketplaces (Anyscale, DGX Cloud Lepton, Nebius clusters) rent you raw GPU-seconds for any model — including image and video — while OpenRouter sits on top of everyone as a single OpenAI-compatible router across 400+ models. The headline trend is hardware specialization: Groq's LPU and similar custom silicon now deliver 500–1,000 tokens per second on mid-size models, fast enough that latency, not raw cost, is increasingly the differentiator. Free credits remain the norm for getting started, and OpenAI-compatible APIs have become table stakes, which means switching providers is now mostly a base-URL change.
1. Groq — Best for fastest inference (LPU hardware)
Note: Open models on custom LPU silicon; 500–1,000 tokens/sec; batch and prompt caching at 50% off · Pricing: Llama 3.1 8B Instant $0.05 in / $0.08 out per 1M; Llama 3.3 70B $0.59 / $0.79; gpt-oss 120B $0.15 / $0.60; Llama 4 Scout $0.11 / $0.34 per 1M tokens · Free API access to start (complimentary developer tier on signup)
Groq sells one thing better than anyone else: speed. Its inference runs on a custom Language Processing Unit (LPU) rather than general-purpose GPUs, and the result is throughput that routinely lands between 500 and 1,000 tokens per second on mid-size open models — fast enough that a streamed response feels instantaneous and agentic loops that chain many calls stop being painful. For real-time chat, voice agents, and any workload where time-to-first-token is the user-facing metric, Groq is the clearest pick on this list.
Pricing is aggressive and transparent. Llama 3.1 8B Instant runs $0.05 per million input tokens and $0.08 output; Llama 3.3 70B Versatile is $0.59 / $0.79; OpenAI's open gpt-oss 120B is $0.15 / $0.60; and Llama 4 Scout sits at $0.11 / $0.34. Cached input tokens are billed at half rate, and an asynchronous Batch API knocks 50% off for non-urgent jobs. Built-in tools — web search at $5 per 1,000 requests, code execution at $0.18/hour, Whisper transcription at $0.04/hour — round out the platform without forcing you to a separate vendor.
The trade-off is breadth. Groq curates a tight roster of open models rather than offering everything, and the very newest frontier-scale checkpoints arrive later (some, like Minimax M2.5, are enterprise-only). If your application needs an exotic model or a proprietary frontier model, you will pair Groq with a broader provider. But for serving popular open models at the lowest possible latency, nothing here matches it.
Pros
- Industry-leading inference speed (500–1,000 tok/sec) via custom LPU hardware
- Very low per-token pricing on popular open models
- OpenAI-compatible API — drop-in for most existing code
- Free developer access to start, plus 50% batch and cached-input discounts
- Built-in web search, code execution, and speech-to-text
Cons
- Smaller curated model catalog than aggregators or Together
- Newest frontier-scale models arrive later or are enterprise-gated
- No native fine-tuning — inference only
Ideal for: Teams building latency-sensitive chat, voice, or agentic features on popular open models who care more about speed than catalog breadth.
2. Together AI — Best overall for open models
Note: 200+ open models, serverless + dedicated + GPU clusters; fine-tuning from ~$0.48/1M training tokens; image and video models too · Pricing: Llama 3.3 70B $1.04 in / $1.04 out per 1M; gpt-oss 120B $0.15 / $0.60; DeepSeek V4 Pro $2.10 / $4.40; dedicated 1×H100 $6.49/hr; on-demand H100 cluster $5.49/hr · Free credits to start on the serverless API
Together AI is the most complete open-model platform on this list, and that completeness is exactly why it is our overall pick. Where most providers do one thing — fast serverless inference, or GPU rental, or fine-tuning — Together does all of them under a single account and API. You can prototype against serverless endpoints, graduate hot paths to dedicated reserved capacity, fine-tune a model on your own data, and rent bare H100 or B200 clusters for training, without ever leaving the platform or learning a second SDK.
The serverless catalog spans 200+ models. Llama 3.3 70B runs $1.04 per million tokens in and out; OpenAI's gpt-oss 120B is $0.15 / $0.60; DeepSeek V4 Pro is $2.10 / $4.40; and lighter models like Qwen3.5 9B drop to $0.17 / $0.25. Need guaranteed throughput? Dedicated endpoints start at $6.49/hour for a single H100. Need raw compute? On-demand GPU clusters run $5.49/hour for H100, $6.79 for H200, and $9.95 for B200, with reserved rates lower. Fine-tuning is priced by model size, from roughly $0.48 per million training tokens for sub-16B models. Together even serves FLUX image generation and Kling/Veo/Sora video, so multimodal teams are covered.
The breadth is also the only real caveat: Together is a platform, not a single endpoint, so the surface area to learn is larger than a focused provider like Groq. Per-token prices on some models are mid-pack rather than rock-bottom — you pay a little for the one-stop convenience. But for the majority of teams that want flexibility to start cheap and scale into dedicated or fine-tuned infrastructure without re-platforming, Together is the safest default.
Pros
- Broadest open-model catalog (200+ models) with OpenAI-compatible API
- Serverless, dedicated endpoints, and bare GPU clusters under one account
- Built-in fine-tuning priced by model size, plus image and video models
- Transparent hourly GPU pricing (H100 $5.49, B200 $9.95) with reserved discounts
- Free credits to start, smooth path from prototype to production scale
Cons
- Larger platform surface area than a single-purpose API
- Per-token prices on some models are mid-pack, not the cheapest
- Heavy GPU-cluster usage gets expensive without reserved commitments
Ideal for: Teams that want one platform for serverless inference, dedicated capacity, fine-tuning, and GPU rental across the widest open-model catalog.
Visit Together AI →Full review
3. Fireworks AI — Best for production speed & fine-tuning
Note: Open models optimized for low-latency production serving; LoRA + full-parameter fine-tuning; cached input and batch at 50% · Pricing: Serverless by size: <4B $0.10, 4–16B $0.20, >16B $0.90 per 1M tokens; DeepSeek V4 Flash $0.14 / $0.28; on-demand H100/H200 $7.00/hr, B200 $10.00/hr; fine-tuning from $0.50/1M tokens · $1 in free credits on new accounts
Fireworks AI targets the team that has outgrown a prototype and needs production-grade serving with predictable performance. Its serverless inference is tuned for low latency and high throughput, and it leans into a clean, size-tiered pricing model that makes budgeting straightforward: text models under 4B parameters cost $0.10 per million tokens, 4–16B models $0.20, and anything above 16B $0.90, regardless of which specific model you pick within a tier. Newer flagship models get explicit prices — DeepSeek V4 Flash, for instance, is $0.14 / $0.28 per million.
Where Fireworks pulls ahead is fine-tuning. It supports both LoRA and full-parameter tuning across model sizes, priced per million training tokens — LoRA SFT starts at $0.50 for sub-16B models and scales up to $10 for models over 300B, with DPO variants available. That, combined with on-demand GPUs (H100/H200 at $7.00/hour, B200 at $10.00, B300 at $12.00), lets you customize a model and then serve it on dedicated hardware without changing vendors. Embeddings are cheap too, starting at $0.008 per million tokens, and cached input plus batch inference are both billed at 50%.
The catch is the modest $1 free-credit allowance — enough to test, not to run a pilot — so you will be entering a card sooner than on platforms with richer free tiers. And like Groq, Fireworks is focused on open models, so proprietary frontier models live elsewhere. For production serving plus serious fine-tuning of open models, though, it is one of the strongest options here.
Pros
- Production-tuned serving with low latency and high throughput
- Clean size-tiered token pricing ($0.10 / $0.20 / $0.90) is easy to budget
- Strong fine-tuning: LoRA and full-parameter, priced per training token
- On-demand H100/H200/B200 GPUs and cheap embeddings from $0.008/1M
- Cached input and batch inference both 50% off; OpenAI-compatible API
Cons
- Only $1 in free credits — minimal for a real pilot
- Open-model focus; no proprietary frontier models
- Size-tiered pricing can overcharge small but premium models
Ideal for: Production teams that need reliable low-latency serving of open models plus first-class LoRA or full fine-tuning under one roof.
Visit Fireworks AI →Full review
4. OpenRouter — Best for multi-model access via one API
Note: OpenAI-compatible router across every major closed and open model with automatic failover and provider load-balancing · Pricing: No markup on provider pricing — you pay exactly the provider's rate; 5.5% fee on pay-as-you-go credits; 400+ models, 60+ providers; enterprise bulk discounts · Free plan with 25+ free models, 4 providers, 50 requests/day
OpenRouter solves a different problem from the inference providers above: instead of serving models itself, it gives you one OpenAI-compatible endpoint that routes to all of them. Through a single API key you can call GPT, Claude, Gemini, Llama, DeepSeek, Qwen, Mistral and hundreds more — 400+ models across 60+ providers — switching between them by changing a single string in your request. For teams that want to A/B different models, avoid vendor lock-in, or fail over automatically when one provider has an outage, this is the most pragmatic way to ship.
The pricing model is refreshingly honest: OpenRouter does not mark up provider rates. The price you see in the catalog is exactly what the underlying provider charges, and OpenRouter's revenue comes from a 5.5% platform fee on pay-as-you-go credit usage. The free plan gives you access to 25+ free models, 4 providers, and 50 requests per day — genuinely useful for prototyping — while paid usage unlocks the full 400+ model catalog with no minimums and no lock-in. Automatic provider load-balancing and failover are built in, so a single model can be served from whichever upstream is healthiest and cheapest.
The trade-off is that you are adding a hop. For absolute lowest latency or the rock-bottom cost of a specific open model, calling a specialist like Groq or Together directly will edge out routing through OpenRouter plus its 5.5% fee. But the operational leverage — one integration, every model, built-in resilience — is worth that small premium for most product teams, especially early on.
Pros
- One OpenAI-compatible API for 400+ models across 60+ providers
- No markup on provider pricing — only a transparent 5.5% platform fee
- Automatic failover and provider load-balancing for resilience
- Genuinely useful free tier (25+ free models, 50 requests/day)
- Eliminates vendor lock-in and makes model A/B testing trivial
Cons
- Adds a routing hop — slightly higher latency than calling a provider directly
- 5.5% fee on top of provider prices for cost-sensitive open-model serving
- You depend on a third party for billing and uptime across all models
Ideal for: Product teams that want to call closed and open models through one API, test models against each other, and avoid lock-in.
5. Replicate — Best for running any model (incl. image/video)
Note: Thousands of community + official models (text, image, video, audio); push your own with Cog; scale-to-zero per-second billing · Pricing: Hardware: CPU $0.36/hr, T4 $0.81/hr, L40S $3.51/hr, A100 80GB $5.04/hr, H100 $5.49/hr (billed per second); models e.g. FLUX 1.1 Pro $0.04/image, Wan 2.1 video 720p $0.25/sec · No standing free tier; pay-as-you-go per second of compute
Replicate is the platform for running any model, not just chat LLMs. Its catalog spans thousands of community and official models across every modality — language models, FLUX and Ideogram image generation, Wan and Kling video, transcription, upscalers, and more — each callable through a uniform API. If your product needs an image generator today and a video model next month, Replicate lets you ship both without standing up GPU infrastructure or learning a new SDK for each.
Billing is per second of compute, which is ideal for bursty or mixed workloads. Hardware ranges from CPU at $0.36/hour through Nvidia T4 ($0.81), L40S ($3.51), A100 80GB ($5.04), and H100 ($5.49), up to 8×H100 nodes — all metered by the second so you pay only while a prediction runs and nothing while idle. Popular models also expose simple per-run pricing: FLUX 1.1 Pro at $0.04 per image, FLUX Dev at $0.025, Ideogram V3 at $0.09, and Wan 2.1 video at $0.09–$0.25 per second of output depending on resolution. You can push your own model with Replicate's Cog packaging tool and get the same scale-to-zero economics.
The trade-offs are latency and cost ceiling. Cold starts can add seconds when a model spins up from zero, which matters for interactive UX, and for steady high-volume text inference a token-metered specialist will be cheaper than per-second GPU billing. There is also no standing free tier. But for breadth of model types and the freedom to deploy literally anything, Replicate is unmatched here.
Pros
- Run thousands of models across text, image, video, and audio via one API
- Per-second compute billing — pay only while a prediction runs, scale to zero
- Simple per-run pricing on popular models (FLUX, Ideogram, Wan video)
- Push and host your own custom models with the Cog packaging tool
- Transparent hardware rates from CPU to 8×H100 nodes
Cons
- Cold starts add latency when models spin up from zero
- Per-second GPU billing costs more than token-metered serving at high volume
- No standing free tier — you pay from the first prediction
Ideal for: Builders who need image, video, audio, or long-tail open models — not just chat — and want to deploy anything without managing GPUs.
6. Cohere — Best for enterprise RAG & embeddings
Note: Enterprise-grade generation, embedding, and rerank models for retrieval; private/VPC and on-prem deployment via cloud partners · Pricing: Command A $2.50 in / $10 out per 1M; Command R+ (08-2024) $2.50 / $10; Command R $0.50 / $1.50; Embed 4 $0.12/1M text ($0.47/1M image tokens); Rerank priced per search · Free trial API keys (rate-limited) for evaluation
Cohere is the enterprise specialist on this list, and its center of gravity is retrieval-augmented generation rather than consumer chat. Where general providers give you a generation model and leave embeddings and reranking to you, Cohere ships all three pieces of a production RAG stack as first-class, tightly integrated APIs — the Command family for generation, Embed for vectorization, and Rerank for relevance — with the security posture (private deployment, VPC, on-prem via cloud partners) that regulated industries require.
Command A, Cohere's flagship, is a 111B-parameter open-weights model with a 256k context window priced at $2.50 per million input tokens and $10 output, tuned for agentic, multilingual, and enterprise workloads; the lighter Command R serves at $0.50 / $1.50 for higher-volume tasks. The retrieval models are the real draw: Embed 4 costs just $0.12 per million text tokens (and $0.47 per million image tokens for multimodal embeddings), and Rerank is billed per search — a single query against up to 100 documents — making it cheap to bolt high-quality reranking onto an existing vector search. Multilingual quality across 100+ languages is a long-standing Cohere strength.
Cohere is not the cheapest or fastest for plain text generation, and its catalog is deliberately narrow — you will not find a hundred community models here. It is built for one job. But if you are standing up enterprise search, a knowledge assistant, or any RAG system where embedding and rerank quality and data governance matter more than raw token price, Cohere is the most purpose-built option on this list.
Pros
- Complete RAG stack — generation, Embed 4, and Rerank — as integrated APIs
- Very cheap, high-quality embeddings ($0.12/1M text) and per-search reranking
- Strong multilingual support across 100+ languages
- Private/VPC and on-prem deployment for regulated enterprises
- Command A is open-weights with a 256k context window
Cons
- Not the cheapest or fastest for plain text generation
- Deliberately narrow catalog — no long tail of community models
- Best value only realized when you use the full retrieval stack
Ideal for: Enterprises building RAG, semantic search, or knowledge assistants that need top-tier embeddings, reranking, and strict data governance.
7. Novita AI — Best for affordable GPU inference
Note: Cost-optimized open-model serving plus on-demand GPU instances; OpenAI-compatible API · Pricing: Llama 3.1 8B $0.02 in / $0.05 out per 1M; Llama 3.3 70B $0.135 / $0.40; Qwen3 Coder 30B $0.07 / $0.27; DeepSeek V4 Pro $1.60 / $3.20 per 1M tokens · Free credits to start; a few permanently free models
Novita AI competes on one axis above all: price. Its token rates for popular open models are among the lowest you will find anywhere — Llama 3.1 8B Instruct at $0.02 per million input tokens and $0.05 output, Llama 3.3 70B at $0.135 / $0.40, Qwen3 Coder 30B at $0.07 / $0.27. For high-volume, cost-sensitive workloads — bulk classification, summarization pipelines, embedding-adjacent batch jobs — those numbers can cut an inference bill dramatically versus mid-pack providers, while still exposing a familiar OpenAI-compatible API.
Beyond the cheap LLM endpoints, Novita is a broader GPU platform. It offers on-demand GPU instances for teams that want to run their own models, and a stable of media-generation APIs (Kling, Vidu, Seedance and others) for image and video, priced per generation. New users get free credits to start, and a handful of models are permanently free, so evaluation costs nothing. Premium and frontier-scale models are available too — DeepSeek V4 Pro at $1.60 / $3.20, GLM-5.1 at $1.38 / $4.40 — if you need them, though the value proposition is strongest at the budget end.
The considerations are the usual ones for a value-first provider: it is less of a household name than Together or Fireworks, so you are weighing brand maturity and ecosystem against savings, and the very newest models may land later than on the flagship platforms. But if your priority is squeezing the most inference out of every dollar on mainstream open models, Novita is the most aggressive option here.
Pros
- Among the lowest token prices anywhere (Llama 3.1 8B at $0.02/$0.05)
- On-demand GPU instances plus image and video generation APIs
- OpenAI-compatible endpoints — easy migration from other providers
- Free credits to start and a few permanently free models
- Covers budget through frontier-scale models in one catalog
Cons
- Less established brand and ecosystem than Together or Fireworks
- Newest models can arrive later than on flagship platforms
- GPU instance details and SLAs are lighter than enterprise clouds
Ideal for: Cost-sensitive teams running high volumes of mainstream open-model inference who want the lowest possible per-token price.
8. Lepton AI — Best for fast deployment
Note: Unified GPU marketplace aggregating many cloud providers; NVIDIA handles billing; deploy models and endpoints across regions · Pricing: Marketplace sample rates: H100 SXM ~$5.01/hr on-demand (~$2.91 spot); H200 ~$5.92/hr (~$3.31 spot); B200 ~$9.36/hr — dynamic, set by upstream providers, billed per minute · Varies by upstream provider; trial credits via NVIDIA
Lepton AI was acquired by NVIDIA and now operates as NVIDIA DGX Cloud Lepton, a unified GPU marketplace built for developers who want to deploy fast across many providers without negotiating contracts with each one. The pitch is speed-to-deploy: rather than serving models on one company's hardware, Lepton acts as a broker that allocates compute from a curated network of clouds — AWS, Nebius, Together AI, Mistral, Scaleway, CoreWeave, Crusoe, Lambda and more — while NVIDIA handles a single billing relationship. You get one console and API to launch endpoints on Blackwell and Hopper GPUs wherever capacity is available.
Because it is a marketplace, pricing is dynamic and set by the upstream providers rather than a single published rate card. Representative on-demand sample rates seen on the platform put H100 SXM around $5.01/hour (roughly $2.91 spot), H200 around $5.92/hour ($3.31 spot), and B200 around $9.36/hour, with per-minute billing. Those numbers move with supply, region, and provider, so treat them as indicative rather than fixed — always confirm the live rate in the console before committing. Trial credits are typically available through NVIDIA to get started.
The strength is reach and convenience: one integration that taps a global pool of GPUs and routes around scarcity. The caveats are transparency and consistency — published marketplace rates bundle the underlying provider price with brokerage overhead, so they are not always directly comparable to going to a single neocloud, and you should verify current pricing yourself. For teams that value fast, flexible deployment across providers over a single fixed price, it is a compelling aggregator.
Pros
- Single console and API to deploy across many GPU clouds at once
- Routes around capacity scarcity — access Blackwell/Hopper wherever available
- NVIDIA handles unified billing across all upstream providers
- Per-minute billing with both on-demand and spot options
- Fast deployment without per-provider contracts
Cons
- Dynamic marketplace pricing — no single fixed rate card; verify live rates
- Published rates bundle provider cost plus brokerage overhead
- Consistency and SLAs depend on the chosen upstream provider
Ideal for: Teams that want to deploy GPU workloads fast across multiple clouds from one console, prioritizing flexibility and capacity reach over a fixed price.
9. Anyscale — Best for Ray-based scaling
Note: Managed Ray platform for distributed training, batch inference, and serving at scale; BYOC to run in your own cloud · Pricing: Hosted compute (markup over cloud): CPU $0.0135/hr, T4 $0.5682/hr, L4 $0.9542/hr, A10G $1.3635/hr, A100 $4.9591/hr; H/B/GB families contact sales; committed-use discounts · $100 in starting credits for new users
Anyscale is the odd one out here in the best possible way: it is not a token-metered inference endpoint but the managed home of Ray, the open-source framework for distributed Python and AI workloads. If your problem is scale — distributed training across many GPUs, large-scale batch inference over millions of records, or serving a custom model with fine-grained control over autoscaling — Anyscale gives you the orchestration layer that token APIs deliberately abstract away. You write Ray code; Anyscale runs it efficiently across a cluster you control.
Pricing is usage-based compute, billed by the hour, as a managed markup over raw cloud GPUs. Published hosted rates include CPU at $0.0135/hour, Nvidia T4 at $0.5682, L4 at $0.9542, A10G at $1.3635, and A100 at $4.9591, with H100/Blackwell and GB families quoted on request and committed-use discounts for steady workloads. New users get $100 in starting credits — the most generous trial allowance on this list — and template projects launch for just a few dollars. Crucially, Anyscale supports Bring-Your-Own-Cloud (BYOC), so you can run on your own AWS/GCP/Azure account or existing GPU reservations while still getting the managed Ray experience.
The flip side is that Anyscale is a platform for engineers, not a one-line API. There is no simple per-token endpoint — you are writing and operating distributed jobs, which is overkill if all you need is to call Llama behind a REST API. But for ML teams that have outgrown single-node inference and need to train, batch-process, or serve at genuine scale, Anyscale is purpose-built and the natural production home for Ray.
Pros
- Managed Ray — best-in-class for distributed training and batch inference
- Most generous free trial here: $100 in starting credits
- Bring-Your-Own-Cloud to run on your own GPU reservations
- Transparent hourly compute pricing with committed-use discounts
- Fine-grained control over autoscaling and large-scale serving
Cons
- Not a simple per-token API — you write and operate distributed jobs
- Overkill for teams that just need a REST endpoint to call a model
- Frontier GPU pricing (H100/Blackwell) is contact-sales only
Ideal for: ML engineering teams that need distributed training, large-scale batch inference, or custom serving at scale via the Ray framework.
10. Nebius AI — Best for European cloud + GPU
Note: European AI cloud: token-metered inference (AI Studio / Token Factory) plus low-cost GPU clusters; EU data residency · Pricing: AI Studio tokens e.g. Llama 3.1 8B ~$0.02/1M, Llama 3.1 70B $0.25 in / $0.75 out per 1M; AI Cloud GPUs: H100 $3.85/hr, H200 $4.50/hr, B200 $7.15/hr on-demand (preemptible from $2.15/hr) · Promo-code free credits; $25 minimum first payment
Nebius is the European answer to the AI-cloud question, and for teams with data-residency or GDPR requirements that is its decisive advantage. It is a full-stack AI cloud headquartered and operated in Europe, offering both token-metered inference through its AI Studio (Token Factory) product and raw GPU clusters through AI Cloud — so the same vendor can serve your Llama endpoint and rent you the H100s to fine-tune it, all within EU jurisdiction.
On the inference side, AI Studio prices are competitive with the cheapest global providers: Llama 3.1 8B runs around $0.02 per million tokens, Llama 3.1 70B is $0.25 input / $0.75 output, and the catalog spans DeepSeek, Qwen, Mistral and gpt-oss with fast variants for latency-sensitive work. On the infrastructure side, AI Cloud GPU pricing is genuinely low: H100 at $3.85/GPU-hour on-demand (and just $2.15 preemptible), H200 at $4.50 ($2.45 preemptible), and B200 at $7.15 ($3.95 preemptible), with up to 35% savings on multi-month reservations. New users can redeem promo-code credits, and the minimum first payment is a modest $25.
The considerations are reach and ecosystem maturity. Nebius's footprint is Europe-centric, so latency for users far outside the region may trail a globally distributed provider, and its community ecosystem is smaller than the US incumbents. But if EU data residency is a hard requirement — or you simply want some of the lowest published H100/H200 GPU-hour rates on the market — Nebius is the strongest pick on this list.
Pros
- European cloud with EU data residency — strong for GDPR-bound teams
- Both token-metered inference (AI Studio) and raw GPU clusters (AI Cloud)
- Some of the lowest published GPU rates: H100 $3.85/hr (preemptible $2.15)
- Competitive token prices (Llama 3.1 8B ~$0.02/1M) with fast variants
- Up to 35% savings on multi-month reservations; low $25 entry
Cons
- Europe-centric footprint — higher latency for far-flung users
- Smaller community ecosystem than US incumbents
- Token catalog narrower than aggregators like OpenRouter
Ideal for: European teams (or anyone needing EU data residency) that want token inference and low-cost H100/H200 GPUs from one compliant cloud.
Compared side by side
| # | Tool | Type | Score | Entry price | Best for |
|---|---|---|---|---|---|
| 1 | Groq | Inference API | 4.7 | Pay-as-you-go, per-token | fastest inference (LPU hardware) |
| 2 | Together AI | Inference API | 4.3 | Pay-as-you-go, per-token; dedicated endpoints and GPU clusters hourly | overall for open models |
| 3 | Fireworks AI | Inference API | 4.4 | Pay-as-you-go, per-token; on-demand GPUs hourly | production speed & fine-tuning |
| 4 | OpenRouter | API gateway / router | 4.5 | Pay-as-you-go: pass-through model prices + 5.5% platform fee | multi-model access via one API |
| 5 | Replicate | Model hosting / inference | 4.3 | Per-second hardware billing or per-run model pricing | running any model (incl. image/video) |
| 6 | Cohere | Inference API | 4.7 | Pay-as-you-go, per-token / per-search | enterprise RAG & embeddings |
| 7 | Novita AI | Inference API | 4.3 | Pay-as-you-go, per-token; GPU instances available | affordable GPU inference |
| 8 | Lepton AI | GPU cloud / inference (NVIDIA DGX Cloud Lepton) | 4.3 | Marketplace GPU pricing (per-minute), on-demand or spot | fast deployment |
| 9 | Anyscale | Compute platform (Ray) | 4.3 | Pay-as-you-go compute, hosted or BYOC; usage-based hourly | Ray-based scaling |
| 10 | Nebius AI | AI cloud / inference | 4.3 | Per-token (AI Studio) and per-GPU-hour (AI Cloud) | European cloud + GPU |
Pricing snapshot (verified June 2026)
- Groq — Free API access to start (complimentary developer tier on signup); Llama 3.1 8B Instant $0.05 in / $0.08 out per 1M; Llama 3.3 70B $0.59 / $0.79; gpt-oss 120B $0.15 / $0.60; Llama 4 Scout $0.11 / $0.34 per 1M tokens.
- Together AI — Free credits to start on the serverless API; Llama 3.3 70B $1.04 in / $1.04 out per 1M; gpt-oss 120B $0.15 / $0.60; DeepSeek V4 Pro $2.10 / $4.40; dedicated 1×H100 $6.49/hr; on-demand H100 cluster $5.49/hr.
- Fireworks AI — $1 in free credits on new accounts; Serverless by size: <4B $0.10, 4–16B $0.20, >16B $0.90 per 1M tokens; DeepSeek V4 Flash $0.14 / $0.28; on-demand H100/H200 $7.00/hr, B200 $10.00/hr; fine-tuning from $0.50/1M tokens.
- OpenRouter — Free plan with 25+ free models, 4 providers, 50 requests/day; No markup on provider pricing — you pay exactly the provider's rate; 5.5% fee on pay-as-you-go credits; 400+ models, 60+ providers; enterprise bulk discounts.
- Replicate — No standing free tier; pay-as-you-go per second of compute; Hardware: CPU $0.36/hr, T4 $0.81/hr, L40S $3.51/hr, A100 80GB $5.04/hr, H100 $5.49/hr (billed per second); models e.g. FLUX 1.1 Pro $0.04/image, Wan 2.1 video 720p $0.25/sec.
- Cohere — Free trial API keys (rate-limited) for evaluation; Command A $2.50 in / $10 out per 1M; Command R+ (08-2024) $2.50 / $10; Command R $0.50 / $1.50; Embed 4 $0.12/1M text ($0.47/1M image tokens); Rerank priced per search.
- Novita AI — Free credits to start; a few permanently free models; Llama 3.1 8B $0.02 in / $0.05 out per 1M; Llama 3.3 70B $0.135 / $0.40; Qwen3 Coder 30B $0.07 / $0.27; DeepSeek V4 Pro $1.60 / $3.20 per 1M tokens.
- Lepton AI — Varies by upstream provider; trial credits via NVIDIA; Marketplace sample rates: H100 SXM ~$5.01/hr on-demand (~$2.91 spot); H200 ~$5.92/hr (~$3.31 spot); B200 ~$9.36/hr — dynamic, set by upstream providers, billed per minute.
- Anyscale — $100 in starting credits for new users; Hosted compute (markup over cloud): CPU $0.0135/hr, T4 $0.5682/hr, L4 $0.9542/hr, A10G $1.3635/hr, A100 $4.9591/hr; H/B/GB families contact sales; committed-use discounts.
- Nebius AI — Promo-code free credits; $25 minimum first payment; AI Studio tokens e.g. Llama 3.1 8B ~$0.02/1M, Llama 3.1 70B $0.25 in / $0.75 out per 1M; AI Cloud GPUs: H100 $3.85/hr, H200 $4.50/hr, B200 $7.15/hr on-demand (preemptible from $2.15/hr).
How to choose
Choosing an LLM inference platform comes down to four questions. Answer them honestly about your actual workload and the shortlist narrows fast.
Token pricing vs GPU-hour pricing
The single biggest fork is how you are billed. Token-metered providers (Groq, Together, Fireworks, Novita, Nebius AI Studio) charge per million tokens and are almost always cheaper and simpler for standard chat, RAG, and text pipelines — you pay only for the text you actually process, with no idle cost. GPU-hour platforms (Replicate, Anyscale, Nebius AI Cloud, DGX Cloud Lepton) rent you the hardware and make sense when you are running a custom or fine-tuned model, doing large-scale training, or serving image and video models that token meters do not cover. A rough rule: if a token-metered endpoint exists for your model, start there; reach for GPU-hour billing only when you need control the API does not give you, or steady high utilization makes dedicated hardware cheaper than per-call rates.
Latency and throughput
Cost is not the only meter. For anything user-facing — chat, voice, autocomplete, agents — time-to-first-token and tokens-per-second determine how the product feels. This is where specialized hardware earns its keep: Groq's LPU delivers 500–1,000 tokens/sec on mid-size models, fast enough to change what is possible in an interactive UI. Watch for cold starts too: scale-to-zero platforms like Replicate save money but can add seconds when a model spins up, which is fine for batch jobs and painful for real-time UX. Match the latency profile to the use case, not just the price.
Model choice and lock-in
Decide whether you want one provider's curated catalog or access to everything. A focused provider gives you the lowest latency and cost for the models it serves; an aggregator like OpenRouter gives you 400+ models and closed frontier models through one API at the cost of a small fee and an extra hop. The good news is that OpenAI-compatible endpoints are now near-universal, so switching providers is often just a base-URL change — keep your integration provider-agnostic and you preserve the freedom to chase price and quality as the market moves.
Data privacy and residency
For regulated or enterprise workloads, where your data is processed can outrank price. Check each provider's data-retention and training policy, whether private/VPC or on-prem deployment is available (Cohere and Anyscale BYOC both support this), and where the compute physically runs. Nebius is purpose-built for EU data residency. If your prompts contain sensitive or proprietary data, confirm in writing that it is not retained or used for training.
A note on self-hosting
If maximum privacy, zero per-call cost, or full offline control matters more than convenience, running models on your own hardware is the alternative to every option above. That is a different toolset — desktop and local runtimes like Ollama, LM Studio, and Jan — covered in our guide to the best local LLM tools. For most production teams the math favors a hosted API (no GPU capex, instant scale, someone else on call), but self-hosting is the right call when data can never leave your premises or your volume is large and steady enough to amortize owned hardware.
Frequently asked questions
What is an LLM API?
An LLM API is a hosted endpoint that lets your application send a prompt over the internet and receive a model-generated response, without you owning or operating any GPUs. You authenticate with an API key, POST your messages, and pay per token used (or per second of compute). Most modern providers expose an OpenAI-compatible interface, so the same client code works across vendors with only a base-URL and key change. This is the standard way to add language-model features — chat, summarization, classification, RAG — to a product in production.
Which is the cheapest LLM API?
For mainstream open models, Novita AI and Nebius AI Studio are among the cheapest, with Llama 3.1 8B around $0.02 per million input tokens and roughly $0.05 output. Groq is also extremely competitive at $0.05/$0.08 for the same model while adding industry-leading speed. The cheapest choice depends on the specific model and your input/output ratio, so compare the exact model you plan to use — and remember an aggregator like OpenRouter passes through provider prices at-cost plus a 5.5% fee, letting you shop rates across providers from one API.
Which LLM API has the fastest inference?
Groq is the speed leader thanks to its custom LPU (Language Processing Unit) hardware, delivering roughly 500–1,000 tokens per second on mid-size open models like Llama 3.1 8B and gpt-oss — fast enough that streamed responses feel instant. Fireworks AI is also tuned for low-latency production serving. For interactive products where time-to-first-token drives the user experience, Groq is the default recommendation; for raw GPU control over a custom model, dedicated endpoints on Together or Fireworks let you tune throughput yourself.
What does OpenAI-compatible mean and why does it matter?
OpenAI-compatible means a provider exposes the same request and response format as OpenAI's API, so any code or SDK written for OpenAI works by changing only the base URL and API key. Groq, Together, Fireworks, OpenRouter, Novita, and Nebius all offer this. It matters because it eliminates lock-in: you can switch providers, A/B test models across vendors, or fail over to a backup in minutes rather than rewriting your integration. Building against this standard keeps you free to chase the best price and performance as the market shifts.
Do these LLM API platforms have free tiers?
Most offer free credits or a free tier to start. Groq gives complimentary developer API access; Together and Novita provide free starting credits; Fireworks includes $1 in credits; Anyscale is the most generous with $100 in starting credits; and OpenRouter has a genuine free plan with 25+ free models and 50 requests per day. Replicate has no standing free tier (you pay per second from the first run), and Nebius uses promo-code credits with a $25 minimum first payment. Free allowances are enough to evaluate; expect to add a payment method for any real pilot.
Can I fine-tune models on these platforms?
Yes, several support fine-tuning. Fireworks offers both LoRA and full-parameter fine-tuning priced per million training tokens (LoRA SFT from $0.50 for sub-16B models), and Together provides fine-tuning across model sizes from roughly $0.48 per million training tokens, then lets you serve the result on dedicated endpoints. Anyscale, via Ray, is built for large-scale distributed training of custom models. Pure inference routers like OpenRouter and speed-focused Groq do not fine-tune — they serve existing models. If customization is core to your product, Together or Fireworks are the strongest fits.
Open models or closed models — which should I use?
Open-weight models (Llama, DeepSeek, Qwen, Mistral, gpt-oss) are dramatically cheaper, can be fine-tuned and self-hosted, and now rival closed models on many tasks — making them the default for cost-sensitive or customizable workloads, and the focus of providers like Together, Groq, Fireworks, and Novita. Closed frontier models (GPT, Claude, Gemini) can still lead on the hardest reasoning and agentic tasks. If you want both without juggling vendors, OpenRouter exposes open and closed models through a single API, letting you route each request to whichever model fits the job and budget.
How do rate limits work on LLM APIs?
Providers cap how many requests and tokens you can send per minute or per day, usually rising as you spend more or move from a trial to a paid key. Free tiers are tightly limited — OpenRouter's free plan, for example, allows 50 requests per day — while paid usage unlocks far higher ceilings, and dedicated endpoints (Together, Fireworks) give you reserved capacity with no shared-pool throttling. For production, check the published limits for your tier, build in retry-with-backoff for 429 responses, and request a limit increase or dedicated capacity before a launch that will spike traffic.
Is my data private when I use an LLM API?
It depends on the provider's policy, so verify before sending sensitive data. Reputable inference providers generally do not train on your API inputs by default, but retention windows and terms vary — read each vendor's data policy. For stricter requirements, Cohere offers private/VPC and on-prem deployment, Anyscale's Bring-Your-Own-Cloud runs jobs in your own account, and Nebius provides EU data residency for GDPR-bound teams. If your prompts contain regulated or proprietary information, confirm in writing that data is not retained or used for training, or self-host.
What is the difference between an LLM API and an API gateway like OpenRouter?
An LLM API (Groq, Together, Fireworks, Novita) serves models on its own infrastructure and bills you for that inference. An API gateway or router like OpenRouter serves no models itself — it provides one OpenAI-compatible endpoint that forwards your request to whichever underlying provider hosts the model you chose, across 400+ models and 60+ providers. The gateway adds value through unified access, automatic failover, and load-balancing, charging a small fee (OpenRouter's is 5.5%) on top of pass-through provider pricing. Use a direct provider for lowest latency on one model; use a gateway for breadth and resilience.
Can these platforms run image and video models, not just text?
Some can. Replicate is the standout for multimodal breadth, hosting thousands of image, video, and audio models (FLUX, Ideogram, Wan, Kling) billed per second or per run alongside text. Together also serves image generation (FLUX) and video (Kling, Veo, Sora), and Novita offers image and video generation APIs. Text-focused providers like Groq, Cohere, and OpenRouter concentrate on language models. If your product needs generative images or video, Replicate or Together let you cover text and media from one account.
When should I use a hosted LLM API instead of self-hosting?
Use a hosted API when you want to ship fast without buying or operating GPUs, need to scale instantly with traffic, and value having the provider handle uptime, updates, and new models — which describes most production teams. Self-hosting with tools like Ollama, LM Studio, or Jan makes sense when data can never leave your premises, you need fully offline operation, or your volume is large and steady enough that owned hardware beats per-call pricing. The honest answer for most: start with a hosted API, and revisit self-hosting only if privacy mandates or sustained scale change the math.
What GPU-hour rates can I expect for renting compute?
Published on-demand rates in mid-2026 cluster around $5–$5.50/hour for an Nvidia H100 (Together $5.49, Replicate $5.49, Fireworks $7.00), with Nebius among the lowest at $3.85/hour on-demand and $2.15 preemptible. H200 runs roughly $4.50–$6.79/hour and B200 around $7.15–$10/hour depending on provider. Marketplace platforms like DGX Cloud Lepton show dynamic rates (sample H100 ~$5/hr, with spot lower) set by upstream clouds. Reserved or committed-use contracts cut these meaningfully — Nebius and Together both advertise sizable multi-month discounts — so commit only once your utilization is predictable.