Cohere Rerank
PaidCohere's Rerank API for boosting RAG retrieval quality at $2 per 1,000 search requests
What is Cohere Rerank?
Cohere Rerank is a specialized API product from Cohere designed to dramatically improve the quality of retrieval in RAG (retrieval-augmented generation) pipelines. In a typical RAG setup, you embed documents into a vector database, retrieve the top 20-100 most relevant documents for a user query using vector similarity, and pass them to an LLM as context. The problem: vector similarity is a noisy signal, and the top 5 documents by cosine similarity are often not the truly most relevant ones. Rerank solves this by taking your top-K vector search results and re-scoring them with a specialized cross-encoder model that reads each query-document pair together and outputs a much more accurate relevance score. Teams that adopt Rerank routinely see 10-30% improvements in end-to-end RAG answer quality without changing their embedding model or LLM. Pricing is simple: $2 per 1,000 search requests, where each search is one query paired with up to 100 documents that are under 500 tokens each. Documents longer than 500 tokens get automatically chunked. As of 2026, Cohere offers Rerank 3.5 as the production default and Rerank 4 Fast as a newer, faster variant with strong accuracy for latency-sensitive use cases. Rerank is available via the Cohere API directly, AWS Bedrock, Azure AI, and through aggregators like OpenRouter. It is one of the cheapest ways to meaningfully improve RAG quality.
⚡ Quick Verdict
Any team running a RAG pipeline that wants a major quality boost without switching embedding models or LLMs
Apps that don't use RAG or don't need to rank more than a few documents
$2 per 1,000 search requests
Limited-rate trial key available
Drop-in RAG quality improvement at predictable per-search pricing
Adds latency and cost to every retrieval call
Bottom line: Cohere Rerank scores 4.4/5 — the highest ROI improvement you can make to an existing RAG pipeline. Use Rerank 3.5 for production quality, Rerank 4 Fast when latency matters most.
Pricing
Rerank 3.5 — $2 per 1,000 search requests: The standard Rerank model with strong multilingual support. One search equals one query with up to 100 documents under 500 tokens each. Documents longer than 500 tokens are automatically split into chunks, and each chunk counts toward the document total.
Rerank 4 Fast: Newer variant optimized for lower latency with competitive accuracy. Pricing similar to Rerank 3.5, available on Cohere API and OpenRouter.
Availability: Cohere API, AWS Bedrock, Azure AI Studio, and third-party gateways like OpenRouter. Identical per-search pricing across platforms.
Free trial tier: Cohere offers a limited-rate free trial key for development and testing.
Key Features
- $2 per 1,000 search requests — simple unit pricing
- One search = query + up to 100 documents
- Automatic document chunking for long texts
- Rerank 3.5 (production default) and Rerank 4 Fast
- Multilingual support across 100+ languages
- Drop-in improvement for any RAG pipeline
- Available on Cohere API, AWS Bedrock, Azure AI
- Works with any embedding model and vector database
Pros & Cons
Pros
- Cheapest way to meaningfully improve RAG answer quality
- Works as a drop-in upgrade without changing embeddings or LLM
- Simple per-search pricing is easy to forecast
- Multilingual support across 100+ languages
Cons
- No free tier beyond limited trial
- Adds one extra API call per query to your RAG pipeline
- Long documents count as multiple billing units after chunking
FAQ
How much does Cohere Rerank actually improve RAG quality?
Most teams see 10-30% improvements in end-to-end answer quality metrics when adding Rerank between their vector retrieval step and their LLM context assembly. The exact gains depend on your domain, query complexity, and embedding model quality — noisier retrieval pipelines benefit the most.
What counts as one search?
One search request equals one query paired with up to 100 documents where each document is under 500 tokens (roughly 2,000 characters). If any of your documents are longer than 500 tokens, Cohere automatically splits them into multiple chunks, and each chunk counts as an additional document toward the 100-document limit.
How does Rerank compare to better embeddings?
Rerank and embedding quality both affect RAG performance, but they address different problems. Better embeddings help the initial retrieval step. Rerank then re-scores that top-K using a more powerful cross-encoder model. In practice, using a solid embedding model plus Rerank beats using an excellent embedding model alone, and is almost always cheaper.
Does Rerank work with any vector database?
Yes. Rerank is completely model-agnostic and works as a post-processing step after any vector database retrieval — Pinecone, Weaviate, Qdrant, Chroma, pgvector, Elasticsearch, OpenSearch, or even simple numpy cosine similarity. No changes to your vector database are required.
What languages does Rerank support?
Cohere Rerank 3.5 supports 100+ languages including English, Spanish, French, German, Portuguese, Italian, Dutch, Chinese, Japanese, Korean, Arabic, Hindi, and dozens of others. For multilingual RAG pipelines where queries and documents may be in different languages, Rerank is one of the strongest cross-lingual reranking options available.
How is this different from the main Cohere tool page?
This page focuses specifically on Rerank — Cohere's RAG-focused reranking API. The main Cohere review covers Cohere as a company, including Command R+ LLMs, Embed, classification APIs, and enterprise features.
📋 Good to know
Sign up at cohere.com/api, get a free trial key, and call the /v1/rerank endpoint with a query and list of documents.
Cohere does not train on customer API data. SOC 2 Type II and HIPAA-eligible on AWS Bedrock.
Start with Rerank 3.5 for production, test Rerank 4 Fast when latency matters.
Very low — single API call that accepts query + documents and returns scored indices.