Rerankers are models that reorder an initial list of search results by scoring each (query, document) pair for relevance, so the best matches appear at the top.
Who this is for: search, recommendation, and RAG teams who already retrieve candidates fast (BM25/vector search) but need higher precision and better top‑K results.
When to use rerankers
Use a reranker when you need better relevance in the top results and can afford scoring a shortlist:
- You have many “almost relevant” candidates and need better ordering in the top 5–20.
- You run vector search or BM25 and want a stronger semantic “final judge.”
- Your users type long, specific, or ambiguous queries (policy, legal, medical, technical docs).
- You’re building RAG and need the context to be the most relevant (not just “kind of related”).
- You want to optimize for business goals (CTR, add-to-cart, resolution rate) while keeping search quality.
When not to use rerankers
Avoid or postpone reranking when the tradeoffs are not worth it:
- Your system can’t tolerate extra latency (or you can’t run it close to users).
- Your first-stage retrieval already returns excellent top‑K (reranking yields small gains).
- You need to rank millions of items in real time (rerankers are for shortlists, not full-corpus scoring).
- You don’t have enough text per item (e.g., extremely short titles only) and can’t enrich content.
- You have no way to measure relevance outcomes (no feedback loop, no evaluation set).
Signals you need a reranker (symptoms)
- Users say “search is bad,” but logs show they do click—just not on the first results.
- High zero-result satisfaction: results exist, but users keep reformulating the query.
- High pogo-sticking: users click, return immediately, and click another result.
- RAG answers are confident but wrong because retrieved context is “nearby,” not correct.
- Your vector search retrieves semantically similar items but misses intent and constraints (e.g., “from Brazil to USA” vs “from USA to Brazil”).
How rerankers work (retrieve‑then‑rerank)
Step-by-step pipeline
- User query arrives (text).
- Initial retrieval fetches N candidates fast (often N = 50–1000) using:
- Lexical retrieval (e.g., BM25)
- Vector search (bi-encoders / embeddings)
- Reranking scores each (query, candidate) pair with a more accurate model.
- Final ranking returns top K (often K = 5–20).
Why reranking is more accurate
Most modern rerankers are Transformer cross-encoders. They read query and document together, allowing “early interaction” between tokens. This usually improves precision, especially for:
- negation (“not”, “without”)
- constraints (“under $1000”, “good battery”)
- directionality (“Brazil → USA”)
- domain terminology (internal doc names, SKUs, acronyms)
Key features to look for
- Cross-encoder reranking (highest accuracy, higher cost)
- Multilingual support (if queries/docs are not English-only)
- Max input length (token limit impacts long documents)
- Batching support (throughput improvements)
- Distilled/smaller variants (lower latency at acceptable quality)
Integrations (common architectures)
Rerankers usually sit behind a search stack or RAG stack:
- Search: BM25 + vector search + reranker + business rules
- RAG: retriever + reranker + context builder + LLM generator
- Recs: candidate generation + reranker + personalization signals
Typical tools involved:
- Elasticsearch/OpenSearch, Vespa, Solr
- Vector DBs (or vector-enabled search engines)
- LLM/RAG frameworks and custom middleware
Limitations and tradeoffs
- Latency: scoring N candidates adds time; cross-encoders can be expensive.
- Cost: more compute per query than retrieval-only approaches.
- Token limits: long documents may need chunking; chunking changes relevance behavior.
- Evaluation complexity: you need relevance data (explicit labels or strong proxies) to tune properly.
Metrics and how to measure
Offline ranking quality (recommended)
- NDCG@K: rewards correct ordering near the top.
- MRR@K: focuses on how quickly the first relevant result appears.
- Recall@K (for stage 1): ensures retrieval gives reranker good candidates.
Online product metrics (what stakeholders care about)
- Search CTR (but watch for misleading clicks)
- Reformulation rate (lower is better)
- Time to first useful click / time to resolution
- Conversion rate (e-commerce) or ticket deflection (support portals)
- RAG answer quality: human eval, win-rate tests, citation correctness
Measurement tip: isolate reranking in an A/B test. Keep retrieval fixed, change only reranking.
Common mistakes (and fixes)
-
Mistake: Reranking too many candidates (e.g., 5,000+)
Fix: Reduce N (e.g., 100–300) or add a cheaper pre-filter step.
-
Mistake: Reranking raw long documents
Fix: Chunk documents and rerank chunks; then roll up to document-level results.
-
Mistake: No baseline evaluation
Fix: Start with offline NDCG/MRR on a labeled set or curated query list.
-
Mistake: Treating reranker score as the only signal
Fix: Combine reranker score with business signals (freshness, availability, authority) via a final rank fusion rule.
-
Mistake: Latency surprises in production
Fix: Batch scoring, cache frequent queries, and run inference closer to users (edge).
How this applies in practice
Practical design choices
- Candidate size (N): 50–300 is common for cross-encoder reranking.
- Document representation: title + key fields often outperform full raw text.
- Chunking strategy: semantic chunks (or section-based) usually rank better than fixed-size chunks.
- Hybrid retrieval: BM25 + vector search increases candidate diversity before reranking.
- Personalization: add user/context features after reranking (or rerank personalized candidates).
Example flow for a docs search or RAG
- Retrieve 200 chunks with hybrid search (BM25 + vectors).
- Rerank the 200 chunks with a cross-encoder.
- Use top 8–12 chunks as LLM context.
- Track citation correctness and answer win-rate.
How to implement on Azion
If you want to reduce reranker latency and keep responses fast globally, run reranking inference closer to users:
- Azion AI Inference :https://www.azion.com/en/products/ai-inference/
- Reranker model docs (example):https://www.azion.com/en/documentation/products/ai/ai-inference/models/baai-bge-reranker-v2-m3/
- Edge-native architecture guidance:https://www.azion.com/en/documentation/architectures/edge-application/edge-native-applications/
Typical setup: retrieve candidates in your search layer → send top N to edge inference → return reranked top K.
Pricing (what affects cost)
Reranker cost usually scales with:
- Queries per second (QPS)
- Candidates reranked per query (N)
- Model size (larger Transformer = slower/more expensive)
- Average document length (more tokens = more compute)
- Caching and batching (can reduce cost materially)
Mini FAQ
What’s the difference between a retriever and a reranker? A retriever finds candidates fast (high recall). A reranker reorders a shortlist with higher accuracy (high precision).
Is BM25 still useful if I use a reranker? Yes. BM25 is often a strong first-stage retriever (or part of hybrid retrieval) that feeds good candidates into the reranker.
How many documents should I rerank? Start with 100–300 for cross-encoders and tune based on latency and NDCG/MRR gains.
Do rerankers help RAG reduce hallucinations? They can, because better-ranked context improves grounding. They don’t guarantee correctness, but they reduce “wrong context” failures.
Should I fine-tune a reranker? Fine-tuning can help in specialized domains (internal docs, legal, healthcare). Start with a strong pretrained reranker, then fine-tune if you have labeled data or reliable implicit feedback.
Docs
- Artificial Intelligence:https://www.azion.com/en/learning/ai/what-is-artificial-intelligence/
- Vector search: https://www.azion.com/en/learning/ai/what-is-vector-search/
- Embeddings: https://www.azion.com/en/learning/ai/what-are-embeddings-and-vectors/
- RAG: https://www.azion.com/en/learning/ai/what-is-rag/
- Latency: https://www.azion.com/en/learning/performance/what-is-latency/
- Edge computing: https://www.azion.com/en/learning/cdn/edge-computing-evolution-of-cdn/