Gökçe Akçıl
#rag#llms#retrieval#evaluation

Optimizing RAG Pipelines

A practical guide to improving RAG pipelines through better retrieval, chunking, reranking, evaluation, and operational design.

April 15, 2026

Executive Summary

A practical guide to improving RAG pipelines through better retrieval, chunking, reranking, evaluation, and operational design.

Retrieval-Augmented Generation is often introduced as a simple pattern: retrieve relevant context, pass it to a language model, get a grounded answer.

That description is useful, but incomplete. In real systems, RAG quality is not determined by the language model alone. It is determined by the weakest part of the retrieval chain: document parsing, chunking, metadata, embedding choice, query transformation, ranking, context assembly, evaluation, and monitoring.

A good RAG pipeline is therefore not a prompt trick. It is an information retrieval system with a generative interface.

The original RAG formulation by Lewis et al. connected parametric memory in a sequence-to-sequence model with non-parametric memory from a dense vector index. That idea remains the foundation: keep factual knowledge outside the model when the knowledge changes, is domain-specific, or must be auditable. The modern RAG stack has expanded around this core with hybrid search, reranking, long-context models, self-reflection, corrective retrieval, graph-based retrieval, and automated evaluation. The 2025-2026 research direction is especially clear: RAG is moving toward robustness, attribution, answerability, and retrieval-aware evaluation rather than simple top-k retrieval demos.

The practical question is not whether to use RAG. The practical question is: where does the pipeline lose evidence?


The Real Failure Modes

Most RAG failures look like generation failures, but begin earlier.

SymptomLikely CauseFix Direction
The answer sounds plausible but is wrongRetrieved context is irrelevant or incompleteImprove retrieval, reranking, and citation checks
The right document exists but is not retrievedChunking, metadata, or embedding mismatchRevisit ingestion and query strategy
The answer ignores retrieved evidenceContext is too long, noisy, or badly orderedCompress, rerank, and place strongest evidence first
The system answers when it should refuseNo confidence threshold or abstention policyAdd retrieval score gates and answerability checks
Evaluation looks good offline but fails in useTest set does not match real queriesBuild query logs, adversarial cases, and regression suites

This is why RAG optimization should start before the model call.


1. Treat Ingestion as Model Work

In many teams, ingestion is treated as plumbing. That is a mistake. The ingestion layer defines what the model can know.

A strong ingestion pipeline should preserve:

  • document hierarchy
  • section titles
  • tables and captions
  • page numbers or source locations
  • timestamps and version identifiers
  • access permissions
  • semantic metadata such as topic, product, jurisdiction, or document type

Flat text extraction is rarely enough. A paragraph from a policy document means something different when it appears under "Exceptions", "Definitions", or "Eligibility". If that structure disappears during ingestion, retrieval becomes blind.

For technical and legal domains, I prefer storing both the raw source and the retrieval representation. The embedding should not become the only memory of the document. If the source text is sensitive, a split-storage pattern can help: embeddings and metadata live in the vector layer, while the original text is encrypted or separately permissioned.

The retrieval index should be reproducible. Every chunk should be traceable back to:

  • source document
  • parser version
  • chunking strategy
  • embedding model
  • ingestion timestamp

Without this, evaluation becomes archaeology.


2. Chunking Is a Bias, Not a Detail

Chunking defines the unit of retrieval. Too small, and the model receives fragments without enough context. Too large, and retrieval becomes less precise, more expensive, and more vulnerable to irrelevant text.

There is no universal chunk size. The right chunk depends on the document type and query distribution.

Corpus TypeBetter Starting Point
API documentationSection-level chunks with headings preserved
Legal or policy textClause/section chunks with parent hierarchy
Research papersParagraph chunks plus abstract/section metadata
Support ticketsWhole ticket or issue-thread chunks
TablesTable-aware extraction, not plain row concatenation

A useful baseline is semantic chunking with overlap, but overlap should be treated carefully. More overlap can improve recall, but it also increases index size, duplicate evidence, and context noise.

Recent work around late chunking argues for embedding longer document context before splitting token representations into chunks. The motivation is important: traditional chunking can remove useful surrounding context before the embedding model sees it. Whether late chunking is worth adopting depends on your embedding stack, but the principle is broadly useful: do not destroy context too early.

My default rule: chunk for retrieval, but preserve hierarchy for interpretation.


3. Dense Retrieval Is Not Enough

Dense embeddings are powerful because they retrieve by meaning rather than exact terms. But exact terms still matter.

A user asking for MLA-C01, Form 1099-K, Article 23, FastAPI UploadFile, or an internal product code may not be well served by semantic similarity alone. Dense retrieval can miss rare identifiers, acronyms, numbers, and quoted phrases.

This is why hybrid retrieval is often a better default:

  • dense search for semantic similarity
  • sparse/BM25 search for lexical precision
  • metadata filtering for scope control
  • reranking for final ordering

Embedding models such as BGE-M3 also reflect this direction by supporting multiple retrieval modes across dense, sparse, and multi-vector retrieval. The broader lesson is not that one model solves retrieval. The lesson is that retrieval is multi-signal.

A practical first-stage retrieval pattern:

  1. Normalize the query.
  2. Apply mandatory metadata filters.
  3. Retrieve candidates from dense search.
  4. Retrieve candidates from BM25 or sparse search.
  5. Merge and deduplicate.
  6. Rerank the top candidates.
  7. Apply confidence and diversity checks.

This is usually more reliable than asking a single vector search call to do everything.


4. Rewrite the Query, But Keep the Original

Users do not write perfect search queries. They write symptoms, fragments, assumptions, and incomplete questions.

Query transformation can help:

  • expand acronyms
  • rewrite vague questions into domain terms
  • decompose multi-hop questions
  • generate hypothetical answers for retrieval
  • create multiple subqueries for different evidence paths

HyDE is a useful example: it generates a hypothetical document or answer, embeds that generated text, and uses it for retrieval. The generated text does not need to be factually correct; it only needs to move the query into a better retrieval neighborhood.

But query rewriting introduces risk. If the rewrite changes the intent, retrieval improves in the wrong direction. For this reason, the original query should remain visible in logs and evaluation. I also prefer retrieving from both the original query and transformed query, then letting reranking decide.

For domain systems, query rewriting should be measurable. Do not add it because it feels intelligent. Add it because recall improves on a representative evaluation set.


5. Reranking Is Often the Highest-Return Upgrade

First-stage retrieval is optimized for speed. Reranking is optimized for precision.

A cross-encoder or reranker examines the query and candidate text together, then scores relevance more accurately than pure vector similarity. This is especially useful when the top 20 retrieved chunks contain the answer, but the top 5 are noisy.

A common pattern:

StageCandidate CountPurpose
Dense + sparse retrieval50-100High recall
Reranking10-20High precision
Context assembly4-8Evidence for generation

The goal is not to maximize retrieved chunks. The goal is to maximize useful evidence per token.

Reranking also makes retrieval more explainable. When a system fails, it is easier to inspect whether the right evidence was retrieved but ranked too low, or never retrieved at all.


6. Long Context Does Not Remove Retrieval Design

Long-context models are useful, but they do not eliminate RAG engineering.

The "Lost in the Middle" paper showed that language models can struggle to use relevant information when it appears in the middle of long contexts. This matters for RAG: dumping more chunks into the prompt can make the answer worse, not better.

Context assembly should be deliberate:

  • put the strongest evidence early
  • remove near-duplicates
  • include source identifiers
  • group related chunks together
  • avoid mixing conflicting versions without warning
  • reserve room for the actual answer

A larger context window increases capacity. It does not guarantee attention, relevance, or faithfulness.

This is where compression becomes useful. Instead of passing every retrieved chunk, compress evidence into a smaller representation while preserving citations. The compression step should be conservative: remove noise, not nuance.


7. Corrective and Self-Reflective RAG

Modern RAG systems increasingly include a control loop around retrieval.

Self-RAG introduced a framework where the model learns to retrieve, generate, and critique its own output using reflection signals. Corrective RAG takes a different angle: evaluate the quality of retrieved documents, then decide whether to use them, refine retrieval, or search elsewhere.

In production terms, these ideas point to a useful architecture:

  1. Retrieve evidence.
  2. Grade evidence relevance.
  3. Decide whether the question is answerable.
  4. Generate with citations.
  5. Verify whether claims are supported by retrieved context.
  6. Refuse or ask for clarification when evidence is insufficient.

This is not just academic decoration. It changes the product behavior. A mature RAG system should know when it does not know.

The challenge is cost and latency. Every extra grader, verifier, or retrieval pass adds time. The right implementation is usually conditional: simple queries take the fast path; uncertain queries trigger deeper retrieval and verification.


8. Graph and Hierarchical Retrieval

Some corpora are not best represented as isolated chunks.

Research papers cite other papers. Legal documents refer to articles, clauses, and exceptions. Enterprise knowledge bases contain product hierarchies, ownership boundaries, and process dependencies. In these cases, retrieval may need structure.

RAPTOR proposed building a tree of summaries over documents, allowing retrieval at different abstraction levels. GraphRAG uses graph structure and community summaries to support broader, global questions over a corpus.

These approaches are especially useful when the question is not "find the paragraph" but "explain the relationship".

Examples:

  • What are the main themes across these reports?
  • Which policies conflict with this operational rule?
  • How does this technical component affect downstream services?
  • What changed between two versions of a document set?

Graph or hierarchical retrieval is not necessary for every RAG system. It adds ingestion complexity. But for multi-document reasoning, it can provide a better retrieval substrate than independent chunks.


9. Evaluation Must Be Built Into the Pipeline

RAG evaluation should not be a final manual check. It should be part of development.

At minimum, evaluate four layers:

LayerQuestion
Retrieval recallDid we retrieve the evidence needed to answer?
Ranking qualityWas the best evidence near the top?
FaithfulnessAre generated claims supported by context?
Answer usefulnessIs the final answer correct, complete, and usable?

RAGAS popularized reference-free and reference-based metrics for RAG evaluation, including faithfulness, answer relevancy, context precision, and context recall. These metrics are not a substitute for human judgment, but they are very useful for regression testing.

A practical evaluation set should include:

  • straightforward factual questions
  • multi-hop questions
  • ambiguous questions
  • unanswerable questions
  • adversarial wording
  • version-sensitive questions
  • queries using acronyms and exact identifiers

The unanswerable set is important. Many RAG systems are evaluated only on questions that have answers. That hides hallucination risk.

For each test query, store:

  • expected answer or rubric
  • required source document IDs
  • acceptable citations
  • failure notes
  • query category

Then run the same suite whenever you change chunking, embeddings, reranking, prompts, or model versions.


10. Monitor RAG Like a System, Not a Demo

A RAG pipeline has operational signals that should be monitored:

  • retrieval latency
  • generation latency
  • empty retrieval rate
  • low-confidence answer rate
  • refusal rate
  • citation coverage
  • top-k score distribution
  • token usage
  • user feedback
  • document ingestion failures
  • stale index age

The most useful signal is often drift in retrieval scores. If average similarity scores drop after a document update or embedding change, the system may still respond, but quality may degrade silently.

For sensitive domains, log enough to debug failures without leaking private content. Store query IDs, retrieved document IDs, scores, model versions, and decision traces. Avoid storing raw user input when privacy requirements prohibit it.

RAG observability should answer three questions:

  1. What evidence was retrieved?
  2. Why was it selected?
  3. Which generated claims depend on it?

If those questions cannot be answered, the system is difficult to trust.


A Practical Optimization Order

When a RAG system is underperforming, I would not start by changing the LLM.

I would use this order:

  1. Inspect failed queries manually.
  2. Confirm whether the answer exists in the corpus.
  3. Check parsing and chunk boundaries.
  4. Add metadata filters where scope matters.
  5. Move from dense-only to hybrid retrieval.
  6. Add reranking.
  7. Improve context ordering and deduplication.
  8. Add answerability checks.
  9. Build a regression evaluation set.
  10. Only then compare generator models.

The language model is the most visible component, but retrieval quality sets the ceiling.


Reference Architecture

A reliable RAG pipeline can be designed as a series of gates:

StepOutputFailure Gate
IngestionClean chunks with metadataReject malformed documents
IndexingVersioned dense/sparse indexesTrack embedding and parser versions
Query processingOriginal + transformed queriesPreserve original intent
RetrievalCandidate evidenceDetect empty or weak retrieval
RerankingOrdered evidenceRemove irrelevant candidates
Context assemblyCompact cited contextLimit duplicates and stale versions
GenerationDraft answerRequire citations for factual claims
VerificationSupported answerRefuse unsupported claims
MonitoringLogs and metricsAlert on drift and failures

This makes RAG less magical and more testable.


Closing Thought

RAG is best understood as a discipline of controlled evidence flow.

The system retrieves evidence, ranks it, compresses it, exposes it to a model, and then asks the model to answer within the boundaries of that evidence. Each step can improve quality. Each step can also introduce failure.

The strongest RAG systems are not the ones with the longest prompts or the largest models. They are the ones where evidence is traceable, retrieval is measurable, and uncertainty is allowed to surface.

That is the difference between a convincing demo and a system that can be trusted.


References

Key Takeaways

  • Core Concept: rag
  • Difficulty: Intermediate/Advanced
  • Author: Gökçe Akçıl (Senior AI/ML Engineer)
G

About Gökçe Akçıl

AI/ML Engineer and Senior Software Engineer with 11+ years of experience specializing in end-to-end ML pipelines and large language models. M.Sc. in Artificial Intelligence.