
RAG Is Just a Search Engine Bolted to a Language Model
Something I realized today. RAG is essentially your personal Google for AI. Let me explain.
The fancy LLMs know a ton of stuff about the world. Where Napoleon was born. What makes the RB26 engine unique. It can even program your toaster. But it has zero clue about your PDF document. Or your corporate wiki. Or the 10,000 pages in your company’s document archive. It’s never seen any of that.
AI sounds complex and smart (on purpose) but in reality it’s fairly simple. If you want your model to know something that it does not know yet — the most effective way is to dump it into the prompt. Just shove the relevant text in there before asking your question.
The problem? You can’t dump 10,000 documents into a prompt. Context windows are finite. And even if they weren’t — the model gets worse at finding answers as the haystack grows. More text in = more confusion, not more knowledge.
This is where RAG comes in. For each question, it finds the 3-8 most relevant pieces across all your documents and injects only those into the prompt. The model reads that evidence and generates an answer grounded in your actual data — not its hazy parametric memory.
That’s it. The concept is dead simple. Everything else — embeddings, vector stores, chunking strategies — is just engineering to make that “find the right pieces” step fast, accurate, and scalable.
You might ask: why not just fine-tune the model on your documents? Fair question. Three reasons:
Fine-tuning bakes facts into weights. When you fine-tune, knowledge gets distributed across billions of parameters. There’s no traceable path from an output back to a source document. You can’t audit which doc caused an answer. You can’t update one fact without retraining the whole thing. You can’t restrict access per-user. And the model can still hallucinate — because weights are probabilistic, not a database lookup.
RAG keeps retrieval explicit and decoupled. You can see exactly which documents were used. You can update docs without touching the model. You can apply access control at retrieval time. You can cite sources because you know what the model was looking at. And critically — retrieval and generation are independently tunable.
They solve different problems. Fine-tuning teaches the model a style or skill — how to answer, what tone to use, what format to follow. RAG gives it knowledge — what the facts actually are, right now, today. One shapes behavior. The other supplies evidence. The best systems use both.
The Two Pipelines
RAG splits into two completely separate processes that share one data structure: the index.
Indexing runs offline — you process your documents once (and incrementally as they change) to build a searchable structure. This is your investment upfront.
Generation runs online — for each user question, you query that structure, retrieve relevant text, and feed it to the model. This is what happens at query time.
The only contract between these two pipelines is text. The indexing side produces text chunks and stores them (with vectors as finding aids). The generation side consumes text chunks by stuffing them into a prompt. They never share weights. They never talk in vector space. The LLM has no idea that vectors were involved — it just sees reference text and a question.
This means you can swap either side independently. Change your embedding model, retune your chunking — the LLM doesn’t know. Switch from Claude to GPT to Llama — retrieval doesn’t care. If answers suck, you can diagnose: “am I retrieving the wrong chunks?” (retrieval problem) or “am I retrieving the right chunks but the model ignores them?” (generation problem). Clean separation of concerns.
They’re connected by the vector store — which is just a database optimized for one very specific query: “give me the things most similar to this.”
Query Pipeline
When a user asks a question, you need to find semantically relevant stuff across your documents and dump it into the prompt. The difficult part: how do you find things that mean the same across millions of pages? Not just matching exact words — grep won’t work here — but finding text that answers your question even if it uses completely different words.
Computers have no idea what anything means. They only understand numbers. So here’s the trick: convert text into numbers. Specifically, convert each chunk of text into a vector — a long series of numbers (think: 1024 floats) that represents its meaning in a mathematical space.
Why does this work? Because embedding models are trained to place semantically similar text close together. During training, the model sees millions of (question, answer) pairs and learns: “these should be nearby in vector space.” So after training, “How much does storage cost?” and a document chunk saying “S3 pricing is $0.023/GB” end up as vectors pointing in roughly the same direction — even though they share almost no words.
One rule within the retrieval side: your question and your document chunks must go through the same embedding model. They live in the same vector space — otherwise measuring distance between them is meaningless. (This is separate from the “swap anything” point earlier. The embedding model and the generative LLM are different things — you swap them independently. But if you swap your embedding model, you re-embed all your chunks too. They’re a matched pair.)
And here’s what makes it practical: finding similar vectors is stupid cheap. If you have a vector V1, finding all vectors close to V1 among millions is a solved problem — specialized data structures (HNSW graphs) do it in milliseconds, not hours. This is what vector databases do (OpenSearch, pgvector, Pinecone, FAISS). You give it a vector, it returns the nearest neighbors. Fast.
But semantic search alone isn’t enough. What if the user asks about “error code XJ-4021”? The embedding model has no idea what that arbitrary string means. But keyword search (BM25 — basically a smarter grep) finds it instantly because it matches exact tokens. This is why production systems run hybrid search — semantic and keyword in parallel. Semantic catches meaning; keyword catches exact terms. Results get merged.
Then: reranking. Retrieval is fast but blunt — it encoded your query and each chunk independently. It asks “are these in the same topic area?” not “does this chunk actually answer this specific question?” A reranker takes each (query, chunk) as a pair, processes them jointly, and outputs a precise relevance score. You retrieve broadly (50 candidates) and rerank tightly (keep top 5). Retrieval gets you in the right neighborhood. Reranking finds the exact house.
Finally, you take those top chunks — as plain text — stuff them into the prompt alongside the user’s question, and hand it to the LLM. The model doesn’t know vectors were involved. It just sees reference text and a question. Reading comprehension. Generate answer.
Indexing
RAG will not work if you don’t load any data into your vector database. By now, you should get an idea of how this works. But a few things worth calling out — because this is where most RAG systems silently fail.
Step 1: Parsing. Your documents aren’t text — they’re PDFs (bags of positioned glyphs), DOCX files (zipped XML), HTML (90% boilerplate). Before anything else, you extract clean text. This is unsexy but critical. Garbage in, garbage out. A PDF with bad extraction will produce chunks of garbled nonsense, and no amount of fancy embedding will save you.
Step 2: Chunking. You split parsed text into segments — typically 256 to 1024 tokens each. Why not just embed the whole document? Two reasons:
- Embedding models have token limits (most cap at 512, newer ones handle 8192)
- Precision — a 50-page document as one vector is useless. You need granular pieces so retrieval can pinpoint which part answers the question
The tradeoff: too big = noise (you retrieve a chapter when you need a sentence). Too small = lost context (the chunk says “the service described above costs $50/month” but “the service described above” is in a different chunk). The fix: overlap. 10-20% overlap between consecutive chunks preserves context at boundaries.
Step 3: Embedding. Each chunk goes through the embedding model → out comes a vector. Same model you’ll use at query time — remember, they must share the same vector space. This is the expensive compute step. If you change your embedding model later, you re-run this for every chunk. That’s why it’s a deliberate choice, not something you swap casually.
Step 4: Store. Vectors go into the vector database alongside metadata — source file path, page number, timestamp, author, document title. The metadata travels with every chunk. It enables two things: filtering at query time (“only search docs from 2024”) and citation in the answer (“this came from page 14 of the Q3 report”). Without metadata, you have a pile of anonymous chunks with no way to trace answers back to sources.
That’s it. This runs once per document (and incrementally as docs change). The cost is upfront. After indexing, your query pipeline can search across everything in milliseconds.
The Whole Point
RAG isn’t magic. It’s a search engine bolted onto a language model. The search finds what’s relevant. The model turns it into an answer. Everything in between — embeddings, vector stores, chunking, reranking — is plumbing to make that search fast and accurate at scale.
If your answers suck, it’s one of two things: you’re retrieving the wrong chunks, or the model is ignoring the right ones. Diagnose which half is broken. Fix that half. That’s it.
The concept fits in one sentence: find the right context, stuff it in the prompt, let the model read. The rest is engineering. Good engineering — but engineering, not wizardry.