LLM, RAG: Who Are They and How They Actually Talk

You walk into a large library. Every book, paper, article, research note, forum post, and documentation page that humanity has produced is stored here, read and internalized by the librarian.
You find the desk clerk at the entrance. You ask: "Who won the last FIFA World Cup, and what are the fixtures for the next one?"
The clerk takes your question to the librarian. The librarian, who has read everything in the building, gives you an answer from memory. Instantly. Confidently.
"Argentina won the 2022 World Cup in Qatar. As for 2026, the draw happened in December 2025, but I am not certain which teams qualified after that. Some qualifying matches finished in early 2026."
Half right. Half uncertain. In production, uncertainty is almost as bad as wrong.
The library was organized and indexed, and the librarian internalized everything as of December 31, 2025. Today is May 2026. Qualifying completed in March 2026. The confirmed group fixtures, the final team list, the opening match details: all of that exists, just not in this library.
That librarian is the LLM (Large Language Model): a system trained on vast amounts of text, up to a fixed point in time, to understand language, reason over it, and generate coherent responses.
The obvious fix is to ask the librarian to keep learning: read everything published since training ended, internalize it, and update the knowledge. But retraining is like re-indexing every book in that multi-storey library from scratch every time something changes. The compute cost runs into millions of dollars. The time makes it impractical for anything that changes faster than once a year. The time and cost involved are mind-boggling.
Turns out there is a better approach. A much cheaper one. Engineers, being engineers, worked around it.
The New Information Room
Next to the main library, there is a smaller room. Call it the new information room. Everything published after the library was indexed lives here: recent documents, current data, your internal knowledge base, anything that post-dates the training cutoff.
When you ask a question, the desk clerk does not go straight to the librarian. The clerk first checks the new information room, pulls the relevant pages, and hands them to the librarian along with your question.
The librarian now has two sources: everything internalized during training, and the fresh pages just retrieved. It draws on both and synthesizes an answer.
The new information room has a document: "The 2026 FIFA World Cup opens on June 11, 2026, with Mexico vs South Africa at Estadio Azteca. Argentina, the defending champion, is drawn in Group J alongside Algeria, Austria, and Jordan."
The desk clerk finds it, hands it to the librarian, and the librarian answers fully and correctly.
No retraining. No re-indexing. Just a smarter fetch step before every response.
The new information room is the store. RAG (Retrieval-Augmented Generation) is the process: retrieve relevant documents from the store, augment the prompt with them, and generate a grounded response. The room is a component. RAG is the pattern that uses it.
Meet the Cast
Every character here maps to a real component in the architecture.
The library stacks are the LLM's training data: the frozen corpus of everything the model learned during training. Vast. Static. Ends at a cutoff date.
The librarian is the LLM itself. Does not fetch. Does not search. Reads everything handed to it and synthesizes a coherent response. The intelligence lives here.
The desk clerk is the retrieval engine: the component that searches the new information room and returns relevant documents. Runs fast. Does not reason. Does not generate.
The new information room is the RAG store: documents loaded after the library was indexed, organized by meaning rather than by topic or alphabetical order. More on how that works shortly.
The user is you. You ask. You receive. For now, you are evaluating the answer yourself. There is a better way, and we will get to it on Day 4.
How a Request Flows
User --> Desk Clerk --> New Information Room --> Librarian --> Answer
You ask the desk clerk. The clerk searches the new information room and brings back the relevant pages, the retrieved context. Those pages go to the librarian alongside your original question. The librarian reads both, draws on the library stacks for general knowledge, and produces an answer grounded in current information.
Simple on the surface. The depth is in how each step actually works.
Chunking: Preparing Documents for the New Information Room
Before any document lands in the new information room, it gets cut into pieces called chunks: passages of text sized for retrieval, roughly paragraph-length, typically a few hundred characters.
Each chunk is stored and retrieved independently. When the desk clerk searches the room, it does not retrieve whole documents. It retrieves the specific chunks most relevant to your question.
Why not store whole documents?
Chunk too large: the relevant sentence is buried inside three irrelevant paragraphs. All of it lands in the context section, and the librarian has to work through noise to find the signal.
Chunk too small: the relevant sentence arrives without the surrounding explanation. The librarian has the fact but not the context to use it correctly.
Two broad approaches:
Fixed-size chunking splits every N characters with overlap to avoid cutting mid-thought. Fast, predictable, and easy to implement. Its limits show quickly: chunks that cut mid-sentence, context split across retrieval boundaries.
Semantic chunking splits at natural meaning boundaries: sentences, paragraphs, topic shifts. More expensive to compute, but it produces more coherent chunks and better retrieval quality.
In rag-api, I used fixed-size chunking (500 characters per chunk, 50 character overlap). In agentic-ai-platform, I moved to semantic chunking via LlamaIndex's SentenceSplitter (512 characters, 64 overlap). The retrieval quality difference was measurable. Day 3 has the numbers.
Embeddings and Semantic Search: How the New Information Room Finds What Is Relevant
The new information room is not organized alphabetically or by topic. You cannot browse by category. You search it by meaning, and meaning in technical terms is called semantics. Searching by meaning is called semantic search. Here is how it actually works.
Every piece of text — your question and every chunk stored in the room — gets converted into a list of numbers called an embedding (a numerical representation of meaning in high-dimensional space, typically over a thousand numbers per piece of text). The conversion is done by a separate model called an embedding model. In my implementation, I used OpenAI's text-embedding-3-small.
The key property: similar meaning produces similar numbers. "Who won the FIFA World Cup in 2022?" and "Argentina lifted the World Cup trophy in Qatar" produce very similar embeddings, even though they share almost no words. They are about the same event, and the numbers reflect that.
Similarity between embeddings is measured using cosine similarity: the angle between two vectors in high-dimensional space.
cos(θ) = (A · B) / (|A| × |B|)
Where A is the query embedding and B is a document chunk embedding. The smaller the angle, the higher the score, and the closer the meaning.
The query goes through the same embedding conversion as every chunk. Cosine similarity of any vector against itself is always 1.00, making it the reference point against which everything else is scored.
Query: "FIFA 2026 World Cup opening match and fixtures"
embedding score: 1.00 (reference)
Chunk A: "The 2026 World Cup opens June 11, Mexico vs South
Africa at Estadio Azteca. Argentina in Group J
with Algeria, Austria, and Jordan." score: 0.94
Chunk B: "Argentina won the 2022 World Cup in Qatar,
defeating France on penalties in the final." score: 0.76
Chunk C: "Estadio Azteca is in Mexico City with a
seating capacity of 87,000." score: 0.41
Chunk D: "The Thames is the longest river in England." score: 0.03
The desk clerk brings back Chunks A and B. C and D stay on the shelf.
Notice that Chunk B scores 0.76 even though it is about the 2022 tournament, not 2026. Semantic search finds related meaning, not just exact topic matches. The librarian receives both and synthesizes a complete answer covering past and present.
One practical note: the desk clerk does not compare your query against every single chunk by brute force. At scale, that would be far too slow. Vector databases use approximate search algorithms like HNSW (Hierarchical Navigable Small World) that find the closest matches in logarithmic time, trading a small, tunable amount of precision for a large gain in speed.
The Vector Database: Filing System of the New Information Room
Something has to store all those embeddings and run those similarity searches efficiently. That is the vector database: the filing system inside the new information room. It stores each chunk alongside its embedding and returns the highest-scoring chunks in milliseconds when the desk clerk arrives with a query.
In my implementation, I used ChromaDB: a local, persistent vector database. Simple to set up, no managed infrastructure required, and the right choice for a single-service build. Day 3 goes deeper on how the vector database is built and queried, and why the storage choice matters at scale.
The Prompt: Your API Request Payload
You already know how to make an API call. Sending a prompt to an LLM is one. Same verb, same headers, same JSON body. The only difference is what goes inside.
Here is a REST request you have probably written a hundred times:
POST /api/search HTTP/1.1
Content-Type: application/json
Authorization: Bearer {token}
{
"query": "FIFA 2026 fixtures and opening match"
}
Here is an LLM request:
POST /v1/messages HTTP/1.1
Content-Type: application/json
Authorization: Bearer {api-key}
{
"model": "gpt-4o",
"system": "[ your standing instructions ]",
"context": "[ retrieved chunks go here ]",
"user": "[ the question ]"
}
Same structure. The body just has a specific contract the LLM API expects.
The exact format varies by provider. OpenAI uses a messages array with role fields, and Anthropic structures it slightly differently. But every LLM API expects these three concepts in some form. The contract is the same even when the schema differs.
The three slots:
System prompt: standing instructions to the librarian. How to behave, what tone to use, what to refuse, what to prioritize. Set once per session. Think of it as your API's base configuration.
Context: the pages the desk clerk retrieved from the new information room. This is what RAG injects. Without it, the librarian only has the frozen stacks. With it, every request can include current, specific, and relevant information.
User message: your actual question. The request body.
The assembled whole is the augmented prompt. This is what the LLM actually processes.
Context engineering is the discipline of deciding what goes into the context section: which documents, how many, in what order, and how recent. It sounds like a detail. It is where RAG systems quietly fail in production. The numbers from my build make that concrete, and they are not flattering. We will get to them in Day 3.
What the Librarian Actually Receives
Here is that same structure, populated with real content from the FIFA example:
POST /v1/messages HTTP/1.1
Content-Type: application/json
Authorization: Bearer {api-key}
{
"model": "gpt-4o",
"system": "You are a helpful assistant. Use the
provided context for current information,
and your own knowledge for everything else.
If you are unsure, say so.",
"context": "Argentina won the 2022 FIFA World Cup in
Qatar, defeating France on penalties.
The 2026 FIFA World Cup opens June 11 with
Mexico vs South Africa at Estadio Azteca.
The tournament runs through July 19 across
the USA, Canada, and Mexico. Argentina, the
defending champion, is in Group J alongside
Algeria, Austria, and Jordan.",
"user": "Who won the last World Cup, and what are
the 2026 fixtures and opening match details?"
}
The librarian reads all three sections. It draws on its training knowledge for general context and uses the retrieved pages for everything current. It produces: "Argentina won the 2022 World Cup in Qatar. The 2026 tournament opens June 11 with Mexico vs South Africa at Estadio Azteca. Argentina, the defending champion, faces Algeria, Austria, and Jordan in Group J."
Grounded. Current. Complete. No retraining required.
For anything current, the quality of that answer depended on what landed in the context section. If the desk clerk had retrieved Chunk D about the Thames instead, the librarian would have had nothing useful for the 2026 question, even though its training knowledge handled the 2022 answer just fine. The generation can be working perfectly while the retrieval is broken. That is the most important thing to understand about RAG in production.
One Character Still Missing
Right now, you are the one deciding whether the answer is good enough. You read the response, evaluate it, and decide whether to ask a follow-up. You are managing the loop.
What if you had an assistant instead? You hand them the task and describe what a good answer looks like, then step back. The assistant handles the queuing, the follow-up questions, the evaluation, and the loop. They come back only when the answer meets the bar you set.
That assistant is an Agent. That is Day 4.
The Cast, Mapped
| Library | Technical Reality |
|---|---|
| Library stacks | LLM training data, frozen at cutoff |
| Librarian | LLM: synthesizes, does not fetch |
| Desk clerk | Retrieval engine: searches, does not reason |
| New information room | RAG store: documents added after training |
| Pages fetched | Retrieved chunks: injected into context section |
| Cutting documents into pieces | Chunking: the unit of retrieval |
| Filing system inside the room | Vector database: stores and searches embeddings |
| Converting text to numbers | Embedding: numerical representation of meaning |
| Finding pages by meaning | Semantic search: cosine similarity over embeddings |
| System prompt | Standing instructions: base configuration |
| Context section | Retrieved chunks in the augmented prompt |
| User message | Your question: the request body |
| Augmented prompt | System plus context plus user: the full LLM payload |
| Context engineering | Deciding what goes in the context section |
| Assistant | Agent: runs the loop so you do not have to (Day 4) |
In the next article, we get out of theory and into code. Three repos, each one adding a capability the previous one lacked, each one teaching something the next one needed. Starting with the simplest possible thing: just talk to the LLM, and see exactly what it cannot do on its own.
Next up: Chat, Search, RAG: Building From Scratch — (coming soon)
Previously: From AI-Integrated Systems to AI Platform Architect
Part of the series: I Built an Agentic AI Platform. Here's What I Learned.
All code lives at github.com/manisundaram



