AI Architecture Articles | Under the hood

RAG is Not a Database

June 11, 2026 · 3 min read

Software Architect

Most discussions about Retrieval-Augmented Generation (RAG) frame it as a way to “connect LLMs to data.” That framing is incomplete — and in large-scale systems, misleading.

RAG is not a database layer. It is a runtime context construction system for a frozen model.

Understanding this distinction is critical for anyone designing production AI systems.

RAG does not store knowledge

A common misconception is that RAG acts like a knowledge store.

That is not accurate.

Instead:

Source data lives in external systems (databases, documents, APIs) Vector indexes store semantic representations, not ground truth Retrieval does not return facts — it returns relevant fragments of representation

RAG does not guarantee correctness. It does not enforce consistency. It does not maintain a canonical state of knowledge.

RAG is a context assembly pipeline, not a query engine

Traditional databases:

Deterministic queries Structured schema Exact retrieval guarantees

RAG systems:

Approximate semantic retrieval Probabilistic ranking Context window assembly for an LLM

The output of RAG is not a “result set.” It is a prompt-ready context bundle.

Retrieval is not reasoning or verification

Retrieved chunks are:

Semantically relevant Not necessarily correct Not validated against a source of truth at retrieval time

The LLM becomes the reasoning layer that interprets this context.

This introduces a key architectural reality:

RAG does not reduce hallucinations by itself — it only changes the input surface area.

RAG is constrained by the context window

Unlike databases, RAG operates under a hard constraint:

Limited tokens Limited attention capacity Competing relevance signals

This forces system-level decisions:

Chunking strategy Embedding granularity Ranking and filtering logic

These are not data concerns — they are cognitive load management decisions for the model.

RAG is a probabilistic system, not a deterministic one

Unlike traditional data systems:

Retrieval is not guaranteed complete Ranking is heuristic Similarity search is approximate Results vary with embeddings and query phrasing

This makes RAG inherently:

Non-deterministic Sensitive to configuration Difficult to reason about without observability

System design complexity shifts to the edges

Once you understand RAG correctly, a key shift happens:

The complexity moves away from the model and into the system:

Chunking strategy becomes critical Embedding model choice becomes architectural Retrieval ranking becomes a relevance system Prompt construction becomes a control surface

In other words:

RAG systems are not “LLM integrations” — they are retrieval + reasoning pipelines.

RAG is not the intelligence layer — it is the context layer

A useful mental model:

LLM = reasoning engine (frozen function) RAG = context shaping system Orchestration layer = control logic

RAG does not make the system intelligent. It determines what the model is allowed to see before it reasons.

Implications for enterprise architecture

Treating RAG as a database abstraction leads to predictable failures:

Over-reliance on embeddings as “truth” Poor chunk design leading to lost context Inconsistent retrieval quality across use cases Unexpected hallucinations due to missing context rather than model failure

Instead, production systems should treat RAG as:

A context engineering layer A relevance filtering system A probabilistic pre-processing stage for reasoning

Final takeaway

RAG is often described as a bridge between data and LLMs.

A more accurate description is:

RAG is a probabilistic context construction system that shapes what a frozen model can reason about at runtime.

The model provides intelligence. The system determines what intelligence can see.

How LLM Works Under the Hood

June 9, 2026 · 7 min read

Jitender Sharma

Software Architect

Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.

This is a 20,000-ft view of the LLM lifecycle in four stages.

The big picture: one model, four stages.

A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.

Stage	What happens	Key ideas
Before	Decide the blueprint	Architecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.
During	Fill in the values	Random weights become meaningful through training: a four-step loop run millions or trillions of times.
Alignment	Make it helpful	Show good examples (SFT) and teach which answers are better (RLHF/DPO).
After	Run it, read-only	Weights are frozen (no learning); inference traverses the model geometry one token at a time.

TAKEAWAY

Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.

Stage 1 - Before training

Two human decisions are baked in before any gradient is computed.

Architecture dials - hidden size, layers, heads, FFN width, vocab size.
Tokenizer vocabulary - the integer alphabet the model reads and writes.

A "7B" model is 7B because of these dials — training never grows it, and most parameters live in the FFN, not attention.

The Architecture dials

Hyperparameter	Example	Description
hidden_size(D)	4096	How much "thinking space" the model has for each word or idea at a given moment.
num_layers(L)	32	How many rounds of refinement - 32 editors in a row.
num_heads(H)	32	A panel of specialists, each spotting a different pattern.
head_dim(D_h)	128	The size of each specialist's notebook.
ffn_hidden(D_ff)	16,384	The knowledge bank — where most facts are stored (~4*D).
vocab_size(V)	32000	The size of the model's dictionary—the building blocks it uses to read and write language.

TAKEAWAY

The model is fully sized and described before it sees a single token.

Stage 2 - During training

Learning is one four-step loop, repeated hundreds of thousands to millions of times.

Forward Pass - Predicts what comes next in a sequence, based on previous tokens.
Loss - How wrong was our prediction?
Backpropagation - Calculate how much, and how each weight contributed to the error.
Optimizer step - Update every weight, slightly adjusting each weigh.

note

The only thing learned here is the next-token prediction — the statistical relationship between tokens given their surrounding context. Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage — that comes later, in alignment.

From random numbers to learned meaning

Before training (random)	After training (meaning)
Every weight is a random number	Every weight holds a learned value
Output is gibberish	Output is fluent, coherent text
No grammar, facts, or reasoning	Grammar, facts, and reasoning emerge
Structure exists, meaning doesn't	Same structure — now full of meaning

TAKEAWAY

Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.

The roles that emerge after training

Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles—learned through experience, not explicitly designed.

Component	Role it settles into
Embeddings	What tokens mean (lexical meaning)
Attention	How tokens relate — routes relevant context
FFNs	Transformation / "thinking" — most parameters and reasoning
LayerNorm	Keep signals stable and usable
Depth (layers)	Progressive refinement of understanding

TAKEAWAY

No one designs these roles; training gradually turns them into specialist roles through learning rather than design.

Stage 3 - Alignment

A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.

	Main training	Polish (alignment)
Data	Trillions of words	Thousands to millions of examples
Length (cost)	Weeks/months, huge cost	Short, cheap
What it does	Teaches knowledge	Shapes behavior

SFT - show it good (prompt, response) examples.
RLHF/DPO - teach it which answer is better.

TAKEAWAY

Alignment turns a raw model into a helpful assistant — it shapes behavior; it doesn't add new knowledge.

Stage 4 - After training

Once training stops, weights are frozen — no learning, no gradients. The model is a fixed function f(tokens) -> next token probabilities.

During inference, the model has no memory of what was asked or answered before — each request starts fresh.

TAKEAWAY

Training builds the geometry. Inference just navigates it one token at a time.

The Mental Model most people get wrong

LLM ≠ continuously learning systems
LLM ≠ dynamic knowledge base
LLM ≠ autonomous agent

What this means for Enterprise Systems

Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system" — it's a fixed component inside a larger architecture.

1. Why RAG is required

LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.

To make them useful in enterprise: To make them useful in enterprise:

Connect them to internal data sources
Inject context at runtime

This is why Retrieval Augmentation (RAG) becomes a foundational pattern.

2. Why agents/orchestration are external

LLMs are:

Stateless
Reactive
Single-step predictors

They cannot:

Execute workflows
Maintain long-running state
Coordinate systems

This is why agentic systems and orchestration layers exist outside the model

important

The intelligence is in the model and the control is in the system design.

3. Why governance is outside the model

You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:

Guardrails
Validation layers
Monitoring and evaluation
Policy enforcement

All of these sit around the model, not inside it

4. Why inference cost dominates

Training is:

One-time
Expensive but amortized

Inference is: Inference is:

Continuous
Scales with usage

important

For enterprise systems: Cost = traffic * tokens * latency requirements

5. Why scale and cost must be designed upfront

Because LLMs don't learn in production, every interaction requires:

Full inference execution
Token processing (input+output)
External system calls (RAG /agents)

This means:

Cost scales with usage, not with training
Latency compounds across system layers
Poor design = exponential cost growth

In real systems, if not handled correctly:

RAG increases token usage
Agents introduce multiple-step execution
Orchestration adds round trips

important

Training is a one-off capital cost; inference is the ongoing operational cost. Also, without careful design, AI systems become unpredictable and expensive at scale

Final Takeaway

The model provides intelligence and the system provides control.

Modern AI architecture is not “LLM design” It is “system design around a frozen model”

Traffic × Tokens × Latency

TAKEAWAY

Treat the LLM as frozen dependency; engineer everything else around it.

RAG does not store knowledge​

RAG is a context assembly pipeline, not a query engine​

Retrieval is not reasoning or verification​

RAG is constrained by the context window​

RAG is a probabilistic system, not a deterministic one​

System design complexity shifts to the edges​

RAG is not the intelligence layer — it is the context layer​

Implications for enterprise architecture​

Final takeaway​

The big picture: one model, four stages.​

Stage 1 - Before training​

The Architecture dials​

Stage 2 - During training​

From random numbers to learned meaning​

The roles that emerge after training​

Stage 3 - Alignment​

Stage 4 - After training​

The Mental Model most people get wrong​

What this means for Enterprise Systems​

1. Why RAG is required​

2. Why agents/orchestration are external​

3. Why governance is outside the model​

4. Why inference cost dominates​

5. Why scale and cost must be designed upfront​

Final Takeaway​

RAG does not store knowledge

RAG is a context assembly pipeline, not a query engine

Retrieval is not reasoning or verification

RAG is constrained by the context window

RAG is a probabilistic system, not a deterministic one

System design complexity shifts to the edges

RAG is not the intelligence layer — it is the context layer

Implications for enterprise architecture

Final takeaway

The big picture: one model, four stages.

Stage 1 - Before training

The Architecture dials

Stage 2 - During training

From random numbers to learned meaning

The roles that emerge after training

Stage 3 - Alignment

Stage 4 - After training

The Mental Model most people get wrong

What this means for Enterprise Systems

1. Why RAG is required

2. Why agents/orchestration are external

3. Why governance is outside the model

4. Why inference cost dominates

5. Why scale and cost must be designed upfront

Final Takeaway