Skip to main content

2 posts tagged with "LLM"

View All Tags

Hallucinations Is a Design Problem, Not a Model Problem

· 4 min read
Jitender Sharma
Software Architect

Every time a model invents a citation, the conversation jumps to "which model hallucinates less?". That's the wrong question. The model did exactly what it was built to do. Everyone's focused on picking the model that hallucinates least.

The thing that will actually decide whether your AI system is trustworthy is the architecture you wrap around the model – grounding, retrieval, validation, and an explicit path to "I don't know".

A hallucination isn't a bug the next checkpoint will patch. It's the expected behavior of frozen, probabilistic next-token predictor asked a question it has no grounded answer for. Treating it as model defect means you keep waiting for a fix that isn't coming. Treating it as a design problem means you can actually solve it today.

THE CLAIM

Hallucination is not the model failing. It's the model succeeding at the wrong objective – fluent continuation – in a system that never gave it the right one: grounded truth.

Why the model was never going to save you

A trained model is a frozen function: f(tokens) -> next-token probabilities. It has no live knowledge, no source of truth, and no built-in concept of “I don't actually know this”. Three properties make hallucinations structural, not accidental:

Property of the modelConsequence
Frozen at training timeNo access to fresh, private or post-cutoff facts - it fills gaps from priors
Optimized for fluency, not truthThe objective was plausible next token, never verified fact
No native abstention“Confidently wrong” scores the same as confident and right unless the system checks

So when you ask something outside what it learned, it doesn't error out - it produces the most statistically plausible continuation. That continuation is often fluent, well-formatted, and wrong. The model isn't broken. It's doing precisely what next-token prediction does.

The model invents a citation because inventing a plausible continuation is the only thing it was ever built to do - truth was never in its objective, so it has to be in your architecture.

note

A bigger or newer model shifts where the cliff is, not that there is a cliff. You're buying a lower hallucinations rate, not a guarantee. Rates don't survive contact with a regulator, an auditor, or a customer who was given a fake policy number.

Why this is a design problem (the enterprise lens)

If the model can't be the source of truth, the system has to be, always been. That reframes hallucinations from "model quality" to "system design" - and design is something you control.

  • Grounding is an architecture choice, not a model feature. RAG exists precisely because the model's knowledge is frozen. Inject the right context at runtime and the model is continuing from facts instead of inventing from priors. No retrieval layer = you've delegated truth to a frozen function and hoped.
  • Validation lives outside the model. Guardrails, schema/grounding checks, and citation verifications sit around the model - you can't patch behaviors inside frozen weights in real time. The system decides what's allowed to reach the user, not the model.
  • "I don't know" must be an engineered path. Models don't volunteer abstention. Confidence thresholds, retrieval-coverage checks, and explicit fallbacks are what turn a confident guess into an honest "I can't answer that from sources I have."
  • Cost and governance ride on this. An ungrounded answer in a bank, a hospital, or a legal workflow isn't a quality blip - it's liability. Design decides whether a wrong answer is impossible to surface or merely cheap to retry.
important

The intelligence is in the model. The truth is in the system. If your architecture has no component that owns "is this actually true and supported?", then nothing does - and the model will happily fill the silence.

Different answers ≠ Hallucinations

There is no 100% surety – and that’s the whole point

What “designing for it” actually looks like

How to actually build the verifier

Where I actually land

How LLM Works Under the Hood

· 7 min read
Jitender Sharma
Software Architect

Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.

This is a 20,000-ft view of the LLM lifecycle in four stages.

The big picture: one model, four stages.

A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.


StageWhat happensKey ideas
BeforeDecide the blueprintArchitecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.
DuringFill in the valuesRandom weights become meaningful through training: a four-step loop run millions or trillions of times.
AlignmentMake it helpfulShow good examples (SFT) and teach which answers are better (RLHF/DPO).
AfterRun it, read-onlyWeights are frozen (no learning); inference traverses the model geometry one token at a time.
TAKEAWAY

Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.

Stage 1 - Before training

Two human decisions are baked in before any gradient is computed.

  • Architecture dials - hidden size, layers, heads, FFN width, vocab size.
  • Tokenizer vocabulary - the integer alphabet the model reads and writes.

A "7B" model is 7B because of these dials — training never grows it, and most parameters live in the FFN, not attention.

The Architecture dials

HyperparameterExampleDescription
hidden_size(D)4096How much "thinking space" the model has for each word or idea at a given moment.
num_layers(L)32How many rounds of refinement - 32 editors in a row.
num_heads(H)32A panel of specialists, each spotting a different pattern.
head_dim(D_h)128The size of each specialist's notebook.
ffn_hidden(D_ff)16,384The knowledge bank — where most facts are stored (~4*D).
vocab_size(V)32000The size of the model's dictionary—the building blocks it uses to read and write language.
TAKEAWAY

The model is fully sized and described before it sees a single token.

Stage 2 - During training

Learning is one four-step loop, repeated hundreds of thousands to millions of times.

  1. Forward Pass - Predicts what comes next in a sequence, based on previous tokens.
  2. Loss - How wrong was our prediction?
  3. Backpropagation - Calculate how much, and how each weight contributed to the error.
  4. Optimizer step - Update every weight, slightly adjusting each weigh.
note

The only thing learned here is the next-token prediction — the statistical relationship between tokens given their surrounding context. Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage — that comes later, in alignment.

From random numbers to learned meaning

Before training (random)After training (meaning)
Every weight is a random numberEvery weight holds a learned value
Output is gibberishOutput is fluent, coherent text
No grammar, facts, or reasoningGrammar, facts, and reasoning emerge
Structure exists, meaning doesn'tSame structure — now full of meaning
TAKEAWAY

Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.

The roles that emerge after training

Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles—learned through experience, not explicitly designed.

ComponentRole it settles into
EmbeddingsWhat tokens mean (lexical meaning)
AttentionHow tokens relate — routes relevant context
FFNsTransformation / "thinking" — most parameters and reasoning
LayerNormKeep signals stable and usable
Depth (layers)Progressive refinement of understanding
TAKEAWAY

No one designs these roles; training gradually turns them into specialist roles through learning rather than design.

Stage 3 - Alignment

A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.

Main trainingPolish (alignment)
DataTrillions of wordsThousands to millions of examples
Length (cost)Weeks/months, huge costShort, cheap
What it doesTeaches knowledgeShapes behavior
  • SFT - show it good (prompt, response) examples.
  • RLHF/DPO - teach it which answer is better.
TAKEAWAY

Alignment turns a raw model into a helpful assistant — it shapes behavior; it doesn't add new knowledge.

Stage 4 - After training

Once training stops, weights are frozen — no learning, no gradients. The model is a fixed function f(tokens) -> next token probabilities.

During inference, the model has no memory of what was asked or answered before — each request starts fresh.

TAKEAWAY

Training builds the geometry. Inference just navigates it one token at a time.

The Mental Model most people get wrong

  • LLM ≠ continuously learning systems
  • LLM ≠ dynamic knowledge base
  • LLM ≠ autonomous agent

What this means for Enterprise Systems

Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system" — it's a fixed component inside a larger architecture.

1. Why RAG is required

LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.

To make them useful in enterprise: To make them useful in enterprise:

  • Connect them to internal data sources
  • Inject context at runtime

This is why Retrieval Augmentation (RAG) becomes a foundational pattern.

2. Why agents/orchestration are external

LLMs are:

  • Stateless
  • Reactive
  • Single-step predictors

They cannot:

  • Execute workflows
  • Maintain long-running state
  • Coordinate systems

This is why agentic systems and orchestration layers exist outside the model

important

The intelligence is in the model and the control is in the system design.

3. Why governance is outside the model

You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:

  • Guardrails
  • Validation layers
  • Monitoring and evaluation
  • Policy enforcement

All of these sit around the model, not inside it

4. Why inference cost dominates

Training is:

  • One-time
  • Expensive but amortized

Inference is: Inference is:

  • Continuous
  • Scales with usage
important

For enterprise systems: Cost = traffic * tokens * latency requirements

5. Why scale and cost must be designed upfront

Because LLMs don't learn in production, every interaction requires:

  • Full inference execution
  • Token processing (input+output)
  • External system calls (RAG /agents)

This means:

  • Cost scales with usage, not with training
  • Latency compounds across system layers
  • Poor design = exponential cost growth

In real systems, if not handled correctly:

  • RAG increases token usage
  • Agents introduce multiple-step execution
  • Orchestration adds round trips
important

Training is a one-off capital cost; inference is the ongoing operational cost. Also, without careful design, AI systems become unpredictable and expensive at scale

Final Takeaway

The model provides intelligence and the system provides control.

Modern AI architecture is not “LLM design” It is “system design around a frozen model”

Traffic × Tokens × Latency

TAKEAWAY

Treat the LLM as frozen dependency; engineer everything else around it.