Jitender Sharma - Architects Handbook Blog

AI Observability In Enterprise

2026-06-18T00:00:00.000Z

Everyone says "monitor your AI in production". Almost nobody draws the system that does it. "Add Observability" is a slogan until you can say exactly what gets captured, where it lands, how long it lives, and who reads it.

This is an architecture breakdown - capture in the request path, fan-out into purpose-built storage tiers, and four very different consumers reading off them. The headline: AI observability isn't one thing. It's five signals with five retention policies feeding four jobs, and the regulator-facing ones look nothing like the dashboard-facing ones.

THE CLAIM

AI observability is not "a dashboard". It's a capture-and-retention architecture: each signal (logs, metrics, traces, raw prompts, audit records) has a different consumer, a different retention window, and a different blast radius if you get it wrong.

The whole system on one page

Read it left to right: capture -> store -> consumer. The rest of this piece is just the reasoning behind each arrow.

This isn't only for AI

The capture->store->consume backbone here isn't AI-specific. Swap the Agentic app/ RAG service node for a microservice, a VM-hosted app, or a cots product and the skeleton is unchanged: emit OTel signals, fan them out to tiers wit deliberate retention, feed operational / SLO/ audit consumers. Only two boxes are the AI-specific part, the raw prompt/response store and the drift detector. Drop those and you're left with a perfectly standard service-observability architecture. So you don't need a different observability sta for non-agentic systems, you just need fewer arrows the same one.

1. Capture lives in the request path - and that's the hard constraint

The app - an agent, a RAG service, any LLM system - emits three OpenTelemetry Signals - logs, metrics, traces - through an OTel SDK, into an OTel collector that sits in the hot path. Two design consequences fall out of that immediately:

Instrumentation is not free. Every signal you emit costs latency and money on the request path. That's why the boring signals (metrics) are cheap and always-on, while the expensive ones (traces, raw payloads) are sampled or gated.
The Collector is the control point. Routing, sampling, redaction, and fan-out happen once, in the Collector - not scattered across app code. This is where you strip PII before it every reaches a long-lived store.

note

Using vendor neutral OpenTelemetry at the capture layer is the decision that keeps your backwards swappable. The signals are standardized; where they land is a routing config, not a rewrite.

2. Five Signal, Five storage tiers, five retention policies

This is the part most "monitoring" setups collapse into on bucket - and it's exactly where AI system's differ from ordinary services. Retention is a governance decision, not a storage default.

Signal	Store	Retention	Why this window
Structured Logs	Log store	90 d	Operational debugging; cheap to keep short, noisy to keep log
Metrics	Time Series DB(TSDB)	13 mo	Trent + year-over-year comparison, tiny per point cost
Sampled Traces	Trace Store	90 d	latency/causality debugging; full traces are expensive, so sample
Raw Prompt/response	Restricted store	encrypted, 90d	Sensitive content - quality/drift analysis, tightly access-controlled
Audit record	Audit log	immutable, 7y	Compliance evidence - must survive, must not be editable

The two dotted arrows in the diagram matter. Raw prompt/response and audit records are not routine telemetry - they are sensitive, governed signals. One is encrypted and short-lived; the other is immutable and kept for years. Treating either like a normal log is how you end up with PII in a debug dashboard or a compliance gap at audit time.

important

If your "observability" stores everything in one tier with one retention setting, you have made a governance decision by accident. The raw-prompt store and the audit log have opposite requirements short + erasable vs long + immutable and conflating them fails both.

3. Four consumers, four different questions

Storage isn't the point; the questions you can answer are. Each consumer reads a different tier.

Dashboards (logs + metrics + traces) - what is the system doing right now? The operational view.
SLO + burn-rate alerts (metrics) - are we spending our error budget too fast? Pages a human before users feel it.
Drift detector (traces + raw prompts + embeddings) - is the input distribution moving away from what we tested - and from RAG, is the retrieval corpus drifting too? This is the AI-specific one; model quality erodes silently as the world changes.
Regulatory replay (audit log) - can we reconstruct exactly what the system did, months later, for someone who wasn't there? The immutable trail.

The split is the insight: operational health, model-quality erosion, and provable accountability are three different jobs. A latency dashboard tells you nothing about drift. A drift detector can't satisfy an auditor. You need all three, fed by the right tiers.

Why this is an architecture problem, not a tooling purchase

You can buy dashboard. You cannot buy the decision in this diagram.

What to sample (trace, raw payloads) vs always capture (metrics): a latency/cost trade off.
where redaction happens (the collector, before persistence): a privacy boundary.
Which tier is immutable (the audit log): a compliance commitment you design in, not bolt on.
What "healthy" means (the SLOs and drift thresholds): domain knowledge no tool ships with.

note

This is the same thesis as "Hallucination" is a design problem: reliability lives in the system around the model. Observability is how you measure that reliability: groundedness, unsupported-claim rate and drift become metrics you log the way you'd log latency.

The precise position

Most teams stand up a metrics dashboard, call it "AI observability," and move on. That covers exactly one of the four consumer above and not the two that regulators and quality erosion will eventually make you care about.

The architecture that actually holds up captures five signals with deliberate retention, redacts at the collector and feeds four distinct consumers: operational, budget, drift and audit. The diagram isn't decoration; it's the set of decisions you will be asked to defend.

TAKEAWAY

"Monitor your AI" is a slogan. Capture five signals, route them to tiers with deliberate retention, and feed four consumers, dashboards, SLO alerts, drift detection, and regulatory replay. That's the system, everything else is a dashboard pretending to be a strategy.

Hallucinations Is a System Design Problem, Not a Model Problem

2026-06-16T00:00:00.000Z

Every time a model invents a citation, the conversation jumps to "which model hallucinates less?". That's the wrong question. The model did exactly what it was built to do. Everyone's focused on picking the model that hallucinates least.

The thing that will actually decide whether your AI system is trustworthy is the architecture you wrap around the model – grounding, retrieval, validation, and an explicit path to "I don't know".

A hallucination isn't a bug the next checkpoint will patch. It's the expected behavior of a frozen, probabilistic next-token predictor asked a question it has no grounded answer for. Treating it as a model defect means you keep waiting for a fix that isn't coming. Treating it as a design problem means you can actually solve it today.

THE CLAIM

Hallucination is not the model failing. It's the model succeeding at the wrong objective – fluent continuation – in a system that never gave it the right one: grounded truth.

Why the model was never going to save you

A trained model is a frozen function: f(tokens) -> next-token probabilities. It has no live knowledge, no source of truth, and no built-in concept of “I don't actually know this”. Three properties make hallucinations structural, not accidental:

Property of the model	Consequence
Frozen at training time	No access to fresh, private or post-cutoff facts - it fills gaps from priors
Optimized for fluency, not truth	The objective was plausible next token, never verified fact
No native abstention	“Confidently wrong” scores the same as confident and right unless the system checks

So when you ask something outside what it learned, it doesn't error out - it produces the most statistically plausible continuation. That continuation is often fluent, well-formatted, and wrong. The model isn't broken. It's doing precisely what next-token prediction does.

The model invents a citation because inventing a plausible continuation is the only thing it was ever built to do - truth was never in its objective, so it has to be in your architecture.

note

A bigger or newer model shifts where the cliff is, not that there is a cliff. You're buying a lower hallucination rate, not a guarantee. Rates don't survive contact with a regulator, an auditor, or a customer who was given a fake policy number.

Why this is a design problem (the enterprise lens)

If the model can't be the source of truth, the system has to be. That reframes hallucinations from "model quality" to "system design" - and design is something you control.

Grounding is an architecture choice, not a model feature. RAG exists precisely because the model's knowledge is frozen. Inject the right context at runtime and the model is continuing from facts instead of inventing from priors. No retrieval layer = you've delegated truth to a frozen function and hoped.
Validation lives outside the model. Guardrails, schema/grounding checks, and citation verifications sit around the model - you can't patch behaviors inside frozen weights in real time. The system decides what's allowed to reach the user, not the model.
"I don't know" must be an engineered path. Models don't volunteer abstention. Confidence thresholds, retrieval-coverage checks, and explicit fallbacks are what turn a confident guess into an honest "I can't answer that from sources I have."
Cost and governance ride on this. An ungrounded answer in a bank, a hospital, or a legal workflow isn't a quality blip - it's liability. Design decides whether a wrong answer is impossible to surface or merely cheap to retry.

important

The intelligence is in the model. The truth is in the system. If your architecture has no component that owns "is this actually true and supported?", then nothing does - and the model will happily fill the silence.

Non-determinism is not hallucination

This is the objection we hear most, and it's the strongest argument for the design framing - not against it. But it actually bundles two different things together.

Different answers ≠ Hallucinations

	Non-determinism	Hallucination
What it is	Different wording for the same question	A confident false claim
Cause	Sampling (temperature, top-p) picks among probable tokens	No grounded fact, so it continues from priors
Your control	Yes - set `temperature=0`	Only via grounding + verification

The model never stores "an answer". Each step it produces a probability distribution over the next token, then samples from it. At temperature > 0 you are rolling a weighted dice every token - hence different phrasings. Set temperature = 0 (greedy decoding) and it becomes near-deterministic: same input -> same output.

(near, because floating-point rounding and GPU batching cause tiny variations - an engineering detail, not the core issue.)

So "different answers each time" is a knob you control, not proof the model is reliable.

There is no 100% surety – and that’s the whole point

Grounding does not guarantee a correct answer. It shifts the probability mass. Without context, the most-probable continuation comes from fuzzy training priors (high risk). With the right context in the prompt, the most-probable continuation becomes "paraphrase what's in front of me" (much lower risk). You move from maybe ~70% to 95% - never to 100%.

So where does the surety come from? Not the model - a separate verifier. The thing that generates the answer must not be the thing that decides it's trustworthy. A grounded model gives you a good draft - 95%; design decides what happens to the other 5%, whether it silently reaches your user or gets caught and blocked.

note

You can't make a frozen, sampling-based function promise truth - so reliability has to be engineered around it. The model's lack of a guarantee is the reason design exists, not a reason to wait for a better model.

What “designing for it” actually looks like

Those four principles become one concrete pipeline. You don't eliminate hallucinations by hoping - you box it in with layers, each on catching what the last let through.

Retrieve before you generate - give the model facts to continue from, not a blank page.
Constrain the output - structural formats, required citations, schema validation.
Verify against the source - does everything claim trace back to retrieved evidence?
Make abstention first-class - "no grounded answer" is a valid, designed outcome, not a failure.
Observe in production - log groundedness and unsupported claim rates the way you'd log latency, Hallucination is a measurable system metric, not a vibe.

How to actually build the verifier

"Add a verifier" is easy to say. The trap is building one that just re-asks the same model "are you sure?" - it'll rationalize its own output. A good verifier follows two rules and one ordering.

Rule 1 - independent from the generator. The thing that checks the answer must not be the thing that wrote it. Use deterministic code, a retrieval system, or a separate model call that sees only the claim + the source - never the original reasoning.

Rule 2 - verify atomic claims, not paragraph "Mostly right" hides one wrong clause. Decompose the answer into individual facts and check each one against evidence.

The ordering - cheapest, most deterministic checks first, expensive models last, on the reside only:

Layer	Mechanism	Catches	Cost
1. Structural	JSON schema, constrained decoding	No citations, malformed output	~Free
2. Deterministic Facts	Exact/fuzzy match against source	Invented numbers, IDs, dates, quotes	~Free
3. Grounding (NLI)	Small entailment model per claim	Unsupported or contradicted claims	Cheap
4. LLM-as-judge	Separate model	Nuanced cases the rest can't settle	Expensive

The verifier doesn't make the system perfect. It converts a silent, confident, wrong answer into a caught-and-blocked one - turning an unbounded risk into a measurable error rate with a fallback. That conversation is exactly what you can put in front of an auditor.

Where I actually land

My point is: I'm not saying models don't matter, or that one model is as good as another. Picking a stronger model genuinely lowers the baseline rate.

I am saying: a better model reduces hallucinations; only better design lets you bound and govern it. If your reliability strategy is "wait for the next model," you've outsourced your most important architectural decision to someone else's release schedule - and you still won't be able to promise an auditor anything.

Stop asking "which model hallucinates the least?" Start asking "what in the system owns the truth, and what happens when it doesn't have an answer?"

TAKEAWAY

Hallucination is the model doing its job inside a system that forgot to do its own. Engineer grounding, validation, and abstention around the frozen model - that's where reliability is actually built.

How LLM Works Under the Hood

2026-06-09T00:00:00.000Z

Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.

This is a 20,000-ft view of the LLM lifecycle in four stages.

The big picture: one model, four stages.

A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.

Stage	What happens	Key ideas
Before	Decide the blueprint	Architecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.
During	Fill in the values	Random weights become meaningful through training: a four-step loop run millions or trillions of times.
Alignment	Make it helpful	Show good examples (SFT) and teach which answers are better (RLHF/DPO).
After	Run it, read-only	Weights are frozen (no learning); inference traverses the model geometry one token at a time.

TAKEAWAY

Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.

Stage 1 - Before training

Two human decisions are baked in before any gradient is computed.

Architecture dials - hidden size, layers, heads, FFN width, vocab size.
Tokenizer vocabulary - the integer alphabet the model reads and writes.

A "7B" model is 7B because of these dials — training never grows it, and most parameters live in the FFN, not attention.

The Architecture dials

Hyperparameter	Example	Description
hidden_size(D)	4096	How much "thinking space" the model has for each word or idea at a given moment.
num_layers(L)	32	How many rounds of refinement - 32 editors in a row.
num_heads(H)	32	A panel of specialists, each spotting a different pattern.
head_dim(D_h)	128	The size of each specialist's notebook.
ffn_hidden(D_ff)	16,384	The knowledge bank — where most facts are stored (~4*D).
vocab_size(V)	32000	The size of the model's dictionary—the building blocks it uses to read and write language.

TAKEAWAY

The model is fully sized and described before it sees a single token.

Stage 2 - During training

Learning is one four-step loop, repeated hundreds of thousands to millions of times.

Forward Pass - Predicts what comes next in a sequence, based on previous tokens.
Loss - How wrong was our prediction?
Backpropagation - Calculate how much, and how each weight contributed to the error.
Optimizer step - Update every weight, slightly adjusting each weigh.

note

The only thing learned here is the next-token prediction — the statistical relationship between tokens given their surrounding context. Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage — that comes later, in alignment.

From random numbers to learned meaning

Before training (random)	After training (meaning)
Every weight is a random number	Every weight holds a learned value
Output is gibberish	Output is fluent, coherent text
No grammar, facts, or reasoning	Grammar, facts, and reasoning emerge
Structure exists, meaning doesn't	Same structure — now full of meaning

TAKEAWAY

Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.

The roles that emerge after training

Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles—learned through experience, not explicitly designed.

Component	Role it settles into
Embeddings	What tokens mean (lexical meaning)
Attention	How tokens relate — routes relevant context
FFNs	Transformation / "thinking" — most parameters and reasoning
LayerNorm	Keep signals stable and usable
Depth (layers)	Progressive refinement of understanding

TAKEAWAY

No one designs these roles; training gradually turns them into specialist roles through learning rather than design.

Stage 3 - Alignment

A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.

	Main training	Polish (alignment)
Data	Trillions of words	Thousands to millions of examples
Length (cost)	Weeks/months, huge cost	Short, cheap
What it does	Teaches knowledge	Shapes behavior

SFT - show it good (prompt, response) examples.
RLHF/DPO - teach it which answer is better.

TAKEAWAY

Alignment turns a raw model into a helpful assistant — it shapes behavior; it doesn't add new knowledge.

Stage 4 - After training

Once training stops, weights are frozen — no learning, no gradients. The model is a fixed function f(tokens) -> next token probabilities.

During inference, the model has no memory of what was asked or answered before — each request starts fresh.

TAKEAWAY

Training builds the geometry. Inference just navigates it one token at a time.

The Mental Model most people get wrong

LLM ≠ continuously learning systems
LLM ≠ dynamic knowledge base
LLM ≠ autonomous agent

What this means for Enterprise Systems

Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system" — it's a fixed component inside a larger architecture.

1. Why RAG is required

LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.

To make them useful in enterprise: To make them useful in enterprise:

Connect them to internal data sources
Inject context at runtime

This is why Retrieval Augmentation (RAG) becomes a foundational pattern.

2. Why agents/orchestration are external

LLMs are:

Stateless
Reactive
Single-step predictors

They cannot:

Execute workflows
Maintain long-running state
Coordinate systems

This is why agentic systems and orchestration layers exist outside the model

important

The intelligence is in the model and the control is in the system design.

3. Why governance is outside the model

You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:

Guardrails
Validation layers
Monitoring and evaluation
Policy enforcement

All of these sit around the model, not inside it

4. Why inference cost dominates

Training is:

One-time
Expensive but amortized

Inference is: Inference is:

Continuous
Scales with usage

important

For enterprise systems: Cost = traffic * tokens * latency requirements

5. Why scale and cost must be designed upfront

Because LLMs don't learn in production, every interaction requires:

Full inference execution
Token processing (input+output)
External system calls (RAG /agents)

This means:

Cost scales with usage, not with training
Latency compounds across system layers
Poor design = exponential cost growth

In real systems, if not handled correctly:

RAG increases token usage
Agents introduce multiple-step execution
Orchestration adds round trips

important

Training is a one-off capital cost; inference is the ongoing operational cost. Also, without careful design, AI systems become unpredictable and expensive at scale

Final Takeaway

The model provides intelligence and the system provides control.

Modern AI architecture is not “LLM design” It is “system design around a frozen model”

Traffic × Tokens × Latency

TAKEAWAY

Treat the LLM as frozen dependency; engineer everything else around it.

Jitender Sharma - Architects Handbook Blog

AI Observability In Enterprise

The whole system on one page​

1. Capture lives in the request path - and that's the hard constraint​

2. Five Signal, Five storage tiers, five retention policies​

3. Four consumers, four different questions​

Why this is an architecture problem, not a tooling purchase​

The precise position​

Hallucinations Is a System Design Problem, Not a Model Problem

Why the model was never going to save you​

Why this is a design problem (the enterprise lens)​

Non-determinism is not hallucination​

Different answers ≠ Hallucinations​

There is no 100% surety – and that’s the whole point​

What “designing for it” actually looks like​

How to actually build the verifier​

Where I actually land​

How LLM Works Under the Hood

The big picture: one model, four stages.​

Stage 1 - Before training​

The Architecture dials​

Stage 2 - During training​

From random numbers to learned meaning​

The roles that emerge after training​

Stage 3 - Alignment​

Stage 4 - After training​

The Mental Model most people get wrong​

What this means for Enterprise Systems​

1. Why RAG is required​

2. Why agents/orchestration are external​

3. Why governance is outside the model​

4. Why inference cost dominates​

5. Why scale and cost must be designed upfront​

Final Takeaway​

The whole system on one page

1. Capture lives in the request path - and that's the hard constraint

2. Five Signal, Five storage tiers, five retention policies

3. Four consumers, four different questions

Why this is an architecture problem, not a tooling purchase

The precise position

Why the model was never going to save you

Why this is a design problem (the enterprise lens)

Non-determinism is not hallucination

Different answers ≠ Hallucinations

There is no 100% surety – and that’s the whole point

What “designing for it” actually looks like

How to actually build the verifier

Where I actually land

The big picture: one model, four stages.

Stage 1 - Before training

The Architecture dials

Stage 2 - During training

From random numbers to learned meaning

The roles that emerge after training

Stage 3 - Alignment

Stage 4 - After training

The Mental Model most people get wrong

What this means for Enterprise Systems

1. Why RAG is required

2. Why agents/orchestration are external

3. Why governance is outside the model

4. Why inference cost dominates

5. Why scale and cost must be designed upfront

Final Takeaway