Jitender Sharma Blog

RAG Is Not a Database

2026-06-27T00:00:00.000Z

A team ships RAG, passes the demo, and three weeks later a user retrieves a document they were never allowed to see. The vector store did its job. The architecture was never there to stop it.

I see the same root cause every time: teams ask which vector database to buy before they have defined what "retrieval" means in their system. That question assumes RAG is a data layer: ingest documents, embed them, query at runtime, paste chunks into a prompt. Storage solved, problem solved.

It is not. A vector index is one component in a context construction pipeline that runs on every user request. Identity, freshness, ranking, abstention, and attribution all decide whether the model answers from evidence or invents from fluency. The database does not do that work. The architecture around it does.

This is an architecture breakdown of what RAG actually is in production.

THE CLAIM

RAG is not a database. It is runtime context construction: a governed pipeline that assembles the right evidence, for the right principal, at query time, before inference begins.

Treating RAG as storage leads teams to optimize embedding models and chunk sizes while skipping the layers that decide whether the answer is grounded: who may see which documents, which chunks survive ranking, and what happens when retrieval returns nothing worth citing.

Why the database mental model fails

The database framing is seductive because it maps to familiar CRUD workflows. Ingest PDFs. Chunk. Embed. Store. Query. Ship.

Production RAG does not look like that. At query time the system must:

Scope retrieval to identity (not every user sees every chunk)
Retrieve candidates (often hybrid: lexical + vector + metadata filters)
Rank and filter (relevance is not cosine similarity alone)
Pack context (budget tokens, dedupe, attribute sources)
Decide whether to answer (abstain when evidence is thin)

None of those steps live inside the vector store. The store holds vectors and metadata. The pipeline owns truth boundaries.

Database mental model	RAG as context construction
Primary job	Persist and return stored records
Success metric	Query latency, index size
Identity	Often ignored until audit
Failure mode	Empty result set
Ops focus	Reindex when docs change
Who owns quality	Data engineering

The gap shows up in regulated environments first. An auditor does not ask which vector DB you picked. They ask: who retrieved what, under which policy, and what did the model see? A database answer does not satisfy that question. A pipeline with identity-scoped retrieval, ranked context packs, and structured attribution does.

What actually runs at query time

RAG is not "fetch top-k chunks." It is a short-lived assembly line that produces a context pack: the bounded input the model is allowed to reason over.

Four boundaries, one request:

① Ingress: bind the question to a principal. Retrieval without identity is a data leak waiting for production traffic.
② Retrieval: candidate generation, not final context. Hybrid search and ACL filters shrink the candidate set before ranking spends compute.
③ Rank & pack: re-ranking is where most quality wins hide. Token budgeting and deduplication turn "top-k blobs" into a coherent evidence pack.
④ Inference: the model reasons over the pack. Citation and abstention are system outcomes, not prompt wishes.

The storage boundary

The vector index stores candidates. It does not store truth.

Truth is the outcome of the full pipeline: scoped retrieval, ranked evidence, attributed context, and an explicit decision to answer or abstain. Optimizing the index without designing these layers is how teams ship fluent wrong answers at scale.

Demo vs production

Layer	Demo default	Production default
Identity	Single shared index	Per-principal ACL on every retrieval path
Retrieval	Vector top-k	Hybrid search + metadata filters + freshness rules
Ranking	Skipped ("similarity is enough")	Re-ranker + score thresholds + dedupe
Context pack	Concatenate chunks	Token budget, source attribution, versioned templates
Output	Model free-text	Cite sources or abstain; log what entered the pack
Change	Re-embed when someone notices drift	Eval gate on index updates; replay for regulators

The demo path works in a notebook. The production path is what survives the first compliance review.

What this looks like when it breaks

Teams living in the database mental model do not announce it. They ship features that look like RAG until production traffic arrives. Three symptoms show up first:

Leakage. A user retrieves chunks from documents their role should never see. The vector store returned a valid result. The pipeline never bound retrieval to identity.
Confident wrong cites. The model answers with footnotes — and the sources do not support the claim. Cosine similarity passed; ranking and score thresholds never ran.
No replay story. An auditor asks what the model saw on March 12. The team has index stats and prompt logs, not the assembled context pack.

Two failure modes get conflated: empty retrieval (nothing worth citing) and wrong retrieval (something plausible, not true). The first needs abstention. The second needs ranking, eval, and attribution. A database framing treats both as "bad query results." A pipeline framing treats them as distinct design problems.

Indexing is not where RAG quality is won. Teams spend months on chunking and embedding, then ship vector top-k at query time. Offline work is necessary. Scoped, ranked, attributable retrieval at query time is what production runs on.

The procurement reframe

Wrong question: "Which vector database?" Right questions:

Identity: how does each retrieval path bind to the caller's claims?
Audit: what gets logged in the context pack for replay?
Abstention: when evidence falls below threshold, do you stop or guess?

Freshness and scope answer to the pipeline too. Stale embeddings, document versions, who-may-see-what-today, sources spread across CRM, tickets, and policy engines: none of that lives in one datastore. Which is the whole point.

Where I actually land

I'm not saying vector stores don't matter, or that chunking is optional. You need storage. You need indexing. The mistake is stopping there.

The teams that ship trustworthy RAG treat the index as input to a pipeline, not the product. They design identity binding, ranking thresholds, context-pack logging, and abstention before they debate embedding dimensions. Those are the layers an auditor, a regulator, and a customer who acted on a wrong answer will actually hold you to.

Stop asking "which vector database?" Start asking "what assembles evidence for this principal, on this request, and what do we do when that assembly fails?"

TAKEAWAY

RAG is not a database. It is runtime context construction scoped to identity, ranked for relevance, packed for the model, and auditable end to end.

In a demo, retrieval is a query. In production, retrieval is architecture.

Policy-Governed Agent Runtime

2026-06-25T00:00:00.000Z

Enterprise teams, especially in banking and other regulated industries, are connecting agents to operational tools: payment rails, core banking APIs, KYC workflows, trade settlement. Most production designs still leave undefined where token, identity, and policy state live during execution. The failure mode is not "the model misbehaved." It is "we cannot prove who authorized what, with which policy, before money moved."

This is an architecture breakdown of runtime trust boundaries. The LLM operates on conversation and tool schemas only. The Identity Provider owns claims. The Policy Engine (PDP) returns verdicts. The Policy Enforcement Point (PEP) gates every tool invocation and forwards only what policy allows.

THE CLAIM

Proposal is not permission. An agent proposes tool calls. Governance decides whether they run. Policy in the prompt or the weights is not enforcement: it's a suggestion the model may ignore.

In a Policy-Governed Agent Runtime (PGAR), the token and policies stay out of the LLM. The model proposes. The PEP enforces. The PDP decides. Governance lives on the execution path, not in the system message: the same separation banks already enforce between a teller's screen and the authorization engine behind a wire transfer.

The whole system on one page

Five trust boundaries. Token and policy never cross the LLM boundary (③).

① Ingress (API Gateway + Identity Provider): receives the request, validates the token, issues claims
② Agentic App: holds the session and token; never sends either to the LLM
③ LLM: gets conversation + tool schemas only; proposes a tool call
④ Policy layer (PEP + PDP): receives the proposal + token; PDP returns a verdict
⑤ Downstream (Payment Hub): PEP calls only on Allow; service re-authorizes

Read it across those five boundaries: ingress → agentic app → LLM proposes → PEP asks PDP → downstream executes. Most agent security stops at ② and never builds a real ④ or ⑤ re-auth. The rest of this piece walks that path: why prompt guardrails and per-API auth fail, then one wire request traced end to end with the contracts underneath.

Why prompt guardrails aren't authorization

Proposal is not permission. Most production agents today are prompt-governed: rules in the system message, hope in the middle, tools at the end. That works until someone asks a regulator, a security team, or a compliance officer to explain why an agent initiated a $47,500 wire without four-eyes approval, or released a payment to a beneficiary that failed sanctions screening.

	Prompt-based guardrails	PGAR
Where policy lives	System prompt / fine-tuned behavior	PDP, enforced by PEP, outside the model
Enforcement	Probabilistic: model may comply	Deterministic. PEP blocks or allows on PDP verdict
Token handling	Often in context or env-injected	Agentic App + PEP only; never in LLM input
Auditability	"The model was told not to"	Structured PEP/PDP decision log per proposal
Prompt injection resistance	Weak: attacker rewrites the "rules"	Strong: attacker cannot see or rewrite PDP rules
Failure mode	Silent violation	Explicit deny or step-up
Regulatory posture	Hard to defend under model-risk or operational-resilience scrutiny	Verdict chain, policy version, and immutable audit: the artifacts examiners ask for

In banking terms: prompt guardrails are like posting "do not exceed transaction limits" on the break-room wall. PGAR is the core authorization engine that actually holds or releases the payment.

AUTHORIZATION ≠ PROMPTING

Prompt guardrails shape behavior: tone, format, abstention. They are not a substitute for authorization. PGAR owns the layer guardrails were never built to hold.

Why per-API authorization isn't enough

We already authorize every REST call. Isn't that enough?

In microservices, deterministic code calls authorized APIs. Agents insert a probabilistic orchestrator: the LLM proposes tool calls (what, in what order, with what arguments) before any request leaves the runtime. Per-API auth decides whether POST /wires may run; it does not govern whether the agent should have proposed that wire for $47,500 without four-eyes attestation, or whether a multi-step chain (lookup → validate → initiate) satisfies compound policy across amount limits, sanctions context, and approval state. API access logs show that a call succeeded; they do not record which policy version allowed the proposal before side effects. PGAR does not replace downstream re-auth: Payment Hub still checks the token. The PEP governs the proposal mile between model output and API invocation, and writes the verdict chain examiners expect.

Prediction vs. truth on the execution path

Regulated systems need both: in different places. The LLM is a predictor: it infers intent, sequences tool calls, and drafts user-facing language. That is appropriate work for a probabilistic model. Authorization is not prediction. Whether $47,500 exceeds a $25,000 limit, whether a beneficiary cleared sanctions, whether four-eyes attestation is present: these are boolean facts evaluated against policy, not continuations the model might get right most of the time.

Task	Who owns it	Why
Parse "send wire to Acme for INV-8842"	LLM (proposal)	Intent and phrasing: prediction is fine
Decide if officer may initiate wire	PDP (verdict)	Entitlement: must be deterministic
Compare amount to `wire.auto_approved`	PDP (verdict)	Limit check: arithmetic, not fluency
Screen beneficiary against sanctions	Payment Hub + PDP	External truth: the model has no source
Record who approved before funds move	PEP (audit)	Evidence: cannot be inferred

Put limits, entitlements, and sanctions in the prompt and you have delegated truth to a predictor. PGAR keeps prediction upstream (what to propose) and truth on the execution path (whether it may run). Treating "the model usually respects the rules" as authorization evidence fails model-risk and operational-resilience review: not because the model is bad, but because examiners require replayable verdicts, not plausible behavior.

PREDICTION VS. TRUTH

Intelligence in the LLM. Truth in the PDP. Never conflate proposal with permission on the path that moves money, data, or regulatory scope. This is the same thesis as "Hallucination" is a design problem: reliability and control live in the system around the model. PGAR is what that looks like when the system needs to authorize actions, not just validate answers.

Corporate wire: one request through five boundaries

The overview diagram shows five trust boundaries; the sequence below walks every hop inside them.

User says: "Send $47,500 to Acme Supplies for invoice INV-8842: use our operating account." The LLM sees three tool schemas. lookup_beneficiary, validate_payment, initiate_wire. With no authority attached. This request exercises all three: lookup the payee, validate the payment (limits, sanctions, cut-off), initiate the wire. The PDP watches three risk triggers the model never sees: amount above auto-approval limit (STEP-UP), scope or entitlement violation (DENY), and sanctions or high-risk corridor hit (DENY or STEP-UP).

When something goes wrong: or during a scheduled review: compliance and regulators do not ask "what did the model intend?" They ask:

WHAT EXAMINERS ASK

Which policy version decided? Every PEP log must carry pgar.payments.wire/v3 (or equivalent), not "the system prompt from Tuesday."
Was execution blocked until attestation? Proof that STEP-UP fired and ALLOW came only after supervisor four-eyes, not after the model "felt confident."
Can you replay the verdict chain without model logs? Subject, action, resource, context, verdict: immutable, before side effects. Chat transcripts are discovery; PEP/PDP records are evidence.

Prompt-governed agents struggle on all three. PGAR is built to answer them by construction.

Step-up is a PDP verdict, not a model feature. The model was never wrong for proposing $47,500: it was never given the $25,000 auto-approval limit. The PEP surfaces STEP-UP, the Agentic App owns four-eyes approval UX, and only a subsequent Allow reaches the Payment Hub. That attestation is what lands in the compliance archive: who authorized the exception, against which policy version, before a single dollar moved.

When the PDP says DENY

EXPLICIT DENY, NOT SILENT VIOLATION

The sequence above walks STEP-UP. The same architecture handles the case regulators care about most. App forwards validate_payment to the PEP. Payment Hub returns sanctions_status: hit. PEP asks the PDP; verdict is DENY. The flow stops: initiate_wire is never proposed to execution, no amount argument, no model override, no "we told it not to." The audit record shows DENY with policy version and redacted context before any funds move. Prompt-governed agents fail silently here: the model may still propose the wire, or explain around the block in fluent language, with no immutable evidence that authorization was refused. PGAR turns that into an explicit, replayable block: the failure mode AML and sanctions examiners expect when a control trips.

From diagram to contracts

If you can answer examiner questions from PEP/PDP logs alone, you have PGAR. If you need the chat transcript, you don't. The sequence diagram is the story; these payloads are the contracts that make the verdict chain replayable.

Every PDP evaluation uses the same four-field shape: who (subject), what (action), on what (resource), under what conditions (context).

1. Token and claims stay in the Agentic App

The token never enters the LLM request. It stays in the session and attaches to every Agentic App → PEP → Payment Hub call.

{
  "token": "eyJhbGciOiJSUzI1NiIs...",
  "claims": {
    "iss": "https://idp.bank.example",
    "sub": "officer-123",
    "email": "jitender@bank.example",
    "act": { "sub": "officer-123" },
    "sct": { "type": "access" },
    "roles": ["corporate_banking_officer", "payments_initiator"],
    "emt_iat": 1718812800,
    "emt_exp": 1718899200,
    "emts": {
      "payments.lookup": true,
      "payments.validate": true,
      "payments.wire.initiate": true
    },
    "limits": {
      "wire.auto_approved": 25000,
      "wire.above_requires": "supervisor_four_eyes"
    },
    "portfolio_accounts": ["acct-operating-4412"],
    "iat": 1718812800,
    "exp": 1718816400
  }
}

2. What the LLM actually sees

This is the payload that crosses the Agentic App → LLM boundary. Notice what's missing.

{
  "messages": [
    { "role": "user", "content": "Send $47,500 to Acme Supplies for invoice INV-8842: use our operating account." }
  ],
  "tools": [
    {
      "name": "lookup_beneficiary",
      "parameters": { "payee_name": "string", "invoice_ref": "string" }
    },
    {
      "name": "validate_payment",
      "parameters": { "beneficiary_id": "string", "amount": "number", "source_account": "string", "reference": "string" }
    },
    {
      "name": "initiate_wire",
      "parameters": { "beneficiary_id": "string", "amount": "number", "source_account": "string", "reference": "string" }
    }
  ]
}

No Authorization header. No roles, emts, or limits. No policy text.

THE PGAR TEST

If any of those appear in your LLM payload, you don't have PGAR. You have prompt governance.

3. What the PEP sends to the PDP

The PEP doesn't send natural language to the PDP. It maps session claims into subject, adds the tool proposal as action, resource, and context, and calls the PDP.

Field	Source	Carries
subject	Session `claims` (see above)	Who: same identity, roles, entitlements, and limits held in the Agentic App
action	Tool proposal name	What. `lookup_beneficiary`, `validate_payment`, `initiate_wire`
resource	Tool proposal target	On what: beneficiary, source account, wire payment
context	Proposal + runtime state	Conditions. `amount`, `sanctions_status`, `approval`

{
  "subject": {
    "iss": "https://idp.bank.example",
    "sub": "officer-123",
    "email": "jitender@bank.example",
    "act": { "sub": "officer-123" },
    "sct": { "type": "access" },
    "roles": ["corporate_banking_officer", "payments_initiator"],
    "emt_iat": 1718812800,
    "emt_exp": 1718899200,
    "emts": {
      "payments.lookup": true,
      "payments.validate": true,
      "payments.wire.initiate": true
    },
    "limits": {
      "wire.auto_approved": 25000,
      "wire.above_requires": "supervisor_four_eyes"
    },
    "portfolio_accounts": ["acct-operating-4412"],
    "iat": 1718812800,
    "exp": 1718816400
  },
  "action": "initiate_wire",
  "resource": {
    "type": "wire_payment",
    "beneficiary_id": "bene-acme-441",
    "source_account": "acct-operating-4412",
    "reference": "INV-8842"
  },
  "context": {
    "amount": 47500,
    "sanctions_status": "clear",
    "approval": null
  }
}

After step-up, the same request returns with context.approval set to { "type": "supervisor_four_eyes", "attestation_id": "apr-9f2c" }. And the PDP re-evaluates.

4. Policy rules: three verdicts, no fourth option

The PDP runs one policy surface. Three outcomes only: ALLOW, DENY, STEP_UP. No "the model said it was fine."

{
  "policy_id": "pgar.payments.wire",
  "default_decision": "DENY",
  "rules": [
    {
      "decision": "DENY",
      "when": {
        "action": ["initiate_wire", "validate_payment"],
        "subject.emts.payments.wire.initiate": false
      }
    },
    {
      "decision": "DENY",
      "when": {
        "action": ["validate_payment", "initiate_wire"],
        "context.sanctions_status": "hit"
      }
    },
    {
      "decision": "ALLOW",
      "when": {
        "action": "lookup_beneficiary",
        "subject.emts.payments.lookup": true
      }
    },
    {
      "decision": "ALLOW",
      "when": {
        "action": "validate_payment",
        "subject.emts.payments.validate": true,
        "context.sanctions_status": "clear"
      }
    },
    {
      "decision": "ALLOW",
      "when": {
        "action": "initiate_wire",
        "context.amount.lte": "subject.limits.wire.auto_approved",
        "context.sanctions_status": "clear"
      }
    },
    {
      "decision": "STEP_UP",
      "when": {
        "action": "initiate_wire",
        "context.amount.gt": "subject.limits.wire.auto_approved",
        "context.approval": null
      }
    },
    {
      "decision": "ALLOW",
      "when": {
        "action": "initiate_wire",
        "context.amount.gt": "subject.limits.wire.auto_approved",
        "context.approval.present": true,
        "context.sanctions_status": "clear"
      }
    }
  ]
}

OPA, Cedar, your IAM PDP, or an internal rules engine can implement this surface: the requirement is structured input, deterministic output, evaluated by the PDP, not natural-language policy in a system prompt.

5. The PEP. Structural, not conventional

The PEP sits between proposal and execution. The Agentic App cannot call the Payment Hub directly: every path goes through the PEP, which runs the same four steps on every proposal:

Receive the input: the tool proposal (initiate_wire to bene-acme-441 for $47,500, no approval yet), the bearer token, and the subject's claims.
Assemble and ask the PDP: map proposal and claims into the subject/action/resource/context request and call the PDP. Here the PDP returns STEP_UP, reason wire_above_auto_approved.
Write the audit record: every verdict is logged with the subject, action, resource, redacted context, the policy version that decided it (pgar.payments.wire/v3), and the verdict itself: immutable, before any side effect.

DECISION FIRST, EXECUTION SECOND

This is the record operational-resilience and model-risk reviewers expect: verdict logged before any side effect: no retroactive narrative.

Act on the verdict: only ALLOW reaches the Payment Hub; STEP_UP returns a step-up-required response to the Agentic App; DENY returns a refusal. In this case the PEP responds not executed, step-up required. The wire never touched the payment rail.

STRUCTURAL ENFORCEMENT

If the Agentic App can call downstream services without passing through the PEP, you don't have enforcement: you have a suggestion. The choke point must be structural, not conventional.

Why this is an architecture problem, not a sprint item

You can buy an agent framework in an afternoon. You cannot buy the boundary decisions PGAR requires: those are architecture commitments someone will have to defend to security, finance, internal audit, and regulators.

WHO MUST DEFEND THIS

In a bank, that conversation happens with model-risk management, second-line compliance, and the teams who already own payment authorization: not only with the squad shipping the chatbot.

Engineering thinks…	Architecture decides…
"Put the wire limit in the system prompt"	Where policy is evaluated. PDP, deterministically, on structured input at the PEP
"The model will learn to respect the rules"	What the LLM is allowed to see: schemas yes, credentials and entitlements no
"We'll add auth later"	Whether every path to downstream services goes through the PEP or some paths bypass it
"Identity is the IdP team's problem"	How claims flow to the PDP without ever reaching the LLM
"Logging is a nice-to-have"	Which PEP/PDP decisions are immutable audit events vs. sampled debug traces
"One team owns the agent"	Who owns the gateway, identity, policy, and service boundaries: four different stakeholders, often four different lines of defense in a regulated firm

PGAR is the control surface for actions in a governed agent stack: intelligence stays in the LLM, control in the PEP + PDP, and every verdict is an audit-grade event, not a sampled trace.

TAKEAWAY

Proposal is not permission. The LLM proposes. The Agentic App holds the token. The PDP decides. The PEP enforces. Downstream services re-authorize. That is PGAR: governance as architecture, not as a paragraph in the system prompt.

If the Agentic App can reach downstream without the PEP, you have a demo, not governed production.

AI Observability In Enterprise

2026-06-18T00:00:00.000Z

Everyone says "monitor your AI in production". Almost nobody draws the system that does it. "Add Observability" is a slogan until you can say exactly what gets captured, where it lands, how long it lives, and who reads it.

This is an architecture breakdown - capture in the request path, fan-out into purpose-built storage tiers, and four very different consumers reading off them. The headline: AI observability isn't one thing. It's five signals with five retention policies feeding four jobs, and the regulator-facing ones look nothing like the dashboard-facing ones.

THE CLAIM

AI observability is not "a dashboard". It's a capture-and-retention architecture: each signal (logs, metrics, traces, raw prompts, audit records) has a different consumer, a different retention window, and a different blast radius if you get it wrong.

The whole system on one page

Read it left to right: capture -> store -> consumer. The rest of this piece is just the reasoning behind each arrow.

This isn't only for AI

The capture->store->consume backbone here isn't AI-specific. Swap the Agentic app/ RAG service node for a microservice, a VM-hosted app, or a cots product and the skeleton is unchanged: emit OTel signals, fan them out to tiers wit deliberate retention, feed operational / SLO/ audit consumers. Only two boxes are the AI-specific part, the raw prompt/response store and the drift detector. Drop those and you're left with a perfectly standard service-observability architecture. So you don't need a different observability sta for non-agentic systems, you just need fewer arrows the same one.

1. Capture lives in the request path - and that's the hard constraint

The app: an agent, a RAG service, any LLM system: emits five signals through an OTel SDK into an OTel collector on the hot path: logs, metrics, traces (standard OpenTelemetry) plus raw prompt/response and audit records (governed, AI-specific). Two design consequences fall out immediately:

Instrumentation is not free. Every signal you emit costs latency and money on the request path. That's why the boring signals (metrics) are cheap and always-on, while the expensive ones (traces, raw payloads) are sampled or gated.
The Collector is the control point. Routing, sampling, redaction, and fan-out happen once, in the Collector - not scattered across app code. This is where you strip PII before it every reaches a long-lived store.

note

Using vendor neutral OpenTelemetry at the capture layer is the decision that keeps your backwards swappable. The signals are standardized; where they land is a routing config, not a rewrite.

2. Five Signal, Five storage tiers, five retention policies

This is the part most "monitoring" setups collapse into on bucket - and it's exactly where AI system's differ from ordinary services. Retention is a governance decision, not a storage default.

Signal	Store	Retention	Why this window
Structured Logs	Log store	30 d	Operational debugging; cheap to keep short, noisy to keep long
Metrics	Time Series DB (TSDB)	13 mo	Trend + year-over-year comparison, tiny per-point cost
Sampled Traces	Trace store	30 d	Latency/causality debugging; full traces are expensive, so sample
Raw prompt/response	Restricted store	encrypted, 90 d	Sensitive content: quality/drift analysis, tightly access-controlled
Audit record	Audit log	immutable, 7 y	Compliance evidence: must survive, must not be editable

The two dotted arrows in the diagram matter. Raw prompt/response and audit records are not routine telemetry - they are sensitive, governed signals. One is encrypted and short-lived; the other is immutable and kept for years. Treating either like a normal log is how you end up with PII in a debug dashboard or a compliance gap at audit time.

important

If your "observability" stores everything in one tier with one retention setting, you have made a governance decision by accident. The raw-prompt store and the audit log have opposite requirements short + erasable vs long + immutable and conflating them fails both.

3. Four consumers, four different questions

Storage isn't the point; the questions you can answer are. Each consumer reads a different tier.

Dashboards (logs + metrics + traces) - what is the system doing right now? The operational view.
SLO + burn-rate alerts (metrics) - are we spending our error budget too fast? Pages a human before users feel it.
Drift detector (traces + raw prompts + embeddings) - is the input distribution moving away from what we tested - and from RAG, is the retrieval corpus drifting too? This is the AI-specific one; model quality erodes silently as the world changes.
Regulatory replay (audit log) - can we reconstruct exactly what the system did, months later, for someone who wasn't there? The immutable trail.

The split is the insight: operational health, model-quality erosion, and provable accountability are three different jobs. A latency dashboard tells you nothing about drift. A drift detector can't satisfy an auditor. You need all three, fed by the right tiers.

Why this is an architecture problem, not a tooling purchase

You can buy dashboard. You cannot buy the decision in this diagram.

What to sample (trace, raw payloads) vs always capture (metrics): a latency/cost trade off.
where redaction happens (the collector, before persistence): a privacy boundary.
Which tier is immutable (the audit log): a compliance commitment you design in, not bolt on.
What "healthy" means (the SLOs and drift thresholds): domain knowledge no tool ships with.

note

This is the same thesis as "Hallucination" is a design problem: reliability lives in the system around the model. Observability is how you measure that reliability: groundedness, unsupported-claim rate and drift become metrics you log the way you'd log latency.

The precise position

Most teams stand up a metrics dashboard, call it "AI observability," and move on. That covers exactly one of the four consumer above and not the two that regulators and quality erosion will eventually make you care about.

The architecture that actually holds up captures five signals with deliberate retention, redacts at the collector and feeds four distinct consumers: operational, budget, drift and audit. The diagram isn't decoration; it's the set of decisions you will be asked to defend.

TAKEAWAY

"Monitor your AI" is a slogan. Capture five signals, route them to tiers with deliberate retention, and feed four consumers, dashboards, SLO alerts, drift detection, and regulatory replay. That's the system, everything else is a dashboard pretending to be a strategy.

Hallucinations Is a System Design Problem

2026-06-16T00:00:00.000Z

Every time a model invents a citation, the conversation jumps to "which model hallucinates less?". That's the wrong question. The model did exactly what it was built to do. Everyone's focused on picking the model that hallucinates least.

The thing that will actually decide whether your AI system is trustworthy is the architecture you wrap around the model – grounding, retrieval, validation, and an explicit path to "I don't know".

A hallucination isn't a bug the next checkpoint will patch. It's the expected behavior of a frozen, probabilistic next-token predictor asked a question it has no grounded answer for. Treating it as a model defect means you keep waiting for a fix that isn't coming. Treating it as a design problem means you can actually solve it today.

THE CLAIM

Hallucination is not the model failing. It's the model succeeding at the wrong objective – fluent continuation – in a system that never gave it the right one: grounded truth.

Why the model was never going to save you

A trained model is a frozen function: f(tokens) -> next-token probabilities. It has no live knowledge, no source of truth, and no built-in concept of “I don't actually know this”. Three properties make hallucinations structural, not accidental:

Property of the model	Consequence
Frozen at training time	No access to fresh, private or post-cutoff facts - it fills gaps from priors
Optimized for fluency, not truth	The objective was plausible next token, never verified fact
No native abstention	“Confidently wrong” scores the same as confident and right unless the system checks

So when you ask something outside what it learned, it doesn't error out - it produces the most statistically plausible continuation. That continuation is often fluent, well-formatted, and wrong. The model isn't broken. It's doing precisely what next-token prediction does.

The model invents a citation because inventing a plausible continuation is the only thing it was ever built to do - truth was never in its objective, so it has to be in your architecture.

note

A bigger or newer model shifts where the cliff is, not that there is a cliff. You're buying a lower hallucination rate, not a guarantee. Rates don't survive contact with a regulator, an auditor, or a customer who was given a fake policy number.

Why this is a design problem (the enterprise lens)

If the model can't be the source of truth, the system has to be. That reframes hallucinations from "model quality" to "system design" - and design is something you control.

Grounding is an architecture choice, not a model feature. RAG exists precisely because the model's knowledge is frozen. Inject the right context at runtime and the model is continuing from facts instead of inventing from priors. No retrieval layer = you've delegated truth to a frozen function and hoped.
Validation lives outside the model. Guardrails, schema/grounding checks, and citation verifications sit around the model - you can't patch behaviors inside frozen weights in real time. The system decides what's allowed to reach the user, not the model.
"I don't know" must be an engineered path. Models don't volunteer abstention. Confidence thresholds, retrieval-coverage checks, and explicit fallbacks are what turn a confident guess into an honest "I can't answer that from sources I have."
Cost and governance ride on this. An ungrounded answer in a bank, a hospital, or a legal workflow isn't a quality blip - it's liability. Design decides whether a wrong answer is impossible to surface or merely cheap to retry.

important

The intelligence is in the model. The truth is in the system. If your architecture has no component that owns "is this actually true and supported?", then nothing does - and the model will happily fill the silence.

Non-determinism is not hallucination

This is the objection we hear most, and it's the strongest argument for the design framing - not against it. But it actually bundles two different things together.

Different answers ≠ Hallucinations

	Non-determinism	Hallucination
What it is	Different wording for the same question	A confident false claim
Cause	Sampling (temperature, top-p) picks among probable tokens	No grounded fact, so it continues from priors
Your control	Yes - set `temperature=0`	Only via grounding + verification

The model never stores "an answer". Each step it produces a probability distribution over the next token, then samples from it. At temperature > 0 you are rolling a weighted dice every token - hence different phrasings. Set temperature = 0 (greedy decoding) and it becomes near-deterministic: same input -> same output.

(near, because floating-point rounding and GPU batching cause tiny variations - an engineering detail, not the core issue.)

So "different answers each time" is a knob you control, not proof the model is reliable.

There is no 100% surety – and that’s the whole point

Grounding does not guarantee a correct answer. It shifts the probability mass. Without context, the most-probable continuation comes from fuzzy training priors (high risk). With the right context in the prompt, the most-probable continuation becomes "paraphrase what's in front of me" (much lower risk). You move from maybe ~70% to 95% - never to 100%.

So where does the surety come from? Not the model - a separate verifier. The thing that generates the answer must not be the thing that decides it's trustworthy. A grounded model gives you a good draft - 95%; design decides what happens to the other 5%, whether it silently reaches your user or gets caught and blocked.

note

You can't make a frozen, sampling-based function promise truth - so reliability has to be engineered around it. The model's lack of a guarantee is the reason design exists, not a reason to wait for a better model.

What “designing for it” actually looks like

Those four principles become one concrete pipeline. You don't eliminate hallucinations by hoping - you box it in with layers, each on catching what the last let through.

Retrieve before you generate - give the model facts to continue from, not a blank page.
Constrain the output - structural formats, required citations, schema validation.
Verify against the source - does everything claim trace back to retrieved evidence?
Make abstention first-class - "no grounded answer" is a valid, designed outcome, not a failure.
Observe in production - log groundedness and unsupported claim rates the way you'd log latency, Hallucination is a measurable system metric, not a vibe.

How to actually build the verifier

"Add a verifier" is easy to say. The trap is building one that just re-asks the same model "are you sure?" - it'll rationalize its own output. A good verifier follows two rules and one ordering.

Rule 1 - independent from the generator. The thing that checks the answer must not be the thing that wrote it. Use deterministic code, a retrieval system, or a separate model call that sees only the claim + the source - never the original reasoning.

Rule 2 - verify atomic claims, not paragraph "Mostly right" hides one wrong clause. Decompose the answer into individual facts and check each one against evidence.

The ordering - cheapest, most deterministic checks first, expensive models last, on the reside only:

Layer	Mechanism	Catches	Cost
1. Structural	JSON schema, constrained decoding	No citations, malformed output	~Free
2. Deterministic Facts	Exact/fuzzy match against source	Invented numbers, IDs, dates, quotes	~Free
3. Grounding (NLI)	Small entailment model per claim	Unsupported or contradicted claims	Cheap
4. LLM-as-judge	Separate model	Nuanced cases the rest can't settle	Expensive

The verifier doesn't make the system perfect. It converts a silent, confident, wrong answer into a caught-and-blocked one - turning an unbounded risk into a measurable error rate with a fallback. That conversation is exactly what you can put in front of an auditor.

Where I actually land

My point is: I'm not saying models don't matter, or that one model is as good as another. Picking a stronger model genuinely lowers the baseline rate.

I am saying: a better model reduces hallucinations; only better design lets you bound and govern it. If your reliability strategy is "wait for the next model," you've outsourced your most important architectural decision to someone else's release schedule - and you still won't be able to promise an auditor anything.

Stop asking "which model hallucinates the least?" Start asking "what in the system owns the truth, and what happens when it doesn't have an answer?"

TAKEAWAY

Hallucination is the model doing its job inside a system that forgot to do its own. Engineer grounding, validation, and abstention around the frozen model - that's where reliability is actually built.

How LLM Works Under the Hood

2026-06-09T00:00:00.000Z

Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.

This is a 20,000-ft view of the LLM lifecycle in four stages.

The big picture: one model, four stages.

A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.

Stage	What happens	Key ideas
Before	Decide the blueprint	Architecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.
During	Fill in the values	Random weights become meaningful through training: a four-step loop run millions or trillions of times.
Alignment	Make it helpful	Show good examples (SFT) and teach which answers are better (RLHF/DPO).
After	Run it, read-only	Weights are frozen (no learning); inference traverses the model geometry one token at a time.

TAKEAWAY

Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.

Stage 1 - Before training

Two human decisions are baked in before any gradient is computed.

Architecture dials - hidden size, layers, heads, FFN width, vocab size.
Tokenizer vocabulary - the integer alphabet the model reads and writes.

A "7B" model is 7B because of these dials. Training never grows it, and most parameters live in the FFN, not attention.

The Architecture dials

Hyperparameter	Example	Description
hidden_size(D)	4096	How much "thinking space" the model has for each word or idea at a given moment.
num_layers(L)	32	How many rounds of refinement - 32 editors in a row.
num_heads(H)	32	A panel of specialists, each spotting a different pattern.
head_dim(D_h)	128	The size of each specialist's notebook.
ffn_hidden(D_ff)	16,384	The knowledge bank, where most facts are stored (~4*D).
vocab_size(V)	32000	The size of the model's dictionary, the building blocks it uses to read and write language.

TAKEAWAY

The model is fully sized and described before it sees a single token.

Stage 2 - During training

Learning is one four-step loop, repeated hundreds of thousands to millions of times.

Forward Pass - Predicts what comes next in a sequence, based on previous tokens.
Loss - How wrong was our prediction?
Backpropagation - Calculate how much, and how each weight contributed to the error.
Optimizer step - Update every weight, slightly adjusting each weigh.

note

The only thing learned here is the next-token prediction: the statistical relationship between tokens given their surrounding context. Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage: that comes later, in alignment.

From random numbers to learned meaning

Before training (random)	After training (meaning)
Every weight is a random number	Every weight holds a learned value
Output is gibberish	Output is fluent, coherent text
No grammar, facts, or reasoning	Grammar, facts, and reasoning emerge
Structure exists, meaning doesn't	Same structure: now full of meaning

TAKEAWAY

Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.

The roles that emerge after training

Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles, learned through experience, not explicitly designed.

Component	Role it settles into
Embeddings	What tokens mean (lexical meaning)
Attention	How tokens relate: routes relevant context
FFNs	Transformation / "thinking". Most parameters and reasoning
LayerNorm	Keep signals stable and usable
Depth (layers)	Progressive refinement of understanding

TAKEAWAY

No one designs these roles; training gradually turns them into specialist roles through learning rather than design.

Stage 3 - Alignment

A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.

	Main training	Polish (alignment)
Data	Trillions of words	Thousands to millions of examples
Length (cost)	Weeks/months, huge cost	Short, cheap
What it does	Teaches knowledge	Shapes behavior

SFT - show it good (prompt, response) examples.
RLHF/DPO - teach it which answer is better.

TAKEAWAY

Alignment turns a raw model into a helpful assistant: it shapes behavior; it doesn't add new knowledge.

Stage 4 - After training

Once training stops, weights are frozen: no learning, no gradients. The model is a fixed function f(tokens) -> next token probabilities.

During inference, the model has no memory of what was asked or answered before: each request starts fresh.

TAKEAWAY

Training builds the geometry. Inference just navigates it one token at a time.

The Mental Model most people get wrong

LLM ≠ continuously learning systems
LLM ≠ dynamic knowledge base
LLM ≠ autonomous agent

What this means for Enterprise Systems

Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system". It's a fixed component inside a larger architecture.

1. Why RAG is required

LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.

To make them useful in enterprise: To make them useful in enterprise:

Connect them to internal data sources
Inject context at runtime

This is why Retrieval Augmentation (RAG) becomes a foundational pattern.

2. Why agents/orchestration are external

LLMs are:

Stateless
Reactive
Single-step predictors

They cannot:

Execute workflows
Maintain long-running state
Coordinate systems

This is why agentic systems and orchestration layers exist outside the model

important

The intelligence is in the model and the control is in the system design.

3. Why governance is outside the model

You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:

Guardrails
Validation layers
Monitoring and evaluation
Policy enforcement

All of these sit around the model, not inside it

4. Why inference cost dominates

Training is:

One-time
Expensive but amortized

Inference is: Inference is:

Continuous
Scales with usage

important

For enterprise systems: Cost = traffic * tokens * latency requirements

5. Why scale and cost must be designed upfront

Because LLMs don't learn in production, every interaction requires:

Full inference execution
Token processing (input+output)
External system calls (RAG /agents)

This means:

Cost scales with usage, not with training
Latency compounds across system layers
Poor design = exponential cost growth

In real systems, if not handled correctly:

RAG increases token usage
Agents introduce multiple-step execution
Orchestration adds round trips

important

Training is a one-off capital cost; inference is the ongoing operational cost. Also, without careful design, AI systems become unpredictable and expensive at scale

Final Takeaway

The model provides intelligence and the system provides control.

Modern AI architecture is not “LLM design” It is “system design around a frozen model”

Traffic × Tokens × Latency

TAKEAWAY

Treat the LLM as frozen dependency; engineer everything else around it.

Jitender Sharma Blog

RAG Is Not a Database

Why the database mental model fails​

What actually runs at query time​

Demo vs production​

What this looks like when it breaks​

The procurement reframe​

Where I actually land​

Policy-Governed Agent Runtime

The whole system on one page​

Why prompt guardrails aren't authorization​

Why per-API authorization isn't enough​

Prediction vs. truth on the execution path​

Corporate wire: one request through five boundaries​

When the PDP says DENY​

From diagram to contracts​

1. Token and claims stay in the Agentic App​

2. What the LLM actually sees​

3. What the PEP sends to the PDP​

4. Policy rules: three verdicts, no fourth option​

5. The PEP. Structural, not conventional​

Why this is an architecture problem, not a sprint item​

AI Observability In Enterprise

The whole system on one page​

1. Capture lives in the request path - and that's the hard constraint​

2. Five Signal, Five storage tiers, five retention policies​

3. Four consumers, four different questions​

Why this is an architecture problem, not a tooling purchase​

The precise position​

Hallucinations Is a System Design Problem

Why the model was never going to save you​

Why this is a design problem (the enterprise lens)​

Non-determinism is not hallucination​

Different answers ≠ Hallucinations​

There is no 100% surety – and that’s the whole point​

What “designing for it” actually looks like​

How to actually build the verifier​

Where I actually land​

How LLM Works Under the Hood

The big picture: one model, four stages.​

Stage 1 - Before training​

The Architecture dials​

Stage 2 - During training​

From random numbers to learned meaning​

The roles that emerge after training​

Stage 3 - Alignment​

Stage 4 - After training​

The Mental Model most people get wrong​

What this means for Enterprise Systems​

1. Why RAG is required​

2. Why agents/orchestration are external​

3. Why governance is outside the model​

4. Why inference cost dominates​

5. Why scale and cost must be designed upfront​

Final Takeaway​

Why the database mental model fails

What actually runs at query time

Demo vs production

What this looks like when it breaks

The procurement reframe

Where I actually land

The whole system on one page

Why prompt guardrails aren't authorization

Why per-API authorization isn't enough

Prediction vs. truth on the execution path

Corporate wire: one request through five boundaries

When the PDP says DENY

From diagram to contracts

1. Token and claims stay in the Agentic App

2. What the LLM actually sees

3. What the PEP sends to the PDP

4. Policy rules: three verdicts, no fourth option

5. The PEP. Structural, not conventional

Why this is an architecture problem, not a sprint item

The whole system on one page

1. Capture lives in the request path - and that's the hard constraint

2. Five Signal, Five storage tiers, five retention policies

3. Four consumers, four different questions

Why this is an architecture problem, not a tooling purchase

The precise position

Why the model was never going to save you

Why this is a design problem (the enterprise lens)

Non-determinism is not hallucination

Different answers ≠ Hallucinations

There is no 100% surety – and that’s the whole point

What “designing for it” actually looks like

How to actually build the verifier

Where I actually land

The big picture: one model, four stages.

Stage 1 - Before training

The Architecture dials

Stage 2 - During training

From random numbers to learned meaning

The roles that emerge after training

Stage 3 - Alignment

Stage 4 - After training

The Mental Model most people get wrong

What this means for Enterprise Systems

1. Why RAG is required

2. Why agents/orchestration are external

3. Why governance is outside the model

4. Why inference cost dominates

5. Why scale and cost must be designed upfront

Final Takeaway