<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://jitendersharma.dev/blogs</id>
    <title>Jitender Sharma - Architects Handbook Blog</title>
    <updated>2026-06-18T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://jitendersharma.dev/blogs"/>
    <subtitle>Jitender Sharma - Architects Handbook Blog</subtitle>
    <icon>https://jitendersharma.dev/img/favicon.ico</icon>
    <entry>
        <title type="html"><![CDATA[AI Observability In Enterprise]]></title>
        <id>https://jitendersharma.dev/blogs/ai-observability-in-enterprise</id>
        <link href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise"/>
        <updated>2026-06-18T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[AI Observability In Enterprise]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="AI Observability In Enterprise" src="https://jitendersharma.dev/assets/images/ai-observability-in-enterprise-55eeff32f54218da0d324483661c8048.png" width="1536" height="1024" class="img_ev3q"></p>
<p>Everyone says "monitor your AI in production". Almost nobody draws the system that does it. "Add Observability" is a slogan until you can say <strong>exactly what gets captured, where it lands, how long it lives, and who reads it.</strong></p>
<p>This is an <strong>architecture breakdown</strong> - capture in the request path, fan-out into purpose-built storage tiers, and four very different consumers reading off them. The headline: AI observability isn't one thing. It's <strong>five signals with five retention policies feeding four jobs</strong>, and the regulator-facing ones look nothing like the dashboard-facing ones.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>THE CLAIM</div><div class="admonitionContent_BuS1"><p>AI observability is not "a dashboard". It's a <strong>capture-and-retention architecture</strong>: each signal (logs, metrics, traces, raw prompts, audit records) has a different consumer, a different retention window, and a different blast radius if you get it wrong.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-whole-system-on-one-page">The whole system on one page<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#the-whole-system-on-one-page" class="hash-link" aria-label="Direct link to The whole system on one page" title="Direct link to The whole system on one page" translate="no">​</a></h2>
<!-- -->
<p>Read it left to right: <strong>capture -&gt; store -&gt; consumer</strong>. The rest of this piece is just the reasoning behind each arrow.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>This isn't only for AI</div><div class="admonitionContent_BuS1"><p>The <code>capture-&gt;store-&gt;consume</code> backbone here isn't AI-specific. Swap the <strong>Agentic app/ RAG service</strong> node for a microservice, a VM-hosted app, or a cots product and the skeleton is unchanged: emit OTel signals, fan them out to tiers wit deliberate retention, feed operational / SLO/ audit consumers. Only <strong>two boxes are the AI-specific part</strong>,
the <em>raw prompt/response</em> store and the <em>drift detector</em>. Drop those and you're left with a perfectly standard service-observability architecture. So you don't need a different observability sta for non-agentic systems, you just need fewer arrows the same one.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-capture-lives-in-the-request-path---and-thats-the-hard-constraint">1. Capture lives in the request path - and that's the hard constraint<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#1-capture-lives-in-the-request-path---and-thats-the-hard-constraint" class="hash-link" aria-label="Direct link to 1. Capture lives in the request path - and that's the hard constraint" title="Direct link to 1. Capture lives in the request path - and that's the hard constraint" translate="no">​</a></h2>
<p>The app - an agent, a RAG service, any LLM system - emits <strong>three OpenTelemetry Signals</strong> - logs, metrics, traces - through an OTel SDK, into an <strong>OTel collector</strong> that sits in the hot path. Two design consequences fall out of that immediately:</p>
<ul>
<li class=""><strong>Instrumentation is not free.</strong> Every signal you emit costs latency and money on the request path. That's why the boring signals (metrics) are cheap and always-on, while the expensive ones (traces, raw payloads) are <strong>sampled</strong> or <strong>gated</strong>.</li>
<li class=""><strong>The Collector is the control point.</strong> Routing, sampling, redaction, and fan-out happen <em>once</em>, in the Collector - not scattered across app code. This is where you strip PII before it every reaches a long-lived store.</li>
</ul>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Using vendor neutral <strong>OpenTelemetry</strong> at the capture layer is the decision that keeps your backwards swappable. The signals are standardized; where they land is a routing config, not a rewrite.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-five-signal-five-storage-tiers-five-retention-policies">2. Five Signal, Five storage tiers, five retention policies<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#2-five-signal-five-storage-tiers-five-retention-policies" class="hash-link" aria-label="Direct link to 2. Five Signal, Five storage tiers, five retention policies" title="Direct link to 2. Five Signal, Five storage tiers, five retention policies" translate="no">​</a></h2>
<p>This is the part most "monitoring" setups collapse into on bucket - and it's exactly where AI system's differ from ordinary services. <strong>Retention is a governance decision, not a storage default</strong>.</p>
<table><thead><tr><th><strong>Signal</strong></th><th><strong>Store</strong></th><th><strong>Retention</strong></th><th><strong>Why this window</strong></th></tr></thead><tbody><tr><td><strong>Structured Logs</strong></td><td>Log store</td><td><strong>90 d</strong></td><td>Operational debugging; cheap to keep short, noisy to keep log</td></tr><tr><td><strong>Metrics</strong></td><td>Time Series DB(TSDB)</td><td><strong>13 mo</strong></td><td>Trent + year-over-year comparison, tiny per point cost</td></tr><tr><td><strong>Sampled Traces</strong></td><td>Trace Store</td><td><strong>90 d</strong></td><td>latency/causality debugging; full traces are expensive, so sample</td></tr><tr><td><strong>Raw Prompt/response</strong></td><td>Restricted store</td><td><strong>encrypted, 90d</strong></td><td>Sensitive content - quality/drift analysis, tightly access-controlled</td></tr><tr><td><strong>Audit record</strong></td><td>Audit log</td><td><strong>immutable, 7y</strong></td><td>Compliance evidence - must survive, must not be editable</td></tr></tbody></table>
<p>The two dotted arrows in the diagram matter. <strong>Raw prompt/response</strong> and <strong>audit records</strong> are not routine telemetry - they are <strong>sensitive, governed</strong> signals. One is encrypted and short-lived; the other is immutable and kept for years. Treating either like a normal log is how you end up with PII in a debug dashboard or a compliance gap at audit time.</p>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>If your "observability" stores everything in one tier with one retention setting, you have made a governance decision by accident. The raw-prompt store and the audit log have <strong>opposite</strong> requirements <em>short + erasable vs long + immutable</em> and conflating them fails both.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-four-consumers-four-different-questions">3. Four consumers, four different questions<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#3-four-consumers-four-different-questions" class="hash-link" aria-label="Direct link to 3. Four consumers, four different questions" title="Direct link to 3. Four consumers, four different questions" translate="no">​</a></h2>
<p>Storage isn't the point; the questions you can answer are. Each consumer reads a different tier.</p>
<ul>
<li class=""><strong>Dashboards</strong> (logs + metrics + traces) - <em>what is the system doing right now</em>? The operational view.</li>
<li class=""><strong>SLO + burn-rate alerts</strong> (metrics) - <em>are we spending our error budget too fast?</em> Pages a human before users feel it.</li>
<li class=""><strong>Drift detector</strong> (traces + raw prompts + embeddings) - <em>is the input distribution moving away from what we tested - and from RAG, is the retrieval corpus drifting too</em>? This is the AI-specific one; model quality erodes silently as the world changes.</li>
<li class=""><strong>Regulatory replay</strong> (audit log) - <em>can we reconstruct exactly what the system did, months later, for someone who wasn't there?</em> The immutable trail.</li>
</ul>
<!-- -->
<p>The split is the insight: <strong>operational health, model-quality erosion, and provable accountability are three different jobs.</strong> A latency dashboard tells you nothing about drift. A drift detector can't satisfy an auditor. You need all three, fed by the right tiers.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-is-an-architecture-problem-not-a-tooling-purchase">Why this is an architecture problem, not a tooling purchase<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#why-this-is-an-architecture-problem-not-a-tooling-purchase" class="hash-link" aria-label="Direct link to Why this is an architecture problem, not a tooling purchase" title="Direct link to Why this is an architecture problem, not a tooling purchase" translate="no">​</a></h2>
<p>You can buy dashboard. You cannot buy the <strong>decision</strong> in this diagram.</p>
<ul>
<li class=""><strong>What to sample</strong> (trace, raw payloads) vs <strong>always capture</strong> (metrics): a latency/cost trade off.</li>
<li class=""><strong>where redaction happens</strong> (the collector, before persistence): a privacy boundary.</li>
<li class=""><strong>Which tier is immutable</strong> (the audit log): a compliance commitment you design in, not bolt on.</li>
<li class=""><strong>What "healthy" means</strong>  (the SLOs and drift thresholds): domain knowledge no tool ships with.</li>
</ul>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>This is the same thesis as <a class="" href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem">"Hallucination" is a design problem:</a> reliability lives in the <strong>system around the model.</strong> Observability is how you <em>measure</em> that reliability: groundedness, unsupported-claim rate and drift become metrics you log the way you'd log latency.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-precise-position">The precise position<a href="https://jitendersharma.dev/blogs/ai-observability-in-enterprise#the-precise-position" class="hash-link" aria-label="Direct link to The precise position" title="Direct link to The precise position" translate="no">​</a></h2>
<p>Most teams stand up a metrics dashboard, call it "AI observability," and move on. That covers exactly one of the four consumer above and not the two that regulators and quality erosion will eventually make you care about.</p>
<p>The architecture that actually holds up captures <strong>five signals with deliberate retention</strong>, redacts <strong>at the collector</strong> and feeds <strong>four distinct consumers</strong>: operational, budget, drift and audit. The diagram isn't decoration; it's the set of decisions you will be asked to defend.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>"Monitor your AI" is a slogan. <strong>Capture five signals, route them to tiers with deliberate retention, and feed four consumers, dashboards, SLO alerts, drift detection, and regulatory replay.</strong> That's the system, everything else is a dashboard pretending to be a strategy.</p></div></div>]]></content>
        <author>
            <name>Jitender Sharma</name>
        </author>
        <category label="ai" term="ai"/>
        <category label="Observability" term="Observability"/>
        <category label="Architecture" term="Architecture"/>
        <category label="Platform, Integration & Governance" term="Platform, Integration & Governance"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Hallucinations Is a System Design Problem, Not a Model Problem]]></title>
        <id>https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem</id>
        <link href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem"/>
        <updated>2026-06-16T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Hallucinations Is a System Design Problem, Not a Model Problem]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="Hallucinations Is a System Design Problem, Not a Model Problem" src="https://jitendersharma.dev/assets/images/hallucinations-69a7823f450757ca5783b0de28369679.png" width="1254" height="705" class="img_ev3q"></p>
<p>Every time a model invents a citation, the conversation jumps to "which model hallucinates less?". That's the wrong question. The model did exactly what it was built to do. Everyone's focused on <strong>picking the model that hallucinates least</strong>.</p>
<p>The thing that will actually decide whether your AI system is trustworthy is <strong>the architecture you wrap around the model</strong> – grounding, retrieval, validation, and an explicit path to "I don't know".</p>
<p>A hallucination isn't a bug the next checkpoint will patch. It's the <strong>expected behavior</strong> of a frozen, probabilistic next-token predictor asked a question it has no grounded answer for. Treating it as a model defect means you keep waiting for a fix that isn't coming. Treating it as a design problem means you can actually solve it today.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span> THE CLAIM</div><div class="admonitionContent_BuS1"><p>Hallucination is not the model failing. It's the model succeeding at the wrong objective – fluent continuation – in a system that never gave it the right one: grounded truth.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-the-model-was-never-going-to-save-you">Why the model was never going to save you<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#why-the-model-was-never-going-to-save-you" class="hash-link" aria-label="Direct link to Why the model was never going to save you" title="Direct link to Why the model was never going to save you" translate="no">​</a></h2>
<p>A trained model is a <strong>frozen function</strong>: <code>f(tokens) -&gt; next-token probabilities</code>. It has no live knowledge, no source of truth, and no built-in concept of “I don't actually know this”. Three properties make hallucinations structural, not accidental:</p>
<table><thead><tr><th>Property of the model</th><th>Consequence</th></tr></thead><tbody><tr><td><strong>Frozen at training time</strong></td><td>No access to fresh, private or post-cutoff facts - it fills gaps from priors</td></tr><tr><td><strong>Optimized for fluency, not truth</strong></td><td>The objective was plausible next token, never verified fact</td></tr><tr><td><strong>No native abstention</strong></td><td>“Confidently wrong” scores the same as confident and right unless the system checks</td></tr></tbody></table>
<p>So when you ask something outside what it learned, it doesn't error out - it produces the most statistically plausible continuation. That continuation is often fluent, well-formatted, and wrong. The model isn't broken. It's doing precisely what next-token prediction does.</p>
<p>The model invents a citation because inventing a plausible continuation is the only thing it was ever built to do - truth was never in its objective, so it has to be in your architecture.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>A bigger or newer model shifts where the cliff is, not that there is a cliff. You're buying a lower hallucination rate, not a guarantee. Rates don't survive contact with a regulator, an auditor, or a customer who was given a fake policy number.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-is-a-design-problem-the-enterprise-lens">Why this is a design problem (the enterprise lens)<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#why-this-is-a-design-problem-the-enterprise-lens" class="hash-link" aria-label="Direct link to Why this is a design problem (the enterprise lens)" title="Direct link to Why this is a design problem (the enterprise lens)" translate="no">​</a></h2>
<p>If the model can't be the source of truth, <strong>the system has to be</strong>. That reframes hallucinations from "model quality" to "system design" - and design is something you control.</p>
<ul>
<li class=""><strong>Grounding is an architecture choice, not a model feature</strong>. RAG exists precisely because the model's knowledge is frozen. Inject the right context at runtime and the model is <em>continuing from facts</em> instead of <em>inventing from priors</em>. No retrieval layer = you've delegated truth to a frozen function and hoped.</li>
<li class=""><strong>Validation lives outside the model</strong>. Guardrails, schema/grounding checks, and citation verifications sit <em>around</em> the model - you can't patch behaviors inside frozen weights in real time. The system decides what's allowed to reach the user, not the model.</li>
<li class=""><strong>"I don't know" must be an engineered path</strong>. Models don't volunteer abstention. Confidence thresholds, retrieval-coverage checks, and explicit fallbacks are what turn a confident guess into an honest "I can't answer that from sources I have."</li>
<li class=""><strong>Cost and governance ride on this</strong>. An ungrounded answer in a bank, a hospital, or a legal workflow isn't a quality blip - it's liability. Design decides whether a wrong answer is impossible to surface or merely cheap to retry.</li>
</ul>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>The <strong>intelligence</strong> is in the model. The <strong>truth</strong> is in the system. If your architecture has no component that owns "is this actually true and supported?", then nothing does - and the model will happily fill the silence.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="non-determinism-is-not-hallucination">Non-determinism is not hallucination<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#non-determinism-is-not-hallucination" class="hash-link" aria-label="Direct link to Non-determinism is not hallucination" title="Direct link to Non-determinism is not hallucination" translate="no">​</a></h2>
<p>This is the objection we hear most, and it's the strongest argument for the design framing - not against it. But it actually bundles two different things together.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="different-answers--hallucinations">Different answers ≠ Hallucinations<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#different-answers--hallucinations" class="hash-link" aria-label="Direct link to Different answers ≠ Hallucinations" title="Direct link to Different answers ≠ Hallucinations" translate="no">​</a></h3>
<table><thead><tr><th></th><th><strong>Non-determinism</strong></th><th><strong>Hallucination</strong></th></tr></thead><tbody><tr><td>What it is</td><td>Different wording for the same question</td><td>A <em>confident false claim</em></td></tr><tr><td>Cause</td><td><strong>Sampling</strong> (temperature, top-p) picks among probable tokens</td><td>No grounded fact, so it continues from priors</td></tr><tr><td>Your control</td><td>Yes - set <code>temperature=0</code></td><td>Only via grounding + verification</td></tr></tbody></table>
<p>The model never stores "an answer". Each step it produces a <strong>probability distribution</strong> over the next token, then <em>samples</em> from it. At <code>temperature &gt; 0</code> you are rolling a weighted dice every token - hence different phrasings. Set <code>temperature = 0</code> (greedy decoding) and it becomes <strong>near-deterministic</strong>: same input -&gt; same output.</p>
<br>
<p><code>(near, because floating-point rounding and GPU batching cause tiny variations - an engineering detail, not the core issue.)</code></p>
<br>
<p>So "different answers each time" is a <strong>knob you control</strong>, not proof the model is reliable.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="there-is-no-100-surety--and-thats-the-whole-point">There is no 100% surety – and that’s the whole point<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#there-is-no-100-surety--and-thats-the-whole-point" class="hash-link" aria-label="Direct link to There is no 100% surety – and that’s the whole point" title="Direct link to There is no 100% surety – and that’s the whole point" translate="no">​</a></h3>
<p>Grounding does not guarantee a correct answer. It shifts the probability mass. Without context, the most-probable continuation comes from <em>fuzzy</em> training priors (high risk). With the right context in the prompt, the most-probable continuation becomes <em>"paraphrase what's in front of me" (much lower risk)</em>. You move from maybe ~70% to 95% - <strong>never to 100%</strong>.</p>
<br>
<p>So where does the surety come from? <strong>Not the model - a separate verifier</strong>. The thing that generates the answer must not be the thing that decides it's trustworthy. A grounded model gives you a good draft - 95%; design decides what happens to the other 5%, whether it silently reaches your user or gets caught and blocked.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>You can't make a frozen, sampling-based function promise truth - so reliability <strong>has to</strong> be engineered around it. The model's lack of a guarantee is the reason design exists, not a reason to wait for a better model.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-designing-for-it-actually-looks-like">What “designing for it” actually looks like<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#what-designing-for-it-actually-looks-like" class="hash-link" aria-label="Direct link to What “designing for it” actually looks like" title="Direct link to What “designing for it” actually looks like" translate="no">​</a></h2>
<p>Those four principles become one concrete pipeline. You don't eliminate hallucinations by hoping - you <strong>box it in</strong> with layers, each on catching what the last let through.</p>
<!-- -->
<ul>
<li class=""><strong>Retrieve before you generate</strong> - give the model facts to continue from, not a blank page.</li>
<li class=""><strong>Constrain the output</strong> - structural formats, required citations, schema validation.</li>
<li class=""><strong>Verify against the source</strong> - does everything claim trace back to retrieved evidence?</li>
<li class=""><strong>Make abstention first-class</strong> - "no grounded answer" is a valid, designed outcome, not a failure.</li>
<li class=""><strong>Observe in production</strong> - log groundedness and unsupported claim rates the way you'd log latency, Hallucination is a measurable system metric, not a vibe.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-to-actually-build-the-verifier">How to actually build the verifier<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#how-to-actually-build-the-verifier" class="hash-link" aria-label="Direct link to How to actually build the verifier" title="Direct link to How to actually build the verifier" translate="no">​</a></h2>
<p>"Add a verifier" is easy to say. The trap is building one that just re-asks the same model "are you sure?" - it'll rationalize its own output. A good verifier follows two rules and one ordering.</p>
<p><strong>Rule 1 - independent from the generator.</strong> The thing that <em>checks</em> the answer must not be the thing that wrote it. Use deterministic code, a retrieval system, or a <em>separate</em> model call that sees only the claim + the source - never the original reasoning.</p>
<p><strong>Rule 2 - verify atomic claims, not paragraph</strong> "Mostly right" hides one wrong clause. Decompose the answer into individual facts and check each one against evidence.</p>
<p><strong>The ordering - cheapest, most deterministic checks first, expensive models last, on the reside only:</strong></p>
<!-- -->
<table><thead><tr><th><strong>Layer</strong></th><th><strong>Mechanism</strong></th><th><strong>Catches</strong></th><th><strong>Cost</strong></th></tr></thead><tbody><tr><td><strong>1. Structural</strong></td><td>JSON schema, constrained decoding</td><td>No citations, malformed output</td><td>~Free</td></tr><tr><td><strong>2. Deterministic Facts</strong></td><td>Exact/fuzzy match against source</td><td>Invented numbers, IDs, dates, quotes</td><td>~Free</td></tr><tr><td><strong>3. Grounding (NLI)</strong></td><td>Small entailment model per claim</td><td>Unsupported or contradicted claims</td><td>Cheap</td></tr><tr><td><strong>4. LLM-as-judge</strong></td><td><em>Separate</em> model</td><td>Nuanced cases the rest can't settle</td><td>Expensive</td></tr></tbody></table>
<p>The verifier doesn't make the system perfect. It converts a <em>silent, confident, wrong answer</em> into a caught-and-blocked one - turning an unbounded risk into a <strong>measurable error rate with a fallback</strong>. That conversation is exactly what you can put in front of an auditor.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-i-actually-land">Where I actually land<a href="https://jitendersharma.dev/blogs/hallucinations-is-a-system-design-problem-not-model-problem#where-i-actually-land" class="hash-link" aria-label="Direct link to Where I actually land" title="Direct link to Where I actually land" translate="no">​</a></h2>
<p>My point is: I'm not saying models don't matter, or that one model is as good as another. Picking a stronger model genuinely lowers the baseline rate.</p>
<br>
<p>I am saying: a better model <strong>reduces</strong> hallucinations; only better <strong>design</strong> lets you <strong>bound and govern</strong> it. If your reliability strategy is "wait for the next model," you've outsourced your most important architectural decision to someone else's release schedule - and you still won't be able to promise an auditor anything.</p>
<br>
<p>Stop asking "which model hallucinates the least?" Start asking <strong>"what in the system owns the truth, and what happens when it doesn't have an answer?"</strong></p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Hallucination is the model doing its job inside a system that forgot to do its own. Engineer grounding, validation, and abstention around the frozen model - that's where reliability is actually built.</p></div></div>]]></content>
        <author>
            <name>Jitender Sharma</name>
        </author>
        <category label="LLM" term="LLM"/>
        <category label="Hallucinations" term="Hallucinations"/>
        <category label="Point of View" term="Point of View"/>
        <category label="AI Systems" term="AI Systems"/>
        <category label="System Architecture" term="System Architecture"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[How LLM Works Under the Hood]]></title>
        <id>https://jitendersharma.dev/blogs/how-llm-works-under-the-hood</id>
        <link href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood"/>
        <updated>2026-06-09T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[How LLM Works Under the Hood]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="How LLM Works Under the Hood" src="https://jitendersharma.dev/assets/images/transformer-caadd220223ee8d122785364571f04c8.png" width="1030" height="579" class="img_ev3q"></p>
<p>Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.</p>
<p>This is a 20,000-ft view of the LLM lifecycle in four stages.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-big-picture-one-model-four-stages">The big picture: one model, four stages.<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#the-big-picture-one-model-four-stages" class="hash-link" aria-label="Direct link to The big picture: one model, four stages." title="Direct link to The big picture: one model, four stages." translate="no">​</a></h2>
<p>A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.</p>
<!-- -->
<br>
<table><thead><tr><th>Stage</th><th>What happens</th><th>Key ideas</th></tr></thead><tbody><tr><td>Before</td><td>Decide the blueprint</td><td>Architecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.</td></tr><tr><td>During</td><td>Fill in the values</td><td>Random weights become meaningful through training: a four-step loop run millions or trillions of times.</td></tr><tr><td>Alignment</td><td>Make it helpful</td><td>Show good examples (SFT) and teach which answers are better (RLHF/DPO).</td></tr><tr><td>After</td><td>Run it, read-only</td><td>Weights are frozen (no learning); inference traverses the model geometry one token at a time.</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-1---before-training">Stage 1 - Before training<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#stage-1---before-training" class="hash-link" aria-label="Direct link to Stage 1 - Before training" title="Direct link to Stage 1 - Before training" translate="no">​</a></h2>
<p>Two human decisions are baked in before any gradient is computed.</p>
<ul>
<li class=""><strong>Architecture dials</strong> - hidden size, layers, heads, FFN width, vocab size.</li>
<li class=""><strong>Tokenizer vocabulary</strong> - the integer alphabet the model reads and writes.</li>
</ul>
<p>A "7B" model is 7B because of these dials — training never grows it, and most parameters live in the FFN, not attention.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-dials">The Architecture dials<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#the-architecture-dials" class="hash-link" aria-label="Direct link to The Architecture dials" title="Direct link to The Architecture dials" translate="no">​</a></h3>
<table><thead><tr><th>Hyperparameter</th><th>Example</th><th>Description</th></tr></thead><tbody><tr><td>hidden_size(D)</td><td>4096</td><td>How much "thinking space" the model has for each word or idea at a given moment.</td></tr><tr><td>num_layers(L)</td><td>32</td><td>How many rounds of refinement - 32 editors in a row.</td></tr><tr><td>num_heads(H)</td><td>32</td><td>A panel of specialists, each spotting a different pattern.</td></tr><tr><td>head_dim(D_h)</td><td>128</td><td>The size of each specialist's notebook.</td></tr><tr><td>ffn_hidden(D_ff)</td><td>16,384</td><td>The knowledge bank — where most facts are stored (~4*D).</td></tr><tr><td>vocab_size(V)</td><td>32000</td><td>The size of the model's dictionary—the building blocks it uses to read and write language.</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>The model is fully sized and described before it sees a single token.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-2---during-training">Stage 2 - During training<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#stage-2---during-training" class="hash-link" aria-label="Direct link to Stage 2 - During training" title="Direct link to Stage 2 - During training" translate="no">​</a></h2>
<p>Learning is one four-step loop, repeated hundreds of thousands to millions of times.</p>
<!-- -->
<ol>
<li class=""><strong>Forward Pass</strong> - Predicts what comes next in a sequence, based on previous tokens.</li>
<li class=""><strong>Loss</strong> - How wrong was our prediction?</li>
<li class=""><strong>Backpropagation</strong> - Calculate how much, and how each weight contributed to the error.</li>
<li class=""><strong>Optimizer step</strong> - Update every weight, slightly adjusting each weigh.</li>
</ol>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The only thing learned here is the <strong>next-token prediction</strong> — the statistical relationship between tokens given their surrounding context.
Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage — that comes later, in alignment.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-random-numbers-to-learned-meaning">From random numbers to learned meaning<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#from-random-numbers-to-learned-meaning" class="hash-link" aria-label="Direct link to From random numbers to learned meaning" title="Direct link to From random numbers to learned meaning" translate="no">​</a></h3>
<table><thead><tr><th>Before training (random)</th><th>After training (meaning)</th></tr></thead><tbody><tr><td>Every weight is a random number</td><td>Every weight holds a learned value</td></tr><tr><td>Output is gibberish</td><td>Output is fluent, coherent text</td></tr><tr><td>No grammar, facts, or reasoning</td><td>Grammar, facts, and reasoning emerge</td></tr><tr><td>Structure exists, meaning doesn't</td><td>Same structure — now full of meaning</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-roles-that-emerge-after-training">The roles that emerge after training<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#the-roles-that-emerge-after-training" class="hash-link" aria-label="Direct link to The roles that emerge after training" title="Direct link to The roles that emerge after training" translate="no">​</a></h3>
<p>Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles—learned through experience, not explicitly designed.</p>
<table><thead><tr><th>Component</th><th>Role it settles into</th></tr></thead><tbody><tr><td>Embeddings</td><td>What tokens mean (lexical meaning)</td></tr><tr><td>Attention</td><td>How tokens relate — routes relevant context</td></tr><tr><td>FFNs</td><td>Transformation / "thinking" — most parameters and reasoning</td></tr><tr><td>LayerNorm</td><td>Keep signals stable and usable</td></tr><tr><td>Depth (layers)</td><td>Progressive refinement of understanding</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>No one designs these roles; training gradually turns them into specialist roles through learning rather than design.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-3---alignment">Stage 3 - Alignment<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#stage-3---alignment" class="hash-link" aria-label="Direct link to Stage 3 - Alignment" title="Direct link to Stage 3 - Alignment" translate="no">​</a></h2>
<p>A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.</p>
<table><thead><tr><th></th><th>Main training</th><th>Polish (alignment)</th></tr></thead><tbody><tr><td>Data</td><td>Trillions of words</td><td>Thousands to millions of examples</td></tr><tr><td>Length (cost)</td><td>Weeks/months, huge cost</td><td>Short, cheap</td></tr><tr><td>What it does</td><td>Teaches knowledge</td><td>Shapes behavior</td></tr></tbody></table>
<ul>
<li class=""><strong>SFT</strong> - show it good (prompt, response) examples.</li>
<li class=""><strong>RLHF/DPO</strong> - teach it which answer is better.</li>
</ul>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Alignment turns a raw model into a helpful assistant — it shapes behavior; it doesn't add new knowledge.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-4---after-training">Stage 4 - After training<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#stage-4---after-training" class="hash-link" aria-label="Direct link to Stage 4 - After training" title="Direct link to Stage 4 - After training" translate="no">​</a></h2>
<p>Once training stops, <strong>weights are frozen</strong> — no learning, no gradients. The model is a fixed function <code>f(tokens) -&gt; next token probabilities</code>.</p>
<!-- -->
<p>During inference, the model has <strong>no memory</strong> of what was asked or answered before — each request starts fresh.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Training builds the geometry. Inference just navigates it one token at a time.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-mental-model-most-people-get-wrong">The Mental Model most people get wrong<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#the-mental-model-most-people-get-wrong" class="hash-link" aria-label="Direct link to The Mental Model most people get wrong" title="Direct link to The Mental Model most people get wrong" translate="no">​</a></h2>
<ul>
<li class="">LLM ≠ continuously learning systems</li>
<li class="">LLM ≠ dynamic knowledge base</li>
<li class="">LLM ≠ autonomous agent</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-this-means-for-enterprise-systems">What this means for Enterprise Systems<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#what-this-means-for-enterprise-systems" class="hash-link" aria-label="Direct link to What this means for Enterprise Systems" title="Direct link to What this means for Enterprise Systems" translate="no">​</a></h2>
<p>Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system" — it's a <strong>fixed component inside a larger architecture</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-why-rag-is-required">1. Why RAG is required<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#1-why-rag-is-required" class="hash-link" aria-label="Direct link to 1. Why RAG is required" title="Direct link to 1. Why RAG is required" translate="no">​</a></h3>
<p>LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.</p>
<p><strong>To make them useful in enterprise:</strong>
<strong>To make them useful in enterprise:</strong></p>
<ul>
<li class="">Connect them to internal data sources</li>
<li class="">Inject context at runtime</li>
</ul>
<p>This is why <strong>Retrieval Augmentation (RAG)</strong> becomes a foundational pattern.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-why-agentsorchestration-are-external">2. Why agents/orchestration are external<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#2-why-agentsorchestration-are-external" class="hash-link" aria-label="Direct link to 2. Why agents/orchestration are external" title="Direct link to 2. Why agents/orchestration are external" translate="no">​</a></h3>
<p>LLMs are:</p>
<ul>
<li class="">Stateless</li>
<li class="">Reactive</li>
<li class="">Single-step predictors</li>
</ul>
<p>They cannot:</p>
<ul>
<li class="">Execute workflows</li>
<li class="">Maintain long-running state</li>
<li class="">Coordinate systems</li>
</ul>
<p>This is why <strong>agentic systems and orchestration layers exist outside the model</strong></p>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>The intelligence is in the model and the <strong>control</strong> is in the system design.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-why-governance-is-outside-the-model">3. Why governance is outside the model<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#3-why-governance-is-outside-the-model" class="hash-link" aria-label="Direct link to 3. Why governance is outside the model" title="Direct link to 3. Why governance is outside the model" translate="no">​</a></h3>
<p>You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:</p>
<ul>
<li class="">Guardrails</li>
<li class="">Validation layers</li>
<li class="">Monitoring and evaluation</li>
<li class="">Policy enforcement</li>
</ul>
<p>All of these sit <strong>around the model, not inside it</strong></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-why-inference-cost-dominates">4. Why inference cost dominates<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#4-why-inference-cost-dominates" class="hash-link" aria-label="Direct link to 4. Why inference cost dominates" title="Direct link to 4. Why inference cost dominates" translate="no">​</a></h3>
<p>Training is:</p>
<ul>
<li class="">One-time</li>
<li class="">Expensive but amortized</li>
</ul>
<p>Inference is:
Inference is:</p>
<ul>
<li class="">Continuous</li>
<li class="">Scales with usage</li>
</ul>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>For enterprise systems:
Cost = traffic * tokens * latency requirements</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-why-scale-and-cost-must-be-designed-upfront">5. Why scale and cost must be designed upfront<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#5-why-scale-and-cost-must-be-designed-upfront" class="hash-link" aria-label="Direct link to 5. Why scale and cost must be designed upfront" title="Direct link to 5. Why scale and cost must be designed upfront" translate="no">​</a></h3>
<p>Because LLMs don't learn in production, every interaction requires:</p>
<ul>
<li class="">Full inference execution</li>
<li class="">Token processing (input+output)</li>
<li class="">External system calls (RAG /agents)</li>
</ul>
<p>This means:</p>
<ul>
<li class="">Cost scales with usage, not with training</li>
<li class="">Latency compounds across system layers</li>
<li class="">Poor design = exponential cost growth</li>
</ul>
<p>In real systems, if not handled correctly:</p>
<ul>
<li class="">RAG increases token usage</li>
<li class="">Agents introduce multiple-step execution</li>
<li class="">Orchestration adds round trips</li>
</ul>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>Training is a <strong>one-off capital cost</strong>; inference is the <strong>ongoing operational cost</strong>. Also, without careful design, AI systems become <strong>unpredictable and expensive at scale</strong></p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-takeaway">Final Takeaway<a href="https://jitendersharma.dev/blogs/how-llm-works-under-the-hood#final-takeaway" class="hash-link" aria-label="Direct link to Final Takeaway" title="Direct link to Final Takeaway" translate="no">​</a></h2>
<p>The model provides intelligence and the system provides control.</p>
<p>Modern AI architecture is not “LLM design” It is “system design around a frozen model”</p>
<p>Traffic × Tokens × Latency</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Treat the LLM as frozen dependency; engineer everything else around it.</p></div></div>]]></content>
        <author>
            <name>Jitender Sharma</name>
        </author>
        <category label="LLM" term="LLM"/>
        <category label="Explainer" term="Explainer"/>
        <category label="AI Systems" term="AI Systems"/>
    </entry>
</feed>