<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://jitendersharma.dev/blog</id>
    <title>Under the hood Blog</title>
    <updated>2026-06-11T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://jitendersharma.dev/blog"/>
    <subtitle>Under the hood Blog</subtitle>
    <icon>https://jitendersharma.dev/img/favicon.ico</icon>
    <entry>
        <title type="html"><![CDATA[RAG is Not a Database]]></title>
        <id>https://jitendersharma.dev/blog/rag-is-not-a-database</id>
        <link href="https://jitendersharma.dev/blog/rag-is-not-a-database"/>
        <updated>2026-06-11T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Most discussions about Retrieval-Augmented Generation (RAG) frame it as a way to “connect LLMs to data.”]]></summary>
        <content type="html"><![CDATA[<p>Most discussions about Retrieval-Augmented Generation (RAG) frame it as a way to “connect LLMs to data.”
That framing is incomplete — and in large-scale systems, misleading.</p>
<p>RAG is not a database layer. It is a runtime context construction system for a frozen model.</p>
<p>Understanding this distinction is critical for anyone designing production AI systems.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="rag-does-not-store-knowledge">RAG does not store knowledge<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#rag-does-not-store-knowledge" class="hash-link" aria-label="Direct link to RAG does not store knowledge" title="Direct link to RAG does not store knowledge" translate="no">​</a></h2>
<p>A common misconception is that RAG acts like a knowledge store.</p>
<p>That is not accurate.</p>
<p>Instead:</p>
<p>Source data lives in external systems (databases, documents, APIs)
Vector indexes store semantic representations, not ground truth
Retrieval does not return facts — it returns relevant fragments of representation</p>
<p>RAG does not guarantee correctness.
It does not enforce consistency.
It does not maintain a canonical state of knowledge.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="rag-is-a-context-assembly-pipeline-not-a-query-engine">RAG is a context assembly pipeline, not a query engine<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#rag-is-a-context-assembly-pipeline-not-a-query-engine" class="hash-link" aria-label="Direct link to RAG is a context assembly pipeline, not a query engine" title="Direct link to RAG is a context assembly pipeline, not a query engine" translate="no">​</a></h2>
<p>Traditional databases:</p>
<p>Deterministic queries
Structured schema
Exact retrieval guarantees</p>
<p>RAG systems:</p>
<p>Approximate semantic retrieval
Probabilistic ranking
Context window assembly for an LLM</p>
<p>The output of RAG is not a “result set.”
It is a prompt-ready context bundle.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="retrieval-is-not-reasoning-or-verification">Retrieval is not reasoning or verification<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#retrieval-is-not-reasoning-or-verification" class="hash-link" aria-label="Direct link to Retrieval is not reasoning or verification" title="Direct link to Retrieval is not reasoning or verification" translate="no">​</a></h2>
<p>Retrieved chunks are:</p>
<p>Semantically relevant
Not necessarily correct
Not validated against a source of truth at retrieval time</p>
<p>The LLM becomes the reasoning layer that interprets this context.</p>
<p>This introduces a key architectural reality:</p>
<p>RAG does not reduce hallucinations by itself — it only changes the input surface area.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="rag-is-constrained-by-the-context-window">RAG is constrained by the context window<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#rag-is-constrained-by-the-context-window" class="hash-link" aria-label="Direct link to RAG is constrained by the context window" title="Direct link to RAG is constrained by the context window" translate="no">​</a></h2>
<p>Unlike databases, RAG operates under a hard constraint:</p>
<p>Limited tokens
Limited attention capacity
Competing relevance signals</p>
<p>This forces system-level decisions:</p>
<p>Chunking strategy
Embedding granularity
Ranking and filtering logic</p>
<p>These are not data concerns — they are cognitive load management decisions for the model.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="rag-is-a-probabilistic-system-not-a-deterministic-one">RAG is a probabilistic system, not a deterministic one<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#rag-is-a-probabilistic-system-not-a-deterministic-one" class="hash-link" aria-label="Direct link to RAG is a probabilistic system, not a deterministic one" title="Direct link to RAG is a probabilistic system, not a deterministic one" translate="no">​</a></h2>
<p>Unlike traditional data systems:</p>
<p>Retrieval is not guaranteed complete
Ranking is heuristic
Similarity search is approximate
Results vary with embeddings and query phrasing</p>
<p>This makes RAG inherently:</p>
<p>Non-deterministic
Sensitive to configuration
Difficult to reason about without observability</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="system-design-complexity-shifts-to-the-edges">System design complexity shifts to the edges<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#system-design-complexity-shifts-to-the-edges" class="hash-link" aria-label="Direct link to System design complexity shifts to the edges" title="Direct link to System design complexity shifts to the edges" translate="no">​</a></h2>
<p>Once you understand RAG correctly, a key shift happens:</p>
<p>The complexity moves away from the model and into the system:</p>
<p>Chunking strategy becomes critical
Embedding model choice becomes architectural
Retrieval ranking becomes a relevance system
Prompt construction becomes a control surface</p>
<p>In other words:</p>
<p>RAG systems are not “LLM integrations” — they are retrieval + reasoning pipelines.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="rag-is-not-the-intelligence-layer--it-is-the-context-layer">RAG is not the intelligence layer — it is the context layer<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#rag-is-not-the-intelligence-layer--it-is-the-context-layer" class="hash-link" aria-label="Direct link to RAG is not the intelligence layer — it is the context layer" title="Direct link to RAG is not the intelligence layer — it is the context layer" translate="no">​</a></h2>
<p>A useful mental model:</p>
<p>LLM = reasoning engine (frozen function)
RAG = context shaping system
Orchestration layer = control logic</p>
<p>RAG does not make the system intelligent.
It determines what the model is allowed to see before it reasons.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="implications-for-enterprise-architecture">Implications for enterprise architecture<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#implications-for-enterprise-architecture" class="hash-link" aria-label="Direct link to Implications for enterprise architecture" title="Direct link to Implications for enterprise architecture" translate="no">​</a></h2>
<p>Treating RAG as a database abstraction leads to predictable failures:</p>
<p>Over-reliance on embeddings as “truth”
Poor chunk design leading to lost context
Inconsistent retrieval quality across use cases
Unexpected hallucinations due to missing context rather than model failure</p>
<p>Instead, production systems should treat RAG as:</p>
<p>A context engineering layer
A relevance filtering system
A probabilistic pre-processing stage for reasoning</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-takeaway">Final takeaway<a href="https://jitendersharma.dev/blog/rag-is-not-a-database#final-takeaway" class="hash-link" aria-label="Direct link to Final takeaway" title="Direct link to Final takeaway" translate="no">​</a></h2>
<p>RAG is often described as a bridge between data and LLMs.</p>
<p>A more accurate description is:</p>
<p>RAG is a probabilistic context construction system that shapes what a frozen model can reason about at runtime.</p>
<p>The model provides intelligence.
The system determines what intelligence can see.</p>]]></content>
        <author>
            <name>Jitender Sharma</name>
        </author>
        <category label="RAG" term="RAG"/>
        <category label="RAG is not a database" term="RAG is not a database"/>
        <category label="vector db" term="vector db"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[How LLM Works Under the Hood]]></title>
        <id>https://jitendersharma.dev/blog/how-llm-works-under-the-hood</id>
        <link href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood"/>
        <updated>2026-06-09T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.]]></summary>
        <content type="html"><![CDATA[<p>Most discussions about LLMs focus on prompts, tools, and frameworks. However, few explain how the model actually works under the hood and why that matters when building real systems.</p>
<p>This is a 20,000-ft view of the LLM lifecycle in four stages.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-big-picture-one-model-four-stages">The big picture: one model, four stages.<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#the-big-picture-one-model-four-stages" class="hash-link" aria-label="Direct link to The big picture: one model, four stages." title="Direct link to The big picture: one model, four stages." translate="no">​</a></h2>
<p>A model's whole life is just four stages. The shape and vocabulary are fixed first; training only fills in the values, and inference is read-only and never learns.</p>
<!-- -->
<br>
<table><thead><tr><th>Stage</th><th>What happens</th><th>Key ideas</th></tr></thead><tbody><tr><td>Before</td><td>Decide the blueprint</td><td>Architecture dials set the shape, tokenizer builds the vocabulary, and parameter count is fixed.</td></tr><tr><td>During</td><td>Fill in the values</td><td>Random weights become meaningful through training: a four-step loop run millions or trillions of times.</td></tr><tr><td>Alignment</td><td>Make it helpful</td><td>Show good examples (SFT) and teach which answers are better (RLHF/DPO).</td></tr><tr><td>After</td><td>Run it, read-only</td><td>Weights are frozen (no learning); inference traverses the model geometry one token at a time.</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Shape + vocabulary are fixed first. Training only fills the values. Inference never learns.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-1---before-training">Stage 1 - Before training<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#stage-1---before-training" class="hash-link" aria-label="Direct link to Stage 1 - Before training" title="Direct link to Stage 1 - Before training" translate="no">​</a></h2>
<p>Two human decisions are baked in before any gradient is computed.</p>
<ul>
<li class=""><strong>Architecture dials</strong> - hidden size, layers, heads, FFN width, vocab size.</li>
<li class=""><strong>Tokenizer vocabulary</strong> - the integer alphabet the model reads and writes.</li>
</ul>
<p>A "7B" model is 7B because of these dials — training never grows it, and most parameters live in the FFN, not attention.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-dials">The Architecture dials<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#the-architecture-dials" class="hash-link" aria-label="Direct link to The Architecture dials" title="Direct link to The Architecture dials" translate="no">​</a></h3>
<table><thead><tr><th>Hyperparameter</th><th>Example</th><th>Description</th></tr></thead><tbody><tr><td>hidden_size(D)</td><td>4096</td><td>How much "thinking space" the model has for each word or idea at a given moment.</td></tr><tr><td>num_layers(L)</td><td>32</td><td>How many rounds of refinement - 32 editors in a row.</td></tr><tr><td>num_heads(H)</td><td>32</td><td>A panel of specialists, each spotting a different pattern.</td></tr><tr><td>head_dim(D_h)</td><td>128</td><td>The size of each specialist's notebook.</td></tr><tr><td>ffn_hidden(D_ff)</td><td>16,384</td><td>The knowledge bank — where most facts are stored (~4*D).</td></tr><tr><td>vocab_size(V)</td><td>32000</td><td>The size of the model's dictionary—the building blocks it uses to read and write language.</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>The model is fully sized and described before it sees a single token.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-2---during-training">Stage 2 - During training<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#stage-2---during-training" class="hash-link" aria-label="Direct link to Stage 2 - During training" title="Direct link to Stage 2 - During training" translate="no">​</a></h2>
<p>Learning is one four-step loop, repeated hundreds of thousands to millions of times.</p>
<!-- -->
<ol>
<li class=""><strong>Forward Pass</strong> - Predicts what comes next in a sequence, based on previous tokens.</li>
<li class=""><strong>Loss</strong> - How wrong was our prediction?</li>
<li class=""><strong>Backpropagation</strong> - Calculate how much, and how each weight contributed to the error.</li>
<li class=""><strong>Optimizer step</strong> - Update every weight, slightly adjusting each weigh.</li>
</ol>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The only thing learned here is the <strong>next-token prediction</strong> — the statistical relationship between tokens given their surrounding context.
Pre-training delivers languages and knowledge; it does not shape behavior (following instructions, being helpful, staying safe). No behavior is learned at this stage — that comes later, in alignment.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="from-random-numbers-to-learned-meaning">From random numbers to learned meaning<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#from-random-numbers-to-learned-meaning" class="hash-link" aria-label="Direct link to From random numbers to learned meaning" title="Direct link to From random numbers to learned meaning" translate="no">​</a></h3>
<table><thead><tr><th>Before training (random)</th><th>After training (meaning)</th></tr></thead><tbody><tr><td>Every weight is a random number</td><td>Every weight holds a learned value</td></tr><tr><td>Output is gibberish</td><td>Output is fluent, coherent text</td></tr><tr><td>No grammar, facts, or reasoning</td><td>Grammar, facts, and reasoning emerge</td></tr><tr><td>Structure exists, meaning doesn't</td><td>Same structure — now full of meaning</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Learning is the same four-step loop, running hundreds of thousands to millions of times, turning random numbers into meaning.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-roles-that-emerge-after-training">The roles that emerge after training<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#the-roles-that-emerge-after-training" class="hash-link" aria-label="Direct link to The roles that emerge after training" title="Direct link to The roles that emerge after training" translate="no">​</a></h3>
<p>Components start as random numbers with no predefined purpose. After millions or billions of training steps, gradient descent gradually shapes them into specialized roles—learned through experience, not explicitly designed.</p>
<table><thead><tr><th>Component</th><th>Role it settles into</th></tr></thead><tbody><tr><td>Embeddings</td><td>What tokens mean (lexical meaning)</td></tr><tr><td>Attention</td><td>How tokens relate — routes relevant context</td></tr><tr><td>FFNs</td><td>Transformation / "thinking" — most parameters and reasoning</td></tr><tr><td>LayerNorm</td><td>Keep signals stable and usable</td></tr><tr><td>Depth (layers)</td><td>Progressive refinement of understanding</td></tr></tbody></table>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>No one designs these roles; training gradually turns them into specialist roles through learning rather than design.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-3---alignment">Stage 3 - Alignment<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#stage-3---alignment" class="hash-link" aria-label="Direct link to Stage 3 - Alignment" title="Direct link to Stage 3 - Alignment" translate="no">​</a></h2>
<p>A raw pre-trained model is a brilliant autocomplete, not yet a helpful assistant. Alignment is a thin, cheap layer on top of pre-training that shapes behavior.</p>
<table><thead><tr><th></th><th>Main training</th><th>Polish (alignment)</th></tr></thead><tbody><tr><td>Data</td><td>Trillions of words</td><td>Thousands to millions of examples</td></tr><tr><td>Length (cost)</td><td>Weeks/months, huge cost</td><td>Short, cheap</td></tr><tr><td>What it does</td><td>Teaches knowledge</td><td>Shapes behavior</td></tr></tbody></table>
<ul>
<li class=""><strong>SFT</strong> - show it good (prompt, response) examples.</li>
<li class=""><strong>RLHF/DPO</strong> - teach it which answer is better.</li>
</ul>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Alignment turns a raw model into a helpful assistant — it shapes behavior; it doesn't add new knowledge.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="stage-4---after-training">Stage 4 - After training<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#stage-4---after-training" class="hash-link" aria-label="Direct link to Stage 4 - After training" title="Direct link to Stage 4 - After training" translate="no">​</a></h2>
<p>Once training stops, <strong>weights are frozen</strong> — no learning, no gradients. The model is a fixed function <code>f(tokens) -&gt; next token probabilities</code>.</p>
<!-- -->
<p>During inference, the model has <strong>no memory</strong> of what was asked or answered before — each request starts fresh.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Training builds the geometry. Inference just navigates it one token at a time.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-mental-model-most-people-get-wrong">The Mental Model most people get wrong<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#the-mental-model-most-people-get-wrong" class="hash-link" aria-label="Direct link to The Mental Model most people get wrong" title="Direct link to The Mental Model most people get wrong" translate="no">​</a></h2>
<ul>
<li class="">LLM ≠ continuously learning systems</li>
<li class="">LLM ≠ dynamic knowledge base</li>
<li class="">LLM ≠ autonomous agent</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-this-means-for-enterprise-systems">What this means for Enterprise Systems<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#what-this-means-for-enterprise-systems" class="hash-link" aria-label="Direct link to What this means for Enterprise Systems" title="Direct link to What this means for Enterprise Systems" translate="no">​</a></h2>
<p>Understanding how LLMs actually work leads to a critical shift in how we design AI systems. The model itself is not "the system" — it's a <strong>fixed component inside a larger architecture</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-why-rag-is-required">1. Why RAG is required<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#1-why-rag-is-required" class="hash-link" aria-label="Direct link to 1. Why RAG is required" title="Direct link to 1. Why RAG is required" translate="no">​</a></h3>
<p>LLMs do not have access to fresh and private data. Their knowledge is fixed at training time.</p>
<p><strong>To make them useful in enterprise:</strong>
<strong>To make them useful in enterprise:</strong></p>
<ul>
<li class="">Connect them to internal data sources</li>
<li class="">Inject context at runtime</li>
</ul>
<p>This is why <strong>Retrieval Augmentation (RAG)</strong> becomes a foundational pattern.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-why-agentsorchestration-are-external">2. Why agents/orchestration are external<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#2-why-agentsorchestration-are-external" class="hash-link" aria-label="Direct link to 2. Why agents/orchestration are external" title="Direct link to 2. Why agents/orchestration are external" translate="no">​</a></h3>
<p>LLMs are:</p>
<ul>
<li class="">Stateless</li>
<li class="">Reactive</li>
<li class="">Single-step predictors</li>
</ul>
<p>They cannot:</p>
<ul>
<li class="">Execute workflows</li>
<li class="">Maintain long-running state</li>
<li class="">Coordinate systems</li>
</ul>
<p>This is why <strong>agentic systems and orchestration layers exist outside the model</strong></p>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>The intelligence is in the model and the <strong>control</strong> is in the system design.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-why-governance-is-outside-the-model">3. Why governance is outside the model<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#3-why-governance-is-outside-the-model" class="hash-link" aria-label="Direct link to 3. Why governance is outside the model" title="Direct link to 3. Why governance is outside the model" translate="no">​</a></h3>
<p>You cannot "patch" behavior inside a trained model in real time. Enterprise systems must implement:</p>
<ul>
<li class="">Guardrails</li>
<li class="">Validation layers</li>
<li class="">Monitoring and evaluation</li>
<li class="">Policy enforcement</li>
</ul>
<p>All of these sit <strong>around the model, not inside it</strong></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-why-inference-cost-dominates">4. Why inference cost dominates<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#4-why-inference-cost-dominates" class="hash-link" aria-label="Direct link to 4. Why inference cost dominates" title="Direct link to 4. Why inference cost dominates" translate="no">​</a></h3>
<p>Training is:</p>
<ul>
<li class="">One-time</li>
<li class="">Expensive but amortized</li>
</ul>
<p>Inference is:
Inference is:</p>
<ul>
<li class="">Continuous</li>
<li class="">Scales with usage</li>
</ul>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>For enterprise systems:
Cost = traffic * tokens * latency requirements</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-why-scale-and-cost-must-be-designed-upfront">5. Why scale and cost must be designed upfront<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#5-why-scale-and-cost-must-be-designed-upfront" class="hash-link" aria-label="Direct link to 5. Why scale and cost must be designed upfront" title="Direct link to 5. Why scale and cost must be designed upfront" translate="no">​</a></h3>
<p>Because LLMs don't learn in production, every interaction requires:</p>
<ul>
<li class="">Full inference execution</li>
<li class="">Token processing (input+output)</li>
<li class="">External system calls (RAG /agents)</li>
</ul>
<p>This means:</p>
<ul>
<li class="">Cost scales with usage, not with training</li>
<li class="">Latency compounds across system layers</li>
<li class="">Poor design = exponential cost growth</li>
</ul>
<p>In real systems, if not handled correctly:</p>
<ul>
<li class="">RAG increases token usage</li>
<li class="">Agents introduce multiple-step execution</li>
<li class="">Orchestration adds round trips</li>
</ul>
<div class="theme-admonition theme-admonition-important admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>important</div><div class="admonitionContent_BuS1"><p>Training is a <strong>one-off capital cost</strong>; inference is the <strong>ongoing operational cost</strong>. Also, without careful design, AI systems become <strong>unpredictable and expensive at scale</strong></p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-takeaway">Final Takeaway<a href="https://jitendersharma.dev/blog/how-llm-works-under-the-hood#final-takeaway" class="hash-link" aria-label="Direct link to Final Takeaway" title="Direct link to Final Takeaway" translate="no">​</a></h2>
<p>The model provides intelligence and the system provides control.</p>
<p>Modern AI architecture is not “LLM design” It is “system design around a frozen model”</p>
<p>Traffic × Tokens × Latency</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TAKEAWAY</div><div class="admonitionContent_BuS1"><p>Treat the LLM as frozen dependency; engineer everything else around it.</p></div></div>]]></content>
        <author>
            <name>Jitender Sharma</name>
        </author>
        <category label="How LLM Works Under the Hood" term="How LLM Works Under the Hood"/>
        <category label="LLM" term="LLM"/>
        <category label="LLM Four Stages" term="LLM Four Stages"/>
    </entry>
</feed>