Skip to main content

Further reading (external)

Curated third-party writing on LLM and agent evaluation — complementary to the Eval Framework Blueprint series. These are by other practitioners, not this site.

HOW TO USE THIS PAGE

Start with Start here for the full picture, then jump to the section that matches what you are building. The Series map at the bottom ties each external resource to a page in this series.

Start here

ResourceAuthor / sourceWhy read it
LLM Evaluation Framework for RAG and AI AgentsFluenceOperating-model view: golden sets, human + judge, RAG + agents, gates, prod monitoring, regression loop
A complete guide to RAG evaluationEvidently AIDev vs stress vs prod vs regression; retrieval vs generation; synthetic data; monitoring
Evals Skills for Coding AgentsHamel HusainProduct evals from 50+ companies: error analysis, judge design, calibration, RAG split, synthetic data
hamelsmu/evals-skillsHamel (open source)Actionable skills: eval-audit, error-analysis, validate-evaluator, evaluate-rag, synthetic generation

Hamel's broader eval writing: hamel.dev.

Production pipelines & CI gates

ResourceWhy read it
LLM Evaluation in Production (every deploy)Deep CI walkthrough: golden set, RAGAS, faithfulness/relevance, GitHub Actions gate
LLM Evals: Build a Production Regression SuiteGolden dataset + scorers + runner + required CI gate; tiered scores, override rules
LLM regression testingGrow golden sets from production failures; baseline comparison vs absolute thresholds
How to Build an LLM Evaluation Pipeline for CI/CDPR vs staging vs nightly tiers; deterministic + model-based scoring
Building a Production LLM Evaluation Harness in PytestFlake-aware multi-sample scoring, cost caps, regression vs baseline
LLM CI/CDJudge gates + golden regression + canary; runnable harness patterns

Agents (Tool, Action, multi-step)

ResourceWhy read it
12-metric framework from 100+ deploymentsRetrieval + generation + tool selection, execution, multi-step coherence + prod cost/latency
Production-Ready LLM Agents: Offline EvaluationRouting, judge, RAG pillars; CI + governance for agent offline eval
Policy-Governed Agent RuntimeOn this site — deterministic Action plane (PEP/PDP) alongside agent eval

LLM-as-judge & human calibration

ResourceWhy read it
How to Calibrate an LLM JudgeGold set → agreement → inspect disagreements → rubric fixes; Cohen's κ; common judge biases
LLM as a Judge in Production (2026 playbook)Position bias, factuality vs tone, when judge is inappropriate, κ targets
Calibrate LLM-as-Judge with Human CorrectionsHuman corrections → few-shot calibration → track agreement (LangSmith Align Evals)
Efficient Inference for Noisy LLM-as-a-JudgeAcademic: debiasing judge noise, PPI-style estimators when judge is imperfect
rusty-llm-juryTPR/TNR + Rogan–Gladen correction for true pass rate when judge is noisy

RAG: retrieval vs generation (Data vs Context)

ResourceWhy read it
Evidently RAG guideBest single external piece on splitting retrieval and generation
Alok's production eval postExplicit rule: retrieval metrics vs generation metrics on different schedules
evaluate-rag skillShort, opinionated RAG eval workflow (Hamel)

Online / dynamic eval & drift

ResourceWhy read it
Fluence frameworkProd monitoring feeding golden sets
Evidently RAG guide — production monitoringReference-free metrics on live queries
12-metric TDS articleCost, P99, production health alongside quality

Synthetic data & golden sets

ResourceWhy read it
generate-synthetic-data skillDimension-based tuples; human review before gating (Hamel)
Evidently RAG guideSynthetic test sets for dev and adversarial stages
Coverge regression testingCoverage-driven curation + incident → golden case

Tool docs (implementation)

Useful once you know what to measure:

ToolDocsFits these planes
RAGASFaithfulness, context precision/recall, answer relevanceContext, Reasoning, Outcome
DeepEvalPytest-style evals, CI-friendlyAll planes via custom metrics
LangSmithTraces, datasets, online eval, human reviewHarness, online, human
LangfuseTraces, datasets, scoring, production monitoringHarness, online, human
PromptfooPR evals, pairwise comparisonsCI gate, pairwise eval

Suggested reading order (external only)

  1. Fluence or Evidently RAG — full picture
  2. Hamel evals-skills blog + skim GitHub skills
  3. Alok production eval or metacto regression suite — CI wiring
  4. TDS 12-metric or offline agents — if you ship agents
  5. AI/TLDR judge calibration — before trusting judge in gates

Series map


← Eval Framework Blueprint