Further reading (external)

Curated third-party writing on LLM and agent evaluation — complementary to the Eval Framework Blueprint series. These are by other practitioners, not this site.

HOW TO USE THIS PAGE

Start with Start here for the full picture, then jump to the section that matches what you are building. The Series map at the bottom ties each external resource to a page in this series.

Start here

Resource	Author / source	Why read it
LLM Evaluation Framework for RAG and AI Agents	Fluence	Operating-model view: golden sets, human + judge, RAG + agents, gates, prod monitoring, regression loop
A complete guide to RAG evaluation	Evidently AI	Dev vs stress vs prod vs regression; retrieval vs generation; synthetic data; monitoring
Evals Skills for Coding Agents	Hamel Husain	Product evals from 50+ companies: error analysis, judge design, calibration, RAG split, synthetic data
hamelsmu/evals-skills	Hamel (open source)	Actionable skills: `eval-audit`, `error-analysis`, `validate-evaluator`, `evaluate-rag`, synthetic generation

Hamel's broader eval writing: hamel.dev.

Production pipelines & CI gates

Resource	Why read it
LLM Evaluation in Production (every deploy)	Deep CI walkthrough: golden set, RAGAS, faithfulness/relevance, GitHub Actions gate
LLM Evals: Build a Production Regression Suite	Golden dataset + scorers + runner + required CI gate; tiered scores, override rules
LLM regression testing	Grow golden sets from production failures; baseline comparison vs absolute thresholds
How to Build an LLM Evaluation Pipeline for CI/CD	PR vs staging vs nightly tiers; deterministic + model-based scoring
Building a Production LLM Evaluation Harness in Pytest	Flake-aware multi-sample scoring, cost caps, regression vs baseline
LLM CI/CD	Judge gates + golden regression + canary; runnable harness patterns

Agents (Tool, Action, multi-step)

Resource	Why read it
12-metric framework from 100+ deployments	Retrieval + generation + tool selection, execution, multi-step coherence + prod cost/latency
Production-Ready LLM Agents: Offline Evaluation	Routing, judge, RAG pillars; CI + governance for agent offline eval
Policy-Governed Agent Runtime	On this site — deterministic Action plane (PEP/PDP) alongside agent eval

LLM-as-judge & human calibration

Resource	Why read it
How to Calibrate an LLM Judge	Gold set → agreement → inspect disagreements → rubric fixes; Cohen's κ; common judge biases
LLM as a Judge in Production (2026 playbook)	Position bias, factuality vs tone, when judge is inappropriate, κ targets
Calibrate LLM-as-Judge with Human Corrections	Human corrections → few-shot calibration → track agreement (LangSmith Align Evals)
Efficient Inference for Noisy LLM-as-a-Judge	Academic: debiasing judge noise, PPI-style estimators when judge is imperfect
rusty-llm-jury	TPR/TNR + Rogan–Gladen correction for true pass rate when judge is noisy

RAG: retrieval vs generation (Data vs Context)

Resource	Why read it
Evidently RAG guide	Best single external piece on splitting retrieval and generation
Alok's production eval post	Explicit rule: retrieval metrics vs generation metrics on different schedules
evaluate-rag skill	Short, opinionated RAG eval workflow (Hamel)

Online / dynamic eval & drift

Resource	Why read it
Fluence framework	Prod monitoring feeding golden sets
Evidently RAG guide — production monitoring	Reference-free metrics on live queries
12-metric TDS article	Cost, P99, production health alongside quality

Synthetic data & golden sets

Resource	Why read it
generate-synthetic-data skill	Dimension-based tuples; human review before gating (Hamel)
Evidently RAG guide	Synthetic test sets for dev and adversarial stages
Coverge regression testing	Coverage-driven curation + incident → golden case

Tool docs (implementation)

Useful once you know what to measure:

Tool	Docs	Fits these planes
RAGAS	Faithfulness, context precision/recall, answer relevance	Context, Reasoning, Outcome
DeepEval	Pytest-style evals, CI-friendly	All planes via custom metrics
LangSmith	Traces, datasets, online eval, human review	Harness, online, human
Langfuse	Traces, datasets, scoring, production monitoring	Harness, online, human
Promptfoo	PR evals, pairwise comparisons	CI gate, pairwise eval

Series map

This series	Strongest external complement
Eval Engineering (executive insight, coming soon)	Fluence framework
Eval Framework Blueprint	Hamel eval-audit skill + metacto regression suite
Golden Datasets	Coverge + bigthings.cloud pipeline
Human Review	LangChain human corrections + AI/TLDR calibration
LLM-as-Judge	Alatirok playbook + validate-evaluator skill
Online & Dynamic Eval	Evidently prod monitoring + TDS production metrics
Synthetic Generation	generate-synthetic-data skill
Context / Data	Evidently RAG guide
Action / Tool	TDS 12-metric + Policy-Governed Agent Runtime

← Eval Framework Blueprint

Start here​

Production pipelines & CI gates​

Agents (Tool, Action, multi-step)​

LLM-as-judge & human calibration​

RAG: retrieval vs generation (Data vs Context)​

Online / dynamic eval & drift​

Synthetic data & golden sets​

Tool docs (implementation)​

Suggested reading order (external only)​

Series map​