Further reading (external)
Curated third-party writing on LLM and agent evaluation — complementary to the Eval Framework Blueprint series. These are by other practitioners, not this site.
HOW TO USE THIS PAGE
Start with Start here for the full picture, then jump to the section that matches what you are building. The Series map at the bottom ties each external resource to a page in this series.
Start here
| Resource | Author / source | Why read it |
|---|---|---|
| LLM Evaluation Framework for RAG and AI Agents | Fluence | Operating-model view: golden sets, human + judge, RAG + agents, gates, prod monitoring, regression loop |
| A complete guide to RAG evaluation | Evidently AI | Dev vs stress vs prod vs regression; retrieval vs generation; synthetic data; monitoring |
| Evals Skills for Coding Agents | Hamel Husain | Product evals from 50+ companies: error analysis, judge design, calibration, RAG split, synthetic data |
| hamelsmu/evals-skills | Hamel (open source) | Actionable skills: eval-audit, error-analysis, validate-evaluator, evaluate-rag, synthetic generation |
Hamel's broader eval writing: hamel.dev.
Production pipelines & CI gates
| Resource | Why read it |
|---|---|
| LLM Evaluation in Production (every deploy) | Deep CI walkthrough: golden set, RAGAS, faithfulness/relevance, GitHub Actions gate |
| LLM Evals: Build a Production Regression Suite | Golden dataset + scorers + runner + required CI gate; tiered scores, override rules |
| LLM regression testing | Grow golden sets from production failures; baseline comparison vs absolute thresholds |
| How to Build an LLM Evaluation Pipeline for CI/CD | PR vs staging vs nightly tiers; deterministic + model-based scoring |
| Building a Production LLM Evaluation Harness in Pytest | Flake-aware multi-sample scoring, cost caps, regression vs baseline |
| LLM CI/CD | Judge gates + golden regression + canary; runnable harness patterns |
Agents (Tool, Action, multi-step)
| Resource | Why read it |
|---|---|
| 12-metric framework from 100+ deployments | Retrieval + generation + tool selection, execution, multi-step coherence + prod cost/latency |
| Production-Ready LLM Agents: Offline Evaluation | Routing, judge, RAG pillars; CI + governance for agent offline eval |
| Policy-Governed Agent Runtime | On this site — deterministic Action plane (PEP/PDP) alongside agent eval |
LLM-as-judge & human calibration
| Resource | Why read it |
|---|---|
| How to Calibrate an LLM Judge | Gold set → agreement → inspect disagreements → rubric fixes; Cohen's κ; common judge biases |
| LLM as a Judge in Production (2026 playbook) | Position bias, factuality vs tone, when judge is inappropriate, κ targets |
| Calibrate LLM-as-Judge with Human Corrections | Human corrections → few-shot calibration → track agreement (LangSmith Align Evals) |
| Efficient Inference for Noisy LLM-as-a-Judge | Academic: debiasing judge noise, PPI-style estimators when judge is imperfect |
| rusty-llm-jury | TPR/TNR + Rogan–Gladen correction for true pass rate when judge is noisy |
RAG: retrieval vs generation (Data vs Context)
| Resource | Why read it |
|---|---|
| Evidently RAG guide | Best single external piece on splitting retrieval and generation |
| Alok's production eval post | Explicit rule: retrieval metrics vs generation metrics on different schedules |
| evaluate-rag skill | Short, opinionated RAG eval workflow (Hamel) |
Online / dynamic eval & drift
| Resource | Why read it |
|---|---|
| Fluence framework | Prod monitoring feeding golden sets |
| Evidently RAG guide — production monitoring | Reference-free metrics on live queries |
| 12-metric TDS article | Cost, P99, production health alongside quality |
Synthetic data & golden sets
| Resource | Why read it |
|---|---|
| generate-synthetic-data skill | Dimension-based tuples; human review before gating (Hamel) |
| Evidently RAG guide | Synthetic test sets for dev and adversarial stages |
| Coverge regression testing | Coverage-driven curation + incident → golden case |
Tool docs (implementation)
Useful once you know what to measure:
| Tool | Docs | Fits these planes |
|---|---|---|
| RAGAS | Faithfulness, context precision/recall, answer relevance | Context, Reasoning, Outcome |
| DeepEval | Pytest-style evals, CI-friendly | All planes via custom metrics |
| LangSmith | Traces, datasets, online eval, human review | Harness, online, human |
| Langfuse | Traces, datasets, scoring, production monitoring | Harness, online, human |
| Promptfoo | PR evals, pairwise comparisons | CI gate, pairwise eval |
Suggested reading order (external only)
- Fluence or Evidently RAG — full picture
- Hamel evals-skills blog + skim GitHub skills
- Alok production eval or metacto regression suite — CI wiring
- TDS 12-metric or offline agents — if you ship agents
- AI/TLDR judge calibration — before trusting judge in gates
Series map
| This series | Strongest external complement |
|---|---|
| Eval Engineering (executive insight, coming soon) | Fluence framework |
| Eval Framework Blueprint | Hamel eval-audit skill + metacto regression suite |
| Golden Datasets | Coverge + bigthings.cloud pipeline |
| Human Review | LangChain human corrections + AI/TLDR calibration |
| LLM-as-Judge | Alatirok playbook + validate-evaluator skill |
| Online & Dynamic Eval | Evidently prod monitoring + TDS production metrics |
| Synthetic Generation | generate-synthetic-data skill |
| Context / Data | Evidently RAG guide |
| Action / Tool | TDS 12-metric + Policy-Governed Agent Runtime |