Agent Engineering Lab · 12 modules

Twelve labs, one engineering story

Each lab is a focused playground for one concept in production AI agent engineering. Seven share the canonical credit-order-eligibility scenario at /tools/agent-lab (rendered as different lenses, deep-linked from the cards below); four are dedicated routes for material that needed its own surface area. One is documentation only because the topic is operational, not visual.

Foundations
Messages, schemas, tools, the loop itself, and retrieval.
5 labs
Retrieval
Why a single retrieval signal is rarely enough.
1 lab
- Lab 06dedicated
  Hybrid search and reranking
  See why a single retrieval signal is not enough.
  BM25, pseudo-vector, RRF hybrid, and a toy reranker side by side.
  Open
Tooling
Tool discovery and the workflow / agent boundary.
2 labs
- Lab 07dedicated
  MCP-style tool protocol
  Discover tools through a standard manifest.
  Tool registry, JSON-schema inputs, permission badges, simulated handshake.
  Open
- Lab 08dedicated
  Workflow vs free-form agent
  Compare a fixed pipeline to an autonomous loop.
  Six-step deterministic workflow next to the free-form agent runner.
  Open
Quality
Replay, score, and regression-test agent behavior.
1 lab
- Lab 09canonical lab
  Evaluation harness
  Replay every case and score the agent.
  Evals lens: 8 canonical cases, per-assertion drill-down, metric strip.
  Open
Safety
Human approval and policy outside the model.
2 labs
- Lab 10canonical lab
  Human-in-the-loop
  Classify actions by risk and approve / reject.
  Built into every run: write actions pause at the approval gate; rejection is first-class.
  Open
- Lab 11canonical lab
  Permissions and audit trails
  Keep policy outside the model and log every decision.
  Policy module + trace timeline: every iteration is a typed, inspectable event.
  Open
Operations
What production deployment and observability look like.
1 lab
- Lab 12docs
  Observability and deployment
  Understand what production operation looks like.
  Operations doc: trace replay, cost / latency budgets, Cloudflare deployment notes.
  Open

Why this layout: Labs 2 / 3 / 4 / 5 / 9 / 10 / 11 are different lenses on a single, complete agent run, so they share one route and the canonical scenario — each card deep-links to the right lens via a query parameter. Labs 1 / 6 / 7 / 8 introduce concepts the canonical run does not exercise (the bare LLM protocol, alternative retrieval signals, the MCP tool manifest, and a deterministic-workflow comparison) so they live as siblings. Lab 12 is a short operations doc because production observability and deployment are infrastructure topics, not a single visual demo.

Twelve labs, one engineering story

Foundations

LLM API fundamentals

Structured outputs

Tool calling

Agent loop

RAG

Retrieval

Hybrid search and reranking

Tooling

MCP-style tool protocol

Workflow vs free-form agent

Quality

Evaluation harness

Safety

Human-in-the-loop

Permissions and audit trails

Operations

Observability and deployment