Eval metrics
Built-ins
Section titled “Built-ins”| Kind | Args | Notes |
|---|---|---|
exactMatch | { expected } | JSON-deep equality |
regex | { pattern } | Stringified output is matched against the pattern |
jsonSchemaValid | { schema? } | Standard-Schema validation; without schema, asserts JSON-serializable |
toolCallContains | { tool } | Asserts the named tool was called at least once |
costUnder | { usd } | ctx.costUsd <= usd |
latencyUnder | { ms } | ctx.latencyMs <= ms |
tokensUnder | { tokens } | ctx.tokensTotal <= tokens |
customJudge | { rubric } | Function-based; rubric returns MetricResult |
Result shape
Section titled “Result shape”interface MetricResult { passed: boolean; score: number; reason?: string }Score is in [0, 1]. Binary metrics report 0 or 1. Custom rubrics MAY interpolate.
Suite shape
Section titled “Suite shape”defineSuite({ id: "summarizer.suite", controller: agent, threshold: 0.8, hiddenGolden: { refs: ["s3://…/golden/*"], threshold: 0.85 }, loadGoldenScenarios: async (refs) => fetchGolden(refs), scenarios: [ { id: "happy.5-messages", input: { since: new Date() }, assertions: [ { kind: "jsonSchemaValid" }, { kind: "toolCallContains", tool: "slack.search" }, { kind: "costUnder", usd: 0.10 }, { kind: "latencyUnder", ms: 8000 }, ], }, ],});Custom judge example
Section titled “Custom judge example”{ kind: "customJudge", rubric: async (output, ctx) => { const score = await llmJudgeScore(output); return { passed: score >= 0.8, score, reason: `LLM judge: ${score}` }; },}Adapter ecosystem (v1+)
Section titled “Adapter ecosystem (v1+)”@ahamie/eval-promptfoo@ahamie/eval-inspect-ai@ahamie/eval-braintrust@ahamie/eval-langsmith