Agent Evals
Agent evals let you test runnable Daita agents with deterministic assertions, data-operation inspectors, baselines, artifacts, skill/plugin checks, and optional structured LLM judges.
#Overview
Agent evals are a developer-preview system for testing whether a Daita agent is working correctly and whether its behavior changed over time. An eval suite loads a runnable agent, runs one or more prompts, evaluates expectations, and writes structured artifacts for humans, CI, dashboards, and coding agents.
Use evals to answer questions like:
- Did the agent produce the expected answer?
- Did it use the right tool, skill, or plugin?
- Did it avoid unsafe SQL or forbidden data operations?
- Did latency, cost, tokens, or tool paths regress?
- Did repeat runs stay stable?
- Did an optional LLM judge agree that the answer met qualitative criteria?
Evals are designed to be deterministic first. LLM judges are optional and should be used when rules cannot capture the quality you care about.
#Quick Start
Create an eval suite YAML file:
name: sales-agent-evals
version: 1
agent:
factory: 'myapp.agents:create_sales_agent'
kwargs:
model: gpt-4o-mini
defaults:
runs: 2
max_iterations: 8
cases:
- id: top-products
prompt: What were the top 5 products by revenue?
expectations:
answer:
contains: ['Widget A']
numeric:
- label: revenue
expected: 12840.50
tolerance: 0.01
tools:
required: ['sqlite_query']
max_calls: 4
sql:
read_only: true
require_limit: true
must_include: ['SUM', 'GROUP BY']
must_not_include: ['DELETE', 'DROP']
skills:
required: ['schema_discovery']
max_errors: 0
plugins:
required: ['sqlite']
forbidden: ['s3']
max_latency_ms: 3000
budgets:
max_tokens: 8000
max_latency_ms: 15000
stability:
require_same_tools: true
max_answer_variants: 1Run it from Python:
import asyncio
from daita.evals import EvalSuite
from daita.evals.reporters import render_pretty
async def main():
report = await EvalSuite.from_file("evals/sales-agent.yaml").run()
print(render_pretty(report))
asyncio.run(main())The daita eval CLI command is planned. Use the Python API while evals are in developer preview.
#Agent Factory Contract
Eval suites load agents through a factory path:
agent:
factory: 'myapp.agents:create_agent'
kwargs:
model: gpt-4o-miniFactory rules:
- Use
module:functionformat. - The factory may be sync or async.
kwargsare passed directly to the factory.- The returned object must provide
run(prompt, ...).
Preferred runnable behavior:
result = await agent.run(prompt, detailed=True)If detailed=True is not supported, evals fall back to run(prompt) and wrap the string response. Optional start() and stop() hooks are called around the suite when present.
#What Evals Can Check
#Answer Expectations
expectations:
answer:
equals: 'Widget A'
contains: ['Widget A']
not_contains: ['I cannot verify']
regex: ["revenue: \\$?[0-9,.]+"]
numeric:
- label: revenue
expected: 12840.50
tolerance: 0.01#Tool Expectations
expectations:
tools:
required: ['sqlite_query']
forbidden: ['sqlite_execute']
max_calls: 4#SQL Expectations
expectations:
sql:
read_only: true
require_limit: true
required_tables: ['sales']
forbidden_tables: ['users_pii']
must_include: ['SUM', 'GROUP BY']
must_not_include: ['DELETE', 'DROP', 'SELECT *']
max_rows_returned: 100#Data Operation Expectations
Non-SQL inspectors normalize tool activity across files, APIs, storage, vector search, workflows, and generic tools.
expectations:
operations:
required_categories: ['file', 'api', 'vector']
forbidden_categories: ['workflow']
max_write_operations: 0
max_delete_operations: 0
files:
required_read: ['sales.csv']
forbidden_read: ['secrets.xlsx']
api:
required_methods: ['GET']
forbidden_methods: ['POST', 'DELETE']
required_hosts: ['api.example.com']
storage:
required_buckets: ['analytics']
forbidden_write: true
vector:
max_top_k: 10
required_filters: ['tenant_id']#Skill and Plugin Expectations
Skill/plugin checks use normalized execution spans when the agent result or trace exposes them.
expectations:
skills:
required: ['math_reasoning']
forbidden: ['web_search']
max_calls: 2
max_latency_ms: 1000
max_errors: 0
plugins:
required: ['sqlite']
forbidden: ['s3']
max_calls: 4
max_latency_ms: 3000
max_errors: 0#Budget and Stability Expectations
defaults:
runs: 3
cases:
- id: stable-answer
prompt: Answer consistently.
expectations:
budgets:
max_tokens: 8000
max_cost: 0.05
max_latency_ms: 15000
max_iterations: 8
stability:
require_same_tools: true
max_answer_variants: 1
max_cost_delta_pct: 25
max_latency_delta_pct: 50
max_token_delta_pct: 25#Datasets
Suites can expand JSONL, JSON, or YAML records into cases:
name: support-agent-evals
agent:
factory: 'myapp.agents:create_support_agent'
dataset:
path: 'evals/support-cases.jsonl'
input_field: 'prompt'
id_field: 'id'
expected_field: 'expected'
case_template:
expectations:
tools:
required: ['search_docs']Example JSONL record:
{ "id": "refund-policy", "prompt": "What is the refund window?", "expected": { "contains": ["30 days"] } }#Baselines
Baselines let you compare a new eval run against a known-good report:
baseline_report = await EvalSuite.from_file("evals/sales-agent.yaml").run(
record_baseline=True,
baseline_path=".daita/evals/baselines/sales-agent.json",
)
report = await EvalSuite.from_file("evals/sales-agent.yaml").run(
compare_baseline=True,
baseline_path=".daita/evals/baselines/sales-agent.json",
)Regression policies can fail on new failures, score drops, cost increases, latency increases, or tool-sequence changes.
baselines:
fail_on:
new_failures: true
score_drop_gt: 0.01
cost_increase_pct_gt: 25
latency_increase_pct_gt: 50
tool_sequence_changed: true#LLM Judges
Use an LLM judge when deterministic checks are not enough. Judges support structured criteria, required criteria, weights, pass thresholds, artifacts, and cost accounting.
expectations:
judge:
provider: openai
model: gpt-4o-mini
pass_score: 0.8
require_all_criteria_pass: true
include_tool_outputs: true
criteria:
- id: direct_answer
description: The answer directly answers the user's question.
required: true
weight: 0.4
- id: grounded
description: The answer only uses facts present in tool outputs.
required: true
weight: 0.4
- id: concise
description: The answer avoids unrelated explanation.
required: false
weight: 0.2By default, judges run only after deterministic assertions pass. Set run_when: always to judge failed deterministic runs too.
#Artifacts
Each eval run writes a stable artifact directory:
.daita/evals/runs/<run_id>/
report.json
summary.md
junit.xml
cases/
<case_id>/
case.json
run-001.json
diff.jsonJudge-enabled runs also write judge input and output artifacts. Baseline comparisons write baseline-comparison.json.
Artifact privacy controls:
artifacts:
output_dir: '.daita/evals/runs'
max_chars: 50000
include_full_answers: true
include_tool_outputs: false
redact_patterns:
- 'sk-[A-Za-z0-9]+'#Failure Output
Pretty output is designed to be readable in terminals and CI logs:
Daita Eval: live-expected-failure-evals
Run: 2026-05-05T00-55-41Z Agent: LiveEvalSkillPlugin Model: gpt-4o-mini
Summary
Cases: 0 passed / 1 failed / 0 warned
Runs: 0 passed / 1 failed
Score: 0.0%
Cost: $0.0008
Time: 2.7s
Cases
FAILED forbidden-execution-path 0/1 runs $0.0008 2.7s 405 tokens
tools: multiply
skills: math_reasoning.plan 35ms
plugins: calculator.multiply 12ms
Failures
FAIL forbidden-execution-path run-001
- forbidden_skill_called: Forbidden skill was used: math_reasoning.
- forbidden_plugin_called: Forbidden plugin was used: calculator.The same failure details are available in report.json with stable failure codes, assertion paths, observed values, expected values, artifact paths, and fix hints when available.
#Next Steps
- Agent - Build runnable agents to evaluate
- Tools - Define tool surfaces that evals can inspect
- Skills - Package reusable capabilities and evaluate skill usage
- Tracing - Understand execution traces and plugin spans
- Data Assertions - Enforce row-level data quality inside tools