#[0.18.0] - 2026-05-05
This release introduces the Agent Evaluation System: a new daita.evals package for testing runnable Daita agents with deterministic checks, structured artifacts, baselines, optional LLM judges, and skill/plugin execution visibility.
#Added
-
Agent Evaluation Engine (
daita/evals/)New public eval API for running suites against any compatible Daita agent or custom runnable:
pythonfrom daita.evals import EvalSuite report = await EvalSuite.from_file("evals/sales-agent.yaml").run()Eval suites load agents through a Python factory path, support sync or async factories, call optional
start()andstop()lifecycle hooks, and run prompts throughrun(prompt, detailed=True)when available. -
YAML and JSON Eval Suite Configuration
Eval configs now support suite defaults, per-case overrides, datasets, case templates, artifacts, baselines, and structured expectations. Cases can run once or multiple times using the default sequential same-agent mode.
-
Deterministic Assertion System
New assertion modules cover:
- final-answer equality, containment, regex, and numeric tolerances
- required and forbidden tools
- token, cost, latency, and iteration budgets
- SQL read-only checks, required limits, required/forbidden tables, row-count limits, and SQL shape checks
- repeat-run stability for answer variants, tool sequences, cost, latency, and token deltas
-
Data Operation Inspectors
Eval runs now normalize tool activity into data operations so suites can check non-SQL behavior across files, APIs, storage systems, vector search, workflows, and generic tools.
-
Skill and Plugin Evaluation
Eval evidence now includes
ExecutionSpanrecords for skills, plugins, tools, and workflows. Suites can require or forbid skills/plugins, enforce call limits, check latency budgets, and fail on skill/plugin errors.Example expectations:
yamlexpectations: skills: required: ['schema_discovery'] max_errors: 0 plugins: required: ['sqlite'] forbidden: ['s3'] max_latency_ms: 3000 -
Structured Eval Artifacts
Eval runs write a stable artifact set for developers, CI, dashboards, and coding agents:
report.jsonsummary.mdjunit.xml- per-case
case.json - per-run
run-XXX.json - repeat-run
diff.json - judge input/output artifacts
- baseline comparison artifacts
Artifact privacy controls support truncation, redaction, full-answer inclusion, and tool-output inclusion.
-
Dataset-Driven Eval Cases
Suites can expand JSONL, JSON, or YAML datasets into eval cases using configurable input, ID, expected, and metadata fields. Case templates let teams apply shared expectations to every dataset record.
-
Baselines and Regression Comparison
Eval reports can be recorded as baselines and compared against later runs. Regression policies support new failures, score drops, cost increases, latency increases, and tool-sequence changes.
-
Optional Structured LLM Judges
LLM judges can evaluate cases when deterministic assertions are not enough. Judges support structured criteria, required criteria, weighted scores, score thresholds, judge artifacts, latency, token, and cost accounting.
-
Eval Reporters
Added pretty terminal output, Markdown summaries, canonical JSON reports, and JUnit XML rendering. Pretty output includes case-level cost, latency, tokens, tool paths, skill/plugin spans, stability summaries, judge summaries, and grouped failures.
-
Eval Test Coverage
Added focused unit tests plus live integration scenarios for plain agents, tool-calling agents, SQLite data agents, non-SQL data operations, skill/plugin spans, expected eval failures, repeat runs, multi-case suites, datasets, baselines, mock judges, live OpenAI judges, and artifact contracts.