Daita Documentation

#[0.18.0] - 2026-05-05

This release introduces the Agent Evaluation System: a new daita.evals package for testing runnable Daita agents with deterministic checks, structured artifacts, baselines, optional LLM judges, and skill/plugin execution visibility.

#Added

Agent Evaluation Engine (daita/evals/)

New public eval API for running suites against any compatible Daita agent or custom runnable:
python
```
from daita.evals import EvalSuite
 
report = await EvalSuite.from_file("evals/sales-agent.yaml").run()
```
Eval suites load agents through a Python factory path, support sync or async factories, call optional start() and stop() lifecycle hooks, and run prompts through run(prompt, detailed=True) when available.
YAML and JSON Eval Suite Configuration

Eval configs now support suite defaults, per-case overrides, datasets, case templates, artifacts, baselines, and structured expectations. Cases can run once or multiple times using the default sequential same-agent mode.
Deterministic Assertion System

New assertion modules cover:
- final-answer equality, containment, regex, and numeric tolerances
- required and forbidden tools
- token, cost, latency, and iteration budgets
- SQL read-only checks, required limits, required/forbidden tables, row-count limits, and SQL shape checks
- repeat-run stability for answer variants, tool sequences, cost, latency, and token deltas
Data Operation Inspectors

Eval runs now normalize tool activity into data operations so suites can check non-SQL behavior across files, APIs, storage systems, vector search, workflows, and generic tools.
Skill and Plugin Evaluation

Eval evidence now includes ExecutionSpan records for skills, plugins, tools, and workflows. Suites can require or forbid skills/plugins, enforce call limits, check latency budgets, and fail on skill/plugin errors.

Example expectations:
yaml
```
expectations:
  skills:
    required: ['schema_discovery']
    max_errors: 0
  plugins:
    required: ['sqlite']
    forbidden: ['s3']
    max_latency_ms: 3000
```
Structured Eval Artifacts

Eval runs write a stable artifact set for developers, CI, dashboards, and coding agents:
- report.json
- summary.md
- junit.xml
- per-case case.json
- per-run run-XXX.json
- repeat-run diff.json
- judge input/output artifacts
- baseline comparison artifacts
Artifact privacy controls support truncation, redaction, full-answer inclusion, and tool-output inclusion.
Dataset-Driven Eval Cases

Suites can expand JSONL, JSON, or YAML datasets into eval cases using configurable input, ID, expected, and metadata fields. Case templates let teams apply shared expectations to every dataset record.
Baselines and Regression Comparison

Eval reports can be recorded as baselines and compared against later runs. Regression policies support new failures, score drops, cost increases, latency increases, and tool-sequence changes.
Optional Structured LLM Judges

LLM judges can evaluate cases when deterministic assertions are not enough. Judges support structured criteria, required criteria, weighted scores, score thresholds, judge artifacts, latency, token, and cost accounting.
Eval Reporters

Added pretty terminal output, Markdown summaries, canonical JSON reports, and JUnit XML rendering. Pretty output includes case-level cost, latency, tokens, tool paths, skill/plugin spans, stability summaries, judge summaries, and grouped failures.
Eval Test Coverage

Added focused unit tests plus live integration scenarios for plain agents, tool-calling agents, SQLite data agents, non-SQL data operations, skill/plugin spans, expected eval failures, repeat runs, multi-case suites, datasets, baselines, mock judges, live OpenAI judges, and artifact contracts.