Back to Changelog
v0.18.0
May 5, 2026

Agent Evaluation System

Added

#[0.18.0] - 2026-05-05

This release introduces the Agent Evaluation System: a new daita.evals package for testing runnable Daita agents with deterministic checks, structured artifacts, baselines, optional LLM judges, and skill/plugin execution visibility.

#Added

  • Agent Evaluation Engine (daita/evals/)

    New public eval API for running suites against any compatible Daita agent or custom runnable:

    python
    from daita.evals import EvalSuite
     
    report = await EvalSuite.from_file("evals/sales-agent.yaml").run()

    Eval suites load agents through a Python factory path, support sync or async factories, call optional start() and stop() lifecycle hooks, and run prompts through run(prompt, detailed=True) when available.

  • YAML and JSON Eval Suite Configuration

    Eval configs now support suite defaults, per-case overrides, datasets, case templates, artifacts, baselines, and structured expectations. Cases can run once or multiple times using the default sequential same-agent mode.

  • Deterministic Assertion System

    New assertion modules cover:

    • final-answer equality, containment, regex, and numeric tolerances
    • required and forbidden tools
    • token, cost, latency, and iteration budgets
    • SQL read-only checks, required limits, required/forbidden tables, row-count limits, and SQL shape checks
    • repeat-run stability for answer variants, tool sequences, cost, latency, and token deltas
  • Data Operation Inspectors

    Eval runs now normalize tool activity into data operations so suites can check non-SQL behavior across files, APIs, storage systems, vector search, workflows, and generic tools.

  • Skill and Plugin Evaluation

    Eval evidence now includes ExecutionSpan records for skills, plugins, tools, and workflows. Suites can require or forbid skills/plugins, enforce call limits, check latency budgets, and fail on skill/plugin errors.

    Example expectations:

    yaml
    expectations:
      skills:
        required: ['schema_discovery']
        max_errors: 0
      plugins:
        required: ['sqlite']
        forbidden: ['s3']
        max_latency_ms: 3000
  • Structured Eval Artifacts

    Eval runs write a stable artifact set for developers, CI, dashboards, and coding agents:

    • report.json
    • summary.md
    • junit.xml
    • per-case case.json
    • per-run run-XXX.json
    • repeat-run diff.json
    • judge input/output artifacts
    • baseline comparison artifacts

    Artifact privacy controls support truncation, redaction, full-answer inclusion, and tool-output inclusion.

  • Dataset-Driven Eval Cases

    Suites can expand JSONL, JSON, or YAML datasets into eval cases using configurable input, ID, expected, and metadata fields. Case templates let teams apply shared expectations to every dataset record.

  • Baselines and Regression Comparison

    Eval reports can be recorded as baselines and compared against later runs. Regression policies support new failures, score drops, cost increases, latency increases, and tool-sequence changes.

  • Optional Structured LLM Judges

    LLM judges can evaluate cases when deterministic assertions are not enough. Judges support structured criteria, required criteria, weighted scores, score thresholds, judge artifacts, latency, token, and cost accounting.

  • Eval Reporters

    Added pretty terminal output, Markdown summaries, canonical JSON reports, and JUnit XML rendering. Pretty output includes case-level cost, latency, tokens, tool paths, skill/plugin spans, stability summaries, judge summaries, and grouped failures.

  • Eval Test Coverage

    Added focused unit tests plus live integration scenarios for plain agents, tool-calling agents, SQLite data agents, non-SQL data operations, skill/plugin spans, expected eval failures, repeat runs, multi-case suites, datasets, baselines, mock judges, live OpenAI judges, and artifact contracts.