Agent Evals

#Overview

Agent evals are a developer-preview system for testing whether a Daita agent is working correctly and whether its behavior changed over time. An eval suite loads a runnable agent, runs one or more prompts, evaluates expectations, and writes structured artifacts for humans, CI, dashboards, and coding agents.

Use evals to answer questions like:

Did the agent produce the expected answer?
Did it use the right tool, skill, or plugin?
Did it avoid unsafe SQL or forbidden data operations?
Did latency, cost, tokens, or tool paths regress?
Did repeat runs stay stable?
Did an optional LLM judge agree that the answer met qualitative criteria?

Evals are designed to be deterministic first. LLM judges are optional and should be used when rules cannot capture the quality you care about.

#Quick Start

Create an eval suite YAML file:

yaml

name: sales-agent-evals
version: 1
 
agent:
  factory: 'myapp.agents:create_sales_agent'
  kwargs:
    model: gpt-4o-mini
 
defaults:
  runs: 2
  max_iterations: 8
 
cases:
  - id: top-products
    prompt: What were the top 5 products by revenue?
    expectations:
      answer:
        contains: ['Widget A']
        numeric:
          - label: revenue
            expected: 12840.50
            tolerance: 0.01
      tools:
        required: ['sqlite_query']
        max_calls: 4
      sql:
        read_only: true
        require_limit: true
        must_include: ['SUM', 'GROUP BY']
        must_not_include: ['DELETE', 'DROP']
      skills:
        required: ['schema_discovery']
        max_errors: 0
      plugins:
        required: ['sqlite']
        forbidden: ['s3']
        max_latency_ms: 3000
      budgets:
        max_tokens: 8000
        max_latency_ms: 15000
      stability:
        require_same_tools: true
        max_answer_variants: 1

Run it from Python:

python

import asyncio
 
from daita.evals import EvalSuite
from daita.evals.reporters import render_pretty
 
async def main():
    report = await EvalSuite.from_file("evals/sales-agent.yaml").run()
    print(render_pretty(report))
 
asyncio.run(main())

The daita eval CLI command is planned. Use the Python API while evals are in developer preview.

#Agent Factory Contract

Eval suites load agents through a factory path:

yaml

agent:
  factory: 'myapp.agents:create_agent'
  kwargs:
    model: gpt-4o-mini

Factory rules:

Use module:function format.
The factory may be sync or async.
kwargs are passed directly to the factory.
The returned object must provide run(prompt, ...).

Preferred runnable behavior:

python

result = await agent.run(prompt, detailed=True)

If detailed=True is not supported, evals fall back to run(prompt) and wrap the string response. Optional start() and stop() hooks are called around the suite when present.

#What Evals Can Check

#Answer Expectations

yaml

expectations:
  answer:
    equals: 'Widget A'
    contains: ['Widget A']
    not_contains: ['I cannot verify']
    regex: ["revenue: \\$?[0-9,.]+"]
    numeric:
      - label: revenue
        expected: 12840.50
        tolerance: 0.01

#Tool Expectations

yaml

expectations:
  tools:
    required: ['sqlite_query']
    forbidden: ['sqlite_execute']
    max_calls: 4

#SQL Expectations

yaml

expectations:
  sql:
    read_only: true
    require_limit: true
    required_tables: ['sales']
    forbidden_tables: ['users_pii']
    must_include: ['SUM', 'GROUP BY']
    must_not_include: ['DELETE', 'DROP', 'SELECT *']
    max_rows_returned: 100

#Data Operation Expectations

Non-SQL inspectors normalize tool activity across files, APIs, storage, vector search, workflows, and generic tools.

yaml

expectations:
  operations:
    required_categories: ['file', 'api', 'vector']
    forbidden_categories: ['workflow']
    max_write_operations: 0
    max_delete_operations: 0
  files:
    required_read: ['sales.csv']
    forbidden_read: ['secrets.xlsx']
  api:
    required_methods: ['GET']
    forbidden_methods: ['POST', 'DELETE']
    required_hosts: ['api.example.com']
  storage:
    required_buckets: ['analytics']
    forbidden_write: true
  vector:
    max_top_k: 10
    required_filters: ['tenant_id']

#Skill and Plugin Expectations

Skill/plugin checks use normalized execution spans when the agent result or trace exposes them.

yaml

expectations:
  skills:
    required: ['math_reasoning']
    forbidden: ['web_search']
    max_calls: 2
    max_latency_ms: 1000
    max_errors: 0
  plugins:
    required: ['sqlite']
    forbidden: ['s3']
    max_calls: 4
    max_latency_ms: 3000
    max_errors: 0

#Budget and Stability Expectations

yaml

defaults:
  runs: 3
 
cases:
  - id: stable-answer
    prompt: Answer consistently.
    expectations:
      budgets:
        max_tokens: 8000
        max_cost: 0.05
        max_latency_ms: 15000
        max_iterations: 8
      stability:
        require_same_tools: true
        max_answer_variants: 1
        max_cost_delta_pct: 25
        max_latency_delta_pct: 50
        max_token_delta_pct: 25

#Datasets

Suites can expand JSONL, JSON, or YAML records into cases:

yaml

name: support-agent-evals
 
agent:
  factory: 'myapp.agents:create_support_agent'
 
dataset:
  path: 'evals/support-cases.jsonl'
  input_field: 'prompt'
  id_field: 'id'
  expected_field: 'expected'
 
case_template:
  expectations:
    tools:
      required: ['search_docs']

Example JSONL record:

json

{ "id": "refund-policy", "prompt": "What is the refund window?", "expected": { "contains": ["30 days"] } }

#Baselines

Baselines let you compare a new eval run against a known-good report:

python

baseline_report = await EvalSuite.from_file("evals/sales-agent.yaml").run(
    record_baseline=True,
    baseline_path=".daita/evals/baselines/sales-agent.json",
)
 
report = await EvalSuite.from_file("evals/sales-agent.yaml").run(
    compare_baseline=True,
    baseline_path=".daita/evals/baselines/sales-agent.json",
)

Regression policies can fail on new failures, score drops, cost increases, latency increases, or tool-sequence changes.

yaml

baselines:
  fail_on:
    new_failures: true
    score_drop_gt: 0.01
    cost_increase_pct_gt: 25
    latency_increase_pct_gt: 50
    tool_sequence_changed: true

#LLM Judges

Use an LLM judge when deterministic checks are not enough. Judges support structured criteria, required criteria, weights, pass thresholds, artifacts, and cost accounting.

yaml

expectations:
  judge:
    provider: openai
    model: gpt-4o-mini
    pass_score: 0.8
    require_all_criteria_pass: true
    include_tool_outputs: true
    criteria:
      - id: direct_answer
        description: The answer directly answers the user's question.
        required: true
        weight: 0.4
      - id: grounded
        description: The answer only uses facts present in tool outputs.
        required: true
        weight: 0.4
      - id: concise
        description: The answer avoids unrelated explanation.
        required: false
        weight: 0.2

By default, judges run only after deterministic assertions pass. Set run_when: always to judge failed deterministic runs too.

#Artifacts

Each eval run writes a stable artifact directory:

text

.daita/evals/runs/<run_id>/
  report.json
  summary.md
  junit.xml
  cases/
    <case_id>/
      case.json
      run-001.json
      diff.json

Judge-enabled runs also write judge input and output artifacts. Baseline comparisons write baseline-comparison.json.

Artifact privacy controls:

yaml

artifacts:
  output_dir: '.daita/evals/runs'
  max_chars: 50000
  include_full_answers: true
  include_tool_outputs: false
  redact_patterns:
    - 'sk-[A-Za-z0-9]+'

#Failure Output

Pretty output is designed to be readable in terminals and CI logs:

text

Daita Eval: live-expected-failure-evals
Run: 2026-05-05T00-55-41Z  Agent: LiveEvalSkillPlugin  Model: gpt-4o-mini
 
Summary
  Cases:  0 passed / 1 failed / 0 warned
  Runs:   0 passed / 1 failed
  Score:  0.0%
  Cost:   $0.0008
  Time:   2.7s
 
Cases
  FAILED forbidden-execution-path  0/1 runs  $0.0008  2.7s  405 tokens
    tools: multiply
    skills: math_reasoning.plan 35ms
    plugins: calculator.multiply 12ms
 
Failures
  FAIL forbidden-execution-path  run-001
    - forbidden_skill_called: Forbidden skill was used: math_reasoning.
    - forbidden_plugin_called: Forbidden plugin was used: calculator.

The same failure details are available in report.json with stable failure codes, assertion paths, observed values, expected values, artifact paths, and fix hints when available.

#Next Steps

Agent - Build runnable agents to evaluate
Tools - Define tool surfaces that evals can inspect
Skills - Package reusable capabilities and evaluate skill usage
Tracing - Understand execution traces and plugin spans
Data Assertions - Enforce row-level data quality inside tools