Daita Logo

DataQuality Plugin

Analytical data quality for agents — statistical profiling, anomaly detection, freshness checks, and consolidated quality reporting on top of any database plugin.

#Installation

No additional packages required beyond your database plugin. For anomaly detection with scipy z-scores, optionally install:

bash
pip install scipy

#Quick Start

python
from daita import Agent
from daita.plugins import postgresql, data_quality
 
db = postgresql(host="localhost", database="analytics")
dq = data_quality(db=db)
 
agent = Agent(
    name="Quality Checker",
    prompt="You are a data quality analyst. Profile tables and flag issues.",
    tools=[db, dq]
)
 
await agent.start()
result = await agent.run("Profile the orders table and flag any anomalies")

#Configuration

python
data_quality(
    db=None,          # Any BaseDatabasePlugin instance — required at execution time
    backend=None,     # Optional graph backend for persisting reports (auto-selected if None)
    thresholds=None,  # Anomaly detection sensitivity (see below)
)

#Parameters

  • db (BaseDatabasePlugin): The database plugin to run quality checks against. Required when tools are called — can be omitted at construction and provided later.
  • backend: Optional graph backend for persisting quality reports as stable nodes. Auto-selected at agent start if not provided.
  • thresholds (dict): Anomaly detection thresholds. Defaults: {"z_score": 3.0, "iqr_multiplier": 1.5}

#Usage

#Column Profiling

Profile null rates, cardinality, and min/max/avg per column:

python
from daita.plugins import sqlite, data_quality
 
async with sqlite(path="app.db") as db:
    dq = data_quality(db=db)
    report = await dq.profile(db, "orders")
    # Returns per-column stats: null_rate, cardinality, min, max, avg

#Anomaly Detection

Detect statistical outliers in a numeric column:

python
from daita.plugins import sqlite, data_quality
 
async with sqlite(path="app.db") as db:
    dq = data_quality(db=db)
    result = await dq.detect_anomaly(db, "transactions", "amount")
    # Returns rows where amount is a statistical outlier

Uses numpy by default; scipy z-scores if installed. Thresholds are configurable:

python
dq = data_quality(db=db, thresholds={"z_score": 2.5, "iqr_multiplier": 1.5})

#Freshness Checks

Validate that a timestamp column is within an expected recency window:

python
from daita.plugins import sqlite, data_quality
 
async with sqlite(path="app.db") as db:
    dq = data_quality(db=db)
    result = await dq.check_freshness(
        db, "events", "created_at",
        expected_interval_hours=24,
    )
    # Returns staleness info; is_fresh=False if data is older than 24 hours

#Quality Report

Generate a consolidated quality report across all columns and persist it:

python
from daita.plugins import sqlite, data_quality
 
async with sqlite(path="app.db") as db:
    dq = data_quality(db=db)
    report = await dq.report(db, "orders")
    # Returns profiling + completeness score, persisted as a stable graph node

#Using with Agents

python
from daita import Agent
from daita.plugins import postgresql, data_quality
import os
 
db = postgresql(
    host=os.getenv("DB_HOST"),
    database=os.getenv("DB_NAME"),
    username=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
)
dq = data_quality(db=db)
 
agent = Agent(
    name="Quality Monitor",
    prompt="You are a data quality monitor. Profile tables, detect anomalies, and report issues.",
    tools=[db, dq]
)
 
await agent.start()
result = await agent.run("""
    1. Profile the transactions table
    2. Check freshness of the events table (expect data within last 6 hours)
    3. Detect anomalies in the revenue column
    4. Generate a full quality report
""")
await agent.stop()

#Available Tools

ToolDescriptionKey Parameters
dq_profileColumn-level profiling (null rates, cardinality, min/max/avg)table (required)
dq_detect_anomalyStatistical outlier detection on a numeric columntable, column (required)
dq_check_freshnessValidates a timestamp column is within a recency windowtable, timestamp_column (required); expected_interval_hours (optional, default 24)
dq_reportConsolidated quality report, persisted as a graph nodetable (required)

#Dialect Support

DataQuality works with all SQL database plugins:

PluginSupported
SQLiteYes
PostgreSQLYes
MySQLYes
SnowflakeYes

Column discovery uses pragma_table_info for SQLite and information_schema.columns for all other dialects.

#Combining with ItemAssertion

For enforcement at query time (rather than analytical profiling), use ItemAssertion with query_checked() directly on the database plugin:

python
from daita.plugins import postgresql
from daita import ItemAssertion
 
async with postgresql(host="localhost", database="app") as db:
    rows = await db.query_checked(
        "SELECT * FROM orders",
        assertions=[
            ItemAssertion(lambda r: r["total"] > 0, "Order total must be positive"),
            ItemAssertion(lambda r: r["status"] in ("pending", "shipped", "delivered"), "Invalid status"),
        ],
    )

DataQualityPlugin is best for analytical quality checks run by an agent. ItemAssertion is best for enforcement — asserting guarantees at the point of data consumption.

#Error Handling

python
from daita.plugins import postgresql, data_quality
from daita import DataQualityError
 
db = postgresql(host="localhost", database="app")
dq = data_quality(db=db)
 
try:
    report = await dq.profile(db, "orders")
except ValueError as e:
    # No db configured
    print(f"Configuration error: {e}")
except DataQualityError as e:
    print(f"Quality violations: {e.violations}")

#Next Steps