A Practical Guide to Data Quality Checks

Data pipelines that run successfully but produce wrong data are worse than pipelines that fail loudly. Quality checks are your last line of defense before bad rows reach dashboards, ML models, and regulatory reports.

DataXPipe treats checks as first-class pipeline citizens: they live in the spec, ship with generated test harnesses, and post structured results to the Catalog API. This guide covers how to design checks that are fast, meaningful, and actionable.

The check lifecycle

Every check in DataXPipe follows the same lifecycle:

Declared in the pipeline spec with an ID, SQL file, and severity.
Generated into checks/ and tests/ directories with HTTP posting logic.
Executed by Airflow after transforms complete (or independently in CI).
Recorded in the Catalog via POST /checks/results.
Queried by operators, the app UI, and downstream alerting systems.

Understanding this flow helps you decide where checks belong—inline in the DAG versus scheduled separately versus gated in CI.

Severity levels that teams actually use

DataXPipe supports at least two severity tiers in generated scaffolds:

Severity	Behavior	Example
`error`	Fail the pipeline run; block downstream marts	`NOT NULL` on primary key columns
`warn`	Record failure but allow run to succeed	Freshness slightly outside SLA

Reserve error for conditions that make downstream data misleading if consumed. Use warn for degraded but interpretable data—late arrivals, minor row-count drift, optional field sparsity.

Avoid a third unofficial tier (“info”) in production check suites. Too many low-severity checks create alert fatigue and get ignored when a real error fires.

The essential check trio

Start every production table with these three checks before adding domain-specific rules.

1. Schema check

Verify expected columns exist with correct types. Schema drift from upstream ingestion is the fastest path to broken BI extracts.

-- checks/chk_schema.sql
-- Validates required columns exist (adapter-specific syntax)
SELECT COUNT(*) AS violations
FROM (
  SELECT 1
  FROM information_schema.columns
  WHERE table_schema = 'analytics'
    AND table_name = 'orders_clean'
    AND column_name IN ('order_id', 'customer_id', 'order_ts', 'amount_cents')
  HAVING COUNT(DISTINCT column_name) < 4
) AS missing;

A result of violations > 0 means the table schema no longer matches the contract.

2. Not-null check on grain columns

Define the grain of your table (usually primary or natural keys) and enforce it:

-- checks/chk_not_null_order_id.sql
SELECT COUNT(*) AS violations
FROM analytics.orders_clean
WHERE order_id IS NULL;

Run similar checks on every column that defines uniqueness or join integrity.

3. Freshness check

Stale data is the most common production incident for batch pipelines:

-- checks/chk_freshness.sql
SELECT COUNT(*) AS violations
FROM analytics.orders_clean
WHERE MAX(order_ts) < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 26 HOUR);

Set thresholds from your SLA plus buffer. A daily 6 AM pipeline with warn at 26 hours and error at 48 hours gives operators time to investigate before executives notice.

Domain-specific checks worth adding

Once the trio is green, layer checks that encode business invariants:

Referential integrity. Orphan foreign keys after joins indicate upstream quality issues or incorrect join logic:

SELECT COUNT(*) AS violations
FROM analytics.orders_clean o
LEFT JOIN analytics.dim_customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

Row-count bounds. Sudden 10x growth or near-zero rows often signals duplicate ingestion or failed filters:

SELECT CASE
  WHEN cnt = 0 THEN 1
  WHEN cnt > 10000000 THEN 1
  ELSE 0
END AS violations
FROM (SELECT COUNT(*) AS cnt FROM analytics.orders_clean);

Distribution checks. For financial amounts, flag negative totals or values beyond reasonable percentiles—these catch unit conversion bugs early.

Declaring checks in the spec

Checks belong in the pipeline spec alongside transforms:

checks:
  - id: chk_not_null_order_id
    sql: checks/chk_not_null_order_id.sql
    severity: error

  - id: chk_freshness
    sql: checks/chk_freshness.sql
    severity: warn
    threshold_hours: 26

  - id: chk_schema
    sql: checks/chk_schema.sql
    severity: error

The generator creates runnable scripts in tests/ that:

Execute the SQL against the target warehouse.
Interpret zero rows (or zero violations) as pass.
POST JSON results to the Catalog.

Example result payload:

{
  "pipeline": "orders_sync",
  "check_id": "chk_not_null_order_id",
  "status": "pass",
  "violations": 0,
  "executed_at": "2025-05-15T06:42:11Z"
}

Running checks outside Airflow

Checks are not tied to the scheduler. Run them in CI after integration tests against a staging warehouse:

$env:DATAXPIPE_URL = "https://catalog.staging.example.com"
$env:DATAXPIPE_API_KEY = "ci-key"
python generated/orders_sync/tests/test_chk_not_null_order_id.py

This pattern catches regressions before DAG deployment and is especially valuable when multiple teams contribute SQL transforms.

Alerting on check results

Query the Catalog for recent failures:

curl "http://localhost:8000/checks/results?pipeline=orders_sync&status=fail"

Effective alerting rules:

error failure → page on-call immediately with pipeline name, check ID, violation count, and link to the failed run.
warn failure → Slack notification; auto-escalate to page if the same check fails three consecutive runs.
Pass after prior fail → send recovery notification so teams know data is trustworthy again.

Correlate check failures with lineage queries to identify whether the root cause is upstream ingestion or the transform itself.

Performance considerations

Checks run against production-scale tables. Keep them fast:

Filter to recent partitions when checking freshness or row counts on large tables.
Avoid SELECT * scans—aggregate early (COUNT, MAX, MIN).
Use warehouse-native metadata (information_schema, INFORMATION_SCHEMA.COLUMNS) for schema checks instead of sampling rows.
Set statement timeouts via DATAXPIPE_TIMEOUT_SECONDS so a stuck check does not block the DAG indefinitely.

Target sub-minute execution for the essential trio; domain checks can run longer but should stay under your Airflow task timeout budget.

Testing your checks

The generated tests/test_chk_*.py files are executable specifications. Extend them with fixture data in staging:

Load a known-bad row (null key) into staging.
Run the check script; assert it posts status: fail.
Remove the bad row; assert recovery posts status: pass.

Treat check SQL with the same review rigor as transform SQL—bugs in checks (wrong threshold, inverted logic) create false confidence.

Maturity model

Stage	Focus	Catalog integration
1	Schema + not-null on critical marts	Manual result review
2	Freshness + row-count bounds on all marts	Automated Slack alerts
3	Referential integrity + domain invariants	PagerDuty on `error` fails
4	CI-gated checks before deploy	Trend dashboards in app UI

Most teams reach stage 2 within a quarter. Stages 3–4 require organizational buy-in but deliver the highest ROI during incidents.

Summary

Checks are only valuable when they are trusted, fast, and wired to action. Declare them in your pipeline spec, start with schema/not-null/freshness, post results to the Catalog on every run, and alert on error severity without hesitation.

Ready to implement? Begin with Getting Started with DataXPipe to generate your first check suite, then use lineage best practices to trace failures upstream when checks fire.