A Practical Guide to Data Quality Checks
Design SQL-based checks with appropriate severity levels, integrate results with the DataXPipe Catalog, and build alerting workflows that catch issues before stakeholders do.
- data-quality
- checks
- observability
Data pipelines that run successfully but produce wrong data are worse than pipelines that fail loudly. Quality checks are your last line of defense before bad rows reach dashboards, ML models, and regulatory reports.
DataXPipe treats checks as first-class pipeline citizens: they live in the spec, ship with generated test harnesses, and post structured results to the Catalog API. This guide covers how to design checks that are fast, meaningful, and actionable.
The check lifecycle
Every check in DataXPipe follows the same lifecycle:
- Declared in the pipeline spec with an ID, SQL file, and severity.
- Generated into
checks/andtests/directories with HTTP posting logic. - Executed by Airflow after transforms complete (or independently in CI).
- Recorded in the Catalog via
POST /checks/results. - Queried by operators, the app UI, and downstream alerting systems.
Understanding this flow helps you decide where checks belong—inline in the DAG versus scheduled separately versus gated in CI.
Severity levels that teams actually use
DataXPipe supports at least two severity tiers in generated scaffolds:
| Severity | Behavior | Example |
|---|---|---|
error | Fail the pipeline run; block downstream marts | NOT NULL on primary key columns |
warn | Record failure but allow run to succeed | Freshness slightly outside SLA |
Reserve error for conditions that make downstream data misleading if consumed. Use warn for degraded but interpretable data—late arrivals, minor row-count drift, optional field sparsity.
Avoid a third unofficial tier (“info”) in production check suites. Too many low-severity checks create alert fatigue and get ignored when a real error fires.
The essential check trio
Start every production table with these three checks before adding domain-specific rules.
1. Schema check
Verify expected columns exist with correct types. Schema drift from upstream ingestion is the fastest path to broken BI extracts.
-- checks/chk_schema.sql
-- Validates required columns exist (adapter-specific syntax)
SELECT COUNT(*) AS violations
FROM (
SELECT 1
FROM information_schema.columns
WHERE table_schema = 'analytics'
AND table_name = 'orders_clean'
AND column_name IN ('order_id', 'customer_id', 'order_ts', 'amount_cents')
HAVING COUNT(DISTINCT column_name) < 4
) AS missing;
A result of violations > 0 means the table schema no longer matches the contract.
2. Not-null check on grain columns
Define the grain of your table (usually primary or natural keys) and enforce it:
-- checks/chk_not_null_order_id.sql
SELECT COUNT(*) AS violations
FROM analytics.orders_clean
WHERE order_id IS NULL;
Run similar checks on every column that defines uniqueness or join integrity.
3. Freshness check
Stale data is the most common production incident for batch pipelines:
-- checks/chk_freshness.sql
SELECT COUNT(*) AS violations
FROM analytics.orders_clean
WHERE MAX(order_ts) < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 26 HOUR);
Set thresholds from your SLA plus buffer. A daily 6 AM pipeline with warn at 26 hours and error at 48 hours gives operators time to investigate before executives notice.
Domain-specific checks worth adding
Once the trio is green, layer checks that encode business invariants:
Referential integrity. Orphan foreign keys after joins indicate upstream quality issues or incorrect join logic:
SELECT COUNT(*) AS violations
FROM analytics.orders_clean o
LEFT JOIN analytics.dim_customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
Row-count bounds. Sudden 10x growth or near-zero rows often signals duplicate ingestion or failed filters:
SELECT CASE
WHEN cnt = 0 THEN 1
WHEN cnt > 10000000 THEN 1
ELSE 0
END AS violations
FROM (SELECT COUNT(*) AS cnt FROM analytics.orders_clean);
Distribution checks. For financial amounts, flag negative totals or values beyond reasonable percentiles—these catch unit conversion bugs early.
Declaring checks in the spec
Checks belong in the pipeline spec alongside transforms:
checks:
- id: chk_not_null_order_id
sql: checks/chk_not_null_order_id.sql
severity: error
- id: chk_freshness
sql: checks/chk_freshness.sql
severity: warn
threshold_hours: 26
- id: chk_schema
sql: checks/chk_schema.sql
severity: error
The generator creates runnable scripts in tests/ that:
- Execute the SQL against the target warehouse.
- Interpret zero rows (or zero violations) as pass.
- POST JSON results to the Catalog.
Example result payload:
{
"pipeline": "orders_sync",
"check_id": "chk_not_null_order_id",
"status": "pass",
"violations": 0,
"executed_at": "2025-05-15T06:42:11Z"
}
Running checks outside Airflow
Checks are not tied to the scheduler. Run them in CI after integration tests against a staging warehouse:
$env:DATAXPIPE_URL = "https://catalog.staging.example.com"
$env:DATAXPIPE_API_KEY = "ci-key"
python generated/orders_sync/tests/test_chk_not_null_order_id.py
This pattern catches regressions before DAG deployment and is especially valuable when multiple teams contribute SQL transforms.
Alerting on check results
Query the Catalog for recent failures:
curl "http://localhost:8000/checks/results?pipeline=orders_sync&status=fail"
Effective alerting rules:
errorfailure → page on-call immediately with pipeline name, check ID, violation count, and link to the failed run.warnfailure → Slack notification; auto-escalate to page if the same check fails three consecutive runs.- Pass after prior fail → send recovery notification so teams know data is trustworthy again.
Correlate check failures with lineage queries to identify whether the root cause is upstream ingestion or the transform itself.
Performance considerations
Checks run against production-scale tables. Keep them fast:
- Filter to recent partitions when checking freshness or row counts on large tables.
- Avoid
SELECT *scans—aggregate early (COUNT,MAX,MIN). - Use warehouse-native metadata (
information_schema,INFORMATION_SCHEMA.COLUMNS) for schema checks instead of sampling rows. - Set statement timeouts via
DATAXPIPE_TIMEOUT_SECONDSso a stuck check does not block the DAG indefinitely.
Target sub-minute execution for the essential trio; domain checks can run longer but should stay under your Airflow task timeout budget.
Testing your checks
The generated tests/test_chk_*.py files are executable specifications. Extend them with fixture data in staging:
- Load a known-bad row (null key) into staging.
- Run the check script; assert it posts
status: fail. - Remove the bad row; assert recovery posts
status: pass.
Treat check SQL with the same review rigor as transform SQL—bugs in checks (wrong threshold, inverted logic) create false confidence.
Maturity model
| Stage | Focus | Catalog integration |
|---|---|---|
| 1 | Schema + not-null on critical marts | Manual result review |
| 2 | Freshness + row-count bounds on all marts | Automated Slack alerts |
| 3 | Referential integrity + domain invariants | PagerDuty on error fails |
| 4 | CI-gated checks before deploy | Trend dashboards in app UI |
Most teams reach stage 2 within a quarter. Stages 3–4 require organizational buy-in but deliver the highest ROI during incidents.
Summary
Checks are only valuable when they are trusted, fast, and wired to action. Declare them in your pipeline spec, start with schema/not-null/freshness, post results to the Catalog on every run, and alert on error severity without hesitation.
Ready to implement? Begin with Getting Started with DataXPipe to generate your first check suite, then use lineage best practices to trace failures upstream when checks fire.