Monitoring Pipeline Runs in DataXPipe
Track run status, correlate check failures, set up alerting, and use Catalog APIs and Prometheus metrics to observe pipeline health in production.
- monitoring
- observability
- runs
Pipeline success in Airflow does not guarantee trustworthy data. DataXPipe records every run lifecycle event and check result in the Catalog, giving operators a unified observability layer beyond task logs. This guide covers run monitoring, alerting patterns, and metrics integration.
Run lifecycle in the Catalog
Generated DAGs POST run records at key lifecycle points:
running → (transforms + checks) → success | failed
Each record includes pipeline ID, timestamps, status, and optional error context. Query runs via API or the product UI at app.dataxpipe.com.
# Recent runs for a pipeline
curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs" `
-H "X-API-KEY: dxp_your_key"
# Single run detail
curl "https://api.dataxpipe.com/api/v1/runs/run-abc123" `
-H "X-API-KEY: dxp_your_key"
Run records link to check results via run_id, enabling drill-down from a failed run to specific quality violations.
The monitoring stack
| Layer | Source | What it tells you |
|---|---|---|
| Orchestration | Airflow task logs | Task-level failures, retries |
| Catalog runs | GET /runs/ | Pipeline-level success/failure |
| Check results | GET /checks/results | Data quality pass/fail |
| Lineage | GET /lineage/{id} | Blast radius of failures |
| Metrics | GET /metrics | Request latency, error rates |
Airflow tells you whether tasks executed. The Catalog tells you whether data is trustworthy.
Alerting rules
Effective on-call alerting combines run status with check severity:
Tier 1: Pipeline failure
Trigger when run status is failed:
curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs?status=failed&limit=1" `
-H "X-API-KEY: dxp_your_key"
Page on-call with pipeline name, run ID, and Airflow log link.
Tier 2: Check failure (error severity)
Query failed checks with error severity—these indicate misleading data:
curl "https://api.dataxpipe.com/api/v1/checks/results?pipeline=orders_sync&status=fail&severity=error" `
-H "X-API-KEY: dxp_your_key"
Include violation count and failure sample rows in the alert payload.
Tier 3: Check failure (warn severity)
Route to Slack, not pager. Auto-escalate if the same check fails three consecutive runs:
warn fail × 3 → page on-call
Recovery notifications
When a check transitions from fail to pass, notify stakeholders that data is trustworthy again. Silent recovery leaves teams uncertain.
Correlating runs with lineage
When orders_enriched check fails, query lineage to determine root cause:
curl "https://api.dataxpipe.com/api/v1/lineage/orders_enriched" `
-H "X-API-KEY: dxp_your_key"
If upstream clean_orders also failed checks, fix ingestion first. If only the transform failed, focus on SQL logic.
This pattern cuts mean-time-to-resolution significantly compared to manual warehouse investigation.
Prometheus metrics
Enable metrics on the Catalog API:
$env:DATAXPIPE_ENABLE_METRICS = "true"
pip install prometheus_client
Scrape GET /metrics for:
- HTTP request count and duration by endpoint
- Active database connections
- Check execution latency
Grafana dashboards combining Catalog metrics with Airflow exporter metrics provide end-to-end SLO tracking.
Incident workflow
Recommended triage sequence:
- Alert fires on run or check failure
- Open run detail in product UI; note
run_id - Query check results for violation counts and samples
- Query lineage upstream to isolate source vs transform
- Fix root cause; trigger manual rerun
- Confirm recovery check passes; close incident
Document this workflow in your team runbook.
See the data quality checks guide.