Monitoring Pipeline Runs in DataXPipe

Pipeline success in Airflow does not guarantee trustworthy data. DataXPipe records every run lifecycle event and check result in the Catalog, giving operators a unified observability layer beyond task logs. This guide covers run monitoring, alerting patterns, and metrics integration.

Run lifecycle in the Catalog

Generated DAGs POST run records at key lifecycle points:

running → (transforms + checks) → success | failed

Each record includes pipeline ID, timestamps, status, and optional error context. Query runs via API or the product UI at app.dataxpipe.com.

# Recent runs for a pipeline
curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs" `
  -H "X-API-KEY: dxp_your_key"

# Single run detail
curl "https://api.dataxpipe.com/api/v1/runs/run-abc123" `
  -H "X-API-KEY: dxp_your_key"

Run records link to check results via run_id, enabling drill-down from a failed run to specific quality violations.

The monitoring stack

Layer	Source	What it tells you
Orchestration	Airflow task logs	Task-level failures, retries
Catalog runs	`GET /runs/`	Pipeline-level success/failure
Check results	`GET /checks/results`	Data quality pass/fail
Lineage	`GET /lineage/{id}`	Blast radius of failures
Metrics	`GET /metrics`	Request latency, error rates

Airflow tells you whether tasks executed. The Catalog tells you whether data is trustworthy.

Alerting rules

Effective on-call alerting combines run status with check severity:

Tier 1: Pipeline failure

Trigger when run status is failed:

curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs?status=failed&limit=1" `
  -H "X-API-KEY: dxp_your_key"

Page on-call with pipeline name, run ID, and Airflow log link.

Tier 2: Check failure (error severity)

Query failed checks with error severity—these indicate misleading data:

curl "https://api.dataxpipe.com/api/v1/checks/results?pipeline=orders_sync&status=fail&severity=error" `
  -H "X-API-KEY: dxp_your_key"

Include violation count and failure sample rows in the alert payload.

Tier 3: Check failure (warn severity)

Route to Slack, not pager. Auto-escalate if the same check fails three consecutive runs:

warn fail × 3 → page on-call

Recovery notifications

When a check transitions from fail to pass, notify stakeholders that data is trustworthy again. Silent recovery leaves teams uncertain.

Correlating runs with lineage

When orders_enriched check fails, query lineage to determine root cause:

curl "https://api.dataxpipe.com/api/v1/lineage/orders_enriched" `
  -H "X-API-KEY: dxp_your_key"

If upstream clean_orders also failed checks, fix ingestion first. If only the transform failed, focus on SQL logic.

This pattern cuts mean-time-to-resolution significantly compared to manual warehouse investigation.

Prometheus metrics

Enable metrics on the Catalog API:

$env:DATAXPIPE_ENABLE_METRICS = "true"
pip install prometheus_client

Scrape GET /metrics for:

HTTP request count and duration by endpoint
Active database connections
Check execution latency

Grafana dashboards combining Catalog metrics with Airflow exporter metrics provide end-to-end SLO tracking.

Incident workflow

Recommended triage sequence:

Alert fires on run or check failure
Open run detail in product UI; note run_id
Query check results for violation counts and samples
Query lineage upstream to isolate source vs transform
Fix root cause; trigger manual rerun
Confirm recovery check passes; close incident

Document this workflow in your team runbook.

See the data quality checks guide.