DX
Data Quality

Monitoring Pipeline Runs in DataXPipe

Track run status, correlate check failures, set up alerting, and use Catalog APIs and Prometheus metrics to observe pipeline health in production.

DataXPipe Team
  • monitoring
  • observability
  • runs

Pipeline success in Airflow does not guarantee trustworthy data. DataXPipe records every run lifecycle event and check result in the Catalog, giving operators a unified observability layer beyond task logs. This guide covers run monitoring, alerting patterns, and metrics integration.

Run lifecycle in the Catalog

Generated DAGs POST run records at key lifecycle points:

running → (transforms + checks) → success | failed

Each record includes pipeline ID, timestamps, status, and optional error context. Query runs via API or the product UI at app.dataxpipe.com.

# Recent runs for a pipeline
curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs" `
  -H "X-API-KEY: dxp_your_key"

# Single run detail
curl "https://api.dataxpipe.com/api/v1/runs/run-abc123" `
  -H "X-API-KEY: dxp_your_key"

Run records link to check results via run_id, enabling drill-down from a failed run to specific quality violations.

The monitoring stack

LayerSourceWhat it tells you
OrchestrationAirflow task logsTask-level failures, retries
Catalog runsGET /runs/Pipeline-level success/failure
Check resultsGET /checks/resultsData quality pass/fail
LineageGET /lineage/{id}Blast radius of failures
MetricsGET /metricsRequest latency, error rates

Airflow tells you whether tasks executed. The Catalog tells you whether data is trustworthy.

Alerting rules

Effective on-call alerting combines run status with check severity:

Tier 1: Pipeline failure

Trigger when run status is failed:

curl "https://api.dataxpipe.com/api/v1/pipelines/orders_sync/runs?status=failed&limit=1" `
  -H "X-API-KEY: dxp_your_key"

Page on-call with pipeline name, run ID, and Airflow log link.

Tier 2: Check failure (error severity)

Query failed checks with error severity—these indicate misleading data:

curl "https://api.dataxpipe.com/api/v1/checks/results?pipeline=orders_sync&status=fail&severity=error" `
  -H "X-API-KEY: dxp_your_key"

Include violation count and failure sample rows in the alert payload.

Tier 3: Check failure (warn severity)

Route to Slack, not pager. Auto-escalate if the same check fails three consecutive runs:

warn fail × 3 → page on-call

Recovery notifications

When a check transitions from fail to pass, notify stakeholders that data is trustworthy again. Silent recovery leaves teams uncertain.

Correlating runs with lineage

When orders_enriched check fails, query lineage to determine root cause:

curl "https://api.dataxpipe.com/api/v1/lineage/orders_enriched" `
  -H "X-API-KEY: dxp_your_key"

If upstream clean_orders also failed checks, fix ingestion first. If only the transform failed, focus on SQL logic.

This pattern cuts mean-time-to-resolution significantly compared to manual warehouse investigation.

Prometheus metrics

Enable metrics on the Catalog API:

$env:DATAXPIPE_ENABLE_METRICS = "true"
pip install prometheus_client

Scrape GET /metrics for:

  • HTTP request count and duration by endpoint
  • Active database connections
  • Check execution latency

Grafana dashboards combining Catalog metrics with Airflow exporter metrics provide end-to-end SLO tracking.

Incident workflow

Recommended triage sequence:

  1. Alert fires on run or check failure
  2. Open run detail in product UI; note run_id
  3. Query check results for violation counts and samples
  4. Query lineage upstream to isolate source vs transform
  5. Fix root cause; trigger manual rerun
  6. Confirm recovery check passes; close incident

Document this workflow in your team runbook.

See the data quality checks guide.