Declarative vs Imperative Pipeline Definitions

Every data team eventually faces the same architectural question: should pipelines be defined as declarative specifications (what the pipeline does) or imperative code (how to orchestrate each step)? DataXPipe bets on declarative specs. This article explains why—and when imperative code still belongs in the stack.

The three-artifact problem

Imperative pipelines typically spread knowledge across:

Orchestration code — Airflow DAG Python files with task dependencies
Transform logic — SQL files or dbt models
Documentation — Confluence pages describing lineage and SLAs

When a column is renamed in the warehouse, engineers update the SQL but forget the DAG docstring and the lineage wiki. Stakeholders discover the drift when a dashboard breaks.

Declarative specs collapse these into one validated artifact:

pipeline:
  name: orders_sync
  description: Incremental sync of raw orders
  owner: data-platform@example.com
  environment: production
  schedule: "0 6 * * *"

sources:
  - id: raw_orders
    type: postgres
    connection_ref: pg-raw
    object: ecommerce.orders
    load_mode: incremental
    watermark_column: updated_at

transforms:
  - id: orders_clean
    type: sql
    code_ref: transforms/orders_clean.sql
    inputs: [raw_orders]

targets:
  - id: clean_orders
    type: bigquery
    connection_ref: bq-analytics
    object: analytics.orders_clean
    write_mode: merge
    primary_key: [order_id]

One file. One validation step. One generator run produces DAGs, checks, and metadata.

What declarative specs excel at

Concern	Declarative advantage
Lineage	Inputs/outputs declared explicitly; no SQL parsing
Catalog registration	`pipeline.json` generated automatically
Check attachment	Checks live in spec, ship with pipeline
Onboarding	New engineers read YAML, not 400 lines of DAG code
CI validation	JSON Schema catches errors before deploy
Version control	Spec diffs are reviewable; DAG codegen is deterministic

DataXPipe validates specs against specs/spec_schema.json plus semantic rules (unique IDs, valid connection refs). Invalid specs fail in CI, not at 6 AM in production.

Where imperative code still wins

Declarative specs are not a replacement for all pipeline logic:

Complex control flow. Dynamic task generation, branching on runtime variables, and sensor-driven triggers are awkward in YAML. Keep these in Airflow; reference the generated DAG as a base and extend with @task decorators sparingly.

One-off backfills. Ad-hoc replays with custom date logic are faster as imperative scripts than spec amendments.

Non-standard operators. Proprietary ingestion tools without generator support need wrapper operators in Python.

The best teams use declarative specs for steady-state production pipelines and imperative code for exceptions and glue.

Migration path from imperative DAGs

Teams with existing Airflow DAGs can migrate incrementally:

Extract metadata — Document sources, transforms, targets, and schedule from existing DAGs
Write the spec — Model the happy path in YAML; ignore edge-case branching initially
Generate and diff — Compare generated DAG output against the legacy DAG
Run in parallel — Execute generated DAG in staging; validate Catalog run events match
Cut over — Replace legacy DAG in production; archive imperative version

Do not attempt a big-bang rewrite. Migrate high-churn pipelines first where spec benefits (lineage, checks) outweigh migration cost.

Avoiding spec bloat

Declarative does not mean “put everything in YAML.” Anti-patterns:

Embedding 200-line SQL blocks in spec files (use code_ref instead)
Encoding business logic in params objects that should live in SQL
Duplicating connection credentials in specs (use connection_ref)

Keep specs structural. Keep transforms in SQL files. Keep credentials in the Catalog connections registry.

Measuring success

Teams that adopt declarative specs typically report:

Faster incident triage (lineage queryable in minutes)
Fewer “undocumented pipeline” audit findings
Shorter onboarding for data engineers joining mid-quarter

Track catalog registration coverage: percentage of production tables with a registered pipeline spec.

Start with Getting Started with DataXPipe.