YAML Spec Validation Deep Dive

Every DataXPipe pipeline begins as a YAML spec validated against specs/spec_schema.json and additional semantic rules. Validation catches structural errors before artifact generation and Catalog registration. This deep dive explains the validation layers and how to integrate them into CI.

Validation layers

DataXPipe applies two validation passes:

JSON Schema validation — Required fields, types, and structure
Semantic validation — Cross-field rules (unique IDs, valid refs, lineage consistency)

Both must pass before python -m generator.run_example (or your spec path) produces artifacts.

Required top-level structure

The schema requires a pipeline object with these fields:

pipeline:
  name: orders_sync          # required, unique identifier
  description: Sync orders   # required
  owner: team@example.com    # required
  environment: production    # required
  schedule: "0 6 * * *"      # required, cron expression
  tags: [ecommerce]          # optional

Missing any required field fails validation with a schema error pointing to the exact path.

Sources, transforms, and targets

Each section has its own required fields:

sources:
  - id: raw_orders                    # required, unique
    type: postgres                    # required
    connection_ref: pg-raw            # required, must exist in connections
    object: ecommerce.orders          # required
    load_mode: incremental            # required
    watermark_column: updated_at      # required for incremental

transforms:
  - id: orders_clean                  # required, unique
    type: sql                         # required
    code_ref: transforms/orders_clean.sql  # required
    inputs: [raw_orders]              # required, must reference valid IDs

targets:
  - id: clean_orders                  # required, unique
    type: bigquery                    # required
    connection_ref: bq-analytics      # required
    object: analytics.orders_clean    # required
    write_mode: merge                 # required
    primary_key: [order_id]           # required for merge mode

Checks declaration

Checks reference targets by ID:

checks:
  - id: chk_freshness                 # required, unique
    type: freshness                   # required
    target: clean_orders              # required, must match target id
    max_delay_minutes: 1560           # optional, freshness-specific

Or use SQL-based checks with type: sql and code_ref depending on your spec version.

Every connection_ref must resolve to an inline connections entry or a Catalog-registered connection at deploy time.

Running validation locally

Validate before generating:

python -m generator.validate specs/orders_sync.yaml

On success, generate artifacts:

python -m generator.run_example
# or with explicit spec path:
python -m generator.generate specs/orders_sync.yaml

Validation errors print JSON Schema paths like pipeline.schedule: '' is not valid—fix the cited field and re-run.

CI integration

Add a validation step to your GitHub Actions workflow:

- name: Validate pipeline specs
  run: |
    for spec in specs/*.yaml; do
      python -m generator.validate "$spec"
    done

Fail the build on validation errors. Optionally diff generated metadata/lineage.json against the previous commit to catch unintended lineage changes.

Common validation errors

Error	Cause	Fix
Duplicate ID	Two sources share `id: orders`	Rename to unique IDs
Unknown input	Transform references `raw_order` (typo)	Match exact source/target ID
Missing watermark	`load_mode: incremental` without column	Add `watermark_column`
Invalid cron	`schedule: daily`	Use cron: `"0 6 * * *"`
Unresolved connection_ref	Ref not in connections block	Add connection or register in Catalog

Semantic rules beyond schema

The generator enforces rules JSON Schema cannot express:

Lineage consistency — Every transform input must exist as a source ID or prior transform/target output
Target uniqueness — One producer per target ID unless explicitly versioned
Check target validity — Checks must reference existing target IDs

These rules prevent silent lineage gaps in generated metadata/lineage.json.

Validation → generation → registration flow

YAML spec
  → validate (schema + semantic)
  → generate (DAG, SQL, checks, metadata)
  → POST pipeline.json to Catalog
  → deploy DAG to Airflow

Skipping validation in CI is the most common cause of broken lineage and failed Catalog registration in production.

See Getting Started with DataXPipe for a complete first-pipeline walkthrough.