YAML Spec Validation Deep Dive
Understand DataXPipe JSON Schema validation, semantic rules, CI integration, and common spec errors that block artifact generation and catalog registration.
- yaml
- validation
- getting-started
Every DataXPipe pipeline begins as a YAML spec validated against specs/spec_schema.json and additional semantic rules. Validation catches structural errors before artifact generation and Catalog registration. This deep dive explains the validation layers and how to integrate them into CI.
Validation layers
DataXPipe applies two validation passes:
- JSON Schema validation — Required fields, types, and structure
- Semantic validation — Cross-field rules (unique IDs, valid refs, lineage consistency)
Both must pass before python -m generator.run_example (or your spec path) produces artifacts.
Required top-level structure
The schema requires a pipeline object with these fields:
pipeline:
name: orders_sync # required, unique identifier
description: Sync orders # required
owner: team@example.com # required
environment: production # required
schedule: "0 6 * * *" # required, cron expression
tags: [ecommerce] # optional
Missing any required field fails validation with a schema error pointing to the exact path.
Sources, transforms, and targets
Each section has its own required fields:
sources:
- id: raw_orders # required, unique
type: postgres # required
connection_ref: pg-raw # required, must exist in connections
object: ecommerce.orders # required
load_mode: incremental # required
watermark_column: updated_at # required for incremental
transforms:
- id: orders_clean # required, unique
type: sql # required
code_ref: transforms/orders_clean.sql # required
inputs: [raw_orders] # required, must reference valid IDs
targets:
- id: clean_orders # required, unique
type: bigquery # required
connection_ref: bq-analytics # required
object: analytics.orders_clean # required
write_mode: merge # required
primary_key: [order_id] # required for merge mode
Checks declaration
Checks reference targets by ID:
checks:
- id: chk_freshness # required, unique
type: freshness # required
target: clean_orders # required, must match target id
max_delay_minutes: 1560 # optional, freshness-specific
Or use SQL-based checks with type: sql and code_ref depending on your spec version.
Every connection_ref must resolve to an inline connections entry or a Catalog-registered connection at deploy time.
Running validation locally
Validate before generating:
python -m generator.validate specs/orders_sync.yaml
On success, generate artifacts:
python -m generator.run_example
# or with explicit spec path:
python -m generator.generate specs/orders_sync.yaml
Validation errors print JSON Schema paths like pipeline.schedule: '' is not valid—fix the cited field and re-run.
CI integration
Add a validation step to your GitHub Actions workflow:
- name: Validate pipeline specs
run: |
for spec in specs/*.yaml; do
python -m generator.validate "$spec"
done
Fail the build on validation errors. Optionally diff generated metadata/lineage.json against the previous commit to catch unintended lineage changes.
Common validation errors
| Error | Cause | Fix |
|---|---|---|
| Duplicate ID | Two sources share id: orders | Rename to unique IDs |
| Unknown input | Transform references raw_order (typo) | Match exact source/target ID |
| Missing watermark | load_mode: incremental without column | Add watermark_column |
| Invalid cron | schedule: daily | Use cron: "0 6 * * *" |
| Unresolved connection_ref | Ref not in connections block | Add connection or register in Catalog |
Semantic rules beyond schema
The generator enforces rules JSON Schema cannot express:
- Lineage consistency — Every transform input must exist as a source ID or prior transform/target output
- Target uniqueness — One producer per target ID unless explicitly versioned
- Check target validity — Checks must reference existing target IDs
These rules prevent silent lineage gaps in generated metadata/lineage.json.
Validation → generation → registration flow
YAML spec
→ validate (schema + semantic)
→ generate (DAG, SQL, checks, metadata)
→ POST pipeline.json to Catalog
→ deploy DAG to Airflow
Skipping validation in CI is the most common cause of broken lineage and failed Catalog registration in production.
See Getting Started with DataXPipe for a complete first-pipeline walkthrough.