New: AI & text-to-SQL on your own SupersetBook a demo

Data Strategy18 Apr 2026

Migrating from Azure Synapse to a Modern Lakehouse Architecture

Step-by-step guide to migrating from Azure Synapse to a lakehouse architecture using Iceberg, dbt, and Superset for modern analytics.

DTD23 Team

15 minutes read

Migrating from Azure Synapse to a Modern Lakehouse Architecture

Azure Synapse Analytics has been a workhorse for many enterprises, but its proprietary T-SQL dialect, licensing costs, and architectural constraints increasingly feel like friction when you're trying to move fast. If you're evaluating a migration to a modern lakehouse architecture, you're not alone—and the good news is that the path forward is clearer than it's ever been.

This guide walks you through a production-grade migration from Azure Synapse to an open lakehouse stack built on Apache Iceberg, dbt, and managed analytics. We'll cover the technical playbook, common pitfalls, and how to keep your analytics running without downtime.

Understanding the Lakehouse vs. Data Warehouse Paradigm Shift

Before you start migrating, you need to understand what you're migrating to. A lakehouse is fundamentally different from the traditional data warehouse model that Azure Synapse inherited.

Azure Synapse Analytics is an integrated analytics service that combines SQL data warehousing, big data analytics, and data integration into a single platform. It stores data in proprietary formats optimized for MPP (massively parallel processing) queries, which means your data is locked into Synapse's schema and storage layer.

A lakehouse, by contrast, stores data in open formats—typically Parquet or ORC files—on object storage like S3, Azure Blob Storage, or GCS. This separation of storage and compute means you can:

Run multiple query engines (Spark, Trino, DuckDB, Presto) against the same data without replication
Version and time-travel your data using tools like Apache Iceberg or Delta Lake
Avoid vendor lock-in by using open standards and open-source tools
Scale compute independently from storage, paying only for what you use

The lakehouse model is not just a technical shift—it's an economic and organizational one. You're trading the simplicity of a single managed service for flexibility, cost control, and the ability to integrate best-of-breed tools into your data stack.

Why Migrate Away from Azure Synapse?

Not every organization should migrate. But if you recognize these pain points, a lakehouse migration likely makes sense:

Cost at Scale: Synapse charges per DWU (Data Warehouse Unit) hour, and those costs compound as you scale. A lakehouse on object storage costs significantly less per terabyte, especially if you have cold data or variable query patterns.

T-SQL Dialect Lock-in: Synapse uses a modified SQL dialect that doesn't map cleanly to standard SQL or other platforms. If you want to use Trino, Spark, or DuckDB for specific workloads, you'll need to rewrite queries.

Data Format Constraints: Synapse stores data in proprietary clustered columnstore indexes. If you want to share data with Spark, machine learning pipelines, or other tools, you need to export and reformat.

Governance and Lineage: Modern lakehouse tools like dbt provide explicit lineage, version control, and testing capabilities that Synapse's stored procedures and ETL jobs lack.

Organizational Scale: As your data team grows, you want engineers to write SQL and Python against standard formats, not learn Synapse-specific patterns.

If your organization is already migrating from a data warehouse to the lakehouse model elsewhere in your stack, Synapse becomes an island of legacy infrastructure that slows down data teams.

The Modern Lakehouse Stack: Iceberg, dbt, and Analytics

Before diving into the migration playbook, here's the stack we're targeting:

Apache Iceberg: An open table format built on top of object storage that provides ACID transactions, schema evolution, time-travel queries, and hidden partitioning. Unlike Delta Lake (which is Databricks-proprietary), Iceberg is vendor-neutral and supported by Spark, Trino, Flink, and other engines.

dbt (Data Build Tool): A workflow orchestration and transformation framework that lets you define data models in SQL, version control them in Git, run them on a schedule or event-driven basis, and test data quality. dbt is the modern replacement for Synapse stored procedures and ETL jobs.

Apache Superset: An open-source business intelligence platform that connects to your lakehouse query engine (Spark, Trino) and lets you build dashboards, run ad-hoc queries, and embed analytics into applications. Unlike Synapse's limited BI capabilities, Superset is designed for self-serve analytics and can be deployed as a managed service on D23, which handles infrastructure, AI-powered query generation, and expert consulting.

This stack is open, modular, and designed for teams that need production-grade analytics without the platform overhead.

Phase 1: Assessment and Planning

The first phase of any migration is understanding what you're moving and building a cutover plan.

Inventory Your Synapse Workload

Start by documenting:

Tables and schemas: How many tables, what data types, what sizes? Use SELECT * FROM INFORMATION_SCHEMA.TABLES to get a baseline.
Stored procedures and ETL logic: Synapse likely has T-SQL procedures, COPY statements, and scheduled jobs. These need to be rewritten as dbt models and Python scripts.
Indexes and constraints: Synapse uses clustered columnstore indexes and constraints. Iceberg doesn't need these—the table format handles optimization—but you should document them to understand query patterns.
External tables: If you're using Synapse's PolyBase to query data in Azure Blob Storage or other sources, you're already partially on a lakehouse path. Document these for replication.
User access and security: Who queries Synapse, how do they authenticate, and what row-level or column-level security policies exist? You'll need to replicate these in your new platform.
Query patterns and SLAs: What's the typical query volume, latency requirements, and concurrency? This informs your choice of compute engine and cluster sizing.

Choose Your Compute Engine

You have several options for running queries against your Iceberg lakehouse:

Apache Spark: Best for large-scale batch processing, machine learning, and complex transformations. Spark is the default choice if you're running dbt on a lakehouse. Overkill for interactive BI queries but excellent for data engineering.

Trino (formerly Presto): A federated query engine that can query multiple data sources (Iceberg, S3, Postgres, etc.) with a single SQL interface. Excellent for BI and ad-hoc analytics. Faster than Spark for interactive queries. Migrating from Azure Synapse to Trino is a common pattern for teams that want a drop-in Synapse replacement.

DuckDB: An embedded SQL database that's incredibly fast for analytical queries on local or remote Parquet files. Great for small-to-medium queries and data exploration, but not ideal for multi-user, high-concurrency workloads.

Databricks: A managed Spark platform with Lakehouse-as-a-Service. If you want a fully managed experience similar to Synapse but on a lakehouse architecture, migrating from Azure Synapse to Databricks is a popular choice.

For most teams, we recommend Spark for transformation (dbt) and Trino for analytics. This gives you the flexibility of Spark for complex ETL and the speed of Trino for BI queries.

Define Your Target Architecture

Your lakehouse architecture should look something like this:

Data Sources (Synapse, APIs, logs) 
    ↓
Data Lake (S3/Blob Storage)
    ↓
Iceberg Tables (Bronze/Silver/Gold)
    ↓
dbt Transformations (version-controlled, tested)
    ↓
Query Engine (Spark, Trino, DuckDB)
    ↓
Analytics & BI (Superset, custom apps)

The "Bronze/Silver/Gold" layering is a best practice:

Bronze: Raw data ingested from sources, minimally transformed, preserving all fields.
Silver: Cleaned, deduplicated, validated data ready for analysis. Business logic starts here.
Gold: Aggregated, business-ready tables for dashboards, reporting, and machine learning.

This layering keeps your transformations modular, testable, and maintainable.

Phase 2: Data Migration and Schema Translation

Moving data from Synapse to Iceberg requires careful planning around data types, performance, and validation.

Translating Synapse Data Types to Iceberg

Synapse uses T-SQL data types that don't map 1:1 to Iceberg/Parquet. Here's a reference:

Synapse Type	Iceberg/Parquet Type	Notes
`BIGINT`	`LONG`	Direct mapping
`INT`	`INT`	Direct mapping
`DECIMAL(p,s)`	`DECIMAL(p,s)`	Preserve precision
`FLOAT`	`DOUBLE`	Use DOUBLE for IEEE 754 compliance
`VARCHAR(MAX)`	`STRING`	No length limit in Parquet
`DATETIME2`	`TIMESTAMP`	Convert to UTC; handle timezone carefully
`UNIQUEIDENTIFIER`	`STRING`	Store GUIDs as strings
`BIT`	`BOOLEAN`	Direct mapping
`VARBINARY(MAX)`	`BINARY`	Use sparingly; prefer columnar formats

When you export data from Synapse, use CTAS (Create Table As Select) to extract to Parquet files on Blob Storage, then use Spark or Iceberg's Spark API to create Iceberg tables:

-- In Synapse, export to Parquet
COPY (SELECT * FROM dbo.customers) 
TO 'https://yourstorage.blob.core.windows.net/data/customers/'
WITH (
    FILE_FORMAT = 'PARQUET',
    OVERWRITE = 'TRUE'
);

Then in Spark:

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("synapse-migration").getOrCreate()
 
# Read Parquet files from Blob Storage
df = spark.read.parquet("abfss://[email protected]/customers/")
 
# Write as Iceberg table
df.writeTo("catalog.database.customers").create()

Handling Large Tables and Partitioning

For large tables (>100 GB), you'll want to partition your Iceberg tables to avoid scanning the entire dataset on every query. Synapse uses clustered columnstore indexes; Iceberg uses hidden partitioning, which is more flexible.

Choose your partition key based on query patterns:

Date-based partitioning (e.g., PARTITION BY YEAR(created_at), MONTH(created_at)): Good for time-series data, logs, and events.
Region or tenant partitioning: Good for multi-tenant systems.
Categorical partitioning: Good for high-cardinality fields with uneven distribution.

Iceberg's hidden partitioning means you don't need to include the partition column in your WHERE clause—Iceberg's query planner automatically prunes partitions. This is a major advantage over Delta Lake, which requires explicit partition handling.

Validating Data Integrity

Before you cutover, validate that your migrated data matches the source:

-- Count validation
SELECT COUNT(*) FROM synapse_table;
SELECT COUNT(*) FROM iceberg_table;
 
-- Checksum validation (using MD5 or SHA256 on key columns)
SELECT 
    MD5(CAST(CONCAT(id, email, created_at) AS STRING)) as row_hash
FROM synapse_table
ORDER BY id
LIMIT 1000;
 
SELECT 
    MD5(CAST(CONCAT(id, email, created_at) AS STRING)) as row_hash
FROM iceberg_table
ORDER BY id
LIMIT 1000;

For critical tables, run a full row-by-row comparison using a tool like dbt's equality test or a custom validation script.

Phase 3: Rewriting ETL as dbt Transformations

This is where the real work happens. Synapse's stored procedures and SSIS packages need to be rewritten as dbt models.

Converting Stored Procedures to dbt Models

A typical Synapse stored procedure looks like this:

CREATE PROCEDURE dbo.sp_load_customer_summary AS
BEGIN
    DELETE FROM dbo.customer_summary;
    
    INSERT INTO dbo.customer_summary
    SELECT 
        c.customer_id,
        c.customer_name,
        COUNT(o.order_id) as total_orders,
        SUM(o.order_amount) as total_spent
    FROM dbo.customers c
    LEFT JOIN dbo.orders o ON c.customer_id = o.customer_id
    WHERE c.is_active = 1
    GROUP BY c.customer_id, c.customer_name;
END;

In dbt, this becomes a model file (models/marts/customer_summary.sql):

{{
  config(
    materialized='table',
    indexes=[{'columns': ['customer_id']}]
  )
}}
 
WITH customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
    WHERE is_active = 1
),
 
orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
 
summary AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        COUNT(o.order_id) as total_orders,
        COALESCE(SUM(o.order_amount), 0) as total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
)
 
SELECT * FROM summary

The advantages of dbt over stored procedures:

Version control: Your transformation logic lives in Git, with full history and code review.
Modularity: Use ref() to reference other models, creating a DAG (directed acyclic graph) of dependencies.
Testing: Built-in tests for null values, uniqueness, referential integrity, and custom validations.
Documentation: Auto-generate documentation from your models and tests.
Incremental builds: Process only new or changed data instead of full refreshes.

Handling Complex Logic and Incremental Loads

For complex procedures with multiple steps, break them into multiple dbt models. For incremental loads (where you only process new data since the last run), use dbt's incremental materialization:

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='fail'
  )
}}
 
SELECT 
    order_id,
    customer_id,
    order_date,
    order_amount
FROM {{ source('raw', 'orders') }}
 
{% if execute %}
  {% if this.exists %}
    WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
  {% endif %}
{% endif %}

This model will do a full refresh on the first run, then only insert new rows on subsequent runs—dramatically reducing compute costs and runtime.

Scheduling dbt Runs

Replace Synapse's scheduled jobs with dbt's orchestration:

dbt Cloud: Managed dbt with built-in scheduling, logging, and alerting.
Airflow: Open-source orchestration platform with dbt integration (via dbt-airflow or cosmos).
GitHub Actions: Simple cron-based scheduling for small workflows.
Cron + dbt CLI: DIY approach; works but lacks observability.

For most teams, dbt Cloud is the easiest path—it handles credential management, provides a UI for viewing runs, and integrates with Slack for alerts.

Phase 4: Building Analytics on Superset

Once your data is in the lakehouse and transformations are running via dbt, you need a BI layer to replace Synapse's limited reporting capabilities.

D23 is a managed Apache Superset platform purpose-built for teams migrating from Synapse, Looker, or Tableau. Unlike Synapse's embedded Power BI integration, Superset is open-source, API-first, and designed for self-serve analytics.

Connecting Superset to Your Lakehouse

Superset connects to your query engine via a database driver. If you're using Trino, the connection string looks like:

trino://user:password@trino-coordinator:8080/iceberg/default

If you're using Spark (via Databricks or open-source), use the Spark SQL driver:

spark://user:token@your-spark-cluster:7077/default

Once connected, Superset can:

Discover tables and columns from your Iceberg catalog automatically.
Build dashboards by dragging and dropping charts onto a canvas.
Write ad-hoc SQL queries against your lakehouse without needing to know dbt or Spark internals.
Embed analytics into your product via iframes or the Superset API.
Generate SQL from natural language using AI (if you're using D23's managed service with MCP integration).

Migrating Synapse Reports to Superset Dashboards

For each Synapse report, you'll create an equivalent Superset dashboard:

Identify the underlying query: Extract the SQL from your Synapse report or stored procedure.
Rewrite for your query engine: Adapt the SQL to Trino or Spark syntax (usually minimal changes).
Create a Superset dataset: In Superset, create a new dataset pointing to your Iceberg table or a custom SQL query.
Build visualizations: Create charts (bar, line, pie, table, etc.) on top of the dataset.
Assemble the dashboard: Drag charts onto a dashboard, add filters, and configure drill-downs.
Test and validate: Verify that numbers match the original Synapse report.

Superset's filter and drill-down capabilities are powerful and often exceed what Synapse's Power BI integration offers. You can create cascading filters (e.g., select a region, then see only customers in that region), drill-downs to detail data, and cross-filter multiple charts.

Leveraging AI for Query Generation

One of the biggest productivity gains from D23 is AI-powered query generation. Instead of manually writing SQL, users can describe what they want in natural language, and the AI generates the query.

For example:

User: "Show me revenue by product category for the last 12 months, broken down by month."

AI: Generates the SQL:

SELECT 
    DATE_TRUNC('month', order_date) as month,
    product_category,
    SUM(order_amount) as revenue
FROM iceberg.gold.orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY DATE_TRUNC('month', order_date), product_category
ORDER BY month, product_category

This is a game-changer for self-serve analytics, especially for non-technical users who previously relied on analysts to write queries.

Phase 5: Cutover and Validation

Once everything is built and tested, you're ready to cutover from Synapse to the lakehouse.

Parallel Running

For critical systems, run both Synapse and the lakehouse in parallel for a period (typically 1-2 weeks):

Week 1: Run both systems; compare results daily.
Week 2: Switch read-only traffic to the lakehouse; keep Synapse as a fallback.
Day 1 of Week 3: Switch all traffic; decommission Synapse.

During parallel running, monitor:

Query latency: Are queries faster or slower on the lakehouse? (They should be comparable or faster.)
Data freshness: Is dbt running on schedule? Are dashboards updating on time?
Accuracy: Do numbers in Superset match Synapse reports?
User adoption: Are teams able to use Superset without extensive training?

Rollback Plan

Have a rollback plan in case something goes wrong:

Keep Synapse data and schemas intact for at least 30 days post-migration.
Document the cutover process so you can reverse it if needed.
Have a communication plan for notifying users if you need to roll back.

In practice, rollbacks are rare if you've done thorough parallel testing. But having the option reduces anxiety and makes stakeholders more comfortable with the migration.

Post-Cutover Optimization

After cutover, spend time optimizing:

Query performance: Use Trino's EXPLAIN or Spark's EXPLAIN EXTENDED to understand slow queries. Add partitioning or materialized views as needed.
Cost optimization: Monitor cloud storage and compute costs. Identify unused tables and archive them.
Data quality: Set up dbt tests and data quality checks to catch issues early.
Documentation: Update your data dictionary and onboard new users.

Common Pitfalls and How to Avoid Them

Pitfall 1: Underestimating Schema Evolution

Iceberg handles schema evolution beautifully—you can add, drop, or rename columns without rewriting the entire table. But your dbt models need to account for this.

Solution: Use dbt's source() function to define your raw data sources and document expected schema. Use on_schema_change='fail' in your dbt config to catch unexpected changes.

Pitfall 2: Ignoring Query Performance

A query that runs in 2 seconds on Synapse might take 30 seconds on Trino if it's not optimized for the lakehouse.

Solution: Use EXPLAIN to understand query plans. Add partitioning and clustering to your Iceberg tables. Consider materialized views for expensive aggregations.

Pitfall 3: Losing Access Control

Synapse has row-level security (RLS) and column-level security (CLS) built in. Iceberg doesn't—you need to implement these in your BI layer or query engine.

Solution: Implement RLS in Superset using dataset-level filters. Or use Trino's row-level access control (RLAC) if you need enforcement at the query engine level.

Pitfall 4: Neglecting Cost Optimization

A lakehouse can be cheaper than Synapse, but only if you're intentional about it. Unoptimized queries can scan massive amounts of data and rack up cloud storage costs.

Solution: Monitor query patterns. Use table statistics and partitioning to reduce scans. Archive cold data to cheaper storage tiers. Use incremental dbt models to avoid full refreshes.

Pitfall 5: Underestimating the Learning Curve

Your team knows Synapse and T-SQL. Moving to a lakehouse with Spark, Trino, dbt, and Superset requires learning new tools and patterns.

Solution: Invest in training. Hire or consult with experts who've done this before. Use managed services like D23 that include expert consulting and support.

Real-World Migration Example

Let's walk through a simplified example: migrating a customer analytics workload from Synapse to a lakehouse.

Current state (Synapse):

Raw customer data in dbo.raw_customers (100M rows)
Aggregated customer metrics in dbo.customer_metrics (refreshed nightly via stored procedure)
Power BI dashboard pulling from dbo.customer_metrics

Target state (Lakehouse):

Raw customer data in Iceberg table bronze.customers (partitioned by ingestion date)
Cleaned customer data in silver.customers (dbt model)
Aggregated metrics in gold.customer_metrics (dbt model, incremental)
Superset dashboard connected to gold.customer_metrics

Migration steps:

Export raw data from Synapse to Parquet:

COPY (SELECT * FROM dbo.raw_customers) 
TO 'https://yourstorage.blob.core.windows.net/data/raw_customers/'
WITH (FILE_FORMAT = 'PARQUET');

Create Iceberg tables in Spark:

df = spark.read.parquet("abfss://[email protected]/raw_customers/")
df.writeTo("iceberg.bronze.customers").create()

Write dbt models:
- models/staging/stg_customers.sql: Clean and deduplicate raw data
- models/marts/customer_metrics.sql: Aggregate by customer

Test and validate:

# dbt tests
- name: customer_metrics
  columns:
    - name: customer_id
      tests:
        - unique
        - not_null

Create Superset dashboard:
- Connect to iceberg.gold.customer_metrics
- Build visualizations (customer count, revenue, churn rate, etc.)
- Add filters for date range and customer segment
Cutover:
- Run dbt in production (via dbt Cloud or Airflow)
- Point Superset to the lakehouse
- Decommission Synapse after 30 days of parallel running

Evaluating Managed Services vs. DIY

You can build this stack yourself, but there are trade-offs.

DIY Approach:

Pros: Full control, minimal costs, no vendor lock-in
Cons: Requires deep expertise, ongoing maintenance burden, slower time-to-value

Managed Services:

Pros: Faster deployment, expert support, built-in best practices, less operational overhead
Cons: Less control, potential vendor lock-in, higher per-user costs

For the query engine (Spark, Trino), managed options include Databricks and Dremio, which handle infrastructure and optimization for you.

For analytics and BI, D23 is a managed Apache Superset platform that includes infrastructure, AI-powered query generation, and expert consulting. This is particularly valuable if your team is new to open-source BI tools.

For data transformation (dbt), dbt Cloud is the managed option, but you can also run dbt on your own infrastructure using Airflow or GitHub Actions.

Conclusion: The Path Forward

Migrating from Azure Synapse to a modern lakehouse is a significant undertaking, but the benefits—lower costs, better flexibility, open standards, and faster time-to-insight—make it worth the effort.

The key to a successful migration is:

Plan thoroughly: Inventory your workload, choose your target architecture, and build a realistic timeline.
Migrate data carefully: Validate data integrity at every step. Use partitioning and compression to optimize storage.
Rewrite transformations as dbt: Break complex logic into modular, testable models. Version control everything.
Build analytics on Superset: Leverage open-source BI to replace Synapse's limited reporting capabilities. Consider managed services like D23 to accelerate deployment.
Run in parallel: Test thoroughly before switching all traffic. Have a rollback plan.
Optimize post-cutover: Monitor costs, query performance, and data quality. Invest in team training.

The lakehouse is not just a technical shift—it's a move toward a more flexible, cost-effective, and team-friendly data stack. If you're ready to move beyond Synapse's constraints, the path is clear, and the tools are mature. The question is not whether to migrate, but when.

For teams looking to accelerate their migration and avoid common pitfalls, consulting with experts who've done this before—whether through D23's managed platform or independent data engineering firms—can be the difference between a smooth transition and a months-long slog.

Your lakehouse is waiting. The data is ready to move. Let's go.