Step-by-step guide to migrating from Azure Synapse to a lakehouse architecture using Iceberg, dbt, and Superset for modern analytics.
Azure Synapse Analytics has been a workhorse for many enterprises, but its proprietary T-SQL dialect, licensing costs, and architectural constraints increasingly feel like friction when you're trying to move fast. If you're evaluating a migration to a modern lakehouse architecture, you're not alone—and the good news is that the path forward is clearer than it's ever been.
This guide walks you through a production-grade migration from Azure Synapse to an open lakehouse stack built on Apache Iceberg, dbt, and managed analytics. We'll cover the technical playbook, common pitfalls, and how to keep your analytics running without downtime.
Before you start migrating, you need to understand what you're migrating to. A lakehouse is fundamentally different from the traditional data warehouse model that Azure Synapse inherited.
Azure Synapse Analytics is an integrated analytics service that combines SQL data warehousing, big data analytics, and data integration into a single platform. It stores data in proprietary formats optimized for MPP (massively parallel processing) queries, which means your data is locked into Synapse's schema and storage layer.
A lakehouse, by contrast, stores data in open formats—typically Parquet or ORC files—on object storage like S3, Azure Blob Storage, or GCS. This separation of storage and compute means you can:
The lakehouse model is not just a technical shift—it's an economic and organizational one. You're trading the simplicity of a single managed service for flexibility, cost control, and the ability to integrate best-of-breed tools into your data stack.
Not every organization should migrate. But if you recognize these pain points, a lakehouse migration likely makes sense:
Cost at Scale: Synapse charges per DWU (Data Warehouse Unit) hour, and those costs compound as you scale. A lakehouse on object storage costs significantly less per terabyte, especially if you have cold data or variable query patterns.
T-SQL Dialect Lock-in: Synapse uses a modified SQL dialect that doesn't map cleanly to standard SQL or other platforms. If you want to use Trino, Spark, or DuckDB for specific workloads, you'll need to rewrite queries.
Data Format Constraints: Synapse stores data in proprietary clustered columnstore indexes. If you want to share data with Spark, machine learning pipelines, or other tools, you need to export and reformat.
Governance and Lineage: Modern lakehouse tools like dbt provide explicit lineage, version control, and testing capabilities that Synapse's stored procedures and ETL jobs lack.
Organizational Scale: As your data team grows, you want engineers to write SQL and Python against standard formats, not learn Synapse-specific patterns.
If your organization is already migrating from a data warehouse to the lakehouse model elsewhere in your stack, Synapse becomes an island of legacy infrastructure that slows down data teams.
Before diving into the migration playbook, here's the stack we're targeting:
Apache Iceberg: An open table format built on top of object storage that provides ACID transactions, schema evolution, time-travel queries, and hidden partitioning. Unlike Delta Lake (which is Databricks-proprietary), Iceberg is vendor-neutral and supported by Spark, Trino, Flink, and other engines.
dbt (Data Build Tool): A workflow orchestration and transformation framework that lets you define data models in SQL, version control them in Git, run them on a schedule or event-driven basis, and test data quality. dbt is the modern replacement for Synapse stored procedures and ETL jobs.
Apache Superset: An open-source business intelligence platform that connects to your lakehouse query engine (Spark, Trino) and lets you build dashboards, run ad-hoc queries, and embed analytics into applications. Unlike Synapse's limited BI capabilities, Superset is designed for self-serve analytics and can be deployed as a managed service on D23, which handles infrastructure, AI-powered query generation, and expert consulting.
This stack is open, modular, and designed for teams that need production-grade analytics without the platform overhead.
The first phase of any migration is understanding what you're moving and building a cutover plan.
Start by documenting:
SELECT * FROM INFORMATION_SCHEMA.TABLES to get a baseline.You have several options for running queries against your Iceberg lakehouse:
Apache Spark: Best for large-scale batch processing, machine learning, and complex transformations. Spark is the default choice if you're running dbt on a lakehouse. Overkill for interactive BI queries but excellent for data engineering.
Trino (formerly Presto): A federated query engine that can query multiple data sources (Iceberg, S3, Postgres, etc.) with a single SQL interface. Excellent for BI and ad-hoc analytics. Faster than Spark for interactive queries. Migrating from Azure Synapse to Trino is a common pattern for teams that want a drop-in Synapse replacement.
DuckDB: An embedded SQL database that's incredibly fast for analytical queries on local or remote Parquet files. Great for small-to-medium queries and data exploration, but not ideal for multi-user, high-concurrency workloads.
Databricks: A managed Spark platform with Lakehouse-as-a-Service. If you want a fully managed experience similar to Synapse but on a lakehouse architecture, migrating from Azure Synapse to Databricks is a popular choice.
For most teams, we recommend Spark for transformation (dbt) and Trino for analytics. This gives you the flexibility of Spark for complex ETL and the speed of Trino for BI queries.
Your lakehouse architecture should look something like this:
Data Sources (Synapse, APIs, logs)
↓
Data Lake (S3/Blob Storage)
↓
Iceberg Tables (Bronze/Silver/Gold)
↓
dbt Transformations (version-controlled, tested)
↓
Query Engine (Spark, Trino, DuckDB)
↓
Analytics & BI (Superset, custom apps)
The "Bronze/Silver/Gold" layering is a best practice:
This layering keeps your transformations modular, testable, and maintainable.
Moving data from Synapse to Iceberg requires careful planning around data types, performance, and validation.
Synapse uses T-SQL data types that don't map 1:1 to Iceberg/Parquet. Here's a reference:
| Synapse Type | Iceberg/Parquet Type | Notes |
|---|---|---|
BIGINT | LONG | Direct mapping |
INT | INT | Direct mapping |
DECIMAL(p,s) | DECIMAL(p,s) | Preserve precision |
FLOAT | DOUBLE | Use DOUBLE for IEEE 754 compliance |
VARCHAR(MAX) | STRING | No length limit in Parquet |
DATETIME2 | TIMESTAMP | Convert to UTC; handle timezone carefully |
UNIQUEIDENTIFIER | STRING | Store GUIDs as strings |
BIT | BOOLEAN | Direct mapping |
VARBINARY(MAX) | BINARY | Use sparingly; prefer columnar formats |
When you export data from Synapse, use CTAS (Create Table As Select) to extract to Parquet files on Blob Storage, then use Spark or Iceberg's Spark API to create Iceberg tables:
-- In Synapse, export to Parquet
COPY (SELECT * FROM dbo.customers)
TO 'https://yourstorage.blob.core.windows.net/data/customers/'
WITH (
FILE_FORMAT = 'PARQUET',
OVERWRITE = 'TRUE'
);Then in Spark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("synapse-migration").getOrCreate()
# Read Parquet files from Blob Storage
df = spark.read.parquet("abfss://[email protected]/customers/")
# Write as Iceberg table
df.writeTo("catalog.database.customers").create()For large tables (>100 GB), you'll want to partition your Iceberg tables to avoid scanning the entire dataset on every query. Synapse uses clustered columnstore indexes; Iceberg uses hidden partitioning, which is more flexible.
Choose your partition key based on query patterns:
PARTITION BY YEAR(created_at), MONTH(created_at)): Good for time-series data, logs, and events.Iceberg's hidden partitioning means you don't need to include the partition column in your WHERE clause—Iceberg's query planner automatically prunes partitions. This is a major advantage over Delta Lake, which requires explicit partition handling.
Before you cutover, validate that your migrated data matches the source:
-- Count validation
SELECT COUNT(*) FROM synapse_table;
SELECT COUNT(*) FROM iceberg_table;
-- Checksum validation (using MD5 or SHA256 on key columns)
SELECT
MD5(CAST(CONCAT(id, email, created_at) AS STRING)) as row_hash
FROM synapse_table
ORDER BY id
LIMIT 1000;
SELECT
MD5(CAST(CONCAT(id, email, created_at) AS STRING)) as row_hash
FROM iceberg_table
ORDER BY id
LIMIT 1000;For critical tables, run a full row-by-row comparison using a tool like dbt's equality test or a custom validation script.
This is where the real work happens. Synapse's stored procedures and SSIS packages need to be rewritten as dbt models.
A typical Synapse stored procedure looks like this:
CREATE PROCEDURE dbo.sp_load_customer_summary AS
BEGIN
DELETE FROM dbo.customer_summary;
INSERT INTO dbo.customer_summary
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as total_orders,
SUM(o.order_amount) as total_spent
FROM dbo.customers c
LEFT JOIN dbo.orders o ON c.customer_id = o.customer_id
WHERE c.is_active = 1
GROUP BY c.customer_id, c.customer_name;
END;In dbt, this becomes a model file (models/marts/customer_summary.sql):
{{
config(
materialized='table',
indexes=[{'columns': ['customer_id']}]
)
}}
WITH customers AS (
SELECT * FROM {{ ref('stg_customers') }}
WHERE is_active = 1
),
orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
summary AS (
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as total_orders,
COALESCE(SUM(o.order_amount), 0) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
)
SELECT * FROM summaryThe advantages of dbt over stored procedures:
ref() to reference other models, creating a DAG (directed acyclic graph) of dependencies.For complex procedures with multiple steps, break them into multiple dbt models. For incremental loads (where you only process new data since the last run), use dbt's incremental materialization:
{{
config(
materialized='incremental',
unique_key='order_id',
on_schema_change='fail'
)
}}
SELECT
order_id,
customer_id,
order_date,
order_amount
FROM {{ source('raw', 'orders') }}
{% if execute %}
{% if this.exists %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
{% endif %}This model will do a full refresh on the first run, then only insert new rows on subsequent runs—dramatically reducing compute costs and runtime.
Replace Synapse's scheduled jobs with dbt's orchestration:
dbt-airflow or cosmos).For most teams, dbt Cloud is the easiest path—it handles credential management, provides a UI for viewing runs, and integrates with Slack for alerts.
Once your data is in the lakehouse and transformations are running via dbt, you need a BI layer to replace Synapse's limited reporting capabilities.
D23 is a managed Apache Superset platform purpose-built for teams migrating from Synapse, Looker, or Tableau. Unlike Synapse's embedded Power BI integration, Superset is open-source, API-first, and designed for self-serve analytics.
Superset connects to your query engine via a database driver. If you're using Trino, the connection string looks like:
trino://user:password@trino-coordinator:8080/iceberg/default
If you're using Spark (via Databricks or open-source), use the Spark SQL driver:
spark://user:token@your-spark-cluster:7077/default
Once connected, Superset can:
For each Synapse report, you'll create an equivalent Superset dashboard:
Superset's filter and drill-down capabilities are powerful and often exceed what Synapse's Power BI integration offers. You can create cascading filters (e.g., select a region, then see only customers in that region), drill-downs to detail data, and cross-filter multiple charts.
One of the biggest productivity gains from D23 is AI-powered query generation. Instead of manually writing SQL, users can describe what they want in natural language, and the AI generates the query.
For example:
User: "Show me revenue by product category for the last 12 months, broken down by month."
AI: Generates the SQL:
SELECT
DATE_TRUNC('month', order_date) as month,
product_category,
SUM(order_amount) as revenue
FROM iceberg.gold.orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY DATE_TRUNC('month', order_date), product_category
ORDER BY month, product_categoryThis is a game-changer for self-serve analytics, especially for non-technical users who previously relied on analysts to write queries.
Once everything is built and tested, you're ready to cutover from Synapse to the lakehouse.
For critical systems, run both Synapse and the lakehouse in parallel for a period (typically 1-2 weeks):
During parallel running, monitor:
Have a rollback plan in case something goes wrong:
In practice, rollbacks are rare if you've done thorough parallel testing. But having the option reduces anxiety and makes stakeholders more comfortable with the migration.
After cutover, spend time optimizing:
Iceberg handles schema evolution beautifully—you can add, drop, or rename columns without rewriting the entire table. But your dbt models need to account for this.
Solution: Use dbt's source() function to define your raw data sources and document expected schema. Use on_schema_change='fail' in your dbt config to catch unexpected changes.
A query that runs in 2 seconds on Synapse might take 30 seconds on Trino if it's not optimized for the lakehouse.
Solution: Use EXPLAIN to understand query plans. Add partitioning and clustering to your Iceberg tables. Consider materialized views for expensive aggregations.
Synapse has row-level security (RLS) and column-level security (CLS) built in. Iceberg doesn't—you need to implement these in your BI layer or query engine.
Solution: Implement RLS in Superset using dataset-level filters. Or use Trino's row-level access control (RLAC) if you need enforcement at the query engine level.
A lakehouse can be cheaper than Synapse, but only if you're intentional about it. Unoptimized queries can scan massive amounts of data and rack up cloud storage costs.
Solution: Monitor query patterns. Use table statistics and partitioning to reduce scans. Archive cold data to cheaper storage tiers. Use incremental dbt models to avoid full refreshes.
Your team knows Synapse and T-SQL. Moving to a lakehouse with Spark, Trino, dbt, and Superset requires learning new tools and patterns.
Solution: Invest in training. Hire or consult with experts who've done this before. Use managed services like D23 that include expert consulting and support.
Let's walk through a simplified example: migrating a customer analytics workload from Synapse to a lakehouse.
Current state (Synapse):
dbo.raw_customers (100M rows)dbo.customer_metrics (refreshed nightly via stored procedure)dbo.customer_metricsTarget state (Lakehouse):
bronze.customers (partitioned by ingestion date)silver.customers (dbt model)gold.customer_metrics (dbt model, incremental)gold.customer_metricsMigration steps:
Export raw data from Synapse to Parquet:
COPY (SELECT * FROM dbo.raw_customers)
TO 'https://yourstorage.blob.core.windows.net/data/raw_customers/'
WITH (FILE_FORMAT = 'PARQUET');Create Iceberg tables in Spark:
df = spark.read.parquet("abfss://[email protected]/raw_customers/")
df.writeTo("iceberg.bronze.customers").create()Write dbt models:
models/staging/stg_customers.sql: Clean and deduplicate raw datamodels/marts/customer_metrics.sql: Aggregate by customerTest and validate:
# dbt tests
- name: customer_metrics
columns:
- name: customer_id
tests:
- unique
- not_nullCreate Superset dashboard:
iceberg.gold.customer_metricsCutover:
You can build this stack yourself, but there are trade-offs.
DIY Approach:
Managed Services:
For the query engine (Spark, Trino), managed options include Databricks and Dremio, which handle infrastructure and optimization for you.
For analytics and BI, D23 is a managed Apache Superset platform that includes infrastructure, AI-powered query generation, and expert consulting. This is particularly valuable if your team is new to open-source BI tools.
For data transformation (dbt), dbt Cloud is the managed option, but you can also run dbt on your own infrastructure using Airflow or GitHub Actions.
Migrating from Azure Synapse to a modern lakehouse is a significant undertaking, but the benefits—lower costs, better flexibility, open standards, and faster time-to-insight—make it worth the effort.
The key to a successful migration is:
The lakehouse is not just a technical shift—it's a move toward a more flexible, cost-effective, and team-friendly data stack. If you're ready to move beyond Synapse's constraints, the path is clear, and the tools are mature. The question is not whether to migrate, but when.
For teams looking to accelerate their migration and avoid common pitfalls, consulting with experts who've done this before—whether through D23's managed platform or independent data engineering firms—can be the difference between a smooth transition and a months-long slog.
Your lakehouse is waiting. The data is ready to move. Let's go.