Learn how to connect Azure Data Factory pipelines to Apache Superset dashboards. Step-by-step integration guide for production analytics.
If you're running data pipelines on Azure, you need dashboards that don't lag behind your infrastructure. Azure Data Factory (ADF) orchestrates your ETL workflows; Apache Superset visualizes the results. Connecting them creates a unified analytics stack that's fast, cost-effective, and doesn't lock you into expensive proprietary platforms.
Most teams choose this combination for three reasons: first, ADF's visual pipeline builder and native Azure integrations handle complex data workflows without custom code; second, Apache Superset's open-source foundation and managed hosting options give you control over costs and deployment; third, the combination avoids vendor lock-in while delivering enterprise-grade analytics performance.
This guide walks you through the complete integration—from configuring ADF pipelines to querying their outputs in Superset dashboards, with real examples you can adapt to your data model.
Azure Data Factory is an orchestration and ETL engine. It reads data from sources (SQL databases, data lakes, APIs), transforms it, and writes results to a target—typically Azure SQL Database, Azure Synapse, or Azure Data Lake Storage. Apache Superset is a visualization and BI layer that queries these target databases directly.
The integration flow is straightforward:
This architecture decouples pipeline orchestration from analytics presentation. ADF handles "when and how to move data"; Superset handles "how to explore and present it."
Looker, Tableau, and Power BI bundle orchestration, transformation, and visualization. That bundling works until you need to:
With ADF + Superset, your transformation logic lives in ADF (reusable, auditable, version-controlled); your analytics layer is open-source (portable, customizable, cost-predictable). D23's managed Superset platform removes the operational burden of self-hosting while preserving this flexibility.
Before Superset can query anything, ADF must write clean data to a target database. Start by creating a Linked Service—ADF's way of storing connection credentials securely.
Step 1: Create a Linked Service to Your Target Database
In the ADF portal, navigate to Manage > Linked Services and create a new connection to Azure SQL Database (or Synapse Analytics, depending on scale).
Linked Service Name: AzureSqlDb_Analytics
Server: your-server.database.windows.net
Database: analytics_db
Authentication: SQL Authentication (or Managed Identity for better security)
Username: sqladmin
Password: [secure password stored in Key Vault]
Use Azure Key Vault integration to store credentials—never hardcode them. This ensures your connection string stays secure and rotatable.
Step 2: Define Output Datasets
Create a dataset that represents the table where ADF will write analytics data. For example, a sales_summary table that aggregates daily transactions.
Dataset Name: SalesSummaryOutput
Linked Service: AzureSqlDb_Analytics
Table Name: dbo.sales_summary
Schema: dbo
Columns: date, region, product, revenue, units_sold, margin
Define the schema explicitly so Superset knows what columns to expect. This prevents surprises when dashboards query stale or missing columns.
Your ADF pipeline orchestrates the full flow: extract raw data, apply transformations, write to the analytics database.
Example Pipeline: Daily Sales Aggregation
Copy Activity (Extract): Pull raw sales transactions from your operational database or data lake.
Data Flow Activity (Transform): Use ADF's visual Data Flow to aggregate by date, region, and product. Calculate rolling metrics like 7-day average revenue.
sales_summary tableStored Procedure Activity (Validation): Run a SQL stored procedure to validate row counts, check for nulls, and log pipeline execution metadata.
usp_validate_sales_summaryThis three-step pattern (extract → transform → validate) ensures data quality and gives you audit trails for compliance.
Monitoring and Alerts
Configure Azure Data Factory's monitoring capabilities to alert you when pipelines fail. Set up alerts for:
These alerts prevent silent failures where Superset dashboards show stale data without anyone knowing.
Superset needs a direct connection to your analytics database. Navigate to Settings > Database Connections and add a new database.
Configuration for Azure SQL Database
Superset uses SQLAlchemy for database connections. For Azure SQL, the connection string follows this pattern:
mssql+pyodbc://username:[email protected]/database_name?driver=ODBC+Driver+17+for+SQL+Server
Break this down:
driver=ODBC Driver 17 for SQL Server)For production, use Managed Identity or Azure AD authentication instead of SQL credentials. This eliminates password rotation headaches and aligns with Azure security best practices.
Testing the Connection
Once you've entered the connection string, click Test Connection. Superset will:
If the test fails, common culprits are:
ODBC Driver 17 for SQL Server on the Superset serverOnce the database connection is live, create datasets that map to your ADF outputs. A dataset is Superset's abstraction layer—it can be a table or a SQL query.
Dataset 1: Direct Table Reference
Dataset Name: Sales Summary
Database: AzureSqlDb_Analytics
Table: dbo.sales_summary
Columns Exposed: date, region, product, revenue, units_sold, margin
Metrics: SUM(revenue), SUM(units_sold), AVG(margin)
Filters: date >= DATE_TRUNC('month', CURRENT_DATE)
This dataset directly queries the sales_summary table that ADF populates. Superset automatically detects columns and suggests metrics based on data types.
Dataset 2: Custom SQL Query
For more complex aggregations, write a SQL query that runs every time a user accesses the dataset:
SELECT
DATE_TRUNC('week', date) AS week,
region,
SUM(revenue) AS weekly_revenue,
SUM(units_sold) AS weekly_units,
AVG(margin) AS avg_margin,
COUNT(DISTINCT product) AS product_count
FROM dbo.sales_summary
WHERE date >= DATEADD(month, -12, CAST(GETDATE() AS DATE))
GROUP BY DATE_TRUNC('week', date), region
ORDER BY week DESC, regionThis query aggregates daily data to weekly, filters to the last 12 months, and pre-calculates metrics. Superset caches the result based on your cache TTL, so repeated dashboard loads don't re-run the expensive query.
Setting Cache Policy
Configure caching to balance freshness and performance:
Cache TTL: 3600 seconds (1 hour)
Warehouse Query Timeout: 300 seconds (5 minutes)
Max Rows: 100,000
For real-time dashboards (e.g., monitoring sales as they happen), use a 5-minute TTL. For executive summaries (refreshed daily by ADF), use 24-hour TTL. This prevents dashboard slowness from expensive queries while keeping data reasonably fresh.
With datasets configured, build a dashboard that visualizes your ADF outputs.
Dashboard Structure: Sales Analytics
KPI Cards (Top Row)
Time Series Chart (Middle)
Breakdown Charts (Bottom)
Filters (Left Sidebar)
Each chart is a Superset "slice"—a visualization of a dataset with specific dimensions, metrics, and filters. Combine slices on a dashboard to tell a story.
Superset dashboards are interactive by default, but you can enhance this:
Cross-Filter Setup
Make the region pie chart a filter source. When a user clicks a region, all other charts on the dashboard filter to that region:
regionNow clicking "North America" in the pie chart instantly filters the time series to show only North American revenue.
Drill-Down Capability
For the product bar chart, enable drill-down so users can click a product to see weekly trends:
product → week → regionThis layered exploration is powerful for root-cause analysis without needing separate dashboards.
If you're building a product or internal portal, embed Superset dashboards directly:
<iframe
src="https://superset.yourdomain.com/superset/dashboard/sales-analytics/?embed=true"
width="100%"
height="800"
frameborder="0"
allow="fullscreen"
></iframe>With D23's managed Superset service, embedding is simplified—no self-hosting complexity, built-in security, and automatic scaling. Your product users see fresh dashboards powered by your ADF pipelines without knowing about the infrastructure underneath.
If your ADF pipeline runs every hour (instead of daily), Superset dashboards stay near real-time. However, full-table refreshes become expensive at scale.
Solution: Incremental Loads
Modify your ADF pipeline to load only new or changed records:
modified_date > @LastLoadTimeMERGE INTO dbo.sales_summary AS target
USING dbo.sales_summary_staging AS source
ON target.date = source.date
AND target.region = source.region
AND target.product = source.product
WHEN MATCHED THEN
UPDATE SET revenue = source.revenue, units_sold = source.units_sold
WHEN NOT MATCHED THEN
INSERT (date, region, product, revenue, units_sold)
VALUES (source.date, source.region, source.product, source.revenue, source.units_sold);This pattern reduces ADF execution time from 30 minutes (full refresh) to 5 minutes (incremental), enabling dashboards to refresh every 15 minutes instead of daily.
Superset's SQL Lab lets analysts write custom queries, but non-technical users need simpler tools. D23's AI-powered text-to-SQL feature converts natural language to SQL:
User Input: "Show me revenue by region for the last quarter, sorted highest to lowest"
Generated SQL:
SELECT
region,
SUM(revenue) AS total_revenue
FROM dbo.sales_summary
WHERE date >= DATEADD(quarter, -1, CAST(GETDATE() AS DATE))
GROUP BY region
ORDER BY total_revenue DESC
This democratizes analytics—marketing teams, sales leaders, and finance analysts can explore ADF data without SQL knowledge or waiting for analysts to build custom dashboards.
As your ADF pipelines and Superset dashboards grow, tracking lineage becomes critical. Which dashboards depend on which ADF outputs? Who owns which dataset?
DataHub's Superset integration automatically ingests your Superset metadata:
Configure DataHub to ingest Superset metadata hourly. This gives your data governance team a single source of truth for all analytics assets, integrated with your ADF lineage data.
Firewall Rules
Your Superset instance (whether self-hosted or managed via D23) needs network access to Azure SQL Database:
Superset Server IP → Allow inbound on port 1433Alternatively, use Azure Private Link to keep traffic internal to your VNet:
Superset → Private Endpoint → Azure SQL (no internet exposure)
This is essential for regulated industries (healthcare, finance) where data can't traverse the public internet.
Azure AD Authentication
Instead of SQL usernames and passwords, use Azure AD:
Linked Service Auth: Managed Identity
Superset Connection: Azure AD Token (automatic refresh)
This eliminates credential management—Azure handles authentication via your identity provider. When an employee leaves, their access revokes automatically.
If your ADF outputs include sensitive data (e.g., employee salaries, customer PII), Superset's RLS ensures users only see rows they're authorized for.
Example: Sales Dashboard with RLS
Your sales team should only see their own region's data. Configure RLS:
Sales_North_Americaregion = 'North America'Now when a North America sales rep views the dashboard, all charts automatically filter to region = 'North America', even if they try to modify the SQL query.
Superset logs all dashboard views, query executions, and data exports. Monitor these logs for:
Configure Superset to send logs to Azure Monitor or your SIEM:
LOG_LEVEL: INFO
LOG_FORMAT: JSON
LOG_DESTINATION: Azure Event Hubs
Set up alerts: if a user exports >10,000 rows, notify your security team immediately.
As your ADF outputs grow to millions of rows, Superset queries slow down. Optimize with:
1. Database Indexes
Your analytics database should have indexes on frequently filtered columns:
CREATE INDEX idx_sales_summary_date ON dbo.sales_summary(date);
CREATE INDEX idx_sales_summary_region ON dbo.sales_summary(region);
CREATE INDEX idx_sales_summary_date_region ON dbo.sales_summary(date, region);Composite indexes (multiple columns) are especially valuable for queries that filter by both date and region.
2. Materialized Views
For expensive aggregations that power multiple dashboards, create a materialized view in your database:
CREATE MATERIALIZED VIEW dbo.sales_summary_weekly AS
SELECT
DATE_TRUNC('week', date) AS week,
region,
product,
SUM(revenue) AS weekly_revenue,
SUM(units_sold) AS weekly_units
FROM dbo.sales_summary
GROUP BY DATE_TRUNC('week', date), region, product;
CREATE INDEX idx_weekly_week ON dbo.sales_summary_weekly(week);Point Superset datasets to the materialized view instead of running the aggregation on-demand. Refresh the view nightly via ADF.
3. Superset Query Caching
Configure Superset's result cache (Redis) to store query results:
RESULTS_BACKEND: redis
RESULTS_BACKEND_USE_MSGPACK: true
CACHE_DEFAULT_TIMEOUT: 3600
When multiple users view the same dashboard within the cache TTL, Superset serves cached results instead of re-querying the database. This dramatically reduces database load.
Set up monitoring to catch performance degradation early:
Metrics to Track
Use Azure Monitor or Datadog to visualize these metrics:
Alert Rule: If P95 Query Latency > 10 seconds for 5 minutes, page on-call
Alert Rule: If ADF Pipeline fails, send Slack notification to #data-team
Symptom: Dashboard shows data from yesterday, but ADF pipeline ran this morning.
Root Causes:
Solutions:
docker-compose restart superset (if self-hosted)Symptom: Dashboard takes 30+ seconds to load; users complain about sluggishness.
Root Causes:
Solutions:
Symptom: "Failed to connect to database" when adding the connection.
Root Causes:
Solutions:
# Install ODBC driver (Ubuntu/Debian)
sudo apt-get install odbc-mssql
# Verify connectivity from Superset server
sqlcmd -S server.database.windows.net -U username -P password -d database_name
# Check firewall rules in Azure Portal
# Ensure Superset server IP is whitelistedinserted_date, updated_date, and data_source columns for traceability.Integrating Azure Data Factory with Apache Superset creates a powerful, cost-effective analytics stack. ADF orchestrates your data pipelines reliably; Superset provides fast, interactive dashboards without the overhead of proprietary BI platforms.
The architecture scales from small teams (one ADF pipeline, five dashboards) to enterprises (hundreds of pipelines, thousands of dashboards). Whether you're self-hosting Superset or using D23's managed platform, the integration patterns remain the same: clean data from ADF, query it in Superset, empower your teams.
Start with a single ADF pipeline and dashboard. Validate the pattern works for your data model. Then expand incrementally—add more pipelines, more dashboards, more users. Use the monitoring and troubleshooting techniques in this guide to keep everything running smoothly.
The result: analytics that move as fast as your business, without the cost or complexity of legacy BI platforms. Your data teams focus on insights, not infrastructure. Your executives get dashboards they trust. Your product teams embed analytics directly into user experiences.
That's the promise of ADF + Superset, and it's achievable today.