New: AI & text-to-SQL on your own SupersetBook a demo

Apache Superset18 Apr 2026

Azure Data Factory + Apache Superset: A Practical Integration Guide

Learn how to connect Azure Data Factory pipelines to Apache Superset dashboards. Step-by-step integration guide for production analytics.

DTD23 Team

13 minutes read

Why Azure Data Factory and Apache Superset Together?

If you're running data pipelines on Azure, you need dashboards that don't lag behind your infrastructure. Azure Data Factory (ADF) orchestrates your ETL workflows; Apache Superset visualizes the results. Connecting them creates a unified analytics stack that's fast, cost-effective, and doesn't lock you into expensive proprietary platforms.

Most teams choose this combination for three reasons: first, ADF's visual pipeline builder and native Azure integrations handle complex data workflows without custom code; second, Apache Superset's open-source foundation and managed hosting options give you control over costs and deployment; third, the combination avoids vendor lock-in while delivering enterprise-grade analytics performance.

This guide walks you through the complete integration—from configuring ADF pipelines to querying their outputs in Superset dashboards, with real examples you can adapt to your data model.

Understanding the Architecture

How ADF and Superset Communicate

Azure Data Factory is an orchestration and ETL engine. It reads data from sources (SQL databases, data lakes, APIs), transforms it, and writes results to a target—typically Azure SQL Database, Azure Synapse, or Azure Data Lake Storage. Apache Superset is a visualization and BI layer that queries these target databases directly.

The integration flow is straightforward:

ADF Pipeline Runs: Your ADF pipeline executes on a schedule (hourly, daily, or on-demand), extracting, transforming, and loading data into a database or data warehouse.
Data Lands in Target Store: Cleaned, aggregated, or enriched data arrives in your analytics database—usually Azure SQL Database or Synapse Analytics.
Superset Queries the Target: Superset connects directly to that database, builds datasets from tables or SQL queries, and renders dashboards in real-time.
Users Explore: Analysts, executives, and embedded users interact with dashboards, drill down, and export reports—all powered by fresh ADF outputs.

This architecture decouples pipeline orchestration from analytics presentation. ADF handles "when and how to move data"; Superset handles "how to explore and present it."

Why This Beats All-in-One Platforms

Looker, Tableau, and Power BI bundle orchestration, transformation, and visualization. That bundling works until you need to:

Reuse ADF pipelines for non-analytics workflows (ML training, operational reporting, data warehousing)
Embed analytics into your product without paying per-seat licensing
Avoid vendor lock-in for your transformation logic
Scale dashboards across teams without platform overhead

With ADF + Superset, your transformation logic lives in ADF (reusable, auditable, version-controlled); your analytics layer is open-source (portable, customizable, cost-predictable). D23's managed Superset platform removes the operational burden of self-hosting while preserving this flexibility.

Setting Up Azure Data Factory for Analytics Output

Configuring ADF Linked Services and Datasets

Before Superset can query anything, ADF must write clean data to a target database. Start by creating a Linked Service—ADF's way of storing connection credentials securely.

Step 1: Create a Linked Service to Your Target Database

In the ADF portal, navigate to Manage > Linked Services and create a new connection to Azure SQL Database (or Synapse Analytics, depending on scale).

Linked Service Name: AzureSqlDb_Analytics
Server: your-server.database.windows.net
Database: analytics_db
Authentication: SQL Authentication (or Managed Identity for better security)
Username: sqladmin
Password: [secure password stored in Key Vault]

Use Azure Key Vault integration to store credentials—never hardcode them. This ensures your connection string stays secure and rotatable.

Step 2: Define Output Datasets

Create a dataset that represents the table where ADF will write analytics data. For example, a sales_summary table that aggregates daily transactions.

Dataset Name: SalesSummaryOutput
Linked Service: AzureSqlDb_Analytics
Table Name: dbo.sales_summary
Schema: dbo
Columns: date, region, product, revenue, units_sold, margin

Define the schema explicitly so Superset knows what columns to expect. This prevents surprises when dashboards query stale or missing columns.

Building the ADF Pipeline

Your ADF pipeline orchestrates the full flow: extract raw data, apply transformations, write to the analytics database.

Example Pipeline: Daily Sales Aggregation

Copy Activity (Extract): Pull raw sales transactions from your operational database or data lake.
- Source: Azure SQL or Blob Storage
- Destination: Staging table in analytics_db
- Frequency: Daily at 2 AM UTC
Data Flow Activity (Transform): Use ADF's visual Data Flow to aggregate by date, region, and product. Calculate rolling metrics like 7-day average revenue.
- Input: Staging table
- Transformations: Group by, window functions, derived columns
- Output: sales_summary table
Stored Procedure Activity (Validation): Run a SQL stored procedure to validate row counts, check for nulls, and log pipeline execution metadata.
- Stored Proc: usp_validate_sales_summary
- Parameters: @RowCount, @ExecutionTime

This three-step pattern (extract → transform → validate) ensures data quality and gives you audit trails for compliance.

Monitoring and Alerts

Configure Azure Data Factory's monitoring capabilities to alert you when pipelines fail. Set up alerts for:

Pipeline failure (immediate notification)
Slow runs (if daily load takes >30 minutes, investigate)
Data quality issues (if validation stored proc fails)

These alerts prevent silent failures where Superset dashboards show stale data without anyone knowing.

Connecting Apache Superset to Your ADF Outputs

Adding the Database Connection

Superset needs a direct connection to your analytics database. Navigate to Settings > Database Connections and add a new database.

Configuration for Azure SQL Database

Superset uses SQLAlchemy for database connections. For Azure SQL, the connection string follows this pattern:

mssql+pyodbc://username:[email protected]/database_name?driver=ODBC+Driver+17+for+SQL+Server

Break this down:

mssql+pyodbc: SQLAlchemy dialect for SQL Server via ODBC
username:password: SQL authentication (or use Azure AD with driver=ODBC Driver 17 for SQL Server)
server.database.windows.net: Your Azure SQL Server hostname
database_name: The analytics database where ADF writes data
driver=ODBC+Driver+17+for+SQL+Server: Required ODBC driver

For production, use Managed Identity or Azure AD authentication instead of SQL credentials. This eliminates password rotation headaches and aligns with Azure security best practices.

Testing the Connection

Once you've entered the connection string, click Test Connection. Superset will:

Validate the ODBC driver is installed
Attempt to connect to your database
Verify the user has SELECT permissions on target tables
Report any SSL/TLS certificate issues

If the test fails, common culprits are:

Firewall rules: Your Superset instance's IP isn't whitelisted on Azure SQL
Missing ODBC driver: Install ODBC Driver 17 for SQL Server on the Superset server
Expired credentials: Check that your Key Vault secret hasn't rotated

Configuring Datasets in Superset

Once the database connection is live, create datasets that map to your ADF outputs. A dataset is Superset's abstraction layer—it can be a table or a SQL query.

Dataset 1: Direct Table Reference

Dataset Name: Sales Summary
Database: AzureSqlDb_Analytics
Table: dbo.sales_summary
Columns Exposed: date, region, product, revenue, units_sold, margin
Metrics: SUM(revenue), SUM(units_sold), AVG(margin)
Filters: date >= DATE_TRUNC('month', CURRENT_DATE)

This dataset directly queries the sales_summary table that ADF populates. Superset automatically detects columns and suggests metrics based on data types.

Dataset 2: Custom SQL Query

For more complex aggregations, write a SQL query that runs every time a user accesses the dataset:

SELECT
  DATE_TRUNC('week', date) AS week,
  region,
  SUM(revenue) AS weekly_revenue,
  SUM(units_sold) AS weekly_units,
  AVG(margin) AS avg_margin,
  COUNT(DISTINCT product) AS product_count
FROM dbo.sales_summary
WHERE date >= DATEADD(month, -12, CAST(GETDATE() AS DATE))
GROUP BY DATE_TRUNC('week', date), region
ORDER BY week DESC, region

This query aggregates daily data to weekly, filters to the last 12 months, and pre-calculates metrics. Superset caches the result based on your cache TTL, so repeated dashboard loads don't re-run the expensive query.

Setting Cache Policy

Configure caching to balance freshness and performance:

Cache TTL: 3600 seconds (1 hour)
Warehouse Query Timeout: 300 seconds (5 minutes)
Max Rows: 100,000

For real-time dashboards (e.g., monitoring sales as they happen), use a 5-minute TTL. For executive summaries (refreshed daily by ADF), use 24-hour TTL. This prevents dashboard slowness from expensive queries while keeping data reasonably fresh.

Building Dashboards from ADF Data

Creating Your First Dashboard

With datasets configured, build a dashboard that visualizes your ADF outputs.

Dashboard Structure: Sales Analytics

KPI Cards (Top Row)
- Total Revenue (Last 30 Days)
- Total Units Sold (Last 30 Days)
- Average Margin %
- YoY Growth %
Time Series Chart (Middle)
- Revenue by Week (last 12 months)
- Stacked area chart showing contribution by region
Breakdown Charts (Bottom)
- Revenue by Product (bar chart, top 10)
- Revenue by Region (pie chart or treemap)
- Margin Distribution (histogram)
Filters (Left Sidebar)
- Date range picker (defaults to last 30 days)
- Region multi-select (all regions selected by default)
- Product multi-select (all products selected)

Each chart is a Superset "slice"—a visualization of a dataset with specific dimensions, metrics, and filters. Combine slices on a dashboard to tell a story.

Configuring Interactivity

Superset dashboards are interactive by default, but you can enhance this:

Cross-Filter Setup

Make the region pie chart a filter source. When a user clicks a region, all other charts on the dashboard filter to that region:

Click the pie chart → Edit Chart
Under Interactions, enable Cross Filter
Select target charts (time series, product breakdown, etc.)
Choose the filter column: region

Now clicking "North America" in the pie chart instantly filters the time series to show only North American revenue.

Drill-Down Capability

For the product bar chart, enable drill-down so users can click a product to see weekly trends:

Edit Chart → Drill Down
Set drill-down columns: product → week → region
Users click "Product A" → see weekly revenue for Product A → click a week → see regional breakdown

This layered exploration is powerful for root-cause analysis without needing separate dashboards.

Embedding Dashboards

If you're building a product or internal portal, embed Superset dashboards directly:

<iframe
  src="https://superset.yourdomain.com/superset/dashboard/sales-analytics/?embed=true"
  width="100%"
  height="800"
  frameborder="0"
  allow="fullscreen"
></iframe>

With D23's managed Superset service, embedding is simplified—no self-hosting complexity, built-in security, and automatic scaling. Your product users see fresh dashboards powered by your ADF pipelines without knowing about the infrastructure underneath.

Advanced Integration Patterns

Real-Time Dashboards with Incremental Loads

If your ADF pipeline runs every hour (instead of daily), Superset dashboards stay near real-time. However, full-table refreshes become expensive at scale.

Solution: Incremental Loads

Modify your ADF pipeline to load only new or changed records:

Source Query: Extract records where modified_date > @LastLoadTime
Staging Table: Write incremental records to a staging table
Merge Activity: Use a SQL Merge statement to upsert into the main analytics table

MERGE INTO dbo.sales_summary AS target
USING dbo.sales_summary_staging AS source
ON target.date = source.date
  AND target.region = source.region
  AND target.product = source.product
WHEN MATCHED THEN
  UPDATE SET revenue = source.revenue, units_sold = source.units_sold
WHEN NOT MATCHED THEN
  INSERT (date, region, product, revenue, units_sold)
  VALUES (source.date, source.region, source.product, source.revenue, source.units_sold);

This pattern reduces ADF execution time from 30 minutes (full refresh) to 5 minutes (incremental), enabling dashboards to refresh every 15 minutes instead of daily.

Text-to-SQL for Self-Serve Exploration

Superset's SQL Lab lets analysts write custom queries, but non-technical users need simpler tools. D23's AI-powered text-to-SQL feature converts natural language to SQL:

User Input: "Show me revenue by region for the last quarter, sorted highest to lowest"

Generated SQL:
SELECT
  region,
  SUM(revenue) AS total_revenue
FROM dbo.sales_summary
WHERE date >= DATEADD(quarter, -1, CAST(GETDATE() AS DATE))
GROUP BY region
ORDER BY total_revenue DESC

This democratizes analytics—marketing teams, sales leaders, and finance analysts can explore ADF data without SQL knowledge or waiting for analysts to build custom dashboards.

Metadata Management with DataHub Integration

As your ADF pipelines and Superset dashboards grow, tracking lineage becomes critical. Which dashboards depend on which ADF outputs? Who owns which dataset?

DataHub's Superset integration automatically ingests your Superset metadata:

BI Entities: Dashboards, charts, datasets
Lineage: Which datasets feed which charts
Ownership: Who created and maintains each asset
Governance: Data quality rules, PII classifications

Configure DataHub to ingest Superset metadata hourly. This gives your data governance team a single source of truth for all analytics assets, integrated with your ADF lineage data.

Security and Access Control

Network Security

Firewall Rules

Your Superset instance (whether self-hosted or managed via D23) needs network access to Azure SQL Database:

In Azure Portal, navigate to your SQL Server → Firewalls and Virtual Networks
Add a rule: Superset Server IP → Allow inbound on port 1433
For managed services, add the provider's IP range (D23 provides this during setup)

Alternatively, use Azure Private Link to keep traffic internal to your VNet:

Superset → Private Endpoint → Azure SQL (no internet exposure)

This is essential for regulated industries (healthcare, finance) where data can't traverse the public internet.

Azure AD Authentication

Instead of SQL usernames and passwords, use Azure AD:

Linked Service Auth: Managed Identity
Superset Connection: Azure AD Token (automatic refresh)

This eliminates credential management—Azure handles authentication via your identity provider. When an employee leaves, their access revokes automatically.

Row-Level Security (RLS)

If your ADF outputs include sensitive data (e.g., employee salaries, customer PII), Superset's RLS ensures users only see rows they're authorized for.

Example: Sales Dashboard with RLS

Your sales team should only see their own region's data. Configure RLS:

Create a Superset user role: Sales_North_America
Define a filter: region = 'North America'
Assign the filter to the role
Assign sales reps in North America to the role

Now when a North America sales rep views the dashboard, all charts automatically filter to region = 'North America', even if they try to modify the SQL query.

Audit Logging

Superset logs all dashboard views, query executions, and data exports. Monitor these logs for:

Unusual query patterns (someone querying all customer records)
Excessive exports (potential data theft)
Failed authentication attempts (brute force attacks)

Configure Superset to send logs to Azure Monitor or your SIEM:

LOG_LEVEL: INFO
LOG_FORMAT: JSON
LOG_DESTINATION: Azure Event Hubs

Set up alerts: if a user exports >10,000 rows, notify your security team immediately.

Performance Optimization

Query Performance Tuning

As your ADF outputs grow to millions of rows, Superset queries slow down. Optimize with:

1. Database Indexes

Your analytics database should have indexes on frequently filtered columns:

CREATE INDEX idx_sales_summary_date ON dbo.sales_summary(date);
CREATE INDEX idx_sales_summary_region ON dbo.sales_summary(region);
CREATE INDEX idx_sales_summary_date_region ON dbo.sales_summary(date, region);

Composite indexes (multiple columns) are especially valuable for queries that filter by both date and region.

2. Materialized Views

For expensive aggregations that power multiple dashboards, create a materialized view in your database:

CREATE MATERIALIZED VIEW dbo.sales_summary_weekly AS
SELECT
  DATE_TRUNC('week', date) AS week,
  region,
  product,
  SUM(revenue) AS weekly_revenue,
  SUM(units_sold) AS weekly_units
FROM dbo.sales_summary
GROUP BY DATE_TRUNC('week', date), region, product;
 
CREATE INDEX idx_weekly_week ON dbo.sales_summary_weekly(week);

Point Superset datasets to the materialized view instead of running the aggregation on-demand. Refresh the view nightly via ADF.

3. Superset Query Caching

Configure Superset's result cache (Redis) to store query results:

RESULTS_BACKEND: redis
RESULTS_BACKEND_USE_MSGPACK: true
CACHE_DEFAULT_TIMEOUT: 3600

When multiple users view the same dashboard within the cache TTL, Superset serves cached results instead of re-querying the database. This dramatically reduces database load.

Monitoring and Alerting

Set up monitoring to catch performance degradation early:

Metrics to Track

Query Latency: P95 query time (target: <5 seconds for dashboards)
Database CPU: Alert if >80% sustained
Cache Hit Rate: Target >70% for dashboards
ADF Pipeline Duration: Alert if daily load exceeds SLA (e.g., >30 minutes)

Use Azure Monitor or Datadog to visualize these metrics:

Alert Rule: If P95 Query Latency > 10 seconds for 5 minutes, page on-call
Alert Rule: If ADF Pipeline fails, send Slack notification to #data-team

Troubleshooting Common Issues

Stale Data in Dashboards

Symptom: Dashboard shows data from yesterday, but ADF pipeline ran this morning.

Root Causes:

Cache TTL too long: Superset is serving a cached query result
ADF pipeline failed silently: Check ADF monitor for failed runs
Database connection stale: Superset's connection pool timed out

Solutions:

Manually refresh the dashboard cache: Dashboard Settings > Clear Cache
Check ADF pipeline run history: Navigate to Monitor > Pipeline Runs, look for failed or delayed runs
Restart Superset's connection pool: docker-compose restart superset (if self-hosted)
Verify ADF alerts are configured to notify you of failures

Slow Dashboard Load Times

Symptom: Dashboard takes 30+ seconds to load; users complain about sluggishness.

Root Causes:

Too many charts on one dashboard (each chart is a separate query)
Unindexed database columns in filter queries
Superset server under-resourced (insufficient CPU/memory)

Solutions:

Reduce dashboard complexity: Split into multiple focused dashboards (one for KPIs, one for details)
Add database indexes on filter columns (as described in Performance Optimization section)
Scale Superset: Increase CPU/memory allocation or add more Superset replicas
Use materialized views for complex aggregations

Connection String Errors

Symptom: "Failed to connect to database" when adding the connection.

Root Causes:

ODBC driver not installed on Superset server
Firewall blocking port 1433
Incorrect server hostname or credentials

Solutions:

# Install ODBC driver (Ubuntu/Debian)
sudo apt-get install odbc-mssql
 
# Verify connectivity from Superset server
sqlcmd -S server.database.windows.net -U username -P password -d database_name
 
# Check firewall rules in Azure Portal
# Ensure Superset server IP is whitelisted

Best Practices Summary

ADF Pipeline Design

Idempotency: Pipelines should produce the same output regardless of how many times they run. Use MERGE or DELETE+INSERT patterns, not APPEND.
Audit Columns: Include inserted_date, updated_date, and data_source columns for traceability.
Data Quality Checks: Validate row counts, check for nulls, and alert on anomalies.
Incremental Loads: Use watermark tables to load only new data, reducing ADF execution time.

Superset Dashboard Design

Performance First: Optimize for sub-5-second load times. Use filters to reduce result sets.
Clear Hierarchy: Top-level KPIs, then breakdowns, then detail tables.
Consistent Formatting: Use the same color palette and number formats across dashboards.
Drill-Down Capability: Enable users to explore from summary to detail without leaving the dashboard.

Security and Governance

Least Privilege: Grant users the minimum data access required for their role.
Audit Everything: Log all queries, exports, and dashboard views.
Encrypt in Transit and at Rest: Use TLS for connections; enable Azure SQL Transparent Data Encryption.
Regular Reviews: Quarterly audit of user permissions and dashboard access.

Conclusion

Integrating Azure Data Factory with Apache Superset creates a powerful, cost-effective analytics stack. ADF orchestrates your data pipelines reliably; Superset provides fast, interactive dashboards without the overhead of proprietary BI platforms.

The architecture scales from small teams (one ADF pipeline, five dashboards) to enterprises (hundreds of pipelines, thousands of dashboards). Whether you're self-hosting Superset or using D23's managed platform, the integration patterns remain the same: clean data from ADF, query it in Superset, empower your teams.

Start with a single ADF pipeline and dashboard. Validate the pattern works for your data model. Then expand incrementally—add more pipelines, more dashboards, more users. Use the monitoring and troubleshooting techniques in this guide to keep everything running smoothly.

The result: analytics that move as fast as your business, without the cost or complexity of legacy BI platforms. Your data teams focus on insights, not infrastructure. Your executives get dashboards they trust. Your product teams embed analytics directly into user experiences.

That's the promise of ADF + Superset, and it's achievable today.