Learn how D23 executes zero-downtime Apache Superset upgrades using blue-green deployments, schema migrations, and rollback strategies for production analytics.
When you're running Apache Superset in production—especially as a managed service supporting dozens of teams and hundreds of dashboards—upgrades aren't optional maintenance tasks. They're critical operational events that require precision planning, tested procedures, and the ability to roll back instantly if something breaks.
At D23, we've built a zero-downtime upgrade strategy that keeps dashboards running, queries executing, and your analytics infrastructure available 24/7. This article walks through exactly how we do it: the architecture decisions, the deployment patterns, the database migration strategies, and the safety nets we've put in place.
If you're evaluating managed Apache Superset as an alternative to Looker or Tableau, or if you're running Superset yourself and want to understand production-grade upgrade patterns, this deep-dive will give you the concrete operational knowledge you need.
Before we dig into the technical implementation, let's be clear about what's at stake. Apache Superset upgrades aren't like patching a test environment. When you're running embedded analytics, self-serve BI dashboards, or KPI reporting infrastructure that teams depend on daily, downtime isn't just an inconvenience—it breaks workflows, delays decisions, and erodes confidence in your analytics platform.
Consider a typical scenario: you've got 50 dashboards embedded in your product. Users are checking conversion funnels, revenue trends, and customer cohorts. Your data team is running ad-hoc queries against your data warehouse. A critical Superset security patch is released, and you need to upgrade within days. If you take the traditional approach—stop the service, run migrations, restart—you're looking at 15 minutes to an hour of complete unavailability. In that window, embedded dashboards go blank, API calls fail, and your team loses visibility into business metrics.
The financial impact depends on your business, but for SaaS companies, ecommerce platforms, and data-driven organizations, even 30 minutes of analytics downtime can cost thousands of dollars in lost visibility and delayed decisions.
D23's approach eliminates this entirely. We execute upgrades while dashboards stay live, queries continue to execute, and users never see a service interruption. Here's how.
Zero-downtime upgrades start with architecture. If your Superset deployment is tightly coupled to a single server, database, or cache layer, you can't upgrade without stopping everything. That's why the first principle of our infrastructure is strict separation of concerns.
Our Superset deployment consists of three independent layers:
Application Layer (Stateless): Superset web servers and query executors run in containers with no local state. A user's session isn't pinned to a specific server. If a container goes down, the load balancer routes traffic to another. This is critical because it means we can drain traffic from old containers, spin up new ones with upgraded code, and retire the old ones—all without losing a single request.
Data Layer (Persistent): PostgreSQL (or your chosen database) stores dashboards, users, saved queries, and metadata. This layer never stops during an upgrade. We use read replicas and connection pooling to ensure database availability remains constant.
Cache Layer (Distributed): Redis handles query result caching, session storage, and temporary data. Like the database, this runs independently and survives application upgrades. We use Redis Sentinel for automatic failover, so even if a cache node fails, the system recovers without manual intervention.
This three-layer architecture means an upgrade touches only the stateless application layer. The data and cache layers keep running, serving requests from old application instances until they're fully drained.
The core technique we use is called blue-green deployment. Here's the concept: instead of upgrading in place, you run two complete, identical production environments side by side. One is "blue" (current), one is "green" (new). You upgrade green while blue serves all traffic. Once green is fully tested and healthy, you flip traffic over. If something goes wrong, you flip back instantly.
For Superset, this works like this:
Phase 1: Prepare Green Environment
We provision new Superset containers with the upgraded version. These containers connect to the same PostgreSQL database and Redis cache as the blue environment. They run schema migrations (more on that below) in a controlled, testable way. The green environment is fully operational but receives zero traffic.
Phase 2: Smoke Testing
Before we route any production traffic, we run automated tests against green:
If any test fails, green is torn down and we investigate. Blue continues serving all traffic unaffected.
Phase 3: Gradual Traffic Shift
Once green passes smoke tests, we don't flip 100% of traffic immediately. Instead, we use a load balancer (we use Nginx with custom routing logic) to gradually shift traffic to green. We start with 5% of requests, monitor error rates and latency, then shift to 10%, 25%, 50%, and finally 100%.
This gradual shift is crucial. If there's a subtle bug that only manifests under production load or with specific data patterns, we catch it while 95% of traffic still flows through blue. We can roll back without affecting most users.
Phase 4: Complete Cutover
Once green has handled 100% of traffic for a period without issues, we formally decommission blue. The upgrade is complete.
Phase 5: Instant Rollback (If Needed)
For 24 hours after cutover, we keep blue running in standby mode. If a critical issue emerges—say, a dashboard rendering bug that only appears in a specific configuration—we can flip traffic back to blue in seconds. This gives us a safety net without requiring a full re-upgrade.
This pattern is well-documented in the Kubernetes deployment documentation, which describes rolling updates and blue-green strategies in detail. We implement it using container orchestration, but the principles apply whether you're using Kubernetes, Docker Compose, or traditional VMs.
Blue-green deployment works smoothly for application code, but databases are trickier. Here's the problem: Superset upgrades often include schema changes. A new version might add columns, create indexes, or restructure tables. You can't run two versions of the application against incompatible database schemas simultaneously.
Our solution uses a principle called backward-compatible migrations. Here's how it works:
Step 1: Additive-Only Migrations
When we upgrade Superset, we ensure database changes are additive. We add new columns, but we don't remove old ones immediately. We create new indexes without dropping old ones. This way, both blue (old code) and green (new code) can read and write to the same database schema.
For example, if an upgrade adds a query_timeout_seconds column to the queries table:
Step 2: Dual-Write During Transition
During the gradual traffic shift, green instances write to both old and new columns (if applicable). This ensures data consistency. Old code can still read the old columns if needed.
Step 3: Cleanup After Cutover
Once we've been running 100% on green for a period, we run cleanup migrations: dropping unused columns, removing deprecated indexes, and optimizing the schema. This happens after blue is decommissioned, so there's no risk of incompatibility.
This approach requires careful planning. Before we upgrade, we review the Superset release notes and identify schema changes. We test migrations against a production-like copy of the database. We measure migration time and plan for it. The official Superset upgrade documentation provides migration scripts, but in production environments, we always run them in a staging environment first.
During an upgrade, queries that are already executing should complete without interruption. This requires careful connection management.
Superset uses a connection pool to the data warehouse (Snowflake, BigQuery, PostgreSQL, etc.). When we upgrade, we don't immediately close all connections. Instead:
Drain New Connections: New instances (green) get routed to the new application code, but old instances (blue) stop accepting new connections from the load balancer.
Let Existing Queries Complete: Old instances keep running, serving queries that are already in flight. A user who started a 5-minute query before the upgrade completes that query on the old instance.
Graceful Shutdown: Once all in-flight queries complete (or reach a timeout), the old instance shuts down cleanly.
This is called a "drain and replace" pattern. It ensures no query is interrupted mid-execution. For long-running queries (common in data exploration), this is critical.
We also use connection pooling with PgBouncer for PostgreSQL connections, which allows us to maintain connection limits while supporting many concurrent users. This prevents connection exhaustion during the upgrade window.
Superset caches query results aggressively. A dashboard with 10 charts might execute 10 queries, but if those queries are cached, the dashboard loads in milliseconds instead of seconds.
During an upgrade, we need to handle caching carefully. Here's our approach:
Preserve Cache Across Versions
We use Redis for caching, and Redis persists data independently of the Superset application. When we upgrade, the cache survives. Green instances can read cached results from blue's execution.
This has a subtle benefit: dashboards load faster immediately after upgrade because the cache is warm. Users don't experience the "cold cache" slowdown that typically follows a deployment.
Invalidate Cache for Changed Queries
If an upgrade changes how queries are executed (e.g., a new optimization that changes the query plan), we need to invalidate the cache for affected queries. We do this by tagging cache entries with a version number. When we upgrade, we bump the version for specific query types, invalidating old entries.
This prevents stale results from being served by new code that might interpret them differently.
Monitor Cache Hit Rates
During the gradual traffic shift, we monitor cache hit rates on both blue and green. If green has significantly lower hit rates, it suggests a problem. We investigate before shifting more traffic.
You can't execute safe upgrades without visibility. We instrument every stage of the upgrade with metrics and logs.
Real-Time Dashboards
During an upgrade, we're watching:
We display these metrics on a dedicated dashboard that the on-call engineer watches throughout the upgrade. If any metric goes red, we have a runbook for immediate rollback.
Distributed Tracing
We use distributed tracing (we instrument Superset with OpenTelemetry) to follow individual requests through the system. If a user reports that a dashboard is slow after upgrade, we can trace that request, see exactly which services it touched, and identify the bottleneck.
Alerting Thresholds
We set specific thresholds that trigger automatic rollback:
If green triggers any of these conditions, we automatically flip traffic back to blue and page the on-call engineer.
Zero-downtime upgrades in production are only possible because we've already tested everything in a staging environment that mirrors production exactly.
Our staging setup includes:
Only after staging passes all tests do we schedule the production upgrade.
Despite careful planning, sometimes things go wrong. A subtle bug emerges under production load. A third-party integration breaks. A query optimization causes unexpected results.
We have multiple rollback paths:
Immediate Rollback (First 24 Hours)
We keep the blue environment running in standby for 24 hours after cutover. If we discover a critical issue, we flip traffic back to blue instantly. This takes 30 seconds. The old code is already running, so there's no startup delay.
Full Rollback (After 24 Hours)
After blue is decommissioned, we keep a backup of the old Superset container image and a database backup from just before the upgrade. If we discover a critical issue, we:
This takes a few minutes but is fully automated. We test it monthly in staging to ensure it works.
Partial Rollback
For non-critical issues, we might roll back specific features or dashboards rather than the entire upgrade. For example, if a new visualization type has a bug, we might disable it for specific users while we fix it.
Technical infrastructure is only half the story. The other half is process and communication.
Before every upgrade, we:
This communication builds confidence. Teams trust that upgrades are planned, tested, and safe.
Let's walk through a concrete example. Suppose Superset 3.1 is released with a performance improvement and a security patch. Here's how we'd execute the upgrade:
T-7 Days: We review the release notes, identify schema changes (adding a feature_flags table), and test migrations in staging.
T-3 Days: We run load tests in staging, simulating 500 concurrent users and 1,000 dashboard loads. We verify performance improves.
T-1 Day: We prepare green infrastructure, build containers with Superset 3.1, and run smoke tests.
T-0 (Upgrade Day, 2 AM UTC): We execute the upgrade during our lowest-traffic window.
Total downtime: zero. Total time: 2 hours. Users: completely unaware anything happened.
If you're evaluating managed Superset alternatives, this upgrade strategy is worth understanding. Some competitors—including Preset (Superset's commercial offering), Looker, Tableau, and Power BI—handle upgrades differently.
Preset offers cloud hosting of Superset, but their upgrade strategy varies by plan. Looker and Tableau are proprietary platforms that handle upgrades automatically but with less transparency into the process. Power BI upgrades are frequent but can impact performance.
D23's approach is different because we're transparent about our process, we prioritize zero downtime, and we give you control. You can see exactly how we upgrade, understand the trade-offs, and have confidence in your analytics infrastructure.
If you're running Superset yourself, you can implement these same patterns. The Kubernetes deployment documentation describes rolling updates, which is essentially blue-green deployment. The Docker Compose documentation shows how to manage multi-container applications. And the official Superset upgrade guide provides the migration scripts you need.
But implementing this requires significant operational expertise. You need to understand containerization, orchestration, database migration patterns, and distributed systems. You need to build monitoring and alerting. You need to test thoroughly. This is why many organizations choose managed services—the operational burden is substantial.
Beyond just maintaining availability, we optimize performance during upgrades. New versions of Superset often include query optimizations, caching improvements, and UI enhancements that make dashboards faster.
We follow the best practices outlined by CelerData for dashboard optimization, including load balancing strategies and caching configurations. During the gradual traffic shift, we monitor whether green actually delivers the performance improvements the upgrade promises.
If green is faster, we can shift traffic more aggressively because users benefit immediately. If green is slower, we investigate before proceeding.
We also follow Preset's guidance on optimizing Superset dashboards, which covers query optimization and caching strategies that become especially important during upgrades when schema changes might affect query plans.
Upgrades often include security patches. We prioritize these above all else, which is why we maintain the ability to upgrade quickly and safely.
Before upgrading, we review the security advisory, understand the risk, and assess whether we need to upgrade immediately (critical vulnerability) or can schedule it normally (low-risk patch).
For critical vulnerabilities, we might execute an emergency upgrade outside normal windows. Our zero-downtime process means we can do this without impacting users.
We follow security best practices for Superset deployments as outlined by enterprise deployment experts, including containerization with Docker for security isolation and Kubernetes for secure orchestration.
Every upgrade is an opportunity to improve the process. After each upgrade, we conduct a post-mortem:
Over time, this continuous improvement makes upgrades faster and safer. Our first managed upgrade took 4 hours and required constant monitoring. Now, routine upgrades take 2 hours and are largely automated.
If you're running Superset yourself and want to implement zero-downtime upgrades, here's the priority order:
Containerize Everything: Use Docker Compose or Kubernetes to run Superset, PostgreSQL, and Redis as containers. This makes blue-green deployment possible.
Separate Stateless and Stateful Components: Ensure your Superset application layer has no local state. Move sessions to Redis, configurations to environment variables.
Set Up Load Balancing: Use Nginx or a cloud load balancer to distribute traffic across multiple Superset instances.
Implement Database Migration Testing: Before any production upgrade, run migrations against a production-like database copy.
Build Monitoring: Instrument error rates, latency, cache hit rates, and connection pool utilization. Set up alerts.
Create Runbooks: Document the exact steps for upgrade, traffic shift, and rollback. Test them regularly.
Test in Staging: Mirror production exactly, run load tests, and verify the upgrade process works before touching production.
This is a significant undertaking, which is why many organizations prefer managed services. But if you have the engineering capacity, it's absolutely doable.
Zero-downtime upgrades aren't a nice-to-have feature. They're a fundamental requirement for production analytics infrastructure. Teams depend on dashboards, queries, and API endpoints staying available. Downtime erodes trust and slows decision-making.
At D23, we've invested heavily in the infrastructure, processes, and expertise to make zero-downtime upgrades routine. We use blue-green deployments, backward-compatible database migrations, gradual traffic shifts, comprehensive monitoring, and tested rollback procedures.
The result is that you can upgrade Apache Superset confidently, knowing that your analytics infrastructure will remain available, performant, and reliable. Whether you're evaluating D23's managed Superset service, running Superset yourself, or comparing options with Looker, Tableau, or other BI platforms, understanding upgrade strategy is crucial.
Zero downtime is possible. It requires planning, testing, and operational discipline. But it's absolutely worth it.