Define realistic embedded analytics SLAs for availability, latency, and freshness. Learn what to promise and how to deliver without overcommitting.
When you embed analytics into your product, you're making a promise to your customers. That promise isn't just about pretty dashboards or clever queries—it's about uptime, speed, and data freshness. The moment you put analytics in the critical path of your customer's workflow, you've entered SLA territory.
The problem: most teams building embedded analytics don't think about SLAs until something breaks. By then, you're scrambling to explain why a dashboard went dark, why queries take 45 seconds, or why yesterday's numbers don't match today's. This article walks you through what embedded analytics SLAs actually are, why they matter, and how to set them realistically without crippling your infrastructure.
An SLA—Service Level Agreement—is a contract between you and your customer about what they can expect from your service. In the context of embedded analytics, that means commitments around three core dimensions: availability (is the dashboard up?), latency (how fast are queries?), and freshness (how recent is the data?).
Embedded analytics differs fundamentally from standalone BI tools. When a customer opens Tableau or Looker, they're aware they're using a BI tool. They expect occasional downtime, refresh delays, and occasional slowness. But when analytics are embedded directly into your product—say, a revenue dashboard in your SaaS platform or a performance report in your mobile app—customers don't think of it as "BI." They think of it as part of your core product. They expect it to work like the rest of your application.
That's the core tension: embedded analytics sit at the intersection of operational systems and analytical systems. Operational systems are built for speed and reliability. Analytical systems are built for flexibility and complex queries. Your SLA needs to reflect that reality.
When you're using managed Apache Superset or building on open-source BI, you have direct control over infrastructure, caching, and query optimization. That's powerful—it means you can make deliberate trade-offs. But it also means the SLA is on you.
Availability is the simplest pillar to understand but the hardest to get right. It answers the question: "Is the analytics dashboard accessible right now?"
Availability is usually expressed as a percentage over a time period. "99.9% availability" means the service can be down for about 43 minutes per month. "99.99%" means about 4 minutes per month. For embedded analytics, most teams aim for 99.5% to 99.9%.
Here's what matters:
The infrastructure stack. If your analytics platform depends on a single database, a single application server, and a single network path, your availability is limited by the weakest link. Each component with 99% availability means combined availability of roughly 99% × 99% × 99% = 97%. That's three nines becoming less than two. This is why managed platforms often outperform self-hosted setups—they distribute load, implement redundancy, and fail over automatically.
Scheduled maintenance. Most SLAs exclude scheduled maintenance windows. If you say "99.9% availability excluding scheduled maintenance," you're buying yourself maintenance windows. A typical SLA might allow 4 hours per month of scheduled downtime. Be explicit about when maintenance happens. Sunday 2 AM UTC might work for your US-based customers but devastate your Asia-Pacific users.
What counts as "down." Does a single customer seeing a 500 error count as downtime for the entire service? Or only if 10% of customers can't access dashboards? Most SLAs define a threshold—typically, the service is considered down if more than 5% of requests fail or if a specific region becomes unreachable. Be specific in your SLA.
Graceful degradation. In practice, you rarely achieve true binary up/down. More often, you have partial degradation: some queries run fast, others time out. Some dashboards load, others don't. Your SLA should account for this. You might commit to "95% of queries complete within 30 seconds" rather than "all queries complete within 30 seconds." This is more realistic and more defensible.
For embedded analytics specifically, availability often matters more than in standalone BI because it's part of your core product experience. A Looker dashboard going down for an hour is annoying. Your embedded revenue dashboard going down for an hour might cost you customer trust.
Latency is how long it takes for a query to return results. In embedded analytics, latency directly impacts user experience. A dashboard that takes 15 seconds to load feels broken, even if it technically works.
Latency SLAs are usually expressed as percentiles. "p50 latency < 2 seconds" means 50% of queries finish in under 2 seconds. "p95 latency < 10 seconds" means 95% of queries finish in under 10 seconds. "p99 latency < 30 seconds" means even the slowest 1% of queries finish within 30 seconds.
Why percentiles? Because averages lie. If 99 queries take 1 second and 1 query takes 100 seconds, the average is 1.99 seconds. But your customer sees that 100-second query and thinks your system is broken. Percentiles force you to think about the tail.
Latency depends on several factors:
Query complexity. A simple "count of events today" query might run in 100ms. A query joining five tables, filtering by 20 conditions, and aggregating across billions of rows might take 30 seconds. Your SLA needs to account for this range. You might commit to different latencies for different dashboard types: "standard dashboards < 5 seconds, custom reports < 30 seconds."
Data volume. As your customers' data grows, queries slow down. A query that runs in 2 seconds on 10 million rows might take 20 seconds on 1 billion rows. This is why it's critical to understand your customers' data volumes when setting SLAs. If you promise "all queries < 5 seconds" but your customer has 50 billion events, you're setting yourself up for failure.
Caching strategy. This is where embedded analytics shine. Unlike ad-hoc BI tools where every query is unique, embedded dashboards often show the same visualizations repeatedly. You can pre-compute results, cache them, and serve cached results instantly. A well-designed caching layer can reduce p95 latency from 20 seconds to 2 seconds. But caching introduces staleness—which brings us to freshness.
Concurrency. When multiple customers query simultaneously, database load increases and latency degrades. Your SLA should specify latency under normal load (e.g., "p95 < 5 seconds at 100 concurrent users") or peak load ("p95 < 10 seconds at 1000 concurrent users").
For embedded analytics, latency SLAs are critical. Users expect embedded experiences to feel snappy. If your embedded dashboard takes 10 seconds to load, users will perceive your entire product as slow, even if the rest of your application is fast.
Freshness answers: "How old is the data in this dashboard?" It's measured as the time between when an event occurs and when it appears in analytics.
Freshness is often the most contentious SLA dimension because it directly conflicts with latency and cost. Real-time data (< 1 second latency) is expensive. It requires streaming infrastructure, complex event processing, and careful orchestration. Near-real-time (< 5 minutes) is more reasonable. Daily batches are cheap but stale.
Freshness depends on your data pipeline architecture:
Batch ETL. Data is extracted, transformed, and loaded on a schedule—typically daily or hourly. Freshness is determined by how often you run the job. If you run daily at midnight UTC, data is up to 24 hours old at the start of the day. This is simple and cheap but stale. Best practices for reliable pipelines emphasize that batch freshness is predictable—you know exactly when data will refresh.
Streaming ingestion. Events flow into your data warehouse in real-time or near-real-time. Freshness is seconds or minutes. This is more expensive (streaming infrastructure, schema management, exactly-once semantics) but much fresher. Building SLAs for real-time dashboards with AI-ETL provides detailed guidance on committing to real-time freshness.
Hybrid approaches. Many teams use a combination: real-time streaming for critical metrics (revenue, user activity) and daily batches for less critical data (customer demographics, historical trends). Your SLA can reflect this: "core metrics updated every 5 minutes, supporting data updated daily."
Freshness also depends on your analytics platform. If you're using managed Apache Superset, you control when data refreshes. You can implement smart caching that serves fresh data for recent time periods and cached data for historical periods. You can refresh different datasets on different schedules.
Here's the key insight: freshness, latency, and cost form a triangle. Pick two, and the third suffers. Real-time + fast = expensive. Real-time + cheap = slow. Fast + cheap = stale. Your SLA should reflect this trade-off explicitly.
Now that you understand the three pillars, how do you actually set targets? The answer depends on your customers, your infrastructure, and your business model.
Different customers have different requirements. A venture capital firm tracking portfolio performance doesn't need real-time data—daily or weekly updates are fine. A SaaS platform showing customers their usage metrics needs data fresh within the hour. A trading platform needs sub-second latency.
Before setting SLAs, ask your customers:
You'll likely get a range of answers. That's okay. You can tier your SLAs: "standard tier: 99.5% availability, p95 latency < 10 seconds, daily data refresh. Premium tier: 99.9% availability, p95 latency < 5 seconds, hourly data refresh."
Look at what Looker, Tableau, Power BI, and other competitors promise. Most traditional BI platforms don't publish detailed SLAs—they're vague about latency and freshness. That's because they can't control these dimensions; they depend on customer infrastructure.
Managed platforms like Preset (the commercial Superset offering) typically commit to 99.5% availability. Most cloud BI platforms commit to 99.9% for premium tiers.
For latency, Looker and Tableau don't typically commit to specific numbers—they say "it depends on your data." That's honest but unhelpful. Managed platforms are more specific because they control the infrastructure.
For freshness, traditional BI platforms don't commit to anything. They assume you'll sync your data warehouse on your own schedule. Managed platforms can be more specific because they often manage the data pipeline.
Your SLA is only as good as your infrastructure. Before committing to anything, map out your actual capabilities:
Availability. What's your current uptime? If you're running on a single database server, your availability is probably 99% at best. If you're running on managed cloud infrastructure with multi-region failover, you might hit 99.99%. Be honest about what you can actually deliver.
Latency. Run load tests. How fast do queries actually run at peak load? If p95 is currently 8 seconds, don't promise 5 seconds. Promise 8 seconds, then work on optimization.
Freshness. What's your current data pipeline? If you're running daily batch jobs, you can't promise hourly freshness without major changes. Understand the cost of each improvement: moving from daily to hourly might require 3x infrastructure investment. Moving from hourly to real-time might require 10x.
A useful framework is the SLA ladder: start conservative, then improve. This is especially important for new products.
Year 1: 99% availability, p95 latency < 15 seconds, daily data refresh. You're learning, your infrastructure is simple, your customer base is small.
Year 2: 99.5% availability, p95 latency < 10 seconds, 6-hourly data refresh. You've optimized your database, implemented caching, added redundancy.
Year 3: 99.9% availability, p95 latency < 5 seconds, hourly data refresh. You've invested in multi-region infrastructure, sophisticated query optimization, streaming data pipelines.
This ladder gives you room to grow without overpromising. It also gives you a roadmap for infrastructure investment.
Setting an SLA is one thing. Measuring it is another. You need visibility into whether you're meeting your commitments.
Availability monitoring is straightforward: ping your dashboard endpoint every 30 seconds from multiple geographic locations. If it responds with a 200 status, it's up. If it doesn't, it's down.
But this is too simplistic for embedded analytics. You need to monitor:
A dashboard might return a 200 status but show stale data or failed queries. That's not really "up."
For latency, you need to track actual query performance in production. Instrument your query layer to record:
Aggregate this data to compute percentiles. "p95 latency is currently 8 seconds, up from 5 seconds yesterday" tells you something is wrong.
Visualize latency over time. Create alerts: "if p95 latency exceeds 10 seconds for 5 minutes, alert on-call engineer."
Freshness monitoring is often overlooked. You need to track:
Implement data freshness checks: "verify that today's data is present by 9 AM UTC." If the check fails, alert.
Make your SLA metrics visible to customers. Many platforms publish a status page showing current availability, latency, and freshness. This builds trust and reduces support burden—customers can see that you're tracking these metrics seriously.
Include historical data: "availability last 30 days: 99.92%." This shows you're consistently meeting commitments.
Eventually, you'll breach an SLA. A query times out. A dashboard goes down. Data doesn't refresh on schedule. What then?
Have a clear process:
Many SLAs include credits for breaches. "If we miss 99.5% availability in a month, we'll credit 10% of your monthly fee."
Credits incentivize you to take SLAs seriously. They also compensate customers for the impact of your failure.
Be specific about credit calculation:
Credits are usually capped at 100% of monthly fees (you can't owe more than the customer paid).
When you breach an SLA, communicate clearly:
This transparency builds trust. Customers understand that systems fail; they respect teams that handle failures well.
Different use cases have different SLA requirements. Here are some examples:
Executives check dashboards daily or weekly. They don't need real-time data. They do need high availability (they trust the numbers they see).
Recommended SLA:
Executive dashboards are often high-stakes. A wrong number in a board meeting is expensive. Prioritize accuracy and availability over freshness.
Operational dashboards show current system state: server status, customer activity, revenue, etc. Teams rely on them to make decisions. They need fresher data.
Recommended SLA:
Operational dashboards are often in the critical path of incident response. If your operations dashboard goes down during an outage, you've made the situation worse.
Customers see these dashboards regularly. They expect them to work like any other part of your product. They need good availability and reasonable latency.
Recommended SLA:
Customer-facing analytics are part of your product experience. A slow or broken dashboard reflects poorly on your entire product.
Users run ad-hoc queries, exploring data. Query complexity varies wildly. Availability and latency expectations are lower.
Recommended SLA:
Self-service analytics are less mission-critical. Users understand that complex queries take time. You have more flexibility here.
When you serve multiple customers, one customer's heavy query shouldn't impact another customer's performance. This requires:
Your SLA might be: "p95 latency < 5 seconds for standard queries, subject to per-customer concurrency limits."
Some analytics workloads are seasonal. A retail company's analytics are heavy during holiday season. A school's analytics are heavy during registration periods.
Your SLA might account for this: "99.5% availability during normal periods, 99% during peak periods (defined as Q4 for retail, August-September for education)."
Beyond availability, latency, and freshness, consider data quality. Data SLAs for reliable pipelines emphasize that data completeness and accuracy are as important as freshness.
You might commit to: "99.9% of data is complete and accurate within 24 hours of collection."
This requires data validation, anomaly detection, and data quality monitoring.
Your analytics depend on upstream systems: data warehouses, APIs, ETL tools. If your data warehouse is down, your analytics are down. But you didn't cause the outage.
Most SLAs exclude dependency failures: "99.9% availability, excluding outages of third-party services like Snowflake or BigQuery."
But you should still monitor and communicate dependency issues. If your SLA is being breached because of a dependency, customers need to know.
Your SLA is only valuable if customers know about it. Include it in:
Be clear about what's included and excluded:
This clarity prevents misunderstandings and sets realistic expectations.
SLAs aren't just legal documents. They're commitments that shape how your team works.
Internalize SLAs in your engineering culture:
Different customers have different SLA requirements. Understanding how SLAs depend on trustworthy analytics and BI emphasizes that SLAs should align with business impact.
A customer paying $10k/month might require 99.9% availability. A customer paying $1k/month might accept 99%. Your SLA structure should reflect this.
Embedded analytics SLAs aren't just operational commitments—they're product strategy. They communicate what you value: availability, speed, or freshness. They shape your infrastructure decisions. They influence your pricing.
Start conservative. Measure relentlessly. Improve systematically. Over time, you'll build embedded analytics that customers trust.
When you're ready to implement embedded analytics with strong SLA foundations, D23 provides managed Apache Superset with the infrastructure and expertise to meet ambitious SLA targets. Whether you're building executive dashboards, operational analytics, or customer-facing analytics, we help you set realistic SLAs and deliver on them consistently.
The key is being honest about what you can deliver, measuring whether you're delivering it, and continuously improving. That's how you build embedded analytics that customers rely on.