Learn how to build multi-agent data discovery systems spanning multiple warehouses. Architecture, patterns, and implementation guide for analytics leaders.
Your organization doesn't live in a single data warehouse anymore. You've got Snowflake handling transactional data, BigQuery running analytics workloads, and a data lake in S3 collecting raw events. Your finance team needs to cross-reference metrics from three different sources. Your product team wants to explore customer behavior without waiting for a data analyst to write a custom query. Meanwhile, your platform is growing fast enough that you're embedding analytics directly into your product—and your customers want to discover and query data across all your systems without knowing SQL.
This is the orchestrating data discovery agents problem. It's not about moving all your data into one place (that's often impractical and expensive). It's about creating a coordinated layer of AI agents that can intelligently navigate, understand, and query across your distributed data infrastructure without requiring a PhD in your schema.
Traditional BI tools like Looker and Tableau assume a relatively static, well-documented data model sitting in one place. They work well when your data architecture is simple. But at scale—especially when you're embedding analytics into your product or managing complex data for portfolio companies—you need something more flexible. You need agents that can discover what data exists, understand relationships across systems, route queries intelligently, and handle the operational complexity of multi-warehouse environments.
This is where orchestrated data discovery agents become essential. Unlike a single monolithic BI tool, a multi-agent system can be built on Apache Superset with AI capabilities, API-first architecture, and the flexibility to integrate with your actual data infrastructure. Let's explore how to design, build, and operate these systems.
Before diving into architecture, let's define what we're actually building.
A data discovery agent is an AI-powered system component that can autonomously explore, catalog, and query data sources. Unlike a static data dictionary, an agent actively understands your schema, can reason about relationships between tables, and can respond to natural language requests. Think of it as a data analyst who knows every table in your warehouse and can instantly answer "what's our monthly churn rate?" without you having to specify exactly which tables to join.
Orchestration means coordinating multiple agents across different data sources, ensuring they work together coherently, and managing the complexity of routing queries to the right system. In a multi-warehouse environment, orchestration is the difference between having five disconnected agents and having a unified data discovery experience.
The core challenge is this: when a user asks a question, which agent should handle it? How do you ensure consistency when the same metric might exist in multiple warehouses with different definitions? How do you manage authentication, rate limiting, and cost when agents are querying across multiple cloud providers?
According to recent enterprise architecture research, orchestrating agents and data requires a blueprint architecture with centralized data registry for discovery across multi-modal data. This isn't theoretical—it's the pattern that's emerging as organizations scale from single-warehouse BI to distributed analytics.
A production-grade multi-warehouse data discovery system has several layers:
Every agent needs to know what data exists. In a single warehouse, this is straightforward—you query the information schema. Across multiple warehouses, you need a centralized registry.
This registry should contain:
The registry isn't static. It needs to be continuously updated as schemas change, new tables are added, and data quality metrics shift. This is where data orchestration platforms coordinate flows across data warehouses and multi-system environments—they maintain the ground truth about what data exists and where.
In practice, you might implement this as a combination of:
When a user or application requests data, the dispatch layer decides which agent(s) should handle the query.
This isn't a simple lookup. It requires reasoning about:
A dispatch router might look like:
User Query → Parse Intent → Identify Required Datasets → Check Availability & Permissions → Select Optimal Warehouse(s) → Route to Agent(s) → Aggregate Results
This is similar to NVIDIA's multi-agent warehouse AI command layer architecture, which orchestrates specialized agents across warehouse systems for real-time intelligence. Each agent is specialized for a particular warehouse or data domain, and the command layer routes work intelligently.
Once routed to the right warehouse, the agent needs to translate the user's intent into SQL (or the native query language).
This is where text-to-SQL capabilities shine. An AI model that's been fine-tuned on your specific schema can understand natural language like "show me our top 10 customers by revenue this quarter" and generate the correct SQL:
SELECT
customer_id,
customer_name,
SUM(order_value) as total_revenue
FROM orders
WHERE order_date >= DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY customer_id, customer_name
ORDER BY total_revenue DESC
LIMIT 10The key is that the model understands your schema, your naming conventions, and your business logic. A generic LLM will fail. You need domain-specific fine-tuning.
Query execution in a multi-warehouse environment requires:
When you're querying across multiple warehouses, you might get results from several agents. You need to:
Caching is particularly important in multi-warehouse setups. A query that costs $5 to run in BigQuery might cost $50 if you re-run it for every user. Intelligent caching—based on query patterns, user roles, and data freshness requirements—can dramatically reduce costs.
Now let's talk about how to actually build this.
D23 is a managed Apache Superset platform designed specifically for this kind of distributed analytics. Rather than building a custom orchestration layer from scratch, you can leverage Superset's native capabilities:
When you add AI capabilities (text-to-SQL, natural language query generation), you get the foundation for intelligent data discovery. Pair this with MCP (Model Context Protocol) integration—which allows you to extend Superset's capabilities with custom tools and data sources—and you have a platform ready for multi-warehouse orchestration.
Start by building your data registry. This is typically a separate service that maintains metadata about all your warehouses and datasets.
A minimal registry API might expose:
GET /warehouses
[
{
"id": "snowflake-prod",
"type": "snowflake",
"region": "us-east-1",
"cost_per_query_estimate": 0.05,
"freshness_sla_minutes": 15
},
{
"id": "bigquery-prod",
"type": "bigquery",
"region": "us",
"cost_per_query_estimate": 0.10,
"freshness_sla_minutes": 5
}
]GET /datasets
[
{
"id": "customers",
"warehouse_id": "snowflake-prod",
"schema": "public",
"table_name": "customers",
"row_count": 50000,
"columns": [
{
"name": "customer_id",
"type": "integer",
"description": "Unique customer identifier"
},
{
"name": "created_at",
"type": "timestamp",
"description": "Account creation date"
}
],
"last_updated": "2025-01-15T10:30:00Z"
}
]GET /metrics
[
{
"id": "monthly_revenue",
"definition": "SUM(order_value) WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)",
"warehouse_ids": ["snowflake-prod", "bigquery-prod"],
"primary_warehouse": "snowflake-prod",
"description": "Total revenue for current month"
}
]This registry becomes the single source of truth for your agents. When an agent needs to answer a question, it queries this registry to understand what data is available.
The router sits between user intent and warehouse execution. It needs to be intelligent about routing decisions.
Here's a simplified router logic:
In practice, this might be implemented as:
def route_query(user_query, user_id, freshness_requirement):
# Parse intent
intent = parse_natural_language(user_query)
required_datasets = identify_datasets(intent)
# Find candidate warehouses
candidates = registry.find_warehouses_with_datasets(required_datasets)
# Filter by permissions
accessible = [w for w in candidates if user_has_access(user_id, w)]
# Filter by freshness
fresh_enough = [w for w in accessible if w.freshness_sla >= freshness_requirement]
# Pick the cheapest
selected = min(fresh_enough, key=lambda w: w.cost_estimate)
return selectedThis is a simplified version. A production system would also consider query complexity, warehouse load, and historical performance.
The magic of data discovery agents is their ability to translate natural language to SQL. But generic LLMs are terrible at this—they hallucinate table names, make up column aliases, and produce syntactically invalid SQL.
You need to fine-tune a model on your specific schema. Here's the approach:
A prompt for text-to-SQL might look like:
You are a SQL expert. Given the following schema and a natural language question,
generate the correct SQL query.
Schema:
- customers (customer_id INT, name VARCHAR, created_at TIMESTAMP, status VARCHAR)
- orders (order_id INT, customer_id INT, order_value DECIMAL, order_date DATE)
- order_items (order_id INT, product_id INT, quantity INT, unit_price DECIMAL)
Question: What's our top 5 customers by total spending this year?
SQL:
SELECT
c.customer_id,
c.name,
SUM(o.order_value) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE YEAR(o.order_date) = YEAR(CURRENT_DATE)
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC
LIMIT 5
This approach is far more reliable than using a generic LLM. According to Meta's engineering work on evolving data warehouses with multi-agent systems, domain-specific training is critical for production reliability.
One warehouse (the "hub") is the source of truth for most queries. Other warehouses (the "spokes") are used for specific purposes: real-time data, specialized analytics, compliance.
The router preferentially routes to the hub unless the query specifically requires data from a spoke.
Pros: Simple, predictable, easy to manage consistency Cons: Hub becomes a bottleneck, doesn't leverage distributed compute
Each warehouse is equally important. The router decides which warehouse to query based on data location and cost.
The challenge: ensuring consistency when the same metric might be calculated differently in different warehouses.
Pros: Leverages distributed compute, scales better Cons: Complex to manage consistency, requires careful governance
Data is organized by freshness and cost. Real-time data lives in a fast, expensive warehouse (like Firestore or a real-time OLAP engine). Slightly stale data lives in a cheaper warehouse (like S3 with Athena). Historical data lives in cold storage.
The router selects the appropriate tier based on freshness requirements.
Pros: Cost-optimized, meets diverse freshness requirements Cons: Requires careful data movement orchestration
For most organizations, a hybrid approach works best. Start with a hub-and-spoke model, then gradually move toward federation as your data architecture matures.
When the same metric exists in multiple warehouses, they often have different definitions. Your Snowflake warehouse might calculate monthly revenue as of the last day of the month. Your BigQuery warehouse might calculate it as of today.
Solve this with:
When you're querying across multiple warehouses, the slowest warehouse determines your overall latency. A query that takes 2 seconds in Snowflake and 15 seconds in BigQuery will feel slow to your users.
Solutions:
Multi-warehouse queries can get expensive fast. A single exploratory query that hits three warehouses could cost $50 or more.
Manage costs with:
If you're embedding analytics into your product, multi-warehouse orchestration becomes even more important. Your customers might have data spread across multiple systems, and they'll expect your embedded analytics to "just work."
D23's API-first architecture is purpose-built for this. You can:
For example, a venture capital firm using D23 could:
All without requiring each company to move their data or standardize their schema.
Once you've built a multi-warehouse orchestration system, you need to operate it. This means monitoring:
Build dashboards for each of these. Make them visible to your team. When something goes wrong, you want to know immediately.
As you mature your orchestration system, you might move toward a Retrieval-Augmented Generation (RAG) approach. Instead of routing a query to a single warehouse, you:
Snowflake's multi-agent RAG system with Gen2 warehouses and Cortex is a good reference implementation. The key insight is that RAG allows agents to be more intelligent about what data to query, rather than relying on a simple router.
For example, if a user asks "how does our churn rate compare to our competitors," a RAG system could:
This is significantly more powerful than a simple router, but also more complex to implement.
The shift from single-warehouse BI tools to orchestrated multi-agent data discovery is fundamental. It reflects how modern data infrastructure actually works: distributed, heterogeneous, and constantly evolving.
Traditional tools like Looker and Tableau assume a stable, well-documented data model. They work great when you have one warehouse. But when you're managing multiple data sources, embedding analytics into your product, or supporting portfolio companies with different data architectures, you need something more flexible.
D23's managed Apache Superset platform combined with intelligent agent orchestration gives you that flexibility. You get the power of open-source BI (no vendor lock-in, full customization) without the operational overhead.
Start simple: build a data registry, implement a basic router, add text-to-SQL for one warehouse. As you mature, add more warehouses, implement caching, move toward RAG. The architecture scales with your needs.
The future of analytics isn't about moving all your data into one place. It's about building intelligent systems that can navigate your actual data architecture, understand your business logic, and answer questions without requiring a PhD in your schema. That's what orchestrated data discovery agents deliver.
For organizations scaling analytics across multiple warehouses—whether you're embedding BI into your product, consolidating portfolio companies, or managing complex data infrastructure—this is the pattern to understand and implement.