New: AI & text-to-SQL on your own SupersetBook a demo

Private Equity18 Apr 2026

The PE Portfolio Data Lake: One Architecture, Twenty Acquisitions

Build a unified data lake across PE portfolio companies. Architecture, consolidation strategies, and analytics for multi-acquisition environments.

DTD23 Team

14 minutes read

The PE Portfolio Data Lake: One Architecture, Twenty Acquisitions

Private equity firms acquire companies at scale. Within five years, a single fund might own fifteen, twenty, or thirty portfolio companies—each with its own ERP, CRM, billing system, and data infrastructure. By the time you need a real-time view of cash flow, unit economics, or gross margin across the portfolio, you're staring at a fragmented mess: spreadsheets, API calls to incompatible systems, and analytics teams drowning in ETL work.

The solution isn't better dashboards or more consultants. It's a unified data lake—one canonical architecture that ingests, standardizes, and surfaces data from every acquisition without rebuilding your infrastructure every time you close a deal.

This article walks through the reference architecture for PE-wide data consolidation. We'll cover why a data lake beats traditional data warehouses for portfolio companies, how to design it for speed and scale, and how to layer analytics and AI on top so your teams actually use it.

Why PE Firms Need a Different Data Architecture

Traditional data warehouses—Snowflake, Redshift, BigQuery—work well for a single, stable organization. You define schemas upfront, load clean data, and query it. PE environments break that assumption.

Each acquired company arrives with:

Different data models. One uses SAP, another NetSuite, another custom-built systems. Their definitions of "revenue," "customer," and "cost" don't align.
Different maturity levels. Some have APIs; others export CSV files weekly. Some have real-time transaction logs; others have batch processes that run at midnight.
Different compliance requirements. Healthcare acquisitions need HIPAA isolation. Financial services need SOX controls. Real estate needs different audit trails.
Different growth trajectories. You acquire a startup with 100GB of data, then a mature company with 50TB. Your architecture must handle 1000x growth without redesign.

A data lakehouse—which combines the flexibility of a data lake with the structure of a warehouse—solves this. You ingest raw data from any source, apply schema-on-read logic, and let different teams consume it at different levels of maturity.

As detailed in Supercharging Private Equity Portfolio Returns, PE firms using lakehouse architecture gain visibility and standardization across portfolio companies while maintaining the flexibility to onboard new acquisitions quickly.

The Core Problem: Data Fragmentation at Scale

Let's make this concrete. A PE fund owns:

Company A: SaaS platform with PostgreSQL, Kafka event streams, and a modern data stack (dbt, Airflow).
Company B: Manufacturing firm with SAP, legacy EDI systems, and nightly batch files to FTP servers.
Company C: Newly acquired startup with Stripe, Shopify, Mixpanel, and no data infrastructure—just SQL access to a Postgres replica.
Company D: Mature e-commerce business with a custom data warehouse, Tableau, and 15 years of accumulated schema drift.

Your CFO needs:

Monthly P&L consolidation across all four companies by the 5th of each month.
Weekly cash flow forecasts updated every Monday.
Real-time KPI dashboards for operational due diligence.
Cohort analysis comparing customer acquisition costs and retention across portfolio companies.

Without a unified data layer, you're building custom ETL for each company, managing incompatible schemas, and waiting for data engineers to manually reconcile numbers every month. With a portfolio data lake, you ingest everything into a central repository, apply transformations once, and let analytics teams self-serve.

Designing the PE Portfolio Data Lake: A Reference Architecture

Here's the architecture that works at scale:

Layer 1: Ingestion (The Front Door)

Data arrives from multiple sources, each with different characteristics. Your ingestion layer must handle:

APIs (Salesforce, NetSuite, Stripe, Shopify, HubSpot).
Databases (PostgreSQL, MySQL, SQL Server, SAP).
Files (CSV, JSON, Parquet, Excel sheets).
Event streams (Kafka, AWS Kinesis, Segment).
Data warehouses (Snowflake, Redshift, BigQuery from existing portfolio companies).

Best practice: Use an open-source orchestration tool like Apache Airflow or Dagster to manage connectors. Building a Modern Data Lake Using Open Source Tools details how to construct data lakes with tools like Delta Lake for ACID transactions and schema enforcement.

For each source, define:

Frequency: Real-time (Kafka), hourly, daily, or on-demand.
Extraction method: Full refresh, incremental (CDC), or event-based.
Landing zone: Raw data stored as-is in cloud object storage (S3, GCS, Azure Blob).

Example pipeline for Company A (SaaS with Kafka):

Kafka topic (user_events) → Spark streaming job → Delta Lake (raw/company_a/events)

Example pipeline for Company B (SAP with nightly batch):

SAP SFTP export → Airflow task → Cloud storage → Delta Lake (raw/company_b/sales_orders)

The key: Land everything raw. Don't transform on ingestion. You'll discover schema changes, data quality issues, and new use cases later—keep optionality.

Layer 2: Storage (The Foundation)

Store raw data in a data lake format that supports ACID transactions, schema evolution, and time-travel queries. Delta Lake (open source) or Apache Iceberg are the standard choices.

Organize your storage by company and domain:

s3://portfolio-data-lake/
  raw/
    company_a/
      events/
      customers/
      transactions/
    company_b/
      sales_orders/
      inventory/
    company_c/
      stripe_charges/
      shopify_orders/
  transformed/
    company_a/
      fct_revenue/
      dim_customer/
    company_b/
      fct_cost_of_goods/
      dim_supplier/
    consolidated/
      fct_portfolio_revenue/
      dim_company/

Each layer has a purpose:

Raw: Immutable, append-only, schema-as-landed. Retention: 7 years (regulatory requirement for PE).
Transformed: Cleaned, deduplicated, conformed schemas. Built with dbt or similar. Retention: 3–5 years.
Consolidated: Cross-company metrics, standardized KPIs. Retention: 10 years (audit trail).

As explained in Data Lake Explained: Architecture and Examples, proper data lake governance requires clear separation of raw and processed data, with metadata management to track lineage and quality.

Layer 3: Transformation (The Logic)

Use dbt (Data Build Tool) to define transformations in SQL. Each portfolio company gets its own dbt project, but they all write to a shared consolidated schema.

Example dbt models:

Company A's revenue model:

select
  date_trunc('month', order_date) as month,
  customer_id,
  sum(amount) as revenue,
  'Company A' as company_name
from {{ ref('stg_company_a_orders') }}
group by 1, 2, 3

Company B's revenue model (different schema):

select
  date_trunc('month', invoice_date) as month,
  sold_to_customer_id,
  sum(invoice_amount) as revenue,
  'Company B' as company_name
from {{ ref('stg_company_b_sales_orders') }}
group by 1, 2, 3

Consolidated revenue model:

select * from {{ ref('company_a_monthly_revenue') }}
union all
select * from {{ ref('company_b_monthly_revenue') }}
union all
select * from {{ ref('company_c_monthly_revenue') }}

The power: Each company's data engineers own their transformations. The consolidated layer is maintained by a central data team. New acquisitions plug in without touching existing logic.

As noted in Data Analytics in Private Equity: Driving Value Creation, transformation with dbt and real-time dashboards enable PE teams to track KPIs and value creation across portfolio companies.

Layer 4: Metadata and Governance (The Guardrails)

With data from twenty companies, you need metadata to answer: "Where did this number come from? Who can access it? Is it fresh?"

Implement:

Data catalog: Tool like Collibra or open-source alternatives (DataHub) that tracks tables, columns, lineage, and ownership.
Access control: Role-based access (RBAC). Company A's employees see only Company A's data. Finance sees consolidated data. Compliance sees everything.
Data quality: dbt tests (freshness, uniqueness, referential integrity) and custom data quality checks (anomaly detection).
Change data capture (CDC): Track who changed what, when. Essential for audit trails.

What Well-Designed Data Lake Architecture Looks Like outlines best practices for data governance with layers for storage, processing, analytics, and consumption.

Layering Analytics and BI on Top

Once your data lake is built, the analytics layer is straightforward. You need:

A semantic layer that defines business metrics (ARR, CAC, Gross Margin) once, and lets every tool query them consistently.
Self-serve BI so analysts don't wait for data engineers.
AI-powered analytics for text-to-SQL and anomaly detection.

This is where D23 fits into the PE data stack. D23 is a managed Apache Superset platform that layers on top of your data lake, providing:

Embedded dashboards for portfolio companies (each company sees its own data, plus benchmarks against peers).
Text-to-SQL powered by LLMs, so non-technical users can ask "What's our churn rate this month?" and get instant answers.
API-first architecture for embedding analytics into internal tools and investor portals.
MCP server integration for AI agents that autonomously generate reports and alerts.

Instead of building dashboards in Looker or Tableau—which require licensing per portfolio company and custom development for each new acquisition—you deploy once and onboard new companies in days.

Example workflow:

Company E is acquired. Data lands in your lake within a week.
dbt models are written to conform Company E's schema to your consolidated layer.
Superset dashboards are auto-generated from your semantic layer (no manual dashboard building).
Company E's team logs into a self-serve portal and explores their data.
Finance team sees Company E in consolidated P&L dashboards automatically.

No Looker licensing negotiation. No "we need custom training on Tableau." Just data and dashboards.

Real-World Example: A $2B PE Fund

Let's walk through a concrete scenario.

The portfolio:

18 portfolio companies across software, healthcare, manufacturing, and real estate.
$2B in aggregate revenue.
Ranging from $10M to $500M revenue each.
Acquired over 8 years (oldest systems are legacy; newest are cloud-native).

The problem:

Monthly close takes 15 days (waiting for manual data pulls and reconciliation).
Operational due diligence teams spend 40% of their time on data requests.
No real-time visibility into cash flow, customer churn, or unit economics.
Each portfolio company uses different BI tools (Tableau, Looker, Power BI, Qlik)—no standardization.

The solution:

Month 1-2: Ingestion layer

Set up Airflow to ingest data from:

12 ERP systems (SAP, NetSuite, Oracle, custom) via API or database connectors.
8 CRM systems (Salesforce, HubSpot, custom) via API.
6 billing systems (Stripe, Zuora, custom) via API.
Event streams from 4 SaaS companies (Kafka).

All data lands in S3 as Parquet files, organized by company and domain. Raw data layer is live in 6 weeks.

Month 3-4: Transformation layer

Data engineering team (3 people) writes dbt models:

Staging models (clean, deduplicate, type-cast each company's raw data).
Intermediate models (customer, product, revenue dimensions).
Fact tables (daily revenue, monthly churn, quarterly cash flow).
Consolidated models (portfolio-wide P&L, cohort analysis, benchmarking).

Each company's data engineers contribute their own staging and intermediate models. Central team owns consolidated layer. dbt tests ensure data quality. Transformations run nightly; critical metrics update hourly.

Month 5-6: Analytics layer

Deploy D23 (Apache Superset) on top of the data lake. Create dashboards:

Portfolio overview: Aggregate revenue, EBITDA, cash flow, headcount across all companies.
Company detail: Each company sees its own P&L, customer metrics, operational KPIs, plus benchmarks against peer companies.
Investor reporting: LP dashboards with fund performance, portfolio company valuations, and value creation metrics.
Operational due diligence: Real-time alerts on churn spikes, cash flow misses, or unit economics deterioration.

Deploy Superset's text-to-SQL feature, so CFO can ask: "Which portfolio companies have declining ARR?" and get instant answers without SQL knowledge.

Month 7+: Ongoing operations

New acquisition: Data lands in lake. dbt models written (2–3 weeks). Dashboards auto-populated. Company onboarded.
New metric: Finance wants to track CAC payback period. Data engineer writes dbt model. Metric appears in Superset automatically. No dashboard redesign.
AI-powered alerts: MCP server monitors portfolio KPIs. If cash flow forecast drops >10%, it sends alerts to CFO and triggers investigation workflow.

Results:

Monthly close: 5 days (3x faster).
Data freshness: Real-time for transactional data, hourly for analytics.
Onboarding time: New portfolio company goes from data to dashboards in 3 weeks (vs. 3 months with traditional BI).
Cost: $500K/year for infrastructure + D23 managed Superset (vs. $2M+/year for Looker or Tableau across 18 companies).
Self-serve adoption: 80% of portfolio company teams use dashboards without support (vs. 20% with legacy BI).

Key Design Decisions

When building your PE portfolio data lake, make these choices early:

1. Cloud Provider and Storage Format

Options:

AWS (S3 + Spark) with Delta Lake or Apache Iceberg.
Google Cloud (GCS + BigQuery) with native support for schema-on-read.
Azure (ADLS + Synapse) with managed Delta Lake.

Recommendation: AWS + Delta Lake if you want maximum flexibility and open-source control. BigQuery if you want managed simplicity and fast SQL queries out of the box.

Why: Delta Lake gives you ACID transactions (data consistency), schema evolution (new columns without rewriting tables), and time-travel queries (audit trails). BigQuery gives you SQL performance and built-in ML features, but locks you into Google.

2. Transformation Tool

Options:

dbt (SQL-based, version control, testing, lineage).
Apache Spark (flexible, handles large-scale transformations).
Dataflow/Beam (streaming transformations).

Recommendation: dbt for 90% of PE use cases. It's SQL, easy to version control, and lets data engineers own their transformations. Use Spark for heavy computations (cohort analysis, machine learning feature engineering).

3. Analytics Platform

Options:

Looker (powerful, expensive, requires licensing per company).
Tableau (beautiful dashboards, steep learning curve, licensing per user).
Power BI (Microsoft ecosystem, cheaper, less flexible).
Metabase (open source, simpler, fewer features).
Apache Superset via D23 (open source, API-first, text-to-SQL, designed for embedded analytics).

Recommendation for PE: D23 (managed Superset). Why? You get one platform across all portfolio companies. Text-to-SQL means CFOs and operators can self-serve. API-first architecture lets you embed analytics into investor portals. Pricing scales with data volume, not user count, so adding acquisitions doesn't explode costs. As detailed on D23's homepage, it's built specifically for teams that need production-grade analytics without platform overhead.

4. Access Control Strategy

Pattern 1: Company isolation

Each portfolio company sees only its own data.
Finance and PE team see consolidated data.
Compliance and audit see everything.

Pattern 2: Role-based access

Finance teams see P&L and cash flow.
Operators see operational KPIs.
Sales teams see customer metrics.

Pattern 3: Hybrid

Company A's sales team sees Company A's data + anonymized benchmarks from Company B and C.
Encourages healthy competition.

Recommendation: Start with Pattern 1 (company isolation). Add benchmarking dashboards later as a separate layer. This prevents data leaks and keeps teams focused.

5. Freshness SLA

Define upfront:

Real-time: Transactional data (orders, payments). Update within 5 minutes.
Near-real-time: Operational metrics (daily revenue, active users). Update hourly.
Daily: Analytical metrics (monthly churn, cohort analysis). Update nightly.
Weekly: Strategic metrics (fund performance, value creation). Update weekly.

Don't aim for real-time everywhere. It's expensive and often unnecessary. A CFO doesn't need real-time P&L; monthly is fine. An operator needs real-time order volume; daily is not.

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-Engineering the Ingestion Layer

Problem: You spend 6 months building a "perfect" CDC pipeline for every source, trying to capture every schema change in real-time.

Solution: Start simple. Daily batch ingestion from APIs. Weekly files from legacy systems. You can optimize later. Get data flowing first.

Pitfall 2: Trying to Standardize Everything Upfront

Problem: You spend months defining a "golden schema" for customer data, trying to force all portfolio companies into it. Then you acquire a company with a completely different business model, and your schema breaks.

Solution: Use schema-on-read. Land raw data as-is. Transform at the point of consumption. Each company's data engineers own their staging layer. Consolidation happens at the semantic layer (Superset), not the storage layer.

Pitfall 3: Building Dashboards for Every Question

Problem: Finance asks for a dashboard. Operators ask for a dashboard. HR asks for a dashboard. You end up maintaining 50 dashboards, and half are stale.

Solution: Build a semantic layer (dbt models) that defines metrics once. Let people query it with SQL or text-to-SQL (Superset). You maintain the data; they self-serve the analysis.

Pitfall 4: Ignoring Data Quality

Problem: You build a beautiful dashboard, but the numbers don't match what the company reports internally. Teams lose trust and go back to spreadsheets.

Solution: Implement dbt tests from day one. Test for:

Freshness (is the data recent?).
Uniqueness (no duplicate customer IDs).
Referential integrity (every order has a valid customer).
Business logic (revenue is positive, churn rate is 0–100%).

Monitor data quality continuously. Alert on anomalies.

Pitfall 5: Treating the Data Lake as a Data Warehouse

Problem: You ingest data, then immediately apply heavy transformations, losing the raw data. When you discover a data quality issue or need to reprocess, you're stuck.

Solution: Keep raw data forever (or at least 7 years for PE audit trails). Transformations are applied on top, never destructively. Use immutable storage (Delta Lake, Iceberg) so you can time-travel if needed.

Governance and Compliance for PE

Private equity has unique compliance requirements:

Financial Reporting (SOX, GAAP)

Your data lake must support:

Audit trails: Every change to financial data is logged with who, what, when.
Version control: You can reconstruct historical P&L at any point in time.
Reconciliation: Automated checks that consolidated P&L matches source systems.

Implementation: Use Delta Lake's audit logs. Store dbt run results and model versions. Implement reconciliation tests in dbt.

If any portfolio company handles customer personal data:

Data residency: Some jurisdictions require data to stay in-country.
Right to deletion: You must be able to purge a customer's data on request.
Data minimization: Only collect what you need.

Implementation: Tag sensitive columns in your data catalog. Implement row-level security (RLS) so only authorized teams see personal data. Use encryption at rest and in transit. For right-to-deletion, store personal data separately from analytical data.

Portfolio Company Isolation

Ensure Company A's data is never visible to Company B:

Network isolation: Separate VPCs or security groups.
Database-level access control: Row-level security by company ID.
Audit logging: Track all data access.

Implementation: Use Superset's row-level security feature. Tag dashboards and datasets by company. Implement fine-grained access control in your data lake (S3 policies, BigQuery IAM).

Measuring Success

After 6 months, measure:

Time-to-Insight

Before: How long from "I need to know X" to "I have the answer?" (typically 2–5 days)
After: Target <1 hour for most questions, <1 day for complex analysis.

Data Freshness

Before: Monthly P&L closes on day 15.
After: Target real-time transaction data, daily analytics, weekly strategic metrics.

Self-Serve Adoption

Before: 10% of team uses BI tools; 90% ask data engineers for reports.
After: Target 70%+ self-serve adoption.

Cost per Portfolio Company

Before: $100K–$200K/year per company (Looker licensing, Tableau, custom BI development).
After: Target $20K–$40K/year per company (data lake infrastructure + D23 managed Superset).

Onboarding Speed

Before: 3–6 months to integrate a new acquisition's data and dashboards.
After: Target 3–4 weeks.

Moving Forward

Building a PE portfolio data lake is a 6–12 month project, but it pays dividends immediately. You'll close faster, understand your portfolio better, and scale without rebuilding infrastructure every time you acquire a company.

The architecture is straightforward:

Ingest everything raw from every source.
Store immutably in a data lake (Delta Lake or Iceberg).
Transform once with dbt, owned by company data teams.
Consolidate at the semantic layer (dbt models).
Analyze with a modern BI platform (Superset via D23) that supports text-to-SQL and embedded analytics.
Govern with metadata, access control, and data quality tests.

The result: One architecture that scales to twenty acquisitions, real-time visibility into your portfolio, and analytics that actually get used.

As explored in Data lakehouses: Fueling innovation with machine learning, modern data lakehouse architectures provide the flexibility and performance needed for complex analytics environments—exactly what PE firms need to manage diverse portfolio companies.

The time to build is now. Every month you wait is a month of manual reporting, missed insights, and slow decision-making. Start with one ingestion pipeline. Add transformations. Deploy dashboards. Measure impact. Scale.

Your next acquisition will be grateful.

The PE Portfolio Data Lake: One Architecture, Twenty Acquisitions

The PE Portfolio Data Lake: One Architecture, Twenty Acquisitions

Why PE Firms Need a Different Data Architecture

The Core Problem: Data Fragmentation at Scale

Designing the PE Portfolio Data Lake: A Reference Architecture

Layer 1: Ingestion (The Front Door)

Layer 2: Storage (The Foundation)

Layer 3: Transformation (The Logic)

Layer 4: Metadata and Governance (The Guardrails)

Layering Analytics and BI on Top

Real-World Example: A $2B PE Fund

Key Design Decisions

1. Cloud Provider and Storage Format

2. Transformation Tool

3. Analytics Platform

4. Access Control Strategy

5. Freshness SLA

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-Engineering the Ingestion Layer

Pitfall 2: Trying to Standardize Everything Upfront

Pitfall 3: Building Dashboards for Every Question

Pitfall 4: Ignoring Data Quality

Pitfall 5: Treating the Data Lake as a Data Warehouse

Governance and Compliance for PE

Financial Reporting (SOX, GAAP)

Data Privacy (GDPR, CCPA, HIPAA)

Portfolio Company Isolation

Measuring Success

Time-to-Insight

Data Freshness

Self-Serve Adoption

Cost per Portfolio Company

Onboarding Speed

Moving Forward