Build a unified data lake across PE portfolio companies. Architecture, consolidation strategies, and analytics for multi-acquisition environments.
Private equity firms acquire companies at scale. Within five years, a single fund might own fifteen, twenty, or thirty portfolio companies—each with its own ERP, CRM, billing system, and data infrastructure. By the time you need a real-time view of cash flow, unit economics, or gross margin across the portfolio, you're staring at a fragmented mess: spreadsheets, API calls to incompatible systems, and analytics teams drowning in ETL work.
The solution isn't better dashboards or more consultants. It's a unified data lake—one canonical architecture that ingests, standardizes, and surfaces data from every acquisition without rebuilding your infrastructure every time you close a deal.
This article walks through the reference architecture for PE-wide data consolidation. We'll cover why a data lake beats traditional data warehouses for portfolio companies, how to design it for speed and scale, and how to layer analytics and AI on top so your teams actually use it.
Traditional data warehouses—Snowflake, Redshift, BigQuery—work well for a single, stable organization. You define schemas upfront, load clean data, and query it. PE environments break that assumption.
Each acquired company arrives with:
A data lakehouse—which combines the flexibility of a data lake with the structure of a warehouse—solves this. You ingest raw data from any source, apply schema-on-read logic, and let different teams consume it at different levels of maturity.
As detailed in Supercharging Private Equity Portfolio Returns, PE firms using lakehouse architecture gain visibility and standardization across portfolio companies while maintaining the flexibility to onboard new acquisitions quickly.
Let's make this concrete. A PE fund owns:
Your CFO needs:
Without a unified data layer, you're building custom ETL for each company, managing incompatible schemas, and waiting for data engineers to manually reconcile numbers every month. With a portfolio data lake, you ingest everything into a central repository, apply transformations once, and let analytics teams self-serve.
Here's the architecture that works at scale:
Data arrives from multiple sources, each with different characteristics. Your ingestion layer must handle:
Best practice: Use an open-source orchestration tool like Apache Airflow or Dagster to manage connectors. Building a Modern Data Lake Using Open Source Tools details how to construct data lakes with tools like Delta Lake for ACID transactions and schema enforcement.
For each source, define:
Example pipeline for Company A (SaaS with Kafka):
Kafka topic (user_events) → Spark streaming job → Delta Lake (raw/company_a/events)
Example pipeline for Company B (SAP with nightly batch):
SAP SFTP export → Airflow task → Cloud storage → Delta Lake (raw/company_b/sales_orders)
The key: Land everything raw. Don't transform on ingestion. You'll discover schema changes, data quality issues, and new use cases later—keep optionality.
Store raw data in a data lake format that supports ACID transactions, schema evolution, and time-travel queries. Delta Lake (open source) or Apache Iceberg are the standard choices.
Organize your storage by company and domain:
s3://portfolio-data-lake/
raw/
company_a/
events/
customers/
transactions/
company_b/
sales_orders/
inventory/
company_c/
stripe_charges/
shopify_orders/
transformed/
company_a/
fct_revenue/
dim_customer/
company_b/
fct_cost_of_goods/
dim_supplier/
consolidated/
fct_portfolio_revenue/
dim_company/
Each layer has a purpose:
As explained in Data Lake Explained: Architecture and Examples, proper data lake governance requires clear separation of raw and processed data, with metadata management to track lineage and quality.
Use dbt (Data Build Tool) to define transformations in SQL. Each portfolio company gets its own dbt project, but they all write to a shared consolidated schema.
Example dbt models:
Company A's revenue model:
select
date_trunc('month', order_date) as month,
customer_id,
sum(amount) as revenue,
'Company A' as company_name
from {{ ref('stg_company_a_orders') }}
group by 1, 2, 3Company B's revenue model (different schema):
select
date_trunc('month', invoice_date) as month,
sold_to_customer_id,
sum(invoice_amount) as revenue,
'Company B' as company_name
from {{ ref('stg_company_b_sales_orders') }}
group by 1, 2, 3Consolidated revenue model:
select * from {{ ref('company_a_monthly_revenue') }}
union all
select * from {{ ref('company_b_monthly_revenue') }}
union all
select * from {{ ref('company_c_monthly_revenue') }}The power: Each company's data engineers own their transformations. The consolidated layer is maintained by a central data team. New acquisitions plug in without touching existing logic.
As noted in Data Analytics in Private Equity: Driving Value Creation, transformation with dbt and real-time dashboards enable PE teams to track KPIs and value creation across portfolio companies.
With data from twenty companies, you need metadata to answer: "Where did this number come from? Who can access it? Is it fresh?"
Implement:
What Well-Designed Data Lake Architecture Looks Like outlines best practices for data governance with layers for storage, processing, analytics, and consumption.
Once your data lake is built, the analytics layer is straightforward. You need:
This is where D23 fits into the PE data stack. D23 is a managed Apache Superset platform that layers on top of your data lake, providing:
Instead of building dashboards in Looker or Tableau—which require licensing per portfolio company and custom development for each new acquisition—you deploy once and onboard new companies in days.
Example workflow:
No Looker licensing negotiation. No "we need custom training on Tableau." Just data and dashboards.
Let's walk through a concrete scenario.
The portfolio:
The problem:
The solution:
Month 1-2: Ingestion layer
Set up Airflow to ingest data from:
All data lands in S3 as Parquet files, organized by company and domain. Raw data layer is live in 6 weeks.
Month 3-4: Transformation layer
Data engineering team (3 people) writes dbt models:
Each company's data engineers contribute their own staging and intermediate models. Central team owns consolidated layer. dbt tests ensure data quality. Transformations run nightly; critical metrics update hourly.
Month 5-6: Analytics layer
Deploy D23 (Apache Superset) on top of the data lake. Create dashboards:
Deploy Superset's text-to-SQL feature, so CFO can ask: "Which portfolio companies have declining ARR?" and get instant answers without SQL knowledge.
Month 7+: Ongoing operations
Results:
When building your PE portfolio data lake, make these choices early:
Options:
Recommendation: AWS + Delta Lake if you want maximum flexibility and open-source control. BigQuery if you want managed simplicity and fast SQL queries out of the box.
Why: Delta Lake gives you ACID transactions (data consistency), schema evolution (new columns without rewriting tables), and time-travel queries (audit trails). BigQuery gives you SQL performance and built-in ML features, but locks you into Google.
Options:
Recommendation: dbt for 90% of PE use cases. It's SQL, easy to version control, and lets data engineers own their transformations. Use Spark for heavy computations (cohort analysis, machine learning feature engineering).
Options:
Recommendation for PE: D23 (managed Superset). Why? You get one platform across all portfolio companies. Text-to-SQL means CFOs and operators can self-serve. API-first architecture lets you embed analytics into investor portals. Pricing scales with data volume, not user count, so adding acquisitions doesn't explode costs. As detailed on D23's homepage, it's built specifically for teams that need production-grade analytics without platform overhead.
Pattern 1: Company isolation
Pattern 2: Role-based access
Pattern 3: Hybrid
Recommendation: Start with Pattern 1 (company isolation). Add benchmarking dashboards later as a separate layer. This prevents data leaks and keeps teams focused.
Define upfront:
Don't aim for real-time everywhere. It's expensive and often unnecessary. A CFO doesn't need real-time P&L; monthly is fine. An operator needs real-time order volume; daily is not.
Problem: You spend 6 months building a "perfect" CDC pipeline for every source, trying to capture every schema change in real-time.
Solution: Start simple. Daily batch ingestion from APIs. Weekly files from legacy systems. You can optimize later. Get data flowing first.
Problem: You spend months defining a "golden schema" for customer data, trying to force all portfolio companies into it. Then you acquire a company with a completely different business model, and your schema breaks.
Solution: Use schema-on-read. Land raw data as-is. Transform at the point of consumption. Each company's data engineers own their staging layer. Consolidation happens at the semantic layer (Superset), not the storage layer.
Problem: Finance asks for a dashboard. Operators ask for a dashboard. HR asks for a dashboard. You end up maintaining 50 dashboards, and half are stale.
Solution: Build a semantic layer (dbt models) that defines metrics once. Let people query it with SQL or text-to-SQL (Superset). You maintain the data; they self-serve the analysis.
Problem: You build a beautiful dashboard, but the numbers don't match what the company reports internally. Teams lose trust and go back to spreadsheets.
Solution: Implement dbt tests from day one. Test for:
Monitor data quality continuously. Alert on anomalies.
Problem: You ingest data, then immediately apply heavy transformations, losing the raw data. When you discover a data quality issue or need to reprocess, you're stuck.
Solution: Keep raw data forever (or at least 7 years for PE audit trails). Transformations are applied on top, never destructively. Use immutable storage (Delta Lake, Iceberg) so you can time-travel if needed.
Private equity has unique compliance requirements:
Your data lake must support:
Implementation: Use Delta Lake's audit logs. Store dbt run results and model versions. Implement reconciliation tests in dbt.
If any portfolio company handles customer personal data:
Implementation: Tag sensitive columns in your data catalog. Implement row-level security (RLS) so only authorized teams see personal data. Use encryption at rest and in transit. For right-to-deletion, store personal data separately from analytical data.
Ensure Company A's data is never visible to Company B:
Implementation: Use Superset's row-level security feature. Tag dashboards and datasets by company. Implement fine-grained access control in your data lake (S3 policies, BigQuery IAM).
After 6 months, measure:
Building a PE portfolio data lake is a 6–12 month project, but it pays dividends immediately. You'll close faster, understand your portfolio better, and scale without rebuilding infrastructure every time you acquire a company.
The architecture is straightforward:
The result: One architecture that scales to twenty acquisitions, real-time visibility into your portfolio, and analytics that actually get used.
As explored in Data lakehouses: Fueling innovation with machine learning, modern data lakehouse architectures provide the flexibility and performance needed for complex analytics environments—exactly what PE firms need to manage diverse portfolio companies.
The time to build is now. Every month you wait is a month of manual reporting, missed insights, and slow decision-making. Start with one ingestion pipeline. Add transformations. Deploy dashboards. Measure impact. Scale.
Your next acquisition will be grateful.