New: AI & text-to-SQL on your own SupersetBook a demo

Data Strategy18 Apr 2026

Google Cloud Storage + Iceberg: An Open Lakehouse on GCP

Build a governed, open lakehouse on GCP using Cloud Storage and Apache Iceberg. Learn architecture, query patterns, and integration with Superset for analytics.

DTD23 Team

11 minutes read

Understanding the Open Lakehouse Architecture

A lakehouse combines the low-cost storage of data lakes with the structured query capabilities and ACID transactions of data warehouses. Unlike proprietary solutions that lock you into a single vendor's storage format, an open lakehouse uses Apache Iceberg—an open-source table format that sits on top of cloud object storage like Google Cloud Storage (GCS)—to give you portability, performance, and control.

The core insight is simple: your data should not be hostage to your analytics platform. By building on Iceberg and GCS, you can query the same data lake from Apache Spark, Trino, Presto, DuckDB, and other engines without vendor lock-in. When you layer D23's managed Apache Superset on top, you get production-grade self-serve BI and embedded analytics without the overhead of managing Superset infrastructure yourself.

This architecture matters because it separates concerns: storage and table format (Iceberg on GCS) from compute (Spark, Trino, BigQuery) from analytics UI (Superset). You can scale each independently, optimize costs, and switch tools without rebuilding your data foundation.

What Is Apache Iceberg and Why It Matters

Apache Iceberg is an open table format designed specifically for large analytic datasets. Think of it as a contract between storage and compute: it defines how data is organized, how schemas evolve, and how transactions work—without tying you to a specific query engine or cloud provider.

Traditional data lakes store raw files (Parquet, ORC) in object storage, but lack critical features:

No ACID transactions: Multiple writers can corrupt data or create inconsistent snapshots
Schema drift: Columns can be added or removed without coordination
No time travel: You cannot easily query historical versions of data
Slow metadata: Listing billions of small files to plan a query is expensive

Iceberg solves these problems through a metadata layer. Every table has a manifest file that tracks which data files belong to the table, their schemas, partition information, and statistics. When you write data, Iceberg updates the manifest atomically. When you read, Iceberg prunes files based on partition and column statistics before touching any data.

The result: your analytics queries run faster, your data pipelines are more reliable, and your data engineers can evolve schemas without breaking downstream consumers. On Google Cloud Storage, this means you get the cost advantages of object storage (pennies per terabyte per month) with the reliability and performance of a managed warehouse.

Designing Your GCS + Iceberg Lakehouse

A production lakehouse on GCP follows a logical architecture with distinct layers:

Storage Layer: Google Cloud Storage buckets hold all data files (Parquet format, typically) and Iceberg metadata. You organize buckets by environment (dev, staging, prod) and by data domain (finance, product, customer). This separation makes access control, cost tracking, and disaster recovery simpler.

Table Metadata Layer: Iceberg metadata (manifests, snapshots, schemas) lives in the same GCS bucket or a separate metadata bucket. Tools like Google Cloud's BigLake can manage this metadata and provide a unified query interface across your Iceberg tables.

Compute Layer: Multiple query engines can read the same Iceberg tables. Apache Spark is common for ETL and batch analytics; Trino for interactive queries; BigQuery for SQL analytics at scale. Each engine independently interprets the Iceberg metadata and reads only the files it needs.

Analytics & BI Layer: D23's managed Superset connects to your compute layer (Spark, Trino, or BigQuery) via SQL, letting your analysts and product teams build dashboards and explore data without writing code. Superset's native support for text-to-SQL and API-first architecture means you can embed analytics directly into your product.

This layering is crucial. Your data engineers own the storage and table format; your analytics engineers own compute configuration; your analysts own the BI layer. Changes in one layer do not cascade downward.

Setting Up Iceberg Tables on Google Cloud Storage

Creating an Iceberg table on GCS requires three components: a GCS bucket, a metadata catalog (like Hive metastore or Iceberg REST catalog), and a compute engine (Spark or Trino). Here's the pattern:

Step 1: Create a GCS Bucket

Start with a bucket following Google Cloud naming conventions (globally unique, lowercase, hyphens). Enable versioning if you want point-in-time recovery of metadata, and set a lifecycle policy to archive old Iceberg snapshots after 30 days (unless you need extended time travel).

gsutil mb -c STANDARD gs://my-lakehouse-prod
gsutil versioning set on gs://my-lakehouse-prod

Step 2: Configure a Metadata Catalog

Iceberg tables need a catalog—a service that tracks table metadata and enforces schema evolution. You have two main options:

Hive Metastore: A traditional approach, compatible with Spark and Trino. Requires running a Hive metastore service (on Dataproc or self-hosted). Simple but not serverless.
Iceberg REST Catalog: A newer approach, decoupling the catalog from any specific compute engine. Google Cloud and partners offer managed REST catalogs; Iceberg's reference implementation is open source.

For production, a managed REST catalog (like those offered by Starburst or Google) is simpler because you do not manage the catalog infrastructure. For cost-sensitive setups, a Hive metastore on Dataproc works well.

Step 3: Create an Iceberg Table

Using Spark on Dataproc, create a table with Iceberg format:

CREATE TABLE my_lakehouse.events (
  event_id STRING,
  user_id STRING,
  event_timestamp TIMESTAMP,
  event_properties MAP<STRING, STRING>
)
USING ICEBERG
PARTITIONED BY (DATE(event_timestamp))
LOCATION 'gs://my-lakehouse-prod/events';

The PARTITIONED BY clause tells Iceberg to organize data by date, which prunes files during query planning. The LOCATION points to your GCS bucket. Iceberg automatically creates the metadata directory structure.

Step 4: Insert Data

Once the table exists, write data using Spark SQL or your ETL tool:

INSERT INTO my_lakehouse.events
SELECT * FROM raw_events WHERE processed_at >= CURRENT_DATE;

Iceberg handles the write atomically. If the job fails halfway, the table is not corrupted; the next run simply retries.

Querying Iceberg Tables with Multiple Engines

The power of Iceberg emerges when you query the same table from different engines. Your Spark ETL pipeline writes to an Iceberg table; your data analysts query it from Trino via Superset; your ML team reads it from BigQuery. All see the same consistent data.

Querying from Spark

Spark has native Iceberg support. Configure your Spark cluster to use Iceberg:

spark = SparkSession.builder \
    .config("spark.sql.catalog.my_lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.my_lakehouse.type", "hive") \
    .config("spark.sql.catalog.my_lakehouse.warehouse", "gs://my-lakehouse-prod") \
    .getOrCreate()
 
df = spark.sql("SELECT * FROM my_lakehouse.events WHERE event_timestamp > '2024-01-01'")

Querying from Trino

Trino's Iceberg connector lets you query Iceberg tables on GCS without Spark:

SELECT event_id, COUNT(*) as event_count
FROM iceberg.my_lakehouse.events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY event_id;

Trino reads the Iceberg metadata directly from GCS, prunes partitions, and streams results back. This is fast and cost-effective for interactive queries because you pay only for the data scanned, not for a standing warehouse.

Querying from BigQuery

BigLake is Google's answer to querying open formats on Cloud Storage. BigLake tables reference Iceberg data on GCS and give you BigQuery's SQL engine:

CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.events`
USING ICEBERG
LOCATION = 'gs://my-lakehouse-prod/events';
 
SELECT user_id, COUNT(*) as sessions
FROM `my-project.my_dataset.events`
WHERE DATE(event_timestamp) = CURRENT_DATE()
GROUP BY user_id;

BigQuery's query optimizer understands Iceberg metadata, pushes down predicates, and reads only required files. You get BigQuery's performance and SQL dialect without duplicating data.

Integrating Superset for Self-Serve Analytics

Once your Iceberg lakehouse is queryable, the next step is giving your team self-serve access without writing SQL. This is where D23's managed Superset shines.

Superset connects to your compute layer (Spark, Trino, or BigQuery) via JDBC or HTTP drivers. You define datasets—logical groupings of tables or SQL queries—and let analysts build dashboards on top.

Setting Up Superset Connections

In D23, create a database connection to your Iceberg compute layer:

For Trino: Point to your Trino coordinator node, specify the Iceberg catalog, and test the connection.
For BigQuery: Use BigQuery's Python API credentials (service account key) and authenticate.
For Spark: Use Spark Thrift Server (a long-running SQL endpoint on your Dataproc cluster).

Once connected, Superset introspects your tables and schemas, discovering all Iceberg tables automatically.

Building Datasets and Dashboards

Create a dataset for your events table:

Dataset Name: Events
Table: my_lakehouse.events
Columns: event_id, user_id, event_timestamp, event_properties
Granularity: event_timestamp (daily)

Superset infers data types and suggests aggregations. Analysts then build charts:

Time series: Events per day over the last 30 days (line chart)
Breakdown: Top 10 users by event count (bar chart)
Funnel: User progression through event types (sankey)

All queries are pushed down to your compute layer. Superset does not copy data; it translates UI interactions into SQL, executes against Iceberg, and renders results.

Embedding Analytics in Your Product

If you are a product company, D23's embedded analytics lets you embed Superset dashboards directly into your app. Your customers see real-time analytics without leaving your product. Superset's API-first architecture means you can:

Embed dashboards with row-level security (RLS) so each customer sees only their data
Trigger dashboard refreshes programmatically
Integrate with your user authentication system
Track analytics usage and engagement

Governance, Performance, and Cost Optimization

A lakehouse at scale requires governance and optimization. Here are the key considerations:

Data Governance

Google Cloud's Dataplex integrates with Iceberg tables to enforce data governance policies. You define data zones (raw, curated, analytics), assign owners, and enforce tagging:

Raw zone: Data as ingested, no transformations
Curated zone: Cleaned, deduplicated, validated data
Analytics zone: Aggregated, denormalized tables for BI

Dataplex discovers Iceberg tables, catalogs their lineage, and enforces access policies. When an analyst queries a table in Superset, Dataplex ensures they have permission and logs the access.

Performance Tuning

Iceberg performance depends on three factors: partition strategy, file size, and statistics.

Partition Strategy: Partition by date or low-cardinality columns (region, product type). Avoid over-partitioning (e.g., by minute) because it creates too many small files.
File Size: Iceberg targets 128 MB to 1 GB per file. Smaller files slow queries; larger files waste storage on partial reads. Use Iceberg's compaction to merge small files periodically.
Statistics: Iceberg tracks min/max values and null counts per column and file. When you query with filters, Iceberg prunes files that cannot match. Ensure your ETL writes statistics by using Spark's Iceberg writer.

Cost Optimization

GCS + Iceberg is cost-effective compared to proprietary warehouses, but optimization matters:

Storage: GCS Standard is $0.020 per GB/month; Archive is $0.004. Use lifecycle policies to move old snapshots to Archive after 30 days.
Compute: Pay only for query execution. Trino on Dataproc costs ~$0.10/hour for a small cluster. BigQuery costs $6.25 per TB scanned. For your 100 TB lakehouse, scanning 10% per month costs ~$625.
Egress: GCS to Compute (Dataproc, BigQuery) within the same region is free. Cross-region egress costs $0.01/GB. Keep compute and storage in the same region.

Schema Evolution and Time Travel

One of Iceberg's killer features is schema evolution without rewriting data. Add a column, drop a column, change a type—Iceberg tracks the change in metadata without touching existing files.

ALTER TABLE my_lakehouse.events ADD COLUMN event_version INT DEFAULT 1;

When you query, Iceberg automatically includes the new column. Existing files return NULL for the new column; new files include the value.

Time travel lets you query historical snapshots:

SELECT * FROM my_lakehouse.events
VERSION AS OF 12345  -- Query snapshot 12345
WHERE event_timestamp > '2024-01-01';

This is invaluable for auditing, debugging data quality issues, and recovering from accidental deletes.

Comparing to Traditional Approaches

Why build a lakehouse instead of using BigQuery directly or a proprietary data warehouse?

BigQuery Native: BigQuery is excellent for SQL analytics, but stores data in Google's proprietary format. If you want to query the same data from Spark (for ML), Trino (for real-time dashboards), or DuckDB (for local analysis), you must export and duplicate. BigQuery's storage costs are also higher (~$0.06/GB/month for active storage, $0.013 for long-term).

Proprietary Data Warehouses (Snowflake, Redshift, Teradata): These are powerful but vendor-locked. If you want to switch tools or query from multiple engines, you must replicate data. Iceberg + GCS avoids this lock-in.

Traditional Data Lakes (Parquet/ORC on S3): These are cheap but lack schema governance, ACID transactions, and efficient querying. Iceberg layers governance on top without the cost penalty.

Iceberg + GCS: Combines low storage costs, open format portability, ACID transactions, and multi-engine query support. You pay for storage (cheap) and compute (you choose). No vendor lock-in.

Real-World Example: E-Commerce Analytics Lakehouse

Consider an e-commerce company with 10 TB of events (orders, clicks, page views) ingested daily via Kafka. They want to:

Run nightly batch analytics (Spark) to compute daily metrics
Let analysts query events in real-time (Trino + Superset)
Feed ML pipelines without duplicating data (Spark + DuckDB)
Maintain audit trails for compliance

Architecture:

Ingestion: Kafka → Spark Streaming → Iceberg on GCS (raw zone)
Transformation: Spark ETL → Iceberg (curated zone) with aggregations
Analytics: Trino → Superset dashboards for real-time exploration
ML: Spark + DuckDB read from Iceberg directly
Governance: Dataplex catalogs tables, enforces access policies

Costs:

Storage: 10 TB × $0.020/month = $200/month
Compute (Dataproc): 2 clusters × 4 nodes × $0.10/hour × 730 hours = ~$5,800/month
Trino (interactive): On-demand Dataproc cluster = ~$1,000/month
Total: ~$7,000/month for a multi-engine, governed, audit-ready analytics platform

A comparable Snowflake setup (10 TB on-demand) would cost ~$15,000+/month. A Looker license for 20 analysts adds another $10,000+/month.

Getting Started: A Practical Roadmap

If you are considering an open lakehouse, here is a pragmatic path:

Phase 1 (Weeks 1–2): Set up a proof of concept

Create a GCS bucket
Deploy a Dataproc cluster with Spark and Iceberg
Create one Iceberg table with sample data
Query from Spark and Trino

Phase 2 (Weeks 3–4): Integrate with analytics

Connect D23's Superset to your Trino cluster
Build a few exploratory dashboards
Measure query latency and cost

Phase 3 (Months 2–3): Operationalize

Set up automated ETL pipelines (Spark jobs on Dataproc)
Implement governance policies (Dataplex)
Migrate key datasets to Iceberg
Train analysts on Superset

Phase 4 (Months 4+): Optimize and expand

Tune partition strategies based on query patterns
Implement compaction to manage file sizes
Add ML pipelines (Spark MLlib, Vertex AI)
Expand embedded analytics if you are a product company

Key Takeaways

Building an open lakehouse on GCS and Iceberg is not just a technical choice—it is a strategic one. You get:

Portability: Query the same data from Spark, Trino, BigQuery, or DuckDB
Cost Efficiency: Pay for storage ($0.020/GB/month) and compute separately
Governance: Iceberg's metadata layer enables schema evolution, time travel, and audit trails
Flexibility: Add analytics tools (like D23's Superset) without re-architecting
No Vendor Lock-In: Your data is yours; your table format is open source

Whether you are a startup building your first analytics platform or an enterprise consolidating disparate data warehouses, Iceberg + GCS is a modern, scalable foundation. Combined with D23's managed Superset for self-serve BI and embedded analytics, you have a complete, production-grade analytics stack without the overhead of managing infrastructure or dealing with vendor lock-in.

The lakehouse paradigm is no longer a buzzword—it is the pragmatic choice for data-driven organizations that want cost efficiency, flexibility, and control.