Build a governed, open lakehouse on GCP using Cloud Storage and Apache Iceberg. Learn architecture, query patterns, and integration with Superset for analytics.
A lakehouse combines the low-cost storage of data lakes with the structured query capabilities and ACID transactions of data warehouses. Unlike proprietary solutions that lock you into a single vendor's storage format, an open lakehouse uses Apache Iceberg—an open-source table format that sits on top of cloud object storage like Google Cloud Storage (GCS)—to give you portability, performance, and control.
The core insight is simple: your data should not be hostage to your analytics platform. By building on Iceberg and GCS, you can query the same data lake from Apache Spark, Trino, Presto, DuckDB, and other engines without vendor lock-in. When you layer D23's managed Apache Superset on top, you get production-grade self-serve BI and embedded analytics without the overhead of managing Superset infrastructure yourself.
This architecture matters because it separates concerns: storage and table format (Iceberg on GCS) from compute (Spark, Trino, BigQuery) from analytics UI (Superset). You can scale each independently, optimize costs, and switch tools without rebuilding your data foundation.
Apache Iceberg is an open table format designed specifically for large analytic datasets. Think of it as a contract between storage and compute: it defines how data is organized, how schemas evolve, and how transactions work—without tying you to a specific query engine or cloud provider.
Traditional data lakes store raw files (Parquet, ORC) in object storage, but lack critical features:
Iceberg solves these problems through a metadata layer. Every table has a manifest file that tracks which data files belong to the table, their schemas, partition information, and statistics. When you write data, Iceberg updates the manifest atomically. When you read, Iceberg prunes files based on partition and column statistics before touching any data.
The result: your analytics queries run faster, your data pipelines are more reliable, and your data engineers can evolve schemas without breaking downstream consumers. On Google Cloud Storage, this means you get the cost advantages of object storage (pennies per terabyte per month) with the reliability and performance of a managed warehouse.
A production lakehouse on GCP follows a logical architecture with distinct layers:
Storage Layer: Google Cloud Storage buckets hold all data files (Parquet format, typically) and Iceberg metadata. You organize buckets by environment (dev, staging, prod) and by data domain (finance, product, customer). This separation makes access control, cost tracking, and disaster recovery simpler.
Table Metadata Layer: Iceberg metadata (manifests, snapshots, schemas) lives in the same GCS bucket or a separate metadata bucket. Tools like Google Cloud's BigLake can manage this metadata and provide a unified query interface across your Iceberg tables.
Compute Layer: Multiple query engines can read the same Iceberg tables. Apache Spark is common for ETL and batch analytics; Trino for interactive queries; BigQuery for SQL analytics at scale. Each engine independently interprets the Iceberg metadata and reads only the files it needs.
Analytics & BI Layer: D23's managed Superset connects to your compute layer (Spark, Trino, or BigQuery) via SQL, letting your analysts and product teams build dashboards and explore data without writing code. Superset's native support for text-to-SQL and API-first architecture means you can embed analytics directly into your product.
This layering is crucial. Your data engineers own the storage and table format; your analytics engineers own compute configuration; your analysts own the BI layer. Changes in one layer do not cascade downward.
Creating an Iceberg table on GCS requires three components: a GCS bucket, a metadata catalog (like Hive metastore or Iceberg REST catalog), and a compute engine (Spark or Trino). Here's the pattern:
Step 1: Create a GCS Bucket
Start with a bucket following Google Cloud naming conventions (globally unique, lowercase, hyphens). Enable versioning if you want point-in-time recovery of metadata, and set a lifecycle policy to archive old Iceberg snapshots after 30 days (unless you need extended time travel).
gsutil mb -c STANDARD gs://my-lakehouse-prod
gsutil versioning set on gs://my-lakehouse-prod
Step 2: Configure a Metadata Catalog
Iceberg tables need a catalog—a service that tracks table metadata and enforces schema evolution. You have two main options:
For production, a managed REST catalog (like those offered by Starburst or Google) is simpler because you do not manage the catalog infrastructure. For cost-sensitive setups, a Hive metastore on Dataproc works well.
Step 3: Create an Iceberg Table
Using Spark on Dataproc, create a table with Iceberg format:
CREATE TABLE my_lakehouse.events (
event_id STRING,
user_id STRING,
event_timestamp TIMESTAMP,
event_properties MAP<STRING, STRING>
)
USING ICEBERG
PARTITIONED BY (DATE(event_timestamp))
LOCATION 'gs://my-lakehouse-prod/events';The PARTITIONED BY clause tells Iceberg to organize data by date, which prunes files during query planning. The LOCATION points to your GCS bucket. Iceberg automatically creates the metadata directory structure.
Step 4: Insert Data
Once the table exists, write data using Spark SQL or your ETL tool:
INSERT INTO my_lakehouse.events
SELECT * FROM raw_events WHERE processed_at >= CURRENT_DATE;Iceberg handles the write atomically. If the job fails halfway, the table is not corrupted; the next run simply retries.
The power of Iceberg emerges when you query the same table from different engines. Your Spark ETL pipeline writes to an Iceberg table; your data analysts query it from Trino via Superset; your ML team reads it from BigQuery. All see the same consistent data.
Querying from Spark
Spark has native Iceberg support. Configure your Spark cluster to use Iceberg:
spark = SparkSession.builder \
.config("spark.sql.catalog.my_lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.my_lakehouse.type", "hive") \
.config("spark.sql.catalog.my_lakehouse.warehouse", "gs://my-lakehouse-prod") \
.getOrCreate()
df = spark.sql("SELECT * FROM my_lakehouse.events WHERE event_timestamp > '2024-01-01'")Querying from Trino
Trino's Iceberg connector lets you query Iceberg tables on GCS without Spark:
SELECT event_id, COUNT(*) as event_count
FROM iceberg.my_lakehouse.events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY event_id;Trino reads the Iceberg metadata directly from GCS, prunes partitions, and streams results back. This is fast and cost-effective for interactive queries because you pay only for the data scanned, not for a standing warehouse.
Querying from BigQuery
BigLake is Google's answer to querying open formats on Cloud Storage. BigLake tables reference Iceberg data on GCS and give you BigQuery's SQL engine:
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.events`
USING ICEBERG
LOCATION = 'gs://my-lakehouse-prod/events';
SELECT user_id, COUNT(*) as sessions
FROM `my-project.my_dataset.events`
WHERE DATE(event_timestamp) = CURRENT_DATE()
GROUP BY user_id;BigQuery's query optimizer understands Iceberg metadata, pushes down predicates, and reads only required files. You get BigQuery's performance and SQL dialect without duplicating data.
Once your Iceberg lakehouse is queryable, the next step is giving your team self-serve access without writing SQL. This is where D23's managed Superset shines.
Superset connects to your compute layer (Spark, Trino, or BigQuery) via JDBC or HTTP drivers. You define datasets—logical groupings of tables or SQL queries—and let analysts build dashboards on top.
Setting Up Superset Connections
In D23, create a database connection to your Iceberg compute layer:
Once connected, Superset introspects your tables and schemas, discovering all Iceberg tables automatically.
Building Datasets and Dashboards
Create a dataset for your events table:
Dataset Name: Events
Table: my_lakehouse.events
Columns: event_id, user_id, event_timestamp, event_properties
Granularity: event_timestamp (daily)
Superset infers data types and suggests aggregations. Analysts then build charts:
All queries are pushed down to your compute layer. Superset does not copy data; it translates UI interactions into SQL, executes against Iceberg, and renders results.
Embedding Analytics in Your Product
If you are a product company, D23's embedded analytics lets you embed Superset dashboards directly into your app. Your customers see real-time analytics without leaving your product. Superset's API-first architecture means you can:
A lakehouse at scale requires governance and optimization. Here are the key considerations:
Data Governance
Google Cloud's Dataplex integrates with Iceberg tables to enforce data governance policies. You define data zones (raw, curated, analytics), assign owners, and enforce tagging:
Dataplex discovers Iceberg tables, catalogs their lineage, and enforces access policies. When an analyst queries a table in Superset, Dataplex ensures they have permission and logs the access.
Performance Tuning
Iceberg performance depends on three factors: partition strategy, file size, and statistics.
Cost Optimization
GCS + Iceberg is cost-effective compared to proprietary warehouses, but optimization matters:
One of Iceberg's killer features is schema evolution without rewriting data. Add a column, drop a column, change a type—Iceberg tracks the change in metadata without touching existing files.
ALTER TABLE my_lakehouse.events ADD COLUMN event_version INT DEFAULT 1;When you query, Iceberg automatically includes the new column. Existing files return NULL for the new column; new files include the value.
Time travel lets you query historical snapshots:
SELECT * FROM my_lakehouse.events
VERSION AS OF 12345 -- Query snapshot 12345
WHERE event_timestamp > '2024-01-01';This is invaluable for auditing, debugging data quality issues, and recovering from accidental deletes.
Why build a lakehouse instead of using BigQuery directly or a proprietary data warehouse?
BigQuery Native: BigQuery is excellent for SQL analytics, but stores data in Google's proprietary format. If you want to query the same data from Spark (for ML), Trino (for real-time dashboards), or DuckDB (for local analysis), you must export and duplicate. BigQuery's storage costs are also higher (~$0.06/GB/month for active storage, $0.013 for long-term).
Proprietary Data Warehouses (Snowflake, Redshift, Teradata): These are powerful but vendor-locked. If you want to switch tools or query from multiple engines, you must replicate data. Iceberg + GCS avoids this lock-in.
Traditional Data Lakes (Parquet/ORC on S3): These are cheap but lack schema governance, ACID transactions, and efficient querying. Iceberg layers governance on top without the cost penalty.
Iceberg + GCS: Combines low storage costs, open format portability, ACID transactions, and multi-engine query support. You pay for storage (cheap) and compute (you choose). No vendor lock-in.
Consider an e-commerce company with 10 TB of events (orders, clicks, page views) ingested daily via Kafka. They want to:
Architecture:
Costs:
A comparable Snowflake setup (10 TB on-demand) would cost ~$15,000+/month. A Looker license for 20 analysts adds another $10,000+/month.
If you are considering an open lakehouse, here is a pragmatic path:
Phase 1 (Weeks 1–2): Set up a proof of concept
Phase 2 (Weeks 3–4): Integrate with analytics
Phase 3 (Months 2–3): Operationalize
Phase 4 (Months 4+): Optimize and expand
Building an open lakehouse on GCS and Iceberg is not just a technical choice—it is a strategic one. You get:
Whether you are a startup building your first analytics platform or an enterprise consolidating disparate data warehouses, Iceberg + GCS is a modern, scalable foundation. Combined with D23's managed Superset for self-serve BI and embedded analytics, you have a complete, production-grade analytics stack without the overhead of managing infrastructure or dealing with vendor lock-in.
The lakehouse paradigm is no longer a buzzword—it is the pragmatic choice for data-driven organizations that want cost efficiency, flexibility, and control.