Skip to content

What Is a Data Lakehouse?

A data lakehouse is an architecture that stores data in open file formats on cheap object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage), while adding a structured table layer on top that gives you transactions, schema enforcement, and fast query performance. You get the cost profile of a data lake and the reliability of a data warehouse in a single system.

The term was first formalized in a 2020 paper from UC Berkeley and Databricks, which argued that the traditional two-tier approach (raw lake for ML, separate warehouse for BI) was causing real problems: data duplication, staleness, high cost, and brittle pipelines. The lakehouse collapses those tiers into one.

The Three Layers

Every lakehouse has three functional layers. Understanding what each one does makes the rest of the architecture click.

graph TD A["Source Systems (Databases, APIs, Streams)"] --> B["Storage Layer: Object Storage S3 / GCS / ADLS"] B --> C["Table Format Layer: Apache Iceberg — Metadata + Manifests + Data Files"] C --> D["Catalog Layer: Apache Polaris · AWS Glue · Project Nessie"] D --> E["Query Engines: Dremio · Spark · Trino · Athena · Snowflake"] E --> F["Consumers: BI Tools · AI Agents · Data Science · Applications"]

Storage Layer

Object storage holds the actual data in columnar file formats, primarily Apache Parquet. It is cheap, durable, and serverless. You pay for what you store, not for idle compute.

Table Format Layer

This is where the lakehouse diverges from a plain data lake. A table format sits between the raw files and the query engines, tracking exactly which files belong to which table, what the schema is, and what changed in each transaction. Apache Iceberg is the dominant open table format for this layer. It provides ACID transactions, schema evolution, time travel, and a consistent view of data across every engine that reads from the table.

Catalog Layer

The catalog maps table names to their metadata locations. Catalogs like Apache Polaris, AWS Glue, and Project Nessie expose the Iceberg REST Catalog API, so any compatible engine can connect to any catalog using the same standard interface.

Data Lakehouse vs Data Lake vs Data Warehouse

graph LR subgraph DW["Data Warehouse"] DW1["Proprietary storage"] DW2["Schema-on-write"] DW3["Fast SQL queries"] DW4["High cost at scale"] end subgraph DL["Data Lake"] DL1["Open object storage"] DL2["Schema-on-read"] DL3["No ACID guarantees"] DL4["Low storage cost"] end subgraph LH["Data Lakehouse"] LH1["Open object storage"] LH2["Schema enforcement via table format"] LH3["Fast SQL + ML + AI in one place"] LH4["Low cost with warehouse reliability"] end
Dimension Data Lake Data Warehouse Data Lakehouse
Storage formatOpen (raw files)ProprietaryOpen (Parquet + table format)
ACID transactionsNoYesYes (via Iceberg)
Schema enforcementRead-time onlyWrite-timeWrite-time + evolution
Time travelNoLimitedYes (snapshot history)
Multi-engine accessYes (raw files)No (proprietary API)Yes (REST Catalog standard)
ML / AI workloadsYesDifficultYes
Storage costLowHighLow
SQL query speedSlowFastFast (with optimization)
Vendor lock-inLowHighLow (open standards)

For a deeper comparison, see the full comparison guide.

Open Table Formats

Three formats dominate the lakehouse table layer today: Apache Iceberg (broadest multi-engine support), Delta Lake (Databricks-primary), and Apache Hudi (key-based upsert pipelines). For a full side-by-side breakdown, see the open table format comparison.

When a Data Lakehouse Is the Right Choice

A lakehouse makes sense when you need more than one of the following from the same data: SQL analytics, machine learning training data, real-time streaming, and AI agent access. If your workloads are purely SQL-based and your data volume is moderate, a managed warehouse may be simpler. The lakehouse is the right call when you need those things without paying for two separate systems or copying data between them. It also makes sense when vendor independence matters.

The Agentic Lakehouse

The latest evolution adds a governed AI access layer on top. AI agents can query your Iceberg tables through a semantic layer that translates business questions into SQL, executes them against governed data, and returns results the agent can reason over. See the Agentic Lakehouse guide.

Frequently Asked Questions

Is a data lakehouse the same as a data lake?

No. A data lake is raw file storage with no table semantics, no transactions, and no consistent query interface. A data lakehouse adds a table format layer that gives you ACID guarantees, schema enforcement, and fast query planning on top of the same low-cost object storage.

Do I need Apache Iceberg to build a data lakehouse?

You need a table format. Apache Iceberg is the most widely supported choice, with native support from every major cloud provider and query engine. Delta Lake and Apache Hudi are alternatives, but Iceberg has the broadest multi-engine write support and an open catalog standard.

Go Deeper

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.