What Is a Data Lakehouse?
A data lakehouse is an architecture that stores data in open file formats on cheap object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage), while adding a structured table layer on top that gives you transactions, schema enforcement, and fast query performance. You get the cost profile of a data lake and the reliability of a data warehouse in a single system.
The term was first formalized in a 2020 paper from UC Berkeley and Databricks, which argued that the traditional two-tier approach (raw lake for ML, separate warehouse for BI) was causing real problems: data duplication, staleness, high cost, and brittle pipelines. The lakehouse collapses those tiers into one.
The Three Layers
Every lakehouse has three functional layers. Understanding what each one does makes the rest of the architecture click.
Storage Layer
Object storage holds the actual data in columnar file formats, primarily Apache Parquet. It is cheap, durable, and serverless. You pay for what you store, not for idle compute.
Table Format Layer
This is where the lakehouse diverges from a plain data lake. A table format sits between the raw files and the query engines, tracking exactly which files belong to which table, what the schema is, and what changed in each transaction. Apache Iceberg is the dominant open table format for this layer. It provides ACID transactions, schema evolution, time travel, and a consistent view of data across every engine that reads from the table.
Catalog Layer
The catalog maps table names to their metadata locations. Catalogs like Apache Polaris, AWS Glue, and Project Nessie expose the Iceberg REST Catalog API, so any compatible engine can connect to any catalog using the same standard interface.
Data Lakehouse vs Data Lake vs Data Warehouse
| Dimension | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage format | Open (raw files) | Proprietary | Open (Parquet + table format) |
| ACID transactions | No | Yes | Yes (via Iceberg) |
| Schema enforcement | Read-time only | Write-time | Write-time + evolution |
| Time travel | No | Limited | Yes (snapshot history) |
| Multi-engine access | Yes (raw files) | No (proprietary API) | Yes (REST Catalog standard) |
| ML / AI workloads | Yes | Difficult | Yes |
| Storage cost | Low | High | Low |
| SQL query speed | Slow | Fast | Fast (with optimization) |
| Vendor lock-in | Low | High | Low (open standards) |
For a deeper comparison, see the full comparison guide.
Open Table Formats
Three formats dominate the lakehouse table layer today: Apache Iceberg (broadest multi-engine support), Delta Lake (Databricks-primary), and Apache Hudi (key-based upsert pipelines). For a full side-by-side breakdown, see the open table format comparison.
When a Data Lakehouse Is the Right Choice
A lakehouse makes sense when you need more than one of the following from the same data: SQL analytics, machine learning training data, real-time streaming, and AI agent access. If your workloads are purely SQL-based and your data volume is moderate, a managed warehouse may be simpler. The lakehouse is the right call when you need those things without paying for two separate systems or copying data between them. It also makes sense when vendor independence matters.
The Agentic Lakehouse
The latest evolution adds a governed AI access layer on top. AI agents can query your Iceberg tables through a semantic layer that translates business questions into SQL, executes them against governed data, and returns results the agent can reason over. See the Agentic Lakehouse guide.
Frequently Asked Questions
Is a data lakehouse the same as a data lake?
No. A data lake is raw file storage with no table semantics, no transactions, and no consistent query interface. A data lakehouse adds a table format layer that gives you ACID guarantees, schema enforcement, and fast query planning on top of the same low-cost object storage.
Do I need Apache Iceberg to build a data lakehouse?
You need a table format. Apache Iceberg is the most widely supported choice, with native support from every major cloud provider and query engine. Delta Lake and Apache Hudi are alternatives, but Iceberg has the broadest multi-engine write support and an open catalog standard.