Skip to content

What Is an Agentic Lakehouse?

An agentic lakehouse is a data lakehouse extended with the infrastructure required for AI agents to safely query and act on enterprise data. The word "agentic" means the architecture has four properties that LLM-based agents require: governed access, trustworthy execution, contextual metadata, and open interoperability.

Why Standard Lakehouses Are Not Enough for AI Agents

The Four Required Layers

graph TD A["AI Agent (LLM + tool-calling)"] A --> B["Semantic Layer: Business context, table descriptions, metric definitions, column meanings"] B --> C["Governed Query Layer: Authentication, RBAC, credential vending, row/column masking, audit logging"] C --> D["Iceberg Table Layer: ACID snapshots, schema evolution, time travel, immutable history"] D --> E["Object Storage: Parquet files in S3 / GCS / ADLS"]

Layer 1: The Semantic Layer

The semantic layer maps raw column names and table names to meanings an LLM can use correctly. When an agent asks about "quarterly revenue," the semantic layer tells it that revenue means SUM(total) WHERE status IN ('SHIPPED', 'DELIVERED') on the orders table, and that cancelled orders must be excluded.

Layer 2: The Governed Query Layer

This handles authentication, authorization, and enforcement. The catalog (Apache Polaris) enforces these policies and vends temporary, scoped storage credentials that only allow access to files the requesting principal is authorized to read.

Layer 3: The Iceberg Table Layer

Apache Iceberg provides immutable snapshots (so results are reproducible), time travel (so you can reconstruct what data the agent saw at query time), schema history, and ACID guarantees (so agents do not see partial writes).

Layer 4: Object Storage

Open Parquet files in your own object storage. Because the data is not locked in a proprietary format, agents built on any framework can connect to the same underlying data without format conversion.

How a Typical Agent Query Flows

sequenceDiagram participant U as User participant A as AI Agent (LLM) participant SL as Semantic Layer participant QE as Query Engine participant Cat as Catalog (Apache Polaris) participant S3 as Iceberg / Object Storage U->>A: "Which customers churned this quarter?" A->>SL: Fetch schema + business context for analytics tables SL-->>A: Table descriptions, metric definitions, filter rules A->>QE: Execute SQL (NL2SQL generated from context) QE->>Cat: Load table, get credentials Cat-->>QE: Vended S3 credentials (scoped to authorized files) QE->>S3: Read Iceberg Parquet files (pruned to relevant partitions) S3-->>QE: Data QE-->>A: Result set A-->>U: "47 customers who purchased in Q3 did not purchase in Q4"

Agentic Lakehouse vs Standard Lakehouse

PropertyStandard LakehouseAgentic Lakehouse
Primary consumersHuman analysts, BI toolsAI agents + human analysts
Query interfaceSQL editors, BI connectorsSQL + MCP + natural language
Semantic contextOptional (docs, wikis)Required (machine-readable semantic layer)
Authorization modelTable-level RBACPer-agent RBAC + row/column masking + credential vending
AuditabilityQuery logsQuery logs + snapshot ID + agent identity
Write safetyManual reviewWAP pattern + automated validation before publish

Go Deeper

๐Ÿ“š Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.