Apache Iceberg Explained

Apache Iceberg is an open table format for large analytic tables stored in object storage. Created at Netflix in 2017 and graduated to a top-level Apache project in 2020, it is the most widely adopted open table format across cloud providers, query engines, and data platforms.

Iceberg solves the core problem with Hive-style data lakes: there was no reliable way to know which files belonged to a table at any given moment, making concurrent writes dangerous, schema changes painful, and consistent reads nearly impossible at scale. Iceberg replaces folder-based tracking with a proper metadata system.

The Core Abstraction: Snapshots and Manifests

graph TD A["Catalog (table name to metadata pointer)"] A --> B["Table Metadata JSON (schema, partition spec, snapshot list)"] B --> C["Manifest List (one entry per manifest, with partition summaries)"] C --> D1["Manifest File (one entry per data file, with column stats)"] C --> D2["Manifest File (one entry per data file, with column stats)"] D1 --> E1["Data File (Parquet)"] D1 --> E2["Data File (Parquet)"] D2 --> E3["Data File (Parquet)"]

When a query arrives, the engine reads the catalog to find the metadata file, reads the manifest list to find which files match the query filters using per-file statistics, and only then reads the actual Parquet data. A query filtering on a date column may skip 99% of the data files without opening them.

Key Capabilities

ACID Transactions

Iceberg uses optimistic concurrency control. A writer reads the current table state, makes changes, and atomically swaps the metadata pointer to a new snapshot. Readers always see a complete, consistent snapshot. Concurrent writers retry if a conflict occurs.

Schema Evolution Without Data Rewrites

Iceberg assigns every column a numeric ID rather than relying on column names. You can rename, add, reorder, or drop columns without touching a single data file. Old files are still read correctly because the column IDs are stable.

Hidden Partitioning

You declare a partition transform (days(order_date)), and Iceberg applies it at write time and pruning at read time without exposing partition columns to query authors. Users write filters on real data columns, pruning happens automatically.

Time Travel and Rollback

Every commit creates an immutable snapshot. You can query any past snapshot by timestamp or snapshot ID, and roll back a table to any previous state with a single metadata operation — no data rewritten on rollback.

How a Write Commit Works

sequenceDiagram participant W as Writer (Spark / Flink / etc.) participant Cat as Catalog participant S3 as Object Storage W->>Cat: Load table (get metadata location) Cat-->>W: metadata.json v4 W->>S3: Write new Parquet data files W->>S3: Write manifest file W->>S3: Write manifest list (new snapshot) W->>S3: Write new metadata.json v5 W->>Cat: Commit — swap pointer from v4 to v5 Cat-->>W: Success (or 409 Conflict — retry)

Engine Support

Engine	Read	Write	Notes
Apache Spark	Yes	Yes	Most complete integration
Apache Flink	Yes	Yes	Streaming sink with exactly-once
Trino	Yes	Yes	Full DML support
Dremio	Yes	Yes	Native with AI Semantic Layer
AWS Athena	Yes	Yes	Native via AWS Glue catalog
Google BigQuery	Yes	Yes	BigLake managed Iceberg tables
Snowflake	Yes	Yes	Iceberg tables + Open Catalog
DuckDB	Yes	Partial	Via iceberg extension
PyIceberg	Yes	Yes	Python-native client library

Where Iceberg Fits

graph LR A["Ingestion (Kafka, CDC, batch ETL)"] --> B["Iceberg Tables (Bronze / Silver / Gold)"] B --> C["Catalog (Apache Polaris / Glue / Nessie)"] C --> D1["Analytics (Dremio, Trino, Athena)"] C --> D2["ML / AI (Spark, PyIceberg, DuckDB)"] C --> D3["AI Agents (MCP, LangChain, Dremio AI Agent)"] D1 --> E["BI Tools (Superset, Tableau, Power BI)"]

The Catalog and REST API

Catalogs are how engines find tables. Iceberg defines a standard REST Catalog API that any catalog can implement. Apache Polaris (co-created by Dremio and Snowflake) is the reference implementation. Other catalogs: Project Nessie, AWS Glue, and Snowflake Open Catalog.