Skip to content

Using Iceberg Metadata Tables to Determine When Compaction Is Needed

Published: at 09:00 AM

Using Iceberg Metadata Tables to Determine When Compaction Is Needed

Scheduling compaction at fixed intervals is better than not optimizing at all—but it can still lead to unnecessary compute spend or delayed maintenance. A smarter approach is to dynamically trigger compaction based on real-time metadata signals.

Apache Iceberg makes this possible with its powerful system of metadata tables, which expose granular details about files, snapshots, and manifests.

In this post, we’ll explore how to query these tables to:

What Are Iceberg Metadata Tables?

Every Iceberg table automatically maintains a set of virtual tables that expose its internals. The most relevant for optimization include:

These tables can be queried like any other SQL table, making it easy to introspect your table’s health.

1. Detecting Small Files with the files Table

To identify partitions suffering from small file syndrome:

SELECT
  partition,
  COUNT(*) AS file_count,
  AVG(file_size_in_bytes) AS avg_size_bytes
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) > 10 AND AVG(file_size_in_bytes) < 134217728; -- 128 MB

You can use this to:

2. Finding Fragmented or Stale Manifests

Bloated metadata can come from too many or inefficient manifest files. Use the manifests table to explore:

SELECT
  COUNT(*) AS manifest_count,
  AVG(added_data_files_count) AS avg_files_per_manifest
FROM my_table.manifests;

Low averages can indicate fragmented manifests that are good candidates for rewriting.

3. Tracking Snapshot Volume and Velocity

To see if snapshots are accumulating too fast (and increasing metadata overhead):

SELECT
  COUNT(*) AS snapshot_count,
  MIN(committed_at) AS first_snapshot,
  MAX(committed_at) AS latest_snapshot
FROM my_table.snapshots;

You can also inspect how many files each snapshot adds or removes to identify noisy patterns from ingestion jobs.

4. Building a Health Score

By combining file count, file size, manifest count, and snapshot frequency, you can compute a “table health score”:

-- Example: High file count + small average size = poor health
WITH file_stats AS (
  SELECT COUNT(*) AS total_files, AVG(file_size_in_bytes) AS avg_file_size
  FROM my_table.files
),
manifest_stats AS (
  SELECT COUNT(*) AS total_manifests
  FROM my_table.manifests
)
SELECT
  total_files,
  avg_file_size,
  total_manifests,
  CASE
    WHEN avg_file_size < 67108864 AND total_files > 1000 THEN 'Needs compaction'
    ELSE 'Healthy'
  END AS status
FROM file_stats, manifest_stats;

5. Triggering Compaction Automatically

Once you identify problematic patterns, you can wire up your orchestration layer to act:

Benefits of Metadata-Driven Optimization

Summary

Apache Iceberg gives you visibility and control over your tables through metadata tables. By tapping into this metadata:

In the next post, we’ll dive into partition evolution and layout pitfalls, and how to avoid undermining your compaction and clustering strategies when schemas or partitions change.