Skip to content

Agentic Analytics and the Semantic Layer

Agentic analytics is what happens when the bridge between business questions and data answers is an AI agent rather than a human analyst. A user asks "which product categories grew fastest this quarter?" and an agent generates SQL, runs it against Iceberg tables, and returns a business-language answer. No analyst writes the query, no BI dashboard has to be pre-built.

Making this reliable requires more than pointing a language model at a database. The semantic layer is the difference between an agentic analytics system that is useful and one that gives confidently wrong answers.

The NL2SQL Pipeline

flowchart LR Q["User question in plain English"] SL["Semantic Layer: Table descriptions, column meanings, metric definitions, filter rules"] LLM["LLM: Generates SQL grounded in semantic context"] ENG["Query Engine (Dremio / Trino / Athena): Executes against Iceberg"] ICE["Iceberg Tables via Apache Polaris catalog"] ANS["LLM: Synthesizes result into business-language answer"] Q --> SL --> LLM --> ENG --> ICE --> ENG --> ANS

Why Raw Schemas Are Not Enough

Consider a table with columns: ord_id, cust, tot, dt, st, rgn. An LLM seeing this schema does not know that tot excludes tax, that st is an order status where CANCELLED orders must be excluded from revenue, or that dt is stored in UTC. Without that context, the LLM writes syntactically valid SQL that returns the wrong answer.

What the Semantic Layer Provides

ElementExampleWhy agents need it
Table description"Canonical order table. Every customer order since 2020-01-01. Excludes test orders."Agent knows when to use this table vs others
Column descriptions"tot: net order revenue in USD after discounts, before tax and shipping"Agent uses the right column for revenue
Metric definitions"revenue = SUM(tot) WHERE st IN ('SHIPPED','DELIVERED')"Agent applies correct filters without guessing
Relationship declarations"cust references customers.customer_id"Agent writes correct JOIN conditions
Business rules"Never include CANCELLED orders in any revenue metric"Agent applies domain logic, not just SQL logic
Timezone / unit notes"dt is stored in UTC"Agent converts correctly for local-time questions

Dremio as a Platform-Managed Semantic Layer

Dremio Virtual Datasets wrap raw Iceberg tables with a clean schema, column descriptions in plain language, and enforced business logic. Dremio's AI Semantic Layer makes these descriptions available to the AI agent at query time, so the LLM generating SQL sees business context rather than raw column names. The metadata lives in Dremio, versioned and governed alongside the data.

Common Failure Modes

FailureCauseFix
Wrong revenue numberLLM included CANCELLED ordersBusiness rule in semantic layer
Join returns duplicatesLLM guessed wrong join keyRelationship declaration in semantic layer
Wrong date rangeLLM used local timezone, data is UTCTimezone note in column description
Hallucinated numbersLLM answered from training data, not live dataForce agent to always execute SQL, never answer from memory
Unauthorized table accessNo access control on query engineCatalog RBAC (Polaris) + MCP server authorization

Go Deeper

๐Ÿ“š Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.