Your AI can now contextualize physical world data using Wherobots Spatial AI Coding Tools Learn More

Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade

Authors

Introduction

Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it.

Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, available to every engine in the ecosystem.

In their Iceberg v3 post, the Snowflake team called out where the geo work came from:

“A special mention to the entire Wherobots team, which implemented geospatial support on its own fork of Iceberg before offering its expertise to the Iceberg community, providing leadership and implementing the feature for the Iceberg project.”

The work started in 2022, the year Wherobots was founded. This post is the story behind it: how production experience became an open standard, and what it means for the teams and tools building on spatial data.

How Geospatial Data Worked in Iceberg Before v3

Until Iceberg v3, geospatial columns did not exist as a concept in the table format. Engineers stored geometry as opaque binary blobs: Well-Known Binary (WKB) bytes in a binary column. The Iceberg catalog had no way to know the column contained spatial data.

The practical consequences:

  • No CRS metadata. The coordinate reference system did not live in the column. Engineers tracked it in documentation or external conventions.
  • No bounding-box statistics. A query engine had no way to skip files based on spatial predicates without reading the geometry itself.
  • No cross-engine portability. Every team needing spatial analytics on a lakehouse built custom encoding pipelines, engine-specific readers, and schema conventions living outside the format.

It worked, but it was fragile. And it did not travel across engines.

Havasu: Production System First

Havasu was our answer to that problem: a spatial lakehouse extension built on an Iceberg fork. Not a proof of concept. A production system, running real customer workloads since 2022, where we could pressure-test the design decisions that would eventually become a standard:

  • Spatial indexing integrated into table metadata
  • Bounding-box statistics at the manifest level, enabling file-group pruning on spatial predicates
  • CRS propagation through the schema, so coordinate reference systems traveled with the data
  • Geometry column encoding with explicit format annotations (WKB, EWKB, WKT, GeoJSON)

That production experience shaped our GeoLake research, which formalized the core design principles: unambiguous CRS representation, efficient geometry encoding in columnar storage, and bounding-box statistics that let spatial predicates be evaluated before any geometry data is read.

Then we contributed the design, the implementation experience, and the lessons learned upstream to Apache Parquet and Apache Iceberg, so the broader ecosystem could build on it.

Many of the design decisions validated in Havasu are now part of the Iceberg v3 spec. Bounding-box statistics, CRS propagation, and geometry encoding all made the transition from production system to open standard.

Before and After: Geometry in the Iceberg Schema

To understand why this matters, consider the difference concretely.

Before (Iceberg v1/v2): A geometry column in the Iceberg schema looks like this:

{ "id": 5, "name": "geom", "type": "binary" }

The catalog sees raw bytes. No spatial semantics. An engine reading this table has no way to know this column contains geometries, what CRS they’re in, or how to prune files spatially, unless it relies on out-of-band conventions like GeoParquet file-level metadata.

After (Iceberg v3): The same column becomes:

{ "id": 5, "name": "geom", "type": "geometry(srid:4326)" }

Now the CRS is a property of the type. The Parquet files carry GEOMETRY logical type annotations that any conforming engine can recognize. Bounding-box statistics in the manifest use a compact coordinate encoding, so engines can skip entire file groups that fall outside a query’s spatial bounds, before reading a single geometry value.

Spatial pruning moves to the format level. This is the architecture we designed and validated in Havasu, now available to any engine that reads Iceberg v3.

Making It Native: Apache Parquet and Iceberg

Making geospatial a first-class type in the open lakehouse required coordinated changes to two foundational Apache projects.

On the Parquet side, the PR to add GEOMETRY and GEOGRAPHY logical types to the format spec drew over 400 comments across months of design review: encoding formats, CRS semantics, edge-interpolation behavior, edge-case handling. Jia Yu (Wherobots Co-Founder and Apache Sedona PMC Chair) and Kristin Cowalcijk (Apache Sedona PMC member) drove core design decisions from the start.

On the Iceberg side, the geo type spec defined how Iceberg catalogs spatial columns, stores bounding-box metadata, and handles spatial partitioning, with another 240+ comments of cross-community design work. The core API and implementation, authored by Kristin, followed with bounding-box types, geospatial predicates, and Parquet geo read/write.

Over a year of coordinated work across both Apache communities. The result: geometry and geography as native primitive types, from the storage format through the table format – with the same level of spec support that timestamps, decimals, and every other type have always had.

The First Wave of Adoption

Snowflake’s announcement is a milestone, not the finish line. It’s the first major engine to ship v3 geo types. It won’t be the last.

Iceberg v3 adoption is accelerating broadly. AWS Glue shipped v3 support at re:Invent 2025. Dremio followed with GA support in their cloud platform. As more engines adopt v3, spatial columns will interoperate the same way every other column type already does. The infrastructure barrier that kept geospatial data siloed from mainstream analytics is coming down.

The end-to-end Parquet geo read/write path for Iceberg is under active review in the Apache Iceberg project. It connects v3 schema types to properly annotated Parquet files with GEOMETRY logical types. Once merged, any Iceberg-compatible engine will produce and consume fully v3-native geo tables.

What This Means for Spatial Data Interoperability

Apache Iceberg is increasingly how organizations query data across engines, and native geo types mean spatial data now travels the same way every other column type does.

A geo-typed table written from one engine is readable by another with correct CRS, bounding-box stats, and spatial pruning intact. No proprietary connectors. No out-of-band conventions.

Cross-engine portability also makes spatial data accessible to AI. Tools and agents working with physical-world data need consistent access across engines and catalogs. When the type contract lives in the format, access does not require per-system special-casing.

Ahead of the Standard

Wherobots Cloud is where teams run spatial ETL, large-scale analytics, and geospatial data engineering on Iceberg today, fully managed. As the upstream v3 pipeline completes, Wherobots Cloud will be the first to produce fully v3-native geo tables. The natural result of being the team that designed the spec and has the longest production track record on it.

Apache Sedona reads and writes Iceberg tables with geospatial columns today, using the Havasu encoding that preceded the v3 spec. This means Sedona users have had production-grade spatial lakehouse capabilities since before the standard was finalized, and the migration path to v3-native types is straightforward as upstream support matures.

WherobotsDB and Sedona also go beyond what the spec requires: distributed CRS-aware computation, automatic transformation across datasets in different projections, and support for CRS formats beyond SRID integers (WKT, PROJ strings, grid-based datum shifts). Most engines that adopt v3 will read the CRS tag. WherobotsDB and Sedona compute with it at scale.

The end-to-end Parquet geo read/write path for Iceberg is under active review in the Apache Iceberg project. It’s the final piece connecting v3 schema types to properly annotated Parquet files with GEOMETRY logical types. Once merged, any Iceberg-compatible engine will be able to produce and consume fully v3-native geo tables.

Where Wherobots Goes Beyond the v3 Spec

The v3 geo types are the foundation. The next layer is making the full pipeline seamless: spatial partitioning strategies, advanced spatial indexing beyond bounding boxes, and tighter integration between spatial predicates and query planning across engines.

These are problems we’ve been working on in Havasu for years. Now that the standard exists, that work can happen in the open, and the entire ecosystem benefits.

The Spatial AI Coding Assistant, including the Wherobots MCP Server, VS Code Extension, and CLI, lets developers and AI agents work with spatial data through natural language. A developer describes an analysis problem, and the tools find relevant datasets, generate spatial queries, and execute them on Wherobots Cloud. Native geo types in Iceberg make this work cleanly: when the type, CRS, and spatial statistics live in the format, an agent queries physical-world data the same way it queries any other column, across any engine reading Iceberg.

If you’re building your spatial data architecture on Iceberg, we’d like to talk.