Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
In our previous blog post, we announced Apache Iceberg and Parquet’s support for spatial data types and discussed their significance. These enhancements significantly improve the economics of utilizing geospatial data in end solutions. Organizations will be capable of creating higher value, lower cost products, faster over time. Today, we take a closer look at these GEO data types in Iceberg (collectively Iceberg GEO), exploring design, key features, and implementation considerations. We will also demonstrate how to leverage these new Iceberg features with Apache Sedona and Wherobots in an upcoming blog post.
The foundation for Iceberg GEO can be traced back to early 2022 with the launch of the GeoParquet project, which aimed to standardize spatial data exchange across different vendors in the cloud-native and big data ecosystem. This initiative was a crucial first step toward unifying spatial data formats in modern data platforms. Its history, and how GeoParquet will become just Parquet, is described in this GeoParquet community blog post.
As interest in spatial data support for lakehouses grew, early design explorations took shape through projects like GeoLake. Wherobots later expanded on these concepts, developing Havasu in 2023, a production-ready extension of Iceberg designed to support geometry, geography, and raster data in cloud data lakehouses. Havasu’s geometry encoding was inspired by GeoParquet, while its storage format remained fully Parquet-compatible, inheriting Iceberg’s key features such as ACID transactions, schema evolution, time travel, and data versioning.
Recognizing the value and need to bring native spatial support to Iceberg, the cloud native spatial community ultimately decided that integrating these features directly into Iceberg would be more beneficial than maintaining a separate extension. This direction would ensure that spatial data was well supported by Iceberg without modifications.
In early 2024, the spatial and Iceberg communities formally initiated discussions on adding GEO data type to Iceberg, using Havasu’s design as a reference. The collaboration involved many community members at Wherobots, CARTO, Planet, Apple, Databricks, Snowflake, and others. Through joint efforts and extensive discussions, the proposal was refined and successfully merged.
The Iceberg GEO proposal introduces two spatial data types: geometry and geography. This distinction addresses the varying levels of spatial data processing support across different engines. Some primarily work with geometry, while others emphasize geography. The primary distinction between these types is how edges between points are interpolated. All other aspects of the Iceberg GEO proposal such as encoding, and bounds apply to both unless explicitly stated otherwise.
The geometry type represents spatial objects in a planar space using Cartesian geometry, assuming all calculations, including distance and area measurements, are performed on a flat surface. It is best suited for applications requiring high-precision, local-scale spatial operations, such as urban planning, country-level modeling, and traffic engineering. Geometry uses planar edge interpolation, where edges between vertices are treated as straight lines.
The geography type represents spatial objects on the ellipsoidal surface of the Earth, making it more appropriate for global-scale applications such as satellite tracking, aviation navigation, and long-distance routing. Unlike geometry, geography accounts for the Earth’s curvature, ensuring that spatial operations like distance and area calculations reflect real-world geography. It requires non-planar edge interpolation algorithms, which define how edges between points behave on a curved surface.
Since the Earth is not flat, different interpolation methods impact the accuracy of spatial operations. The community identified six primary interpolation algorithms: Planar, Spherical, Vincenty, Thomas, Andoyer, and Karney. Taking Planar and Spherical as examples, planar interpolation assumes straight-line edges in a Cartesian plane and is used in the Geometry type while spherical interpolation models edges as geodesic curves on a sphere and could be used in the Geography type.
Details of all interpolation methods can be found in this paper. The Iceberg Geography type requires implementations to explicitly specify which non-planar interpolation algorithm is used, ensuring consistency in spatial computations across different engines.
Both Iceberg GEO types follow the OGC Simple Feature Access v1.2.1 data model, supporting geometric objects such as points, polygons, line strings, and geometry collections. It uses the ISO Well-Known Binary (WKB) format for encoding, which supports higher-dimensional geometries (Z and M values) but does not include a Spatial Reference Identifier (SRID). A more detailed comparison of WKB variants can be found in the GEOS library documentation.
To ensure consistency across spatial tools, Iceberg GEO enforces a longitude-latitude (X, Y) coordinate order, aligning with standards used in GeoPandas, Apache Sedona, and Google Maps.
Both geometry and geography types support CRS definitions using either a SRID (e.g., srid:4326) or a PROJJSON string (e.g., projjson: {…}), which provides a self-contained CRS definition. SRIDs are storage efficient for well-known coordinate systems, while PROJJSON allows detailed CRS specifications. A minor difference between the Geometry and Geography type is that the former allows any CRS, whereas the latter only allows geographic CRS, which only makes sense in the context of non-planar edge interpolation.
Iceberg GEO extends Iceberg’s lower and upper bounds statistics for spatial data by defining bounds based on the westernmost, easternmost, northernmost, and southernmost extents of spatial objects. While these longitude and latitude bounds can theoretically define the bounding box of a data file in Iceberg, allowing query predicates to be checked against it, they help optimize spatial filtering operations like ST_Intersects. However, certain complexities must be considered.
This bounding method is necessary for handling objects that cross the antimeridian (±180° longitude), where the lower longitude bound may be greater than the upper bound. Additionally, for non-planar edge interpolation used in the Geography type, a shape’s bounding box may not always be defined by its vertices, requiring a more precise bounding approach. For example, the territorial waters of Fiji span both hemispheres, with points at (179°E, 18°S), (-179°W, 18°S), (-179°W, 16°S), and (179°E, 16°S). A naive min/max longitude calculation might incorrectly assume the bounding box extends from -179°W to 179°E, nearly covering the entire globe. Instead, Iceberg GEO correctly identifies 179°E as the westernmost point and -179°W as the easternmost, ensuring accurate query filtering and optimization.
For geography types, longitude bounds must fall within [-180, 180], while latitude bounds must be within [-90, 90].
Iceberg GEO is natively supported by Iceberg’s table operations, allowing spatial data to be stored, modified, and queried efficiently. When defining a table schema, users can specify geometry or geography columns with optional CRS parameters.
For data manipulation, Iceberg GEO supports inserting, updating, and deleting spatial data using formats such as WKB and WKT. Queries can leverage spatial functions like ST_Intersects, ST_Contains, and ST_Distance, enabling efficient spatial filtering and analysis. Iceberg’s manifest metadata optimizes query execution by pruning unnecessary data files, significantly improving performance for large datasets.
Iceberg GEO does not define a specific behavior for Z-ordering spatial objects, but engines can implement custom solutions. A common approach is to compute spatial indices such as H3 or S2 and use them for Z-order clustering. This helps preserve spatial locality in storage, improving query performance by reducing unnecessary scans.
Compaction in Iceberg consolidates small files to improve performance. For spatial data, compaction can leverage Z-ordering to group spatially related objects together, enhancing data locality and reducing read overhead. During compaction, Iceberg needs to recalculate the lower and upper bounds for spatial objects to maintain accurate spatial statistics, ensuring that query pruning remains effective.
Iceberg GEO is fully compatible with the Iceberg REST catalog, which stores metadata in JSON format and allows seamless integration across multiple compute engines. For catalogs like Apache Polaris, AWS Glue, and Hive Metastore, additional change may be required to recognize the new geometry and geography types. However, since Iceberg GEO follows Iceberg’s existing metadata structures, the effort required for adaptation is minimal.
With the geo types now merged into Apache Iceberg, Wherobots will soon begin assisting customers in migrating all Havasu-Iceberg tables to native Iceberg tables. This transition will streamline spatial data management while ensuring full compatibility with the Apache Iceberg ecosystem.
Wherobots is the best compute engine for processing spatial data, makes using Iceberg very easy, and has been tuned for working efficiently with Iceberg tables. Our technologies continue to enhance cost performance and data governance, ensuring the best possible experience for spatial data workloads.
If you want to get started working with our Iceberg enabled Spatial Intelligence Cloud, and begin taking advantage of all the benefits of Iceberg GEO, sign up for a Wherobots pro account on the AWS marketplace, which includes $400 in compute credits. We are hosting regular getting started sessions, and the historical ones can be viewed on our Wherobots Youtube channel. As we mentioned upfront, expect to see additional content along with demonstrations in our blog moving forward.
Sign up for our newsletter to stay up to date with everything we are doing to enable the spatial community to embrace the modern geospatial lake-house.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Introducing Scalability for GeoPandas in Apache Sedona
Learn about the new GeoPandas API for Apache Sedona, now available in Wherobots. This new API allows GeoPandas developers to seamlessly scale their analysis beyond what a single compute instance can provide, unlocking insights from large-scale datasets. This integration combines the Pythonic GeoPandas API with the distributed processing power of Apache Sedona.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: