Connect your AI coding assistants to the physical world with Wherobots MCP and CLI Learn More

Spatial Data in Apache Iceberg: Optimizations and Management That Matter

Authors

Spatial data in Apache Iceberg needs different optimization than tabular data. A geometry column has no natural sort order, so unsorted files carry wide, overlapping bounding boxes and query planners cannot prune them… At all… This behaviour turns a selective spatial filter into a full table scan. A second problem compounds it: one oversized geometry inflates a whole file’s bounding box, so even well-sorted tables lose pruning when large and small shapes share a file. Optimizing Iceberg for spatial data means addressing both, then tuning file size, partitioning, and snapshot retention to the realities of geometry.

Jonathan Saring recently published an excellent guide on optimizing Apache Iceberg for analytics workloads covering the four-stage query planning pipeline, file sizing, compaction sequencing, delete-file economics, and metadata lifecycle management. It’s outstanding and if you’ve not read it, start there. It’s one of the more complete discussions of the subject I’ve come across, I’m gonna build directly on that foundation for the rest of this post, so again start here.

Now, what it doesn’t cover (because it isn’t its focus) is what happens when your data has a geometry column (Iceberg got geometry support as of v3!!). Geospatial data breaks several of the assumptions that make standard Iceberg optimization work (as others for typical data science, but that’s for a future post), introduces failure paths that don’t exist in the tabular world, and requires a few decisions that have no equivalent in conventional Iceberg guidance. That’s what this post is about.


How Iceberg Plans a Query (and Why Geometry Breaks It)

To understand why geometry requires different treatment, it helps to have the query planning pipeline clearly in mind. Saring’s article lays this out well, so I’ll regurgitate and summarize the stage, then focus on where spatial data diverges.

Before your query engine reads a single Parquet byte, Iceberg executes a metadata-only pipeline:

Stage 1: Snapshot resolution. The engine contacts the catalog to locate the current metadata file and resolve the active snapshot. Constant-time, O(1) catalog calls regardless of table size. If you are using a catalog service (Glue, REST (This is what we use at Wherobots), Nessie, etc) this process is handled at that layer. If you are using Hadoop or a file system approach, this step simply evaluates the version-hint.txt file to resolve the snapshot.

Stage 2: Manifest-list pruning. The snapshot points to a manifest list. Each entry carries partition-level summary statistics—min and max values—that let the engine skip entire manifest files before reading any data. This is the highest-leverage pruning stage; production measurements show it eliminating 99%+ of partitions before any per-file work begins.

Stage 3: Per-file data skipping. For surviving manifests, the planner reads per-file column statistics—min, max, null count—for each data file. Files whose column ranges can’t overlap the query predicate are skipped. This is where sort order pays off: sorted files have tight bounds, unsorted files have wide overlapping ranges that can’t be pruned.

Stage 4: Row-group filtering. Inside each surviving Parquet file, row-group statistics and Bloom filters perform one final pass. Tight sort order within files makes row-group boundaries meaningful.

Geometry breaks stages 2 and 3. A geometry column doesn’t have a natural total ordering. There’s no inherent way to sort shapes in two-dimensional space the way you sort integers or timestamps. An unsorted file containing geometries spread across a region has a bounding box spanning that entire region. A WHERE ST_Intersects(geom, ...) predicate for any subregion cannot eliminate files or rows. Without spatial ordering, the planner touches every file. Every query becomes a full table scan at the data-file level regardless of how selective your spatial filter is.

Several options for spatial sorting are discussed below.


Spatial Sorting in Apache Iceberg: Z-Order, Hilbert, S2, and H3

Z-Order on the Geometry Column

Iceberg’s native ZORDER BY applies a Morton code (Z-order) curve by interleaving the binary representations of latitude and longitude. It’s the lowest-friction option—no schema changes and no pre-computed columns since IcebergV3 already stores the geometry in the WKB (Well Known Binary)… Just a compaction call.

CALL wherobots.system.rewrite_data_files(
  table => 'wherobots.basemap.roads',
  strategy => 'sort',
  sort_order => 'zorder(geom)'
);

Z-order’s weakness is boundary discontinuities: geometries that are physically adjacent can end up far apart in Z-order if they straddle a quadrant boundary. For continental or global datasets, this meaningfully reduces pruning effectiveness in certain regions. As Saring notes for tabular Z-order, it provides balanced multi-dimensional skipping but is less tight per axis than a pure single-column sort—the same tradeoff applies spatially. For spatial data with a dominant query window, prefer Hilbert.

Hilbert Curve on the Geometry Column

The Hilbert curve doesn’t have Z-order’s boundary problem, and is why Wherobots implements this directly with easy to use syntax). Neighboring points on the earth stay neighbors along the Hilbert index, which means per-file bounding boxes are tighter and pruning is more effective for range queries, bounding box filters, and point-in-polygon lookups.

-- Havasu / WherobotsDB spatial index syntax
CREATE SPATIAL INDEX FOR wherobots.basemap.buildings
USING hilbert(geom, 12);


The resolution parameter controls the trade-off between locality preservation and index cardinality. For national-scale datasets, resolution 10–14 covers most practical use cases. Below 10 you lose locality within cells; above 14 you hit diminishing returns.

For datasets where spatial range queries dominate, Hilbert on the geometry column is the right default—the spatial equivalent of sorting a fact table on its primary filter column.

S2 and H3 Index Columns

S2 (Google’s spherical geometry library) and H3 (Uber’s hexagonal grid) take a different approach: pre-compute a discrete cell identifier and sort or partition on that integer column. This has a meaningful advantage over sorting on geometry directly since the column statistics work perfectly on integers, so min/max pruning at stage 3 functions without any geometry-specific logic. I frequently see these tessellation used as mapping or reporting units, prefect for analytic workloads.

from sedona.sql.st_functions import ST_S2CellIDs, ST_H3CellIDs

# Add an S2 index column at level 13 (~1km cells)
df = df.withColumn("s2_index", ST_S2CellIDs("geom", 13).getItem(0))

df.sortWithinPartitions("s2_index") \
  .writeTo("wherobots.basemap.poi") \
  .tableProperty("write.distribution-mode", "range") \
  .createOrReplace()

H3 has additional advantages: uniform cell area at each resolution and a clean hierarchy, which makes it well-suited for zoom-level-aware queries and aggregation pipelines. If downstream consumers already work with H3 for analytics or ML feature extraction, sorting on H3 gives you alignment between storage layout and query patterns at no extra cost.

Practical choice: If you’re doing heavy H3 analytics, add an H3 column and sort on it. If you want zero-configuration improvement with no schema changes, Hilbert on the geometry column is a strong default. Z-order is appropriate when your spatial access patterns span multiple dimensions simultaneously—similar to the tabular case where Z-order beats single-column sort when two dimensions are equally important in predicates.


The Cost of Re-Sorting a Spatial Iceberg Table: Treat It Like Schema Design

Here’s something the general Iceberg optimization literature glosses over because it matters less for tabular data: changing your sort order is expensive, and for spatial tables it’s very expensive.

For a tabular fact table, resorting on a new column is a full data rewrite—costly, but tractable. You compact the table with the new sort order and the old layout is gone. The only real question is whether you can afford the compute time and the brief period of stale statistics.

For a spatial table, resorting is functionally equivalent to rebuilding a spatial index from scratch. Every geometry is re-evaluated, re-projected onto the new curve, and re-sorted. For a national-scale polygon dataset—say, a building footprint table with 500 million rows and non-trivial geometry complexity—a full resort can take hours and cost real money. If you chose H3 resolution 7 as your sort key and your access patterns turn out to favor resolution 5, fixing that is not a quick compaction job.

This has a practical implication: choose your spatial sort strategy the same way you choose a schema—deliberately, with input from downstream consumers, and with the expectation that changing it later is a significant operation.

Before committing, run the analysis recommended for tabular sort keys: pull representative query patterns, understand the dominant spatial filter resolution, and prototype. With Iceberg’s branch support, you can test a resort on a table branch before committing to the full rewrite:

-- Create a branch for testing
ALTER TABLE wherobots.basemap.buildings
CREATE BRANCH hilbert_test;

-- Resort a sample partition on the branch
CALL wherobots.system.rewrite_data_files(
  table => 'wherobots.basemap.buildings',
  branch => 'hilbert_test',
  where => 'h3_region = 617700169518678015',  -- single partition
  strategy => 'sort',
  sort_order => 'hilbert(geom, 12)'
);

-- Compare bytes scanned and planning time against production branch
-- before committing to the full table rewrite

A few additional realities:

Resolution changes require full rewrites. Switching from H3 resolution 7 to resolution 5 isn’t an incremental operation—every row needs a new cell ID computed and re-sorted. If you’re using an H3 column as your sort key, make sure the resolution is appropriate for your finest-grained query window before you write the first row.

Mixed sort strategies don’t compact cleanly. If you have some partitions sorted by H3 and others by Hilbert (from a strategy migration), compaction across partitions won’t produce a coherent global sort order. A full resort is the only clean path.

Compaction cadence affects sort decay. Streaming ingest writes small, unsorted files. Even after initial sorting, continuous ingest degrades the spatial locality of files over time as new micro-batch files land with arbitrary geometry ordering. Budget for regular spatial sort compaction—not just binpack compaction—on tables that receive continuous updates.

-- Schedule sort compaction for streaming-ingest spatial tables
-- more frequently than you would for tabular tables
CALL wherobots.system.rewrite_data_files(
  table => 'wherobots.analytics.gps_events',
  strategy => 'sort',
  sort_order => 'zorder(geom)',
  options => map(
    'rewrite-all', 'false',           -- only rewrite files below size target
    'target-file-size-bytes', '134217728'  -- 128 MB
  )
);

File Size for Spatial Iceberg Tables: Where Geometry Breaks Standard Guidance

The standard Iceberg file sizing guidance targets 128–512 MB per Parquet file. That guidance assumes relatively uniform row sizes. Geometry columns violate this assumption dramatically.

A table of GPS points has geometry values of roughly 21–28 bytes per row. Building footprints might average 500 bytes per polygon. A table with mixed geometry types—parcel boundaries, road centerlines, administrative areas—can have row sizes varying by three orders of magnitude. A file that “should” be 256 MB by row count might actually be 2 GB if those rows contain dense multi-polygons.

The same principle that makes small files expensive for tabular analytics—each file requires at least one GET to read its Parquet footer during planning, plus additional GETs for column chunks—applies here. But the inverse problem is also real for spatial: oversized files from complex geometry reduce parallelism and create memory pressure during spatial joins.

In practice:

Profile your geometry complexity distribution before setting compaction targets. ST_NumPoints gives you vertex counts; use percentiles to understand your tail:

sedona.sql("""
SELECT
  PERCENTILE(ST_NumPoints(geom), 0.5)  AS p50_vertices,
  PERCENTILE(ST_NumPoints(geom), 0.95) AS p95_vertices,
  PERCENTILE(ST_NumPoints(geom), 0.99) AS p99_vertices,
  AVG(ST_NumPoints(geom))              AS avg_vertices
FROM wherobots.basemap.parcels
""").show()

Reduce the byte target from the Iceberg default for tables with complex or heterogeneous geometry—128–256 MB is more appropriate than 512 MB for dense polygon datasets. Monitor actual Parquet file sizes post-write rather than relying on Spark’s write estimates, which don’t account for geometry serialization overhead.

For raster tables, tile size via RS_TileExplode is the equivalent decision: it determines the granularity at which spatial filter pushdown operates and the parallelism available during processing.


Partitioning Spatial Data in Apache Iceberg

The core partitioning principle from the general Iceberg playbook applies directly: partition by the column that appears in the WHERE clause of the majority of your queries. For spatial data, that is almost always some form of spatial or thematic bucket, e.g. a coarse S2 cell, an H3 region, a country or state code, or a UTM zone identifier. Remember it’s best to use low cardinality columns as too many partitions can have negative impacts.

# Partition a national POI table by H3 resolution 3 (~regional level)
df.withColumn("h3_region", ST_H3CellIDs("geom", 3).getItem(0)) \
  .writeTo("wherobots.analytics.poi") \
  .partitionedBy("h3_region") \
  .createOrReplace()

For data that is both spatially and temporally filtered, like satellite-derived products, observation data, and event telemetry with coordinates for example, compound partitioning on time and space is the right call:

CREATE TABLE wherobots.analytics.ndvi_observations (
  observation_date DATE,
  geom             GEOMETRY,
  ndvi             DOUBLE,
  h3_region        BIGINT
)
USING iceberg
PARTITIONED BY (months(observation_date), h3_region);

The over-partitioning trap is especially acute for spatial data. Fine-grained spatial partitioning—H3 resolution 8 instead of 3—produces enormous partition counts with very few files per partition, increasing manifest scanning overhead without proportional pruning benefit. Prefer coarse partitioning with fine-grained sort within partitions over fine-grained partitioning alone.


Why Large Geometries Break Bounding Box Pruning

This is the failure mode that breaks stage 3 pruning most severely, and the one that has no parallel in tabular data.

Iceberg stores per-column min/max statistics in each data file’s manifest entry. For a geometry column, this translates to a bounding box—the minimum and maximum X and Y coordinates across all geometries in that file. The query planner uses this to skip files whose bounding box doesn’t intersect the query’s spatial filter.

The problem: a single large geometry can inflate the bounding box of an entire file. A national boundary polygon, a major river system, a large administrative region—any of these has a bounding box spanning hundreds or thousands of kilometers. If that geometry lands in a file alongside small, local geometries, the file can no longer be skipped for almost any spatial query over the region. The planner sees a file with a bounding box covering six states and has to open it for every query touching those states, even queries looking for points in a single city.

This is not a theoretical edge case.

The Fix: Isolate Large Geometries with a BFG Partition Flag

A simple approach is to write large and small geometries to separate tables and union them at query time. Don’t. A UNION ALL across two differently-partitioned tables causes Spark to reconcile partition counts across both sides, which either triggers an expensive shuffle or produces a badly skewed stage depending on how your engine handles it. Keep everything in one table.

Instead, classify geometries by bounding box extent before writing and add an is_bfg partition column—a simple integer flag that segregates Big Freaking Geometries into their own partition. The BFG partition loses pruning effectiveness at the file level, but it’s small, fast to full-scan, and critically, it contains the bbox metadata damage so normal geometry partitions stay clean.

from sedona.sql.st_functions import ST_Area, ST_Envelope
from pyspark.sql.functions import col, lit, when

# BFG_THRESHOLD is relative to your typical query window, not an absolute value.
# For a national dataset queried at city scale, 0.01 degrees² is a reasonable start.
# Document this as a table-level constant — don't let it become a magic number.
BFG_THRESHOLD = 0.01  # degrees²

df = df.withColumn("bbox_area", ST_Area(ST_Envelope("geom"))) \
       .withColumn(
           "is_bfg",
           when(col("bbox_area") >= lit(BFG_THRESHOLD), lit(1)).otherwise(lit(0))
       )

# Sort within partitions by spatial index — apply to both partitions.
# BFG sort still helps at stage 4 row-group level even if stage 3 file pruning is weak.
df.sortWithinPartitions("is_bfg", "s2_index") \
  .writeTo("wherobots.basemap.admin") \
  .partitionedBy("is_bfg", "h3_region") \
  .createOrReplace()

At query time, nothing special is required. The query planner handles both partitions transparently—spatial predicates that eliminate the BFG partition entirely will do so via partition pruning, and queries that touch it will full-scan a small, bounded set of files.

SELECT * FROM wherobots.basemap.admin
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((...))'))

A few things to keep in mind once is_bfg is in production:
The threshold is load-bearing and should be defined relative to your typical query window. A geometry that qualifies as BFG in a city-scale dataset is completely normal in a continental one. If the threshold is wrong for a given dataset, geometries get miscategorized silently and pruning degrades without any obvious error signal. Make it a documented constant at minimum; a table property if your pipeline supports it.

The flag is sticky through compaction and updates. If geometries change over time—administrative boundaries being simplified, coastlines refined—a shape that started as BFG may no longer qualify after an edit. Budget for a periodic reclassification job that uses MERGE INTO to move geometries between partitions when their bbox area crosses the threshold in either direction.

The BFG partition still needs a spatial sort. It’s tempting to skip it since file-level pruning is already compromised there, but for global datasets where the BFG partition can grow to include every national boundary, major river system, and ocean polygon, an unsorted BFG partition becomes a meaningful full scan. Hilbert or S2 sort within the BFG partition still tightens row-group statistics at stage 4 and is worth the compaction cost.

The principle throughout: any geometry whose bounding box spans more than a few percent of your typical query window is a pruning liability—isolate it, don’t let it contaminate the metadata of the geometries around it.


Snapshot Retention: Matching Policy to Data Change Rate

Every Iceberg table operation creates a new snapshot. Every. Single. One. Just like traditional RDBMS where we periodically manage our tables, we need to do the same for our snapshots. Saring’s article recommends retaining snapshots for your audit and time-travel SLA—typically 7–30 days for analytics tables, then expiring aggressively. That’s the right default for event-driven data. But geospatial tables span a much wider range of change rates, and the right retention policy differs significantly between them.

It helps to think in three categories.

Fast-Changing Layers: Streaming and Event Data

GPS telemetry, real-time traffic observations, sensor feeds, and other continuously-updated spatial tables behave like any high-ingest tabular table. Default Iceberg retention (7–30 days) is appropriate. The priority here is aggressive snapshot expiration to prevent manifest list bloat and unblock garbage collection of old data files.

-- GPS telemetry: expire quickly, keep only operational window
ALTER TABLE wherobots.analytics.gps_events
SET TBLPROPERTIES (
  'history.expire.max-snapshot-age-ms' = '604800000',    -- 7 days
  'history.expire.min-snapshots-to-keep' = '5'
);

The compaction concern for fast-changing spatial tables is sort decay. Streaming ingest writes small, unsorted files. Even if the table was initially sorted by Hilbert or H3, continuous micro-batch appends degrade spatial locality over time. Fast-changing spatial tables need more frequent sort compaction than their tabular equivalents—not just binpack consolidation.

Moderate-Cadence Layers: POI, Addresses, Administrative Units

Point-of-interest datasets, address databases, and delivery zone boundaries typically update weekly to monthly. These warrant a medium retention window—enough to support rollback and short-horizon time travel, but not indefinite.

-- Weekly-updated POI layer: 60-day retention
ALTER TABLE wherobots.analytics.poi
SET TBLPROPERTIES (
  'history.expire.max-snapshot-age-ms' = '5184000000',   -- 60 days
  'history.expire.min-snapshots-to-keep' = '12'          -- ~3 months of weekly updates
);

Slow-Changing Base Layers: Road Networks, Land Cover, Imagery Products

Road networks update quarterly. Administrative boundaries update annually, sometimes less. Land cover classifications might be refreshed once per product release cycle. And their historical versions remain analytically relevant far longer than their update frequency suggests. A transportation analyst querying 2022 road data against 2024 trip records needs the 2022 snapshot to still exist. Default expiration will quietly delete it.

-- Road network updated quarterly: retain at least 2 years
ALTER TABLE wherobots.basemap.road_network
SET TBLPROPERTIES (
  'history.expire.max-snapshot-age-ms' = '63072000000',  -- 2 years
  'history.expire.min-snapshots-to-keep' = '8'           -- ~2 years of quarterly updates
);

For base layers, named snapshot tags are essential. Age-based expiration and named references are independent—a tagged snapshot won’t be deleted by expiration policy. Tag releases at the time of publication:

ALTER TABLE wherobots.basemap.road_network
CREATE TAG `2024-Q4-release`
AS OF VERSION 42;

This lets analysts and pipelines refer to a specific basemap vintage by name rather than timestamp, without depending on a snapshot being young enough to survive the expiration window.

Decouple compaction from expiration, and sequence them correctly. As Saring’s maintenance ordering makes clear: expire snapshots first, remove orphan files, then compact, then rewrite manifests, then compute statistics. Compacting before expiration wastes compute on files about to be garbage-collected. For slow-changing base layers, this sequence is especially important because the update windows are so infrequent—a mis-sequenced maintenance run on a quarterly table has a three-month blast radius.

Consider separating canonical from delta for frequently-edited layers. If your basemap has an infrequently-updated authoritative version plus a more rapidly-updated change layer (new construction, edit accumulation, QA outputs), store them in separate tables with separate retention policies. The canonical base gets long retention; the delta layer gets standard retention. Queries union at read time. This approach also alleviates the large geometry problem: large administrative geometries in the canonical base are isolated from point and line edits in the delta.


Apache Iceberg Spatial Data: Configuration Reference

ConcernRecommendationNotes
Sort orderHilbert on geom, or H3/S2 integer columnH3/S2 if downstream analytics already uses those indexes; Z-order for multi-dimensional access patterns
ResortingTreat like schema redesign; prototype on a branch firstResolution changes require full rewrites; budget accordingly
File size targetProfile geometry complexity; reduce byte target for complex polygons128–256 MB for dense polygon datasets; monitor actual Parquet sizes post-write
PartitioningCoarse spatial bucket (H3 r3–r4, state, UTM zone) + temporal if applicableDon’t over-partition; coarse partition + fine sort outperforms fine-grained partitioning alone
Large geometriesSeparate file or separate table; never mix with small geometriesThreshold is relative to your typical query window
Retention: fast-changing7–30 days; expire aggressively; prioritize sort compaction frequencySort decay from streaming ingest is the primary spatial concern
Retention: moderate60–90 days; keep enough for rollback and short time-travelWeekly-to-monthly update cadence
Retention: slow-changing1–2 years; use named tags for versioned releasesDecouple compaction and expiration schedules per table
Maintenance orderExpire → orphan cleanup → compact → rewrite manifests → compute statsSame sequence as tabular; cadences differ sharply by layer type

Spatial Layout Sets the Performance Ceiling

The performance ceiling of a spatial Iceberg lake is not the query engine—it’s the physical state of the tables. That’s true for tabular data too, but spatial adds a dimension that makes the penalties for getting it wrong steeper and less visible. A poorly-sorted tabular table is slow. A spatial table with unsorted geometry and mixed large geometries produces full-scan behavior even for highly selective spatial filters, and does it silently—the query doesn’t fail, it just reads everything.

Choose your spatial sort strategy deliberately, because resorting at scale is expensive. Profile geometry complexity before setting file size targets. Isolate large geometries before they poison your pruning metadata. And set snapshot retention per table type—because the data that barely changes is often the data someone needs most when querying the past.


This post builds on Jonathan Saring’s Iceberg Lake for Analytics Data: A Guide, published in ITNEXT. The four-stage planning pipeline, file sizing principles, and maintenance sequencing described there are the foundation for the spatial-specific extensions covered here. Spatial implementations use Wherobots’ Havasu spatial lakehouse, built on Apache Iceberg and Apache Sedona.

Try Wherobots for Free