Apache Sedona Archives

Streaming Spatial Data into Wherobots with Spark Structured Streaming

Posted on March 18, 2026March 19, 2026 by Daniel Smith

Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were)

I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be something between a deep sigh and a nervous laugh.

Here’s the thing: streaming tabular data with Spark Structured Streaming is a well-trodden path. There are tutorials everywhere. But streaming spatial data where you need to create geometry objects, apply spatial business rules, and land the results in a format that actually understands what a Point is (pun definitely intended); that’s where most guides quietly end.

This post walks through a complete, working pipeline that takes raw telemetry data (fleet tracking, asset monitoring, IoT sensors, etc.) and streams it into a Wherobots managed Iceberg tables with full Sedona geometry support. No hand-waving… Real code… The kind of thing you can run on Wherobots Cloud this afternoon.

What We’re Building

The architecture is intentionally straightforward. Three components, one direction, no backflips required:

Data Source — Parquet files landing in an S3 bucket. Each file contains asset tracking events: lat/lon/altitude, speed, heading, operator, asset ID, and status. Think of this as the output from any fleet management system, IoT gateway, or sensor network.
Streaming Ingest Job — A Spark Structured Streaming job running on Wherobots that watches the S3 path, picks up new Parquet files as they arrive, converts raw coordinates into Sedona geometry objects, applies business rules (like a speeding flag), and writes each micro-batch to an Iceberg table in the Wherobots catalog.
Viewer Notebook — A Wherobots notebook that reads the catalog table and gives you stats, batch-level detail, a bar chart, and an interactive map colored by speeding status. The “so what” layer.
A Downstream Application Layer — Once the data is prepared and ready for consumption, it can be fed into a downstream application or database that can serve the data to other consumers: a map, or another data processing system.

The generator that produces the Parquet files is a separate concern; any system that drops Parquet into S3 works. We’re focused on the Wherobots side: how you consume, transform, and catalog streaming spatial data.

The Source Schema

The source data uses a flat schema — no geometry column yet. That’s deliberate. Most real-world telemetry arrives as plain numbers (latitude, longitude, altitude), not as WKB or WKT geometry strings. The spatial enrichment happens during ingest.

Column	Type	Description
`event_id`	STRING	Unique event identifier
`timestamp`	TIMESTAMP	Event time
`x`	DOUBLE	Longitude
`y`	DOUBLE	Latitude
`z`	DOUBLE	Altitude (meters)
`speed_mph`	DOUBLE	Speed in miles per hour
`heading`	DOUBLE	Compass heading (degrees)
`operator_name`	STRING	Fleet operator
`asset_id`	STRING	Individual asset identifier
`asset_type`	STRING	Asset category (e.g., truck, drone)
`status`	STRING	Operational status

Eleven columns. No geometry. That’s what Sedona is for.

Setting Up the Streaming Ingest

SedonaContext Initialization

The first thing the ingest job does is create a SedonaContext. On Wherobots Cloud, Spark is already pre-initialized — you just need to wrap it with Sedona capabilities:

from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Ensuring the Catalog Schema and Table Exist

Before the stream starts, we make sure the target schema and table exist. This is idempotent — safe to run every time the job starts:

sedona.sql("CREATE SCHEMA IF NOT EXISTS wherobots.streaming_demo")

sedona.sql("""
    CREATE TABLE IF NOT EXISTS wherobots.streaming_demo.asset_tracks (
        event_id STRING,
        timestamp TIMESTAMP,
        x DOUBLE,
        y DOUBLE,
        z DOUBLE,
        speed_mph DOUBLE,
        heading DOUBLE,
        operator_name STRING,
        asset_id STRING,
        asset_type STRING,
        status STRING,
        geometry GEOMETRY,
        is_speeding BOOLEAN,
        source_file STRING,
        batch_id LONG,
        job_run_id STRING
    ) 
""")

One thing to call out here:

The table has more columns than the source. The source Parquet has 11 fields. The table has 16. The extra five (geometry, is_speeding, source_file, batch_id, job_run_id) are all derived during the ingest process. This is the whole point; we’re enriching raw telemetry into spatially-aware, business-rule-tagged, lineage-tracked records.

The Spatial Transform

This is where the real work happens, and honestly it’s surprisingly compact. Here’s the transform that runs on every micro-batch:

transformed = raw.selectExpr(
    "event_id",
    "timestamp",
    "x", "y", "z",
    "speed_mph",
    "heading",
    "operator_name",
    "asset_id",
    "asset_type",
    "status",
    "ST_SetSRID(ST_PointZM(x, y, z, unix_timestamp(timestamp)), 4326) AS geometry",
    "CASE WHEN speed_mph > 65 THEN true ELSE false END AS is_speeding",
    "input_file_name() AS source_file",
)

Three derived columns in one selectExpr. Let’s break them down.

Creating Geometry with ST_PointZM

ST_SetSRID(ST_PointZM(x, y, z, unix_timestamp(timestamp)), 4326) AS geometry

ST_PointZM creates a 4D point geometry:

X = longitude
Y = latitude
Z = altitude in meters
M = the “measure” dimension — here we encode the unix timestamp

This is a pattern I’ve come to appreciate. By encoding time as the M value, each point carries its full spatiotemporal context in the geometry itself. If you later need to compute distances or trajectories, the temporal ordering is baked right into the geometry. We wrap the whole thing in ST_SetSRID(..., 4326) to declare it’s in WGS 84.

Business Rule: The Speeding Flag

CASE WHEN speed_mph > 65 THEN true ELSE false END AS is_speeding

Nothing fancy. Speed over 65 mph? You’re speeding. This is computed once, during ingest, and stored in the catalog. Downstream consumers (dashboards, notebooks, APIs) read it directly — no need to recompute. This is the “silver layer” principle applied to streaming: apply your business rules at ingest time so the data is analysis-ready when it lands.

Business Rule: Geofence Violation Detection

Here’s where things get spatial. The is_speeding flag is a simple column-level rule — no external data needed. But what about spatial business rules that depend on other geometry? Things like: “is this asset inside a school zone buffer?” or “did this truck enter a restricted area?”

That’s where Wherobots in_fence column. During each micro-batch, we left-join the asset points against a table of geofence buffer polygons using ST_Intersects. If a point intersects any buffer, in_fence = true.

Why This Has to Happen Inside `foreachBatch`

Your first instinct might be to add the geofence check in the selectExpr alongside the spatial transform. I tried that. Spark won’t let you join a streaming DataFrame against a catalog table directly — the streaming DF is “unresolved” and the join planner doesn’t know what to do with it. The foreachBatch callback is the escape hatch: by the time your function is called, the DataFrame is fully materialized (static), so you can join it against anything.

You might also try createOrReplaceTempView so you can write the join as SQL. That works locally, but on Wherobots Cloud it fails with TABLE_OR_VIEW_NOT_FOUND because the internal DROP that precedes the CREATE hits the catalog resolver. DataFrame API join avoids this entirely.

Loading the Geofence Table: Broadcast + Cache

The geofence table is small (typically dozens to a few hundred buffer polygons) and it rarely changes while the stream is running. Re-reading it from the catalog on every micro-batch would be wasteful. Instead, we load it once at startup, broadcast it so every executor gets a local copy, and cache it so Spark doesn’t re-evaluate it:

geofence_df = (
    spark.table(GEOFENCE_TABLE)
    .select(F.col("geometry").alias("fence_geom"))
    .cache()
)
geofence_df.count()  # materialize the cache

F.broadcast() tells Spark to use a broadcast hash join instead of a shuffle join and the small geofence DataFrame gets shipped to every executor and held in memory, so the join against each micro-batch is local and fast. The .cache() ensures the table read only happens once. The .count() call forces materialization so the cache is warm before the first batch arrives.

One tradeoff: if you update the geofence table while the stream is running, the stream won’t see the changes until you restart. For a table that changes rarely (new school zones don’t appear every 30 seconds), that’s a fine deal.

The Spatial Join Pattern

# geofence_df is the cached DataFrame loaded at startup.
# F.broadcast() here — at the join site — so Spark sees it as
# part of the join plan and uses a broadcast hash join (no shuffle).
df = (
    df.alias("a")
    .join(
        F.broadcast(geofence_df).alias("b"),
        F.expr("ST_Intersects(a.geometry, b.fence_geom)"),
        "left",
    )
    .select("a.*", F.expr("b.fence_geom IS NOT NULL AS in_fence"))
)

A few things to note:

Alias the geometry column on the geofence side to fence_geom. Both tables have a geometry column; without the alias, Spark can’t resolve the join condition.
Left join ensures every asset point survives, even if it doesn’t intersect any fence. Non-matching points get in_fence = false.
The IS NOT NULL check on fence_geom is what converts the join result into a boolean flag.

File-Level Lineage with input_file_name()

input_file_name() AS source_file

input_file_name() is a Spark SQL function that resolves to the path of the source file for each row. It has to be called in the selectExpr during the read phase — before foreachBatch materializes the DataFrame — because the file metadata is lost after materialization. This gives you row-level lineage back to the exact Parquet file that produced each record. Useful for debugging, auditing, and answering the question “where did this data come from?”

The foreachBatch Writer

The write side uses Spark’s foreachBatch pattern, which gives you a regular DataFrame and a batch ID for each micro-batch:

def write_batch(df, batch_id):
    count = df.count()
    if count > 0:
        df = (
            df.withColumn("batch_id", F.lit(batch_id).cast("long"))
              .withColumn("job_run_id", F.lit(JOB_RUN_ID))
        )
        df.writeTo(CATALOG_TABLE).append()

Adding Batch and Job Run Lineage

Two columns are added inside foreachBatch because they can’t be computed earlier:

batch_id — Spark’s micro-batch identifier. Resets to 0 on each job run.
job_run_id — Pulled from the Wherobots environment variable WBC_LABEL_product_instance_id. This is the unique ID for each Wherobots Job Run, which means you can trace every record back to the specific job execution that produced it.

Together, batch_id + job_run_id give you globally unique batch identification across job runs. Since batch_id resets to 0 on every restart, job_run_id is what makes cross-run analysis possible.

The Wherobots-Idiomatic Write Pattern

df.writeTo(CATALOG_TABLE).append()

Not df.write.format("iceberg").mode("append").save(path). The .writeTo().append() pattern uses the catalog-managed table reference, which means Wherobots handles the underlying storage location, metadata, and Iceberg commit lifecycle. It’s cleaner, and it’s how catalog-native writes are meant to work.

Wiring It All Together

The streaming query connects the read, transform, and write stages:

query = (
    transformed.writeStream
    .foreachBatch(write_batch)
    .option("checkpointLocation", CHECKPOINT_PATH)
    .trigger(processingTime="30 seconds")
    .start()
)

The checkpoint location (an S3 path) tracks which files have already been processed. If the job restarts, it picks up exactly where it left off — no duplicate processing, no missed files. The trigger interval of 30 seconds means Spark checks for new files twice a minute.

Viewing the Results

Once data starts landing in the catalog, a Wherobots notebook can read it directly:

df = sedona.table("wherobots.streaming_demo.asset_tracks")

That’s it. No path management, no format specification, no schema declaration. The catalog knows what the table is, where it lives, and what the schema looks like.

Ingest Stats and Lineage

The viewer notebook computes summary statistics including lineage metrics:

stats = df.agg(
    F.count("*").alias("total_records"),
    F.countDistinct("asset_id").alias("unique_assets"),
    F.countDistinct("batch_id").alias("unique_batches"),
    F.countDistinct("job_run_id").alias("unique_job_runs"),
    F.countDistinct("source_file").alias("unique_source_files"),
    F.min("timestamp").alias("earliest_event"),
    F.max("timestamp").alias("latest_event"),
    F.round(F.avg("speed_mph"), 1).alias("avg_speed_mph"),
)

Batch-Level Detail

Because we captured batch_id, job_run_id, and source_file during ingest, we can inspect exactly what landed in each batch:

df.groupBy("job_run_id", "batch_id").agg(
    F.count("*").alias("records"),
    F.countDistinct("source_file").alias("files_in_batch"),
    F.min("timestamp").alias("first_event"),
    F.max("timestamp").alias("last_event"),
).orderBy("job_run_id", "batch_id").show(50, truncate=False)

Records per Batch — Bar Chart

A quick matplotlib chart shows the volume of each micro-batch, which is useful for understanding whether your trigger interval and maxFilesPerTrigger settings are well-tuned:

import matplotlib.pyplot as plt

batch_pdf = (
    df.groupBy("batch_id")
    .agg(F.count("*").alias("records"))
    .orderBy("batch_id")
    .toPandas()
)

fig, ax = plt.subplots(figsize=(12, 4))
ax.bar(batch_pdf["batch_id"], batch_pdf["records"], color="steelblue")
ax.set_xlabel("Batch ID")
ax.set_ylabel("Records")
ax.set_title("Records per Batch")
plt.tight_layout()
plt.show()

The Map: PyDeck with Speeding Visualization

The payoff. A PyDeck map over Iowa with CARTO’s dark-matter basemap, two stacked layers, and a four-color scheme that encodes both is_speeding and in_fence at a glance:

Blue — normal: not speeding, not in a geofence
Red — speeding only
Orange — geofence violation only
Magenta — speeding and in a geofence

Points that hit a geofence get a larger radius (24px, or 30px if also speeding) so they pop out of the sea of blue dots.

def assign_color(row):
    if row["is_speeding"] and row["in_fence"]:
        return (200, 30, 200)   # magenta — both violations
    elif row["in_fence"]:
        return (255, 140, 0)    # orange — geofence only
    elif row["is_speeding"]:
        return (220, 50, 50)    # red — speeding only
    else:
        return (30, 100, 220)   # blue — normal

sample_pdf["radius"] = sample_pdf.apply(
    lambda r: 30 if (r["is_speeding"] and r["in_fence"])
              else 24 if r["in_fence"]
              else 12,
    axis=1,
)

The map itself uses two layers stacked together. Underneath, a PolygonLayer renders the geofence buffer polygons as semi-transparent orange regions, so you can see the zones that triggered violations. On top, the ScatterplotLayer plots every asset point with the four-color scheme:

import pydeck as pdk

geofence_layer = pdk.Layer(
    "PolygonLayer",
    data=geofence_pdf,
    get_polygon="coordinates",
    get_fill_color=[255, 140, 0, 40],
    get_line_color=[255, 140, 0, 160],
    get_line_width=2,
    pickable=True,
)

points_layer = pdk.Layer(
    "ScatterplotLayer",
    data=sample_pdf,
    get_position=["x", "y"],
    get_fill_color=["color_r", "color_g", "color_b", 180],
    get_radius="radius",
    pickable=True,
    auto_highlight=True,
)

deck = pdk.Deck(
    layers=[geofence_layer, points_layer],
    initial_view_state=pdk.ViewState(latitude=41.9, longitude=-93.4, zoom=6.5),
    tooltip={
        "text": "{asset_id} ({asset_type})\\n{operator_name}\\nSpeed: {speed_mph} mph\\nSpeeding: {is_speeding}\\nIn Fence: {in_fence}\\nBatch: {batch_id}"
    },
    map_style="<https://basemaps.cartocdn.com/gl/dark-matter-gl-style/style.json>",
)
deck.show()

The sample is capped at 50,000 points for rendering performance. And because both is_speeding and in_fence are already in the catalog, we’re just reading them — not recomputing anything in the notebook. That’s the whole point of doing the enrichment at ingest time.

Why This Pattern Matters

Let me step back and talk about why this approach is worth your time, beyond the “hey, cool map” factor.

Spatial Enrichment at Ingest, Not at Query Time

In a lot of organizations, raw coordinates sit in tables as plain doubles, and every analyst who needs geometry has to create it themselves. That means everyone is writing their own ST_Point calls, hoping they got the coordinate order right, and probably not setting the SRID. Baking the geometry creation into the ingest pipeline means it’s done once, correctly, and everyone downstream gets a proper GEOMETRY column with SRID 4326.

Business Rules as First-Class Columns

The is_speeding flag is one example, but we went further. The in_fence column demonstrates the same principle with a spatial join: during each micro-batch, every point is checked against a table of geofence buffer polygons using ST_Intersects. Speed violations, geofence containment, proximity to restricted areas are all computed during ingest using Sedona on Wherobots and stored as columns in the Iceberg table. Downstream consumers don’t need to know the logic. They just filter on a boolean.

Lineage Without Extra Infrastructure

source_file, batch_id, and job_run_id give you three levels of traceability — file, batch, and job run — without deploying a separate lineage system. When someone asks “where did this record come from?”, you can answer with a SQL query against the same table.

Iceberg: Geometry as a Native Type

This is the part that would have made 2010-me cry tears of joy. The GEOMETRY column in a Iceberg table is a native type (as of Iceberg v3), not a serialized blob or a WKT string. Spatial predicates can push down to the storage layer. You can query the table with ST_Contains, ST_DWithin, ST_Intersects — the full Sedona function catalog — and the Iceberg metadata helps prune partitions that can’t possibly match. That’s not something you get from storing WKT in a STRING column in a regular Parquet table.

Catalog-Managed Tables, Not Path-Managed Files

sedona.table("wherobots.streaming_demo.asset_tracks") is all a downstream consumer needs. No S3 paths, no Parquet glob patterns, no format hints. The Wherobots catalog manages the table lifecycle: schema evolution, ACID transactions, snapshot isolation, time travel. If you’ve ever debugged a pipeline that broke because someone renamed an S3 prefix, you understand why this matters.

Key Takeaways

Spark Structured Streaming works natively on Wherobots. You can use readStream / writeStream with foreachBatch just like you would in any Spark environment — but with Sedona spatial functions and Iceberg tables available out of the box.
Convert coordinates to geometry at ingest time. Use ST_PointZM (or ST_Point, ST_PointZ depending on your data) with ST_SetSRID to create proper geometry objects. Don’t make every downstream consumer do this.
Apply business rules during ingest. Flags like is_speeding, geofence containment, proximity alerts — compute them once and store them as columns. The silver layer principle applied to streaming.
Capture lineage in the stream. input_file_name() for file-level tracing, batch_id from foreachBatch, and WBC_LABEL_product_instance_id for job-run-level tracing. Three columns, zero additional infrastructure.
Write with .writeTo().append(), not .write.format().save(). Use the catalog-managed write pattern. Let Wherobots handle the storage details.

About the Author

Daniel Smith is a Solution Architect at Wherobots with over two decades of experience in geospatial technology. He builds demos, helps customers, breaks things, fixes them, and writes about what he learns from time to time. When he’s not wrestling with coordinate reference systems or touching things he shouldn’t, he’s probably skating or convincing his kids that geography is, in fact, the coolest subject. You can find him on LinkedIn.

Ready to stream spatial data into your own Wherobots catalog?

Get Started with Wherobots

Raster Processing at Scale: The Out-of-Database Architecture Behind WherobotsDB

Posted on March 10, 2026March 10, 2026 by Pranav Toggi

Introduction

Raster data (satellite imagery, elevation models, sensor grids) is critical to understanding the physical world and increasingly to powering AI. The challenge most data teams face is processing it at scale.

Processing raster data at scale requires an architecture that avoids loading entire files into memory. WherobotsDB solves this with an out-of-database approach that fetches pixel data on demand, enabling terabyte-scale processing without the memory overhead of traditional raster engines.

WherobotsDB extends open-source Apache Sedona with capabilities and performance optimizations purpose-built for preparing physical-world data for AI at scale, while maintaining full API compatibility. Existing Sedona workloads run without code changes.

This post covers how WherobotsDB handles the full raster lifecycle: scalable processing architecture, raster math, coordinate reference systems, vector-raster hybrid workflows, and planetary-scale inference.

“With Wherobots, we were able to merge 15+ complex vector datasets in minutes and run high-resolution ML inference on raster imagery at a fraction of the cost of our legacy stack. The combination of speed, scalability, and ease of integration has boosted our engineering productivity and will accelerate how quickly we can deliver new geospatial data products to market.”

— Rashmit Singh, CTO, SatSure

What Is Out-of-Database Raster Architecture?

At the foundation of Wherobots raster capabilities is an out-of-database raster architecture – which makes it far easier to process raster imagery in an embarrassingly parallel fashion. Instead of loading entire raster files into memory, only metadata is stored and pixel data is fetched on-demand. This means teams can process terabyte-scale imagery collections — statewide mosaics, multi-year satellite archives, continental elevation models — with the same interface they use for vector data. Operations like zonal statistics, clipping, masking, filtering, and raster algebra scale to datasets that would overwhelm in-memory approaches.

Capability	Apache Sedona	WherobotsDB	Notes
Out-DB Raster Support	◔	●	Creates lightweight raster references; pixel data loaded only when needed
Intelligent Caching Layer	○	●	Minimizes repeated remote reads for frequently-accessed rasters
Optimized Shuffle Operations	○	●	Data movement handles only metadata — orders of magnitude faster than full rasters
On-Demand Materialization	○	●	Selectively convert external rasters to in-database format when needed
Automatic Metadata Optimization	○	●	Pre-loads metadata and intelligently repartitions for optimal parallelism
Cloud-Optimized GeoTIFF Support	◐	●	Native COG support with tile-based partial reads from cloud storage

How Out-DB Architecture Transforms Raster Operations

The out-of-database architecture fundamentally changes how raster operations execute:

Operation	Apache Sedona (In-DB Only)	WherobotsDB
Data Movement (Shuffle)	Serializes all pixel data across executors	Serializes only metadata (~KB vs GB)
Tiling Operations	Copies pixel data into each tile	Memory-efficient tiling without data duplication
Clipping	Full pixel-by-pixel processing	Optimized processing paths for common operations
Zonal Statistics	Processes entire raster regardless of region size	Materializes only zonal pixels, optimizing I/O based on region of interest
Raster Loading	Loads entire raster into memory at read time	Lazy loading: metadata on-demand, pixels only when accessed
Resource Management	Standard memory lifecycle	Intelligent caching layer with disk caching for remote files

Key benefits:

On-Demand Data Access: Instead of loading entire raster files into memory, WherobotsDB fetches pixel data only when an operation requires it, reducing memory overhead and enabling processing at terabyte scale.
Memory-Efficient Tiling: Tiles share references to the underlying file with different spatial bounds — enabling massive parallelism without memory overhead.
Smart I/O Reduction: Operations optimize I/O based on the region of interest, and spatial filter push-down skips irrelevant raster files entirely.

Cloud-Optimized GeoTIFFs (COGs) are GeoTIFF files structured so that only the specific byte ranges needed for a given operation are fetched from remote storage, rather than downloading the entire file. Combined with on-demand loading, this architecture minimizes both memory footprint and network I/O.

What Raster Capabilities Does WherobotsDB Include?

Building on this foundation, WherobotsDB includes enhanced raster functions that enable satellite imagery, elevation models, and sensor data analysis directly in SQL alongside traditional vector operations.

Capability	Apache Sedona	WherobotsDB	Notes
Raster to Vector Conversion	○	●	Convert raster regions to vector polygons for hybrid vector-raster analysis workflows
Multi-Band Tile Processing	○	●	Align, stack, and tile rasters from different sources, CRS, and resolutions for distributed multi-source analysis
Zonal Statistics	◕	●	Both support the full statistics suite; WherobotsDB’s Out-DB architecture materializes only zonal pixels, enabling scalability across millions of zones
Custom Raster Algebra	◕	●	Flexible map algebra expressions with near-native execution performance
Spatial Filter Push-down for Rasters	○	●	Uses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries

These capabilities fall into two categories:

Transforming raster data for hybrid workflows – Raster to Vector Conversion, Multi-Band Tile Processing.
Analyzing raster data in place – Zonal Statistics, Custom Raster Algebra, Spatial Filter Push-down.

Raster to Vector Conversion converts contiguous raster regions with the same pixel value into vector polygons. Essential for workflows that need to analyze raster-derived features (flood extents, land cover classifications, building footprints from DSM) using vector spatial operations like overlay, buffering, or spatial joins.

Multi-Band Tile Processing solves one of the most common friction points in raster analysis: combining data from different sources. Satellite imagery from different sensors, time periods, or providers typically arrives in different coordinate reference systems, resolutions, and data types. WherobotsDB aligns and stacks these into a unified multi-band raster, then tiles it into spatial chunks for distributed processing — all in a single operation. This enables workflows like change detection across multi-temporal composites, fusing Sentinel-2 optical bands with elevation data, or building analysis-ready multi-spectral stacks, without manual reprojection or resampling steps.

Zonal Statistics computes aggregate statistics (count, sum, mean, median, mode, stddev, variance, min, max) for raster pixels falling within vector zones. Both Apache Sedona and WherobotsDB support zonal statistics — the differentiator is scale. WherobotsDB’s out-of-database architecture materializes only the pixels within each zone rather than processing the entire raster, making it practical to run zonal statistics across millions of zones on terabyte-scale imagery.

Custom Raster Algebra executes user-defined raster algebra expressions with near-native execution performance. It supports complex multi-band calculations, conditional logic, and neighborhood operations — enabling workflows like computing NDVI (Normalized Difference Vegetation Index, a measure of vegetation density derived from red and near-infrared bands) from satellite imagery, or applying threshold-based classification across large imagery collections.

Spatial Filter Push-down for Rasters uses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries. When a catalog contains thousands of scenes but only a few intersect your area of interest, irrelevant files are eliminated before any processing begins.

Because all of these functions are built on the out-of-database architecture, they inherit the same scalability characteristics described above — lazy loading, selective pixel materialization, and intelligent caching, without additional configuration.

What Is RasterFlow and How Does It Work?

Recently, Wherobots has added an entirely new inference and perception engine for planetary-scale image processing – extending the raster lifecycle beyond analysis into AI. RasterFlow enables teams to run computer vision models against large-scale raster datasets. From preparing imagery, mosaicking, removing edge effects across tiles, executing distributed model inference, and converting predictions into vector geometries, all within Wherobots Cloud.

RasterFlow’s outputs are stored as vectorized results in Apache Iceberg tables — an open table format for large-scale analytic datasets — or as predictions within ZARR (a cloud-native format for chunked, compressed multi-dimensional arrays) or COGs, which can be seamlessly analyzed using the full suite of spatial operations in WherobotsDB. This creates end-to-end raster workflows — from raw imagery through model inference to spatial analytics, without moving data between systems or building custom infrastructure.

RasterFlow supports both popular open-source geospatial AI models and custom PyTorch models, and can generate embeddings from geospatial foundation models. It is currently available to select customers in private preview. If you’re interested in RasterFlow, join our upcoming session to see it in action.

What Comes Next: Query Performance and Spatial Analytics

Raster processing is not only a first-class capability in WherobotsDB, but also it’s one part of a broader set of spatial data processing advances we’ve built beyond open-source Sedona. Vector and raster workloads both benefit from the same query performance optimizations under the hood: spatial relationship acceleration, automatic join optimization, dynamic data redistribution, and a vectorized GeoParquet reader. Queries that require careful tuning with self-managed Sedona run optimally out-of-the-box with WherobotsDB.

In the next post in this series, we’ll go deep on query performance and spatial analytics, how WherobotsDB accelerates spatial joins, range queries, and analytical functions across both vector and raster data types.

Get Started with Wherobots

Access Now

Introducing SedonaDB and SpatialBench for Apache Sedona

Posted on September 24, 2025December 10, 2025 by Damian

Our role at Wherobots and as leaders in the Apache Sedona community is to help more developers, organizations, and AI systems positively transform the physical world using spatial data. In order to make the scale of transformation we envision possible, we’ve had to address significant bottlenecks in how data is stored and queried.

We’re excited to celebrate the availability of SedonaDB and SpatialBench for Apache Sedona. Together they represent the next phase in our plan to accelerate innovation with spatial data and bridge the intelligence gap between AI and the physical world.

Intro to SedonaDB: A modern query engine that gets spatial right

SedonaDB is the first open-source, single-node analytical database engine that treats spatial data as a first-class citizen.

Most analytical query engines already support general-purpose operations: filtering, joins, aggregations, and APIs for SQL or Python. But when it comes to operating on spatial data those same engines fall short: support for geometry and geography types, coordinate reference systems (CRS), spatial joins, and raster or vector operations is missing. The workaround is to bolt on an extension like PostGIS (PostgreSQL), DuckDB Spatial (DuckDB), or SedonaSpark (Spark). While powerful, extensions inherit the limits, costs, and complexities of their host systems, require extra setup and tuning, and can force builders to develop around performance and usability gaps instead of developing their ideas.

SedonaDB is different. It’s for builders solving problems with physical world data.

Written in Rust, it’s lightweight, blazing fast, and spatial-native. Out of the box, it provides:

Full support for spatial types, joins, CRS, and functions on top of industry standard query operations.
Query optimizations, indexing, and data pruning features under the hood that make spatial operations just work with high performance.
Pythonic and SQL interfaces familiar to developers, plus APIs for R and Rust.
Flexibility to run in single-machine environments on local files or data lakes.

SedonaDB uses Apache Arrow and Apache DataFusion, and provides everything you need from a modern vectorized query engine. But it delivers the unique ability to also run high performance spatial workloads easily, without requiring extensions.

Read the announcement on the Apache Sedona blog to dive in and roll up your sleeves.

What led to SedonaDB?

In 2020, Apache Sedona was incubated to address a significant support gap in distributed geospatial data processing. Since then, Sedona has enabled companies like Uber, Amazon Last Mile Delivery, JB Hunt, and thousands of others with geographically distributed operations or interests to build and run more efficient and effective physical operations at scale. It is widely used today to bring geospatial processing support to Apache Spark, Apache Flink, and also Snowflake. But distributed systems aren’t for everyone or the right fit for every use case, and we could do more to drive innovation in lower-scale scenarios.

Accelerating innovation

Many ideas are bootstrapped in no-to-low cost environments where iteration cycles are fast and low risk. There’s a lot that a developer can do today using a laptop or a single virtual machine, with modern software and LLMs—without adding a dependency that adds unwanted cost and complexity to the innovation cycle. Once their ideas are viable, they may not even require a distributed compute environment in production like Spark, or one that is “fully managed” by a vendor.

So the next step was pretty clear. We had to make it easier for builders to use spatial data in no-to-low cost environments so they can iterate and positively transform the physical world, faster. We also decided to address these challenges through open-source software to maximize accessibility.

Making development easier

If you look around the ecosystem, you’ll notice a pattern: to get the analytical support you need for geospatial data, you deploy an analytics engine without the spatial analytics support you want, and then you bolt on what you need via an extension.

Extensions are great and they serve a purpose very well. After all, SedonaSpark is an extension! But that doesn’t mean the combination of engine + extension is ideal. It requires additional setup and management, can require tuning to achieve a reasonable performance, and the underlying engine may end up becoming a bottleneck. Additionally, the development experience around the engine may be overly complex or lack support for the language you prefer, and the engine itself might introduce compute, cost, and other overhead.

Working from the root causes of these challenges, along with the desire to drive more innovation, our next step became obvious. We needed to create a query engine that aids spatial data solutions development out of the box with popular pythonic and SQL interfaces and is optimized for single-machine environments.

But was there enough value created by a spatial-first query engine compared to general purpose query engines with spatial extensions?

Optimizing for spatial data = a better future

Spatial data is no longer a minor class of data. It’s everywhere, the rate at which it’s being generated is growing every day, and its use cases span numerous industries. It streams from devices, vehicles, satellites, and drones, and derivatives from this data inform automation and decision-making across business, government, and research. The solutions being developed with it are transforming how organizations operate in the physical world.

Innovation is happening today with this data despite the friction above, but the pace of this innovation can be accelerated by a query engine with internals intentionally designed to help developers realize the full potential of this data.

This engine is SedonaDB, and it’s backed by an open-source community (Apache Sedona) that is committed to solving physical-world challenges through data and technology.

Intro to SpatialBench: The first standard for spatial query performance

“Without standards, there can be no improvement” – Taiichi Ohno. This statement from the founder of the Toyota Production System is an analogy for why we built SpatialBench. There was no standard way of measuring spatial query performance, so progress couldn’t be easily quantified or query engines objectively compared on this dimension.

We built SpatialBench to establish first standards. The initial release supports 12 representative queries, ranging from simple to complex workloads, and includes a data generator for scale factors 1, 10, 100, and 1000. We hope this framework and its future versions will guide innovation that leads to a greater understanding of the physical world.

We also used SpatialBench to benchmark SedonaDB, DuckDB (with its spatial extension), and GeoPandas at scale factors 1 and 10. Those results are published here.

Next Steps

Get Started with SedonaDB: Try it out and contribute to the roadmap.
Use SpatialBench: Measure spatial query performance using a consistent standard.
Watch the webinar (hosted by CNG): We’ve walked through SedonaDB and SpatialBench, and introduced Wherobots’ Startup Accelerator Program.

Raster Spatial Joins at Scale: Google Earth Engine and BigQuery vs Apache Sedona and Wherobots

Posted on July 31, 2025March 9, 2026 by Matt Forrest

If you are deciding between Google Earth Engine + BigQuery and Apache Sedona + Wherobots for large-scale geospatial analysis, here is the short answer: Wherobots completed a statewide zonal statistics analysis across every building in Texas in 3 minutes 28 seconds. BigQuery timed out on the same query. For a Dallas metro area test using the same datasets, Apache Sedona finished in 37 seconds vs BigQuery’s 8 minutes 24 seconds, making it 13x faster. And before BigQuery could even run the spatial join, filtering the global dataset down to a single bounding box took 58 minutes and 50 seconds.

This post breaks down why that gap exists, what it means architecturally, and when each platform actually makes sense for your workload.

Key Takeaways

Here is what we found:

Apache Sedona via Wherobots is 13x faster than BigQuery for large-scale zonal statistics on a metro-area dataset
BigQuery with Google Earth Engine could not complete a state-scale spatial join without timing out; Wherobots finished the same job in under 4 minutes
Apache Sedona reads directly from cloud storage (S3, STAC) with no data migration required; BigQuery requires importing data into Google Cloud first
Apache Sedona supports 300+ spatial functions; BigQuery offers 71, with ST_REGIONSTATS as the only raster-vector function
For production pipelines at scale, Wherobots wins on speed, flexibility, and cost of data movement. Google Earth Engine remains useful for exploratory, one-off analysis within its own dataset catalog

What is Zonal Statistics Analysis and Why Does Scale Matter?

A zonal statistics analysis is one kind of result of a spatial join between a raster dataset and a vector dataset. This analysis can return things like:

sum: Sum of the pixel values in a zone
mean: Arithmetic mean of those values
min or max: Limit values of the pixels in the zones

For example, we could use vector boundaries of postal codes to find the average elevation from a raster:

Imagine that you want to take a study area, maybe a specific bounding box or metro area, and understand the dominant land classification in that area. In this case, each grid square in the raster might have a value like ”water” or “trees” and you want to find the classification with the maximum count in each of your vector areas.

At a small scale this kind of analysis is relatively easy for one-off work. There are a few engines that allow you to do this.

Shapely/GeoPandas and Rasterio are three commonly used Python libraries used together to do this.
PostGIS, which includes spatial SQL functions to work with raster and vector data, extending PostgreSQL with spatial functionality.
Google Earth Engine, which includes the ability to perform zonal statistics and now add support to perform calculations with vector data inside BigQuery.
Apache Sedona, which provides a functional layer to perform distributed spatial queries using Apache Spark and other engines.

However, there are plenty of use cases where you do want to do this at large scale:

Understand the land classification across thousands, hundreds of thousands, or even millions of individual buildings or properties.
Analyze climate or weather data in small or large scale areas over time.
Get up to date fire risk using global or country specific data.
Calculate and update flood risk across thousands of properties by analyzing the height above nearest drainage points.

These data sets have the potential to add massive value to analytical workflows for many industries and sectors. Unfortunately, the Python and PostGIS approaches are difficult to scale effectively. While they work up to a point, they run on a single compute instance and therefore will only scale as far as the computing power and memory of that machine.

So how can you do this at scale and without disrupting your existing architecture? Let’s look at the two cloud options, Apache Sedona and Google BigQuery.

How Google Earth Engine + BigQuery Handles Large-Scale Spatial Joins (And Where It Struggles)

Apache Sedona and Google Earth Engine both provide solutions when you need to process larger amounts of data.

Google Earth Engine, originally created in 2009, generally stores its raster data in multi-resolution pyramids in data that is not directly exposed to the user via a proprietary library. Google Earth Engine does store some vector data and it’s distributed in a similar format via the libraries themselves.

In Google Earth Engine you can access planetary scale data and analyze it for small and large scale problems without the need to download the data on your computer. In recent years, Google has started to expand the capabilities for Earth Engine to integrate with the Google Cloud Suite of products, most notably BigQuery.

BigQuery is an online analytical processing (or OLAP) data warehouse that allows you to store massive amounts of long-form data and analyze it efficiently using distributed computing. It uses a proprietary storage format that allows it to quickly distribute workloads across data that’s stored in the Google Cloud architecture.

The downside of this proprietary format is that you must import your data into BigQuery before leveraging the distributed computing resources that make it powerful. BigQuery makes it possible to calculate zonal statistics using Earth Engine data, but unfortunately that geospatial data does not reside in the same proprietary, fast format. (this differs from the approach with Wherobots and Apache Sedona outlined later in the post).

Recently, BigQuery added the function ST_REGIONSTATS that allows you to directly integrate BigQuery SQL with datasets from Earth Engine.

All in all, this workflow is great for traditional tabular analytics if you already have or are planning to migrate data into BigQuery. However, if you want to leverage open formats, Apache Sedona and Wherobots allow you do do that at scale with open core tooling.

How Apache Sedona + Wherobots Handles the Same Problem

Apache Sedona is a distributed computing framework that brings spatial functionality into the Apache Spark ecosystem. It allows you to work with large-scale datasets, both vector and raster, and combine them into new datasets using zonal statistical functions. It uses an underlying distributed computing framework that allows you to distribute work in parallel, allowing you to scale up to planetary scale computational workloads.

A key to Sedona’s scalability is that it allows you to leave the data where it sits. Many exabytes of raster data are already in an accessible storage format on the web and Sedona lets you leave that data there without spending the time or money needed to copy it yourself.

This means that there are no proprietary formats, no data ingestion, and no extra steps to actually start working with this data.

Platform Comparison: Apache Sedona + Wherobots vs Google Earth Engine + BigQuery

Here’s how the two approaches compare across key technical and operational dimensions:

	Apache Sedona & Wherobots	Google Earth Engine & Big Query
Open formats	✅ Works with standard formats like Cloud-Optimized GeoTIFF; no vendor lock-in	⚠️ Data needs to be ingested into BigQuery, internal proprietary storage for vector and raster data
No ETL needed	✅ Reads directly from remote raster sources like S3 or STAC endpoints without importing	❌ Only works with supported Google Earth Engine datasets and data stored in Google Cloud Storage
Scalable	✅ Distributed compute + open-source stack enables scaling from local to cloud	✅ Good for one-off, small to mid-scale analyses
Spatial function support	✅ Over 300 raster and vector functions supported in Apache Sedona	⚠️ ST_REGIONSTATS is the only raster/vector function; 71 total spatial functions in BigQuery
Cloud-native	✅ Fully supported cloud environment in AWS (GCP coming soon)	✅ Integrated with Google Cloud; Easy to use if your stack is already in GCP

To scale zonal statistics queries even more, Wherobots only materializes the values or pixels of raster data it needs. Not only does this save time on the computational workload, but it also eliminates a significant amount of moving data back and forth, which adds time and cost in other systems.

Benchmark Results: Wherobots vs Google Earth Engine + BigQuery

Using ST_RegionStats in BigQuery via Google Earth Engine

To compare the two systems I used a common open dataset to analyze land classification near buildings in Texas. This analysis can help determine if a building is in a built-up urban environment or if it is an area with different natural resources.

I used the WorldCover dataset from the European Space Agency, which contains land cover data across the entire globe. The data is in AWS S3 storage as an open dataset here, as well as Google Earth Engine.

For the buildings, I used the Overture Maps dataset, which contains building footprints for the entire globe. Overture data is in both the Wherobots Open Data Catalog and the BigQuery Open Data project.

Initially, I tried to compare the query speed for every building in Texas. Wherobots finished in about 3½ minutes, but BigQuery continued to time out after several attempts.

To scale back my test, I focused on buildings in a small area around Dallas, Texas. Each building was buffered by 30 meters to ensure that we are looking at the area around the buildings, not just the footprint.

To make this query work, I first needed to enable Earth Engine in BigQuery in the Google Cloud console, including:

Activating the Earth Engine API
Adding in required permissions for my project
Locating and subscribing to the required Raster dataset via the Analytics Hub

Below is the code that I used to run this for the Overture Maps data in BigQuery.

WITH texas AS (
  SELECT
    st_buffer(geometry, 30) as geometry, id
  FROM
    <code>PROJECT.DATASET.overture_texas</code>
      where
  st_intersects(geometry, st_geogfromtext('POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))'))
)
SELECT
  t.geometry,
  t.id,
  ST_REGIONSTATS(
    t.geometry,
    (SELECT assets.image.href
     FROM <code>PROJECT.DATASET.landcover</code>
    ),
    'Map'
  ).mean as mean
FROM
  texas AS t
ORDER BY
  mean DESC;

You can use the existing Overture Maps data in BigQuery, however I recommend making a new table with just the data you need for the analysis. Using a WHERE statement to filter the global Overture Maps table to the small bounding box took 58 minutes and 50 seconds.

You can view the complete guide for doing this in GCP here.

Using Apache Sedona and Wherobots

Using Apache Sedona and Wherobots we can efficiently load in both the ESA WorldCover data from AWS S3 open data using built-in functions to read a Spatio-Temporal Asset Catalog (or STAC) endpoint that has an API to read large collections of raster data.

Click here to launch this interactive notebook

Launch Notebook

stac_df = sedona.read.format("stac").load(
    "<https://services.terrascope.be/stac/collections/urn:eop:VITO:ESA_WorldCover_10m_2021_AWS_V2>"
)

stac_df.printSchema()

Alternatively, we can also load this directly from the S3 data source, as you can see here which is automatically retiled which sets up nicely to create our tiled out-of-database raster dataframe, which we can then turn into an Iceberg table to use in the future:

esa = sedona.read.format("raster")\\
    .option("retile", "true")\\
    .load("s3://esa-worldcover/v200/2021/map/*.tif*")
    
esa.createOrReplaceTempView("esa_outdb_rasters")

YOUR_CATALOG_NAME = 'my_catalog'

# Persist as a raster table in Iceberg
sedona.sql(f"""
CREATE OR REPLACE TABLE wherobots.{YOUR_CATALOG_NAME}.esa_world_cover
AS
SELECT * FROM esa_outdb_rasters
""")

# ✅ Out‑DB raster table created!

From here I can set up my zonal stats query using the function RS_ZonalStatsAll which will actually return a STRUCT of all the stats instead of just the selected one as BigQuery does. The values included are:

count: Count of the pixels.
sum: Sum of the pixel values.
mean: Arithmetic mean.
median: Median.
mode: Mode.
stddev: Standard deviation.
variance: Variance.
min: Minimum value of the zone.
max: Maximum value of the zone.

zonal_stats_df = sedona.sql(f"""
SELECT
  p.id,
  RS_ZonalStatsAll(r.rast, p.buffer, 1) AS stats
FROM
  wherobots.{YOUR_CATALOG_NAME}.texas_buildings p
JOIN
  wherobots.{YOUR_CATALOG_NAME}.esa_world_cover r
  ON RS_Intersects(r.rast, p.buffer)
""")

# ✅ Zonal stats computed!

With this query, you could choose to write the results to an Iceberg table, or you could choose to persist them to files. In this case, I chose to persist these to files for our test.

zonal_stats_df.write \\
	.format("geoparquet") \\
	.mode("overwrite") \\
	.save(user_uri + "/results")")

And that’s it. You don’t need to move any data, download anything, or spin up any additional services apart from the notebook to run this process.

Results

The performance gap between the two platforms was significant at both scales tested.

Test 1: All of Texas

The first test that I ran tried to intersect all of the Overture Maps buildings in the state of Texas against this global raster dataset.

With Wherobots, I ran the iteration seven times, writing the complete dataset to disk each time. The results for the iterations were:

3 minutes 28 seconds (average for each iteration)
± 25.9 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

BigQuery was unable to finish the test without timing out.

Test 2: One city

For this test, I limited the query to a bounding box in the area of Dallas shown below.

POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))

For this smaller area, the BigQuery computation ran in 8 minutes and 24 seconds.

Below are the results in Wherobots using the smaller area:

37.4 seconds (average over 7 iterations)
± 1.78 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

Performance Summary

Area	BigQuery	Wherobots
State of Texas	N/A (timed out)	3m 28s
Dallas, Texas metro area	8m 24s	37s

When to Use Apache Sedona + Wherobots vs Google Earth Engine

The benchmark results make the performance case clearly. But there are architectural reasons to consider Sedona and Wherobots beyond raw speed.

Performance and scale

Sedona scales horizontally using distributed compute across Apache Spark, with strategic partitioning and adjustable runtime size to handle metro, national, or global workloads.
Apache Sedona supports 300+ raster and vector functions, compared to BigQuery’s single raster-vector function, ST_REGIONSTATS. That function coverage matters when building full analytical pipelines, not just one-off queries.
Wherobots includes job management for scheduled, repeatable runs, which matters when your raster datasets change over time, such as fire risk, flood risk, or land cover updates.

Architecture and flexibility

Sedona reads directly from remote cloud storage like AWS S3 or STAC endpoints without copying data into a proprietary system. BigQuery requires data to live inside Google Cloud before its distributed compute kicks in.
You are not limited to Earth Engine’s dataset catalog. Any open dataset accessible via cloud storage works with Sedona without ingestion or reformatting.
The stack is open core. You can run Apache Sedona locally to prototype, then move to Wherobots for production without rewriting your analysis.

For teams evaluating Google Earth Engine + BigQuery against Apache Sedona + Wherobots for production geospatial workloads, the benchmark is clear. At metro scale, Sedona is 13x faster. At state scale, BigQuery could not finish the job. If your data lives outside Google Cloud, or you need more than one raster-vector function, or you need repeatable scheduled pipelines, Wherobots is the stronger architectural choice. Google Earth Engine remains a capable tool for exploratory analysis within its own catalog.

Try the interactive notebook

Get Started

Wherobots, Sedona, and GeoPandas are better together

Posted on April 15, 2025August 18, 2025 by Jia Yu

This post shows you how to run computations on geospatial data with Wherobots, Sedona, and GeoPandas, explains how the engines execute computations differently, and describes how you can use these technologies for an amazing workflow. Wherobots notebooks include the Sedona and GeoPandas engines, so you can easily access either library when running computations.

The post will also show you how to write Sedona code with GeoPandas syntax using the Sedona GeoPandas API.

Wherobots is great for quickly running computations on large datasets in parallel. GeoPandas is familiar to lots of geospatial programmers and is interoperable with a lot of libraries, so both computation engines can be useful for an analysis.

Let’s start with an example analysis that shows how to compute building centroids with either Sedona or GeoPandas.

Wherobots and GeoPandas example computation

Here’s how to read a GeoParquet file into a Sedona DataFrame.

path = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet'

buildings = sedona.read.format("geoparquet").load(path)

Let’s inspect a few rows of data.

buildings.select("BUILD_ID", "height_val", "geom").limit(5).show()

+--------+----------+--------------------+
|BUILD_ID|height_val|                geom|
+--------+----------+--------------------+
| 2080161|       0.0|MULTIPOLYGON (((-...|
| 2080159|       0.0|MULTIPOLYGON (((-...|
| 2080158|       0.0|MULTIPOLYGON (((-...|
| 7776189|      8.49|MULTIPOLYGON (((-...|
| 7776214|     10.49|MULTIPOLYGON (((-...|
+--------+----------+--------------------+

Let’s filter out all the rows with a zero height_val and compute the centroid for each building.

(buildings
    .where(col("height_val") != 0.0)
    .withColumn("centroid", ST_Centroid(col("geom")))
    .select("BUILD_ID", "height_val", "centroid")
    .limit(5)
    .show(truncate=False))

+--------+----------+---------------------------------------------+
|BUILD_ID|height_val|centroid                                     |
+--------+----------+---------------------------------------------+
|7776189 |8.49      |POINT (-73.9244121583531 40.876054867139366) |
|7776214 |10.49     |POINT (-73.92479319267416 40.87608655645236) |
|7740167 |37.44     |POINT (-73.93514476461972 40.81269938665691) |
|7740151 |37.27     |POINT (-73.93622071995519 40.81267611885931) |
|7740289 |36.19     |POINT (-73.93705258498942 40.813024939805196)|
+--------+----------+---------------------------------------------+

You could also convert this buildings dataset to a GeoPandas DataFrame and compute the centroid for each building. Here’s how we will run the computation:

Use Sedona to read and filter the data
Convert from Sedona to GeoPandas
Use GeoPandas to compute the centroids

Here’s how to filter the DataFrame and grab a few columns with Sedona:

sedona_df = (
    buildings
        .where(col("height_val") != 0.0)
        .select("BUILD_ID", "height_val", "geometry")
)

Now convert the DataFrame to a pandas DataFrame:

pandas_df = sedona_df.toPandas()

Convert the pandas DataFrame to a GeoPandas DataFrame:

import geopandas as gpd

geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry")

Compute the centroids with GeoPandas and view the contents of the GeoPandas DataFrame:

geopandas_df["centroid"] = geopandas_df.geometry.centroid

print(geopandas_df[["BUILD_ID", "height_val", "centroid"]].head(5))

---------------------------------------------------
   BUILD_ID  height_val                    centroid
0   7776189    8.490000  POINT (-73.92441 40.87605)
1   7776214   10.490000  POINT (-73.92479 40.87609)
2   7740167   37.439999   POINT (-73.93514 40.8127)
3   7740151   37.270000  POINT (-73.93622 40.81268)
4   7740289   36.189999  POINT (-73.93705 40.81302)

Read an Iceberg table into a GeoPandas DataFrame

Suppose you have an Iceberg table named local.db.icetable and would like to analyze the data with GeoPandas.

GeoPandas does not have an Iceberg reader yet, so let’s read the table with Sedona and then convert it to a GeoPandas DataFrame.

Here’s how to create the Sedona DataFrame:

sedona_df = sedona.table("local.db.icetable")

Now let’s convert the DataFrame to a GeoPandas DataFrame:

import geopandas as gpd

pandas_df = sedona_df.toPandas()
geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry")

print(geopandas_df)

---------------------------
 id               geometry
0  a  LINESTRING (1 3, 3 1)
1  d  LINESTRING (2 7, 4 9)
2  e  LINESTRING (6 7, 6 9)
3  f  LINESTRING (8 7, 9 9)

You can only convert smaller DataFrames from Sedona DataFrames to GeoPandas DataFrames. Let’s take a look at the architectural differences between the engines to learn more.

How Sedona and GeoPandas execute computations differently

Sedona executes computations on a distributed cluster, so it can quickly run computations on large datasets.

GeoPandas runs computations on a single node, so it’s not as scalable, but it can be quick for small datasets and is interoperable with libraries in the pandas ecosystem.

GeoPandas computations can be distributed with Sedona GeoPandas or dask-geopandas, and we will describe those technologies in the sections that follow.

Let’s take a look at the differences in more detail.

Single-node execution with GeoPandas vs. Sedona cluster computing

GeoPandas runs computations on a single computer and processes data using only a single node. This is fine for small datasets, but it cannot scale beyond the resources of a single machine.

Sedona runs on many machines working in unison as a cluster. The cluster has a driver node responsible for orchestration and worker nodes that process the data in parallel.

When a computation runs on a single node, only one machine reads the data and executes the query. When a computation is run on a cluster, all worker nodes read data in parallel and execute the query. As you can imagine, running on a cluster is faster and more scalable for large datasets.

Lazy Sedona execution vs. eager computation with GeoPandas

Wherobots is a lazy compute engine, and GeoPandas is an eager compute engine.

Lazy compute engines build up query execution graphs but don’t run the underlying computations until they need to produce a result. Let’s take a look at a line of Sedona code that reads a file:

df = sedona.read.format("geojson").load("overture_buildings.geojson")

This line doesn’t read the underlying file. It waits until the user requests a result from the data before doing the work—it’s lazy. This is good because it can inspect and optimize the queries before running them.

Let’s compare this with GeoPandas code that reads in data to a DataFrame:

overture_buildings = gpd.read_file("overture_buildings.geojson")

This GeoPandas code runs instantly and eagerly loads all the data into an in-memory DataFrame. Eager execution can be beneficial, especially when the data is small and when the DataFrame is repeatedly queried. However, it has drawbacks for larger-than-memory datasets.

Out-of-memory datasets with GeoPandas

GeoPandas eagerly loads data into memory. When the machine does not have enough memory to load all the data, it will error out with an out-of-memory exception. In practice, GeoPandas works well with dataset sizes much smaller than the total memory on the machine.

Wherobots is designed to run on datasets larger than a given machine’s memory. It processes data incrementally and does not load it into memory all at once, allowing It to process out-of-memory datasets.

Let’s now dive into how the computations are executed.

Sedona vs. GeoPandas: Query optimizations

Wherobots uses a query optimizer to ensure the code executes efficiently. For example, suppose a query requires three columns of a 10-column dataset stored in GeoParquet. The query optimizer will intelligently select only the needed three columns rather than reading all 10 columns of data.

GeoPandas doesn’t have a query optimizer, so users must manually specify the optimizations they want to be applied. For example, when a GeoPandas user queries a GeoParquet file, they need to specify the column pruning and row-group filtering logic manually.

The query optimizer provides a better developer experience and better performance, too.

Sedona vs. GeoPandas: Syntax differences

Sedona supports Python, SQL, Java, Scala, and R interfaces.

GeoPandas just supports a Python API.

The Wherobots Python syntax is different from the GeoPandas Python syntax. Sedona is syntactically similar to PySpark, and GeoPandas is syntactically similar to Pandas.

Now let’s take a look at the Sedona GeoPandas API.

Sedona vs. GeoPandas: Library ecosystem

Sedona and GeoPandas are compatible with different libraries.

For example, suppose you would like to read a large geospatial dataset stored in Apache Iceberg, perform some data munging, and then display the results with matplotlib.

Sedona is compatible with Iceberg, and GeoPandas is interoperable with matplotlib, so here is how you can run this computation:

Read the Iceberg table with Sedona
Perform data wrangling with Sedona and reduce the dataset size
Convert to GeoPandas and plot with matplotlib

Sedona and GeoPandas are interoperable with different libraries, and that’s one of the main reasons why it’s nice to have access to both.

Sedona GeoPandas API

The Sedona GeoPandas API is identical to the GeoPandas API, but executes computations with Sedona in a distributed manner. It’s a great way for developers who are familiar with GeoPandas syntax to write code that’s executed with Sedona.

Here’s an example of how to create a Sedona GeoPandas DataFrame:

Dask-GeoPandas overview

The dask-geopandas library is another way to parallelize GeoPandas computations to leverage all cores of a given machine or distribute computations to a cluster.

Here’s a good summary of dask-geopandas: “Since GeoPandas is an extension to the pandas DataFrame, the same way Dask scales pandas can also be applied to GeoPandas”.

A deep dive into the differences between Sedona and dask-geopanadas is beyond the scope of this post, which focuses on Sedona and GeoPandas, but we’re excited to investigate this in a future post!

Converting from Sedona to GeoPandas

You can convert a Wherobots DataFrame to a GeoPandas DataFrame, as you can see in this code snippet:

pandas_df = sedona_df.toPandas()
geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geom")

When you convert from Sedona DataFrames to GeoPandas, you load all the data into the driver node and process it on a single machine. Converting from a Sedona DataFrame to a GeoPandas DataFrame is not recommended if the dataset is large.

It’s better to convert from Sedona GeoPandas once a dataset has been pre-processed and reduced in size. Once the data is small, you can convert it to a GeoPandas DataFrame and enjoy the interoperability offered by the GeoPandas ecosystem.

Converting from GeoPandas to a Sedona DataFrame

You can also convert from a GeoPandas DataFrame to a Sedona DataFrame.
Here’s an example of how to read a FlatGeoBuf file into a GeoPandas DataFrame and then convert it to a Sedona DataFrame:

geopandas_df = gpd.read_file("overture_buildings.fgb")
overture_buildings_df = sedona.createDataFrame(geopandas_df)

Sedona doesn’t have a FlatGeoBuf reader yet, so this is a good way to read FlatGeoBuf files into Sedona DataFrames!

Visualizing Iceberg Data with Sedona Wherobots, GeoPandas, and Folium

Let’s illustrate how to leverage Sedona’s computational power to process large Iceberg tables, reduce the data to a manageable size, and then visualize the results with GeoPandas and Folium. This workflow showcases the strength of combining these tools for geospatial data analysis.

First, we utilize Wherobots Sedona to efficiently read and process our extensive Iceberg data:

buildings_joined = sedona.sql("""
WITH dry_state_roi AS (
    SELECT 
        *
    FROM 
        wherobots_open_data.overture_2025_03_19_1.divisions_division_area
    WHERE
        country = 'US'
    AND 
        subtype = 'region'
)
SELECT 
    Count(building.id) AS buildings_count, roi.region AS region, roi.geometry AS region_geometry
FROM
    wherobots_open_data.overture_2025_03_19_1.buildings_building building
JOIN
    dry_state_roi roi
ON
    ST_Intersects(building.geometry, roi.geometry)
GROUP BY
    roi.region, roi.geometry
""")

Here, Sedona’s ability to handle large datasets shines as it quickly aggregates building counts by region. This reduction in data size is crucial, as it prepares the data for the next step with GeoPandas.

Next, we convert the Sedona DataFrame to a GeoPandas DataFrame:

import geopandas as gpd
buildings_pdf = buildings_joined.toPandas()
buildings_gpdf = gpd.GeoDataFrame(buildings_pdf, geometry="region_geometry")

Now, we move to visualization using Folium, a library that integrates seamlessly with GeoPandas:

import folium
m = folium.Map([43, -100], zoom_start=4)
folium.Choropleth(
    geo_data=buildings_gpdf.to_json(),
    data=buildings_gpdf,
    columns=["region", "buildings_count"],
    key_on="feature.properties.region",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Building Count",
).add_to(m)
m

While Sedona offers other graphing libraries, Folium’s direct compatibility with GeoPandas simplifies the visualization process. This code generates an interactive choropleth map, clearly displaying building density across US regions.

This example illustrates the power of Wherobots Sedona for rapid, large-scale data reduction, followed by GeoPandas and Folium for effective visualization. This combination allows for efficient analysis and presentation of complex geospatial data.

Comparing Sedona Wherobots and GeoPandas performance for a given query

Let’s compare the query performance for a dataset with different scale factors using Sedona Wherobots and GeoPandas.

The GeoPandas query is run on a machine with 12 cores and 96GB of RAM. The Wherobots Sedona query is run on a cluster of machines with a total of 24 cores and 96GB of RAM. Both setups have the same total RAM.

Important note: This comparison is for GeoPandas vs. Sedona Wherobots. Dask-geopandas is not included in this analysis and would be a more direct comparison, as it is a parallel compute engine. The purpose of this comparison is to see the performance inflection point, not to act as though GeoPandas and Sedona Wherebots are similar.

We tested spatial joins on three different datasets with GeoPandas and Sedona Wherobots.

Overture buildings & Postal codes join details

Let’s start with a simple spatial join query that counts the number of Overture Buildings located within postal code zones, by joining the Overture Buildings dataset with a postal codes dataset. Here is the query:

WITH t AS (
SELECT * FROM overture_buildings buildings
	JOIN postal_codes zones
ON ST_Intersects(buildings.geometry, zones.geometry))
)
SELECT COUNT(1) FROM t

Let’s perform this query with different dataset sizes:

Size	overture_buildings	postal_codes
Small	1,000,000	154,452
Medium	10,000,000	154,452
Large	50,000,000	154,452
Full size	706,967,095	154,452

Here are the benchmarking results for each scale factor:

Size	Wherobots execution time	GeoPandas execution time	Output count
Small	8.43s	13.3s	362,042
Medium	12.8s	1m 6s	3,099,105
Large	31.6s	–	15,170,494
Full size	5m 42s	–	295,732,669

‘-’ Kernel died due to OOM (Out Of Memory)

Wherobots is faster for all scale factors. GeoPandas cannot run the query on the large dataset because it’s too much data to fit into memory.

Here is the GeoPandas code to run the query:

import geopandas as gpd

overture_buildings = gpd.read_parquet("path/to/overture_buildings/")
postal_codes = gpd.read_parquet("path/to/postal_codes/")

t = overture_buildings.sindex.query(postal_codes["geometry"], "intersects")
count = t.shape[0]

Here’s how to run the code with Sedona:

df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/1M/overture-buildings/")
df.createOrReplaceTempView("overture_buildings")
df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/postal-code/")
df.createOrReplaceTempView("postal_codes")

sedona.sql("""
WITH t AS (
SELECT * FROM overture_buildings buildings
	JOIN postal_codes zones
ON ST_Intersects(buildings.geometry, zones.geometry)
)
SELECT COUNT(1) FROM t
""").show()

Accessing the datasets

For those interested in replicating the analyses and experiments discussed in this blog post, the datasets are available in a public Amazon S3 bucket. Here’s how you can access them:

s3://wherobots-examples/data/geopandas_blog/
├── 1M/
│   └── overture-buildings/
├── 10M
│   └── overture-buildings/
├── 50M
│   └── overture-buildings/
└── postal-code/

Each xM/ directory (e.g., 1M/, 10M/, 50M/) contains datasets scaled to different sizes, allowing for performance comparisons and scalability testing. The postal-code/ directory contains the postal code dataset used in the spatial join examples.

Conclusion

Sedona and GeoPandas are both great tools for analyzing spatial data.

GeoPandas has many advantages:

Interoperability with lots of libraries in the pandas ecosystem
Built-in matplotlib support for graphing
Long-standing project with a good community and strong technical leadership
Easy installation

GeoPandas is an excellent option if you have a small dataset and are familiar with pandas.

Wherobots and Sedona are better for larger datasets. Wherobots has Python and SQL APIs and is also better for users who prefer writing SQL.

But as you’ve seen in this post, Sedona and GeoPandas are better together. The technologies combined give you access to a larger ecosystem of libraries and the ability to run computations on large and small datasets.

Get Started with Wherobots for free

Access Now

Benefits of Apache Iceberg for geospatial data analysis

Posted on April 3, 2025August 28, 2025 by Ben Pruden

Apache Iceberg is an open table format that recently added support for geometry data columns, which is game changing for geospatial data users both big and small. Today, spatial data users have a problem if they need to scale above about a million features:

Traditional file formats and row-oriented databases perform poorly compared to non-spatial equivalents like Parquet.
Some solutions only work well for data that comfortably fits in memory
Other solutions only work well for data that does not change.

Geometry in Iceberg allows for the best of both worlds:

It is built on Parquet, offers lightning fast reads, and is scalable for larger than memory data sets.
It offers developer-friendly features like DML operations (insert, update, merge, and delete) that spatial users might expect from a database-backed spatial file format.
Beyond that, it implements extra features like versioning and time travel to allow queries against current or historical data.

Let’s take a look at these features!

Reliable geospatial transactions with Iceberg

Iceberg provides reliable transactions, so data operations are completed fully and successfully or do not run at all.

Data lakes don’t support transactions, which can cause issues:

Suppose you’re writing to a data lake, and the compute engine errors in the middle of the write. The partially written files will corrupt the data lake and require manual cleanup.
Data lakes are unreadable while DML operations are running (e.g. while files are being written).
Data lakes don’t provide concurrency protection.

Developers who usually work with databases are often surprised by these data lake limitations because databases have supported transactions for decades.

Lakehouse storage systems significantly improve data lakes with reliable transactions.

Create an Iceberg table with geometry data

Let’s create an Iceberg table with the following linestrings:

You can start by creating the Iceberg table:

CREATE TABLE LOCAL.db.icetable (id string, geometry geometry)
USING iceberg
TBLPROPERTIES('format-version'='3');

Append lines a and b to the table:

df = sedona.createDataFrame([
    ("a", 'LINESTRING(1.0 3.0,3.0 1.0)'),
    ("b", 'LINESTRING(2.0 5.0,6.0 1.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")

Now append lines c and d to the table:

df = sedona.createDataFrame([
    ("c", 'LINESTRING(7.0 4.0,9.0 2.0)'),
    ("d", 'LINESTRING(2.0 7.0,4.0 9.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")

Finally append lines e and f to the table:

df = sedona.createDataFrame([
    ("e", 'LINESTRING(6.0 7.0,6.0 9.0)'),
    ("f", 'LINESTRING(8.0 7.0,9.0 9.0)'),
], ["id", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
 
df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")

Check the content of the table:

  sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+
|id |geometry             |
+---+---------------------+
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
|c  |LINESTRING (7 4, 9 2)|
|d  |LINESTRING (2 7, 4 9)|
+---+---------------------+

Geospatial delete operations with Iceberg

Iceberg makes it easy to delete rows of data in a table based on a predicate. Suppose you’d like to delete all the linestrings in the table that cross a polygon for example:

Here’s how you can delete all the linestrings that cross the polygon.

polygon = "POLYGON((3.0 2.0, 3.0 5.0, 8.0 5.0, 8.0 2.0, 3.0 2.0))"
sql = f"DELETE FROM local.db.icetable WHERE ST_Crosses(geometry, ST_GeomFromWKT('{polygon}'))"
sedona.sql(sql)

Check the table to see that linestrings b and c are deleted from the table.

sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+
|id |geometry             |
+---+---------------------+
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|d  |LINESTRING (2 7, 4 9)|
|a  |LINESTRING (1 3, 3 1)|
+---+---------------------+

Iceberg delete operations are much better than what’s offered by data lakes.

Data lakes don’t allow you to delete rows in a table. Instead, you need to filter and overwrite the whole table or run a tedious manual process to identify the files you need to rewrite. This process is error-prone, requires downtime for the data lake table, and could lead to irreversible data loss (if a file is accidentally deleted and not backed up, for example).

This Iceberg geospatial delete operation does not require downtime and is a reliable transaction.

Geospatial time travel with Iceberg

Iceberg also allows for time travel between different versions of a table. Let’s see all the current versions of the Iceberg table:

sql = "SELECT snapshot_id, committed_at, operation FROM local.db.icetable.snapshots;"
sedona.sql(sql).show(truncate=False)
 
+-------------------+-----------------------+---------+
|snapshot_id        |committed_at           |operation|
+-------------------+-----------------------+---------+
|1978905766678632864|2025-03-13 13:54:46.057|append   |
|8271841988321279736|2025-03-13 13:55:03.642|append   |
|9187521890414340610|2025-03-13 13:55:19.286|append   |
|1785310483219777508|2025-03-13 13:55:33.908|delete   |
+-------------------+-----------------------+---------+

Let’s check the contents of the table before the delete operation was run:

sql = "SELECT * FROM local.db.icetable FOR SYSTEM_VERSION AS OF 9187521890414340610;"
sedona.sql(sql).show(truncate=False)

+---+---------------------+
|id |geometry             |
+---+---------------------+
|c  |LINESTRING (7 4, 9 2)|
|d  |LINESTRING (2 7, 4 9)|
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
+---+---------------------+

And now let’s check the table state after the first append operation:

sql = "SELECT * FROM local.db.icetable FOR SYSTEM_VERSION AS OF 1978905766678632864;"
sedona.sql(sql).show(truncate=False)

+---+---------------------+
|id |geometry             |
+---+---------------------+
|a  |LINESTRING (1 3, 3 1)|
|b  |LINESTRING (2 5, 6 1)|
+---+---------------------+

A new table version is created every time an Iceberg transaction is completed.

Suppose you have a geospatial table and append new data daily. You’re training a model on the data and are surprised that the model’s conclusions changed last week. You’re unsure if your geospatial model provides different results because the data or model code has changed. With time travel, you can run the latest model code on an earlier data version to help analyze the root cause.

Time travel is commonly helpful in the following scenarios:

Regulatory audit purposes
Model training on different versions of a table
Isolating specific versions of a table while iterating on model code

Iceberg lets you delete expired snapshots, which limits your ability to time travel. Therefore, make sure you set each table’s retention properties correctly.

Geospatial upsert operations with Iceberg

Iceberg also supports upsert operations, which allow you to update or insert rows in a table simultaneously. Upserts are especially useful when tables are updated incrementally regularly.

Here are the current contents of the Iceberg table:

+---+---------------------+
|id |geometry             |
+---+---------------------+
|d  |LINESTRING (2 7, 4 9)|
|e  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
|a  |LINESTRING (1 3, 3 1)|
+---+---------------------+

Let’s perform an upsert with the following data:

+---+---------------------+
|id |geometry             |
+---+---------------------+
|d  |LINESTRING (2 7, 4 9)| # duplicate
|e  |LINESTRING (7 7, 6 9)| # updated geometry
|z  |LINESTRING (6 7, 6 9)| # new data
+---+---------------------+

Here’s how the upsert append should run:

New data should be appended
Existing data should be updated
Duplicate data should be ignored

Here’s the code to execute this operation:

merge_sql = """
MERGE INTO local.db.icetable target
USING source
ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET
        target.geometry = source.geometry
WHEN NOT MATCHED THEN
    INSERT (id, geometry) VALUES (source.id, source.geometry)
"""
sedona.sql(merge_sql)

Here are the contents of the table after running this operation:

+---+---------------------+
|id |geometry             |
+---+---------------------+
|a  |LINESTRING (1 3, 3 1)|
|d  |LINESTRING (2 7, 4 9)|
|e  |LINESTRING (7 7, 6 9)|
|z  |LINESTRING (6 7, 6 9)|
|f  |LINESTRING (8 7, 9 9)|
+---+---------------------+

The MERGE command has many other practical applications for geospatial tables.

Geospatial schema enforcement with Iceberg

Iceberg supports schema enforcement, prohibiting appending data with a mismatched schema to the table. It will error if you try to append a DataFrame with a mismatched schema to an Iceberg table.

Let’s create a DataFrame with a different schema than table:

df = sedona.createDataFrame([
    ("x", 2, 'LINESTRING(8.0 8.0,3.0 3.0)'),
    ("y", 3, 'LINESTRING(5.0 5.0,1.0 1.0)'),
], ["id", "num", "geometry"])
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))

Now attempt to try and append the DataFrame to the table:

df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")

Here is the error:

AnalysisException: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to <code>local</code>.<code>db</code>.<code>icetable</code>, the reason is too many data columns:
Table columns: <code>id</code>, <code>geometry</code>.
Data columns: <code>id</code>, <code>num</code>, <code>geometry</code>.

The append operation is prohibited because the DataFrame schema differs from the Iceberg table schema.

Data lakes don’t have built-in schema enforcement, so you can append data with mismatched schema, which can corrupt a table or require developers to use specific syntax when reading the table.

Schema enforcement is a nice feature that protects the integrity of your data tables.

Geospatial schema evolution with Iceberg

Iceberg also allows you to evolve a table’s schema when you want to add or remove columns.

Add a num column to the Iceberg table:

sql = "ALTER TABLE local.db.icetable ADD COLUMNS (num INTEGER);"
sedona.sql(sql)

Now try to append the DataFrame with id, num, and geometry columns to the Iceberg table:

df.write.format("iceberg").mode("append").saveAsTable("local.db.icetable")

The append operation now works since the schema was evolved. Take a look at the contents of the table now:

sedona.sql("SELECT * FROM local.db.icetable;").show(truncate=False)
 
+---+---------------------+----+
|id |geometry             |num |
+---+---------------------+----+
|e  |LINESTRING (6 7, 6 9)|NULL|
|f  |LINESTRING (8 7, 9 9)|NULL|
|d  |LINESTRING (2 7, 4 9)|NULL|
|x  |LINESTRING (8 8, 3 3)|2   |
|y  |LINESTRING (5 5, 1 1)|3   |
|a  |LINESTRING (1 3, 3 1)|NULL|
+---+---------------------+----+

You cannot evolve the schema of a data lake. Developers must manually specify it when they read the data, or the engine must infer it.

It’s problematic for engines to infer the schema because it can be slow if all the data is read or wrong if only a subset of the data is sampled to determine the schema.

Efficient geospatial file listing operations with Iceberg

When an engine queries a data lake, it must perform a file listing operation, read all the metadata from the file footers, and then execute the query. For large datasets, file listing operations can be relatively slow. They take longer if the dataset uses deeply nested Hive-style partitioning or cloud object stores.

File listing operations are faster with Lakehouse storage systems like Iceberg. The engine can get all the file paths and file-level metadata directly from the metadata, so fetching metadata from a Lakehouse storage system is much faster than performing a separate file listing operation.

Small file compaction of geospatial files with Iceberg

Query engines don’t perform as efficiently when there are many small files.

A table composed of 10 one-gig files usually performs much better than a table with 10,000 one-mb files.

Iceberg supports small file compaction, automatically rearranging small files into bigger files and eliminating the small file problem.

You can compact the small files in an Iceberg table using the rewrite_data_files stored procedure.

CALL system.rewrite_data_files(table => 'local.db.icetable')

Data lakes don’t support compaction, so you must manually compact the small files, which requires table downtime and is error-prone.

Iceberg merge on read operations

Iceberg supports both copy-on-write and merge-on-read for DML operations. Copy-on-write operations immediately rewrite all files, which is slow. Lakehouse storage systems store data in immutable Parquet files, so rewriting entire files is necessary for some DML operations.

Merge-on-read implements DML operations differently. Rather than rewriting the underlying data files, merge-on-read operations output deletion vector files with the diffs. Subsequent reads are slightly slower because the data files and deletion vector files must be “merged,” but this strategy dramatically reduces the runtime of DML operations.

Merge-on-read generally offers a better set of tradeoffs than copy-on-write. Iceberg supports merge-on-read, and data lakes do not.
Here’s how to enable merge-on-read for an Iceberg table:

ALTER TABLE LOCAL.db.icetable SET TBLPROPERTIES (
    'write.delete.mode'='merge-on-read',
    'write.update.mode'='copy-on-write',
    'write.merge.mode'='copy-on-write'
);

Conclusion

Lakehouses bring many new features to the geospatial data community.

An Iceberg table is almost always a better option than a Parquet data lake because it offers many advantages. If you already have Parquet tables, converting them into Iceberg tables is straightforward.

It’s great to see how the geospatial data community collaborated closely with the Iceberg community to update the Iceberg spec to support geometry columns. The community can now start investing in spatial tables in Lakehouses, which are more performant and reliable in production applications.

Here are some other blog posts in case you’d like to learn more:

Stay up-to-date with the latest on Wherobots, Apache Sedona, and geospatial data by subscribing to the Spatial Intelligence Newsletter:

The Spatial Intelligence Newsletter: Map Matching, Spatial Joins, ML for EO, Cloud-Native Geospatial and More

Posted on March 14, 2025March 14, 2025 by Tiffany Huynh

👋 Welcome back to the latest edition of the Spatial Intelligence Newsletter! We’ve been busy brewing up some exciting things here at Wherobots, so we have plenty of new updates and content to share!

Latest Content

Don’t Let Messy GPS Slow You Down. The Fastest Way to Clean Up Messy GPS Data – And Save Money

Raw GPS data is messy. 😵‍💫 Noisy signals, lost connections, and inaccuracies make it hard to extract valuable insights. Imagine using your GPS to get to your location, only to find it telling you to drive over water instead of the road (personally, I’ve even had the map tell me to walk on water 🌊🚶🏻‍♀️).

Wherobots’ map matching corrects trajectories by aligning them with real-world road networks (❌no more walking on water! ), all while delivering unmatched accuracy and performance (and saving money!).

Apache Iceberg and Parquet now support GEO– A Huge Step Forward for Cloud Native Geo

Geospatial data has always been thought of as a second class citizen because of what modernized the data ecosystem of today, leaving geospatial data mostly behind. But that’s no longer the case. Thanks to the efforts of the Apache Iceberg and Parquet communities, both Iceberg and Parquet now support geometry and geography (collectively the GEO) data types! 🎉

What does this mean? With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, DuckDB, Apache Sedona, Apache Spark, Databricks, Snowflake, and BigQuery on your data. All the while benefitting from faster queries and lower storage costs from Parquet formatted data. 💨

Exploring design and key features to enhance spatial data workloads with Iceberg GEO

With Apache Icerberg and Parquet now supporting GEO types, this helps improve the economics of utilizing geospatial data in end solutions.This advancement allows organizations to create higher-value, lower-cost products and achieve faster results over time.

Let’s take a closer look at these GEO data types in Iceberg, exploring their design, key features, and implementation considerations. Learn how leveraging these features with Apache Sedona and Wherobots can enhance cost performance and data governance, ensuring the best possible experience for spatial data workloads. 📈

Optimizing Earth Observation Models for Production with ML Model Extension

What are the challenges of applying AI to geospatial problems? 🤖Join panel speakers from Wherobots, Radiant Earth, CRIM and Terradue as they discuss how this challenge led to the development of an open, portable solution for describing computer vision models trained on overhead imagery.

Learn about the MLM STAC Extension, its use cases, and why model developers should adopt it, along with Raster Inference– a serverless computer vision solution that extracts valuable insights from aerial imagery. 🌎

Getting Started With Wherobots

Interested in getting started with Wherobots, but unsure of where to begin? Here are some helpful resources. 👇

Wherobots 101: Mastering Scalable Geospatial Data Processing

Want to take your geospatial analytics to the next level? Whether you’re just starting out or already working with spatial data, learn how to leverage valuable tools and workflows in Wherobots Cloud to analyze, visualize and interpret geospatial datasets. From setting up your account to mastering advanced analytics, this session is a helpful guide to set you up for success!

Wherobots 102: Reading and Processing Cloud Native Geospatial Data

Learn how to efficiently load, manage and analyze raster and vector data in Wherobots’ hosted environment. Whether you’re working with massive geospatial datasets or looking for optimized workflows to write and query GeoParquet and Cloud-Optimized GeoTIFFs (COGs), this video will equip you with the tools and techniques to scale your geospatial analysis.

Working with Foursquare Places Data

Which neighborhood in San Francisco has the most coffee shops? Dive into the Foursquare Open Places dataset, a free and open dataset providing 100M+ global places of interest, with our latest tutorial. ☕

You’ll be able to query using Spatial SQL, subset the data for a specific region, search for specific businesses or places, and aggregate locations by geography. By the end of this tutorial, you’ll have a choropleth map showing the number of coffee shops, sorted by neighborhood.

Apache Sedona Community

Sedona Success Story: Optimizing ETL pipelines at scale with Comcast

🚀 Is scaling your ETL pipeline a priority? Discover how Comcast successfully achieved this by using Apache Sedona, all while boosting productivity and improving the quality of their network operations. 🌐

Learn how Apache Sedona reduces vendor lock-in.
Understand why it outperforms tools like GeoPandas and PostGIS.
See how it improves the ability of the Xfinity network team to optimize their network operations through a global view of performance quality and degradation.
Find out how it integrates seamlessly with Apache Spark and other distributed engines.

O’Reilly: Cloud Native Geospatial Analytics with Apache Sedona – Navigating Large-Scale Spatial Data

We know that handling large-scale spatial data can be daunting, which is why we’ve designed this guide to simplify geospatial data. This will help boost your spatial analytics expertise and transform the way you work with geospatial data! 💪

Our newest chapter, focusing on vector data analysis using spatial SQL, is now available. If you’ve already accessed the previous chapters, be sure to check your inbox (on a separate email) for the latest one! 📧

Engage with the Community Through Sedona Office Hours

We host monthly office hours to bring you the latest news and updates to Apache Sedona. Mark your calendars for the next one. Even if you can’t make it, we’ll send you the recording and slides to make sure you don’t miss anything that might be helpful to you. 🤝

Upcoming Events

Spatial Joins at Scale: Unlocking Advanced Geospatial Analytics

If you’ve ever struggled with Spatial Joins (you know who you are), then this is the one to join (pun intended, courtesy of Matt Forrest 😎)! Learn how to seamlessly integrate Python and Wherobots to perform advanced spatial joins and analyses on geospatial data.

Gain practical skills and best practices for processing and visualizing spatial data at scale. Don’t miss this opportunity to boost your spatial analytics expertise and transform how you work with geospatial data.

Fireside Chat with Overture Maps and Dotlas on Cloud-Native Geospatial: More Than Just Big Data

How is cloud-native geospatial reshaping the way organizations interact with spatial data? ☁️🌎 It prioritizes flexibility, changes how data consumers connect, removes friction, and unlocks new possibilities.

Join us, alongside Amy Rose from the Overture Maps Foundation and Eshwaran Venka from Dotlas, as we explore how modern approaches enable scalability across various compute infrastructures, eliminate the need to move massive datasets, and allow users to work with data wherever they are—whether locally or in the cloud. Hear about where geospatial technology is headed. This is a conversation you definitely don’t want to miss!

Getting Started

🆓 Getting started with Wherobots is easy. If you haven’t already, create a free account and dive in. If you’re looking to take your geospatial analytics to the next level—whether it’s full access to open datasets, map matching, or raster inference—try the Pro tier for free.

Get started with Wherobots

Try Now

This Month in Apache Sedona: November Edition

Posted on November 18, 2024February 28, 2025 by Tiffany Huynh

Apache Sedona has reached ⭐ 2M+ downloads ⭐ in the past month!

What’s New in Apache Sedona 1.7.0

Our newest features include KNN Join, GeoStats and DataFrame-based readers. Learn more about the latest release in our most recent office hour.

You can can access the presentation slides here and the kNN Join slides here. Be sure to save the date for the next office hour to stay up-to-date with the latest releases!

Latest Content

What is Apache Sedona?

Spatial data should be treated as a first class citizen. Here’s an overview of Apache Sedona and some of its common use cases. Learn more.

Introducing GeoStats

A machine learning and statistical toolbox for WherobotsAI and Apache Sedona users. Learn about its use cases and what challenges it helps solve. Read more.

O’Reilly: Cloud-Native Geospatial Analytics with Apache Sedona

Ideal for developers, data scientists and engineers, this hands-on guide provides practical solutions for challenges in working with various types of geospatial data. Get access here.

Past Event

Spatial Data Science Conference

Wherobots attended the Spatial Data Science Conference in New York last month. Head of Marketing Ben Pruden and Senior Solution Architect Daniel Smith led a workshop on insurance risk analysis using Wherobots and CARTO.
We also took the stage with GeoPostcodes to showcase how Apache Sedona on Wherobots Cloud can be leveraged to better understand population data, reducing processing times from weeks to days.
During the meetup, we enjoyed some drinks and light bites while having a great time connecting with members of the Apache Sedona Community.
We’re grateful for the opportunity to meet and learn from others in the geospatial data space, hearing about their use cases and how they’re leveraging these technologies.

Upcoming Events

Cloud Native Geospatial Meetup: GIS Week – November 19

Hear from an amazing lineup of speakers from the Google Maps and Google Earth teams, former ESRI and Wherobots. Topics include kNN Join, the origins of GIS Day, the history and future of cloud-native geospatial technology, and current work within the open source community. More details here.

AWS re:Invent – December 2-6

Expo booth (all week): Swing by the booth #1865 to learn more about Wherobots and Apache Sedona. Grab some free SWAG and participate in our trivia to win some exciting raffle prizes!
GeoParty (December 4): Join us for a geoparty at the Venetian Pool Deck! Enjoy drinks, light snacks, good music and networking with other professionals in the field. Space is limited and you must register to attend. RSVP here.
Session: Extract insights from satellite imagery at scale with WherobotsAI (December 5 12:30PM-12:50PM) at Venetian Hall B Expo, Theater 4
Learn>Learn how WherobotsAI Raster Inference enables data platform and science teams to analyze our planet with satellite imagery faster, more reliably, and with a zero carbon footprint—using SQL and Python. This fully managed, high-performance, carbon-neutral planetary-scale computer vision solution makes AI/ML on satellite imagery accessible to most developers and data scientists.
Connect at re:Invent: If you’ll be at re:Invent and would like to set up a time to meet, feel free email us at reinvent@wherobots.com and we can find a time to connect.

Apache Sedona Office Hour – December 10

We host monthly office hours as a way to engage with the community, share the latest updates and releases, along with future plans. If you’re working on something exciting with Apache Sedona, we’d love to hear about it. Save the date for the next office hour.

How to shift Apache Sedona on Spark workloads to WherobotsDB

Posted on November 6, 2024July 28, 2025 by Jia Yu

Wherobots customers are realizing up to a 20x performance increase and significant cost savings by shifting their Apache Sedona workloads into Wherobots. This guide shows you how easy it is to migrate Apache Sedona workloads into WherobotsDB, and focuses on best practices for Apache Sedona migrations from Amazon EMR, AWS Glue, and Databricks.

What You’ll Achieve:

By following this guide, you’ll be able to:

Lift and shift existing Apache Sedona workloads: Whether your Sedona jobs are currently running in AWS Glue, Databricks, or EMR, we’ll show you how to move them to WherobotsDB.
Orchestrate job runs with Airflow: We’ve made it easy to redirect your existing code and execute it seamlessly on WherobotsDB using Airflow for job orchestration.

Getting Started:

This guide assumes you’ve already decided to migrate to Wherobots. We’ll focus on the technical steps in moving your workloads, empowering you to get up and running quickly.

Storage Integration

Wherobots makes it easy to run the models and scripts you already have in your public or private Amazon S3 buckets using Wherobots’ secure S3 storage integration.

For step-by-step configuration instructions, see S3 Storage integration and SAML Single Sign On (SSO) setup in the official Wherobots documentation.

Initializing Sedona Context within WherobotsDB

When you’re ready to dive into spatial data analysis within WherobotsDB, your first order of business is creating a Sedona context object. This object acts as your gateway to the capabilities of the Wherobots Cloud ecosystem, enabling you to leverage its extensive spatial functions and tools.

To ensure a smooth start, it’s essential to double-check that your Sedona environment is correctly configured within your Wherobots notebook. Pay close attention to the sedona and spark variables used to initialize the Sedona environment, ensuring they match your existing setup and preferences. This approach will help you avoid potential hiccups and ensure a seamless transition into the world of spatial analysis with WherobotsDB.

from sedona.spark import *

config = SedonaContext.builder()\
                # add your Sedona/Spark configurations here in this format
                .config("<sedona-spark-config-key>", "<sedona-spark-config-value>")\
                .getOrCreate()
sedona = SedonaContext.create(config)

This configuration provides the foundation for utilizing Sedona’s spatial functions within WherobotsDB, empowering you to perform advanced geospatial analysis with ease and efficiency.

Move your business logic

Workload migration can be daunting and disruptive. Fortunately, WherobotsDB is built on Apache Sedona and is 100% code-compatible, so you can migrate your workloads seamlessly. You’ll find all the familiar functions, joins, and features of Sedona, performance enhanced by WherobotsDB.

Follow the steps below to seamlessly transfer your business logic to WherobotsDB:

Identify Your Logic
Validate Functionality
Create a Python Script
Redeploy code in your S3 bucket

This diagram illustrates how your Sedona workloads will be integrated within the Wherobots ecosystem:

Identify Your Logic

Start by identifying an obvious component of your spatial workflow to migrate to WherobotsDB. Ensure you have all the supporting elements required for its functionality. This approach will streamline your transition to the WherobotsDB ecosystem.

Validate Functionality

After identifying the business logic you intend to shift, it’s important to validate its functionality within the WherobotsDB environment to ensure it performs as expected. This validation process ensures that your spatial operations, data transformations, and analytical processes produce the same accurate results you rely on.

Test your code using WherobotsDB notebooks. Start by selecting a runtime for your notebook that aligns with the demands of your workload. Then, seamlessly transfer your business logic into the notebook environment. Execute your code and carefully validate the outputs, paying close attention to data counts and consistency with your expected results. This validation process ensures that your logic functions seamlessly within WherobotsDB.

Create a Python Script

With your validated code ready, it’s time to package it into a Python script. This involves simply creating a .py file and organizing your code. This step ensures your logic is portable and easily executed within the WherobotsDB environment.

Redeploy code in your S3 bucket

Now that your business logic is neatly packaged within a Python script, you need to make it accessible to Wherobots Airflow. To do this, you can upload it to an S3 bucket that’s integrated with your Wherobots environment.

Another alternative is to upload it to our Managed Storage. This secure and integrated storage solution ensures your code is readily available for execution within the Wherobots ecosystem. Click here on how to upload to Managed Storage.

Execute Sedona Code with Wherobots Airflow operator

Wherobots provides an Airflow operator called the WherobotsRunOperator to simplify the integration of your code with the Job Runs API. This operator, designed for Apache Airflow, allows you to seamlessly trigger your Wherobots runs within your Airflow workflows. Before running your script, you’ll need to establish a connection to Wherobots in the Airflow Server and retrieve the S3 URI of your uploaded Python file. This URI serves as a reference to your code’s location, enabling the Wherobots Airflow operator to access and execute it.

Here’s an example of how to use the WherobotsRunOperator to execute your Sedona code on WherobotsDB:

import datetime
import pendulum

from airflow import DAG
from airflow_providers_wherobots.operators.run import WherobotsRunOperator

from wherobots.db.runtime import Runtime

with DAG(
    dag_id="test_run_operator",
    schedule="@once",
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as test_run_dag:

    operator = WherobotsRunOperator(
        task_id="analysis_task",
        name="airflow_run_operator",
        runtime=Runtime.TINY,
        run_python={
            "uri": "S3-URI-PATH-TO-YOUR-FILE",
            "args": "test_run=True"
        },
        dag=test_run_dag,
        poll_logs=True,
    )

In this example, the WherobotsRunOperator takes the S3 URI of your Python file and executes it on a specified runtime environment Runtime.TINY. You can configure the Airflow to run your code on a schedule, pass arguments to your code, and monitor the execution logs.

By utilizing the WherobotsRunOperator and the Job Runs API, you can seamlessly integrate your existing Sedona code into WherobotsDB and take advantage of its powerful geospatial capabilities. This approach ensures a smooth transition and allows you to focus on your spatial data analysis without worrying about infrastructure management or complex configurations.

To learn more about the WherobotsRunOperator and its capabilities, refer to the Wherobots documentation.

Alternative to Airflow operator

If you don’t use Airflow, the Wherobots Jobs Runs API provides a convenient way to execute your code directly on WherobotsDB.

Conclusion

Migrating your spatial data workflows doesn’t have to be a complex endeavor. With this guide, you can easily transition from Apache Sedona on Spark, EMR, or Databricks and leverage WherobotsDB on Wherobots Cloud, a cloud-native data processing solution designed to make you more productive and your spatial workloads accelerate.

Ready to simplify your spatial data analysis?

O'Reilly Book: Cloud Native Geospatial Analytics with Apache Sedona

Get Access

Introducing kNN Join for Wherobots and Apache Sedona

Posted on November 5, 2024February 18, 2026 by Ben Pruden

TL;DR: WherobotsDB and Apache Sedona 1.7.0 introduce two types of kNN Join for large-scale geospatial workloads. The Exact kNN Join (ST_KNN) returns precise nearest neighbors and is best when accuracy is critical. The Approximate kNN Join (ST_AKNN) trades roughly 3-7% accuracy for significantly faster query times and lower cost. Both outperform PostGIS at scale, with PostGIS unable to complete joins at the 10M x 1B dataset size.

We are excited to introduce the k-Nearest Neighbors Join (kNN Join) for WherobotsDB and Apache Sedona 1.7.0. With the kNN Join, you can efficiently find the closest entities to your points of interest from datasets at scale. We’ve released two types of kNN Joins: the Exact kNN Join and the Approximate kNN Join.

With the Exact kNN Join you get precise results, but the additional precision adds cost and time to the query. The Exact kNN Join is useful when precise object proximity is critical. It is now available in WherobotsDB and will be available in Apache Sedona 1.7.0. The Approximate kNN Join delivers results faster and at a lower cost than the Exact kNN Join, and is tested to be ~3-7% less accurate than the Exact kNN Join for varying data sizes (more info below). The Approximate kNN Join, available now in WherobotsDB, is useful when performance matters more than accuracy, such as testing workloads or when query speed matters more than precision.

Click here to launch this interactive notebook

Launch Notebook

When is the kNN Join useful?

The kNN Join allows you to quickly find a chosen set (k) of nearest entities (objects) to your points of interest (queries).

When Should You Use the Exact kNN Join

The Exact kNN Join will efficiently identify the exact k closest objects to queries of interest. This makes it a highly useful algorithm when precise object proximity is crucial. Common applications of the Exact kNN Join at scale include:

Reverse geocoding: Ride share services can use the Exact kNN Join to map high quantities of user tracking coordinates to street address for smooth rider pickups.
Identifying nearest transit options: Map services can use the Exact kNN Join to accurately locate (or simulate) the closest public transit options for many people simultaneously.
Analyzing the distribution of public services: City planners can use the Exact kNN Join to identify geographic areas without public services nearby to inform future development plans.
Efficient freight routing: Shipping and cargo trucks can use the Exact kNN Join to locate exact locations of nearest refueling stations, ramps, marine life, etc. to create efficient routes and reduce time and cost.

When Should You Use the Approximate kNN Join

The Approximate kNN Join trades off accuracy for higher query speed. It will find a set of k close neighbors (but not necessarily the closest neighbors) to your query set. Some common applications of the Approximate kNN Join include:

Local search: Map services use Approximate kNN Join to help users find places of interest within a certain distance of their geolocation.
Image search: Geospatial modeling teams leverage Approximate kNN to find images for geographic points of interest to augment available data for model training.

More importantly, the Approximate kNN Join unlocks kNN to be used for any compute or time constrained workloads, including:

Testing: Teams developing geospatial workloads leverage the speed of the Approximate kNN Join to quickly test and iterate their pipelines.
Meeting latency requirements: Teams use Approximate kNN Join to meet latency requirements and serve identified nearest neighbors at scale in production.

Why Self-Hosting kNN Join Algorithms Is Harder Than It Looks

Before, teams leveraging the kNN algorithm for geospatial workloads would have to:

Determine how to host and scale general-purpose open source kNN solutions, including managing the curse of dimensionality, manage computational complexity that increases with data size, and preprocessing outlier data.
Limit the frequency of executing kNN join workloads due to high time and compute costs.

What You Gain by Using kNN Join in Wherobots or Apache Sedona

Now with the kNN Join in Wherobots or Apache Sedona, you can:

Leverage geospatially optimized kNN joins that scale efficiently with increasing data sizes without the overhead of building and maintaining your own solution.
Iterate at a higher pace, by leveraging the Approximate kNN Join to trade-off performance versus accuracy. And, as your workload requires, run the Exact kNN Join to guarantee accurate neighbors for points of interest.
See how ST_KNN and ST_AKNN perform at 44 million geometries across five US states.

We’ll walk through a brief overview of the kNN Join, share how each perform at various scale, and how to use it.

How kNN Join Works in WherobotsDB and Apache Sedona

When used with geospatial data, the kNN Join identifies the k-nearest neighbors for a given spatial point or region based on geographic proximity. It involves two datasets: a query dataset containing points of interest, and an objects dataset, containing the entities you’re trying to locate near those points of interest. For each record in the queries dataset, the kNN Join finds the resulting set (k) of nearest neighbors from the objects dataset based on a user-defined distance metric and k value.

For example, let’s say you’re interested in locating the three closest coffee shops to specific public transit stops in your neighborhood. The query dataset would be gps coordinates of the public transit stops. The objects dataset would be gps coordinates of all coffee shops in your neighborhood. As we’re looking for the three closest coffee shops, the k value is 3. We’ll say that our distance metric is three blocks. Using this as input, the kNN Join will conduct a search of the objects and return three coffee shops that are within three blocks of each of the public transit stops in your neighborhood.

An illustration of kNN join. Using the example above, the red dots are the query dataset containing gps coordinates of the public transit stops. The green dots are the objects dataset containing gps coordinates of all coffee shops in your area. We want the 3-nearest coffee shops (k=3) to these public transit stops. Using kNN join, we find the closest coffee shops to the public transit stops. The query dataset and only their closest found neighbors are shown in the df_joined.

kNN Join SQL API Overview

We’ve made it easy to integrate the kNN Join into your workflows. Here is a quick overview of the kNN Join Spatial SQL APIs. For an in-depth look, see our docs here.

Exact kNN Join

	Supported Geometries	Hyperparameters	Output	SQL API
Exact kNN Join	points, geometries	number of nearest neighbors to identify (k)	dataframe with k nearest neighbors per query ranked by distance	`ST_KNN(...)`

Exact_kNN_Join_Results = ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)

Approximate kNN Join

	Supported Geometries	Hyperparameters	Output	SQL API
Approximate kNN Join	points, geometries	number of nearest neighbors to identify (k)	dataframe with k nearest neighbors per query ranked by distance	`ST_AKNN(...)`

Approximate_kNN_Join_Results = ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)

kNN Join Performance Benchmarks: Wherobots vs PostGIS

We’ll compare how the Exact and Approximate kNN Joins perform vs PostGIS, and how the Approximate kNN Join performs against the Exact kNN Join.

kNN Join vs PostGIS

Let’s look at some performance benchmarks for the Exact and Approximate kNN Joins compared to PostGIS’s exact kNN join algorithm. The following benchmarks are based on measured wall clock time for each join on varying data sizes. Each data size is defined by the query data size and the object data size, so a query set of 1M records and a object set of 12M records is described as 1M x 12M in the graph below. To compare scale and performance, we avoid compute comparisons across options by normalizing on cost per hour for each join.

As seen below, the Approximate and Exact kNN Joins scale better than PostGIS with increasing dataset sizes. A key limitation of PostGIS’s exact kNN join is its inability to scale horizontally across machines, which is why performance with large scale exact kNN joins is better with Wherobots. Meanwhile, the Exact and Approximate kNN Joins distribute the join horizontally across workers and process larger datasets more efficiently. Specifically, the Approximate kNN Join is more performant and efficient at cost per workload than PostGIS and Exact kNN Join. Meanwhile, the Exact kNN Join is more effective for larger workloads than PostGIS, with PostGIS unable to complete the join for the 10M x 1B data size.

Approximate vs Exact KNN Join Benchmarks

Let’s look at some performance benchmarks to help you choose which kNN Join is right for a given workload. We’ve chosen a variety of metrics to evaluate the accuracy of the Approximate kNN Join.

Accuracy: Measures the percentage of the Approximate kNN Join results that are the closest objects to queries.
Jaccard Index: Measures the overlap between the Approximate kNN Join results and the Exact kNN Join results.

Let’s take a look at wall clock time for the Approximate kNN Join versus Exact kNN Join for increasing data sizes.

Let’s take a look at accuracy and Jaccard scores for the Approximate kNN Join for increasing data sizes.

As seen above, the Approximate kNN Join will run faster than the Exact kNN Join for datasets of varying size, is highly accurate, and identifies highly similar sets of nearest neighbors as the Exact KNN Join, even for increasing dataset sizes. We recommend leveraging the Approximate kNN Join whenever you’re workloads allow for less precision, are cost constrained, or require high performance.

Tutorial: How to Run kNN Join on Weather and Flight Data in WherobotsDB

The following example details how to apply the Exact and Approximate kNN Joins on weather events data to find which flights are closest to each weather event, which can be crucial for making real-time decisions in air traffic management and ensuring flight safety.

We will load this weather data from the Wherobots Spatial Catalog. The Spatial Catalog includes a collection of Wherobots maintained open datasets from various data sources like Overture Maps, LandSAT, Wild Fires, New York Taxi Data, and more. These datasets are optimized for fast and efficient analytics with WherobotsDB.

The queries table contains the locations of weather events, such as storms or turbulence, while the objects table contains flight locations. Visit the user documentation for a full walk through of both kNN joins.

1. Preparing the Weather Events Data as Queries DataFrame

First, we load the weather events data from the Spatial Data Catalog. Then, we assign a monotonically increasing ID to each row in the weather events, and load it to a temporary view so we can use in the SQL.

df_queries = sedona.table("wherobots_pro_data.weather.weather_events")
df_queries = df_queries.withColumn("id", monotonically_increasing_id())
df_queries.createOrReplaceTempView("weather")

df_queries.select("id", "geometry", "type").show(20, False)

The weather queries events DataFrame looks like this:

+-------+-------------------------+----+
|id     |geometry                 |type|
+-------+-------------------------+----+
|2716677|POINT (-110.974 41.8191) |Snow|
|2656934|POINT (-109.4604 41.5323)|Snow|
|2653971|POINT (-109.0528 41.5945)|Rain|
|2519252|POINT (-105.0333 42.45)  |Rain|
|2480600|POINT (-105.5419 44.3394)|Snow|
+-------+-------------------------+----+
only showing top 5 rows

2. Preparing the Flights Data as Objects DataFrame

We are using the flights tracking data from ABS-B flight observation data. This data originated from ADSB.lol, which provides publicly available daily updates of crowdsourced flight tracking data.

df_objects = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet")
df_objects.createOrReplaceTempView("flights")

The flights objects DataFrame looks like this:

+---------------------------+-----------------------------+-------------------+
|desc                       |geometry                     |timestamp          |
+---------------------------+-----------------------------+-------------------+
|BOEING 737-800             |POINT (-110.232108 43.929066)|2024-01-11 00:00:00|
|CESSNA 240 CORVALIS TTX    |POINT (-110.852247 43.765574)|2024-01-01 00:00:00|
|CESSNA 172 SKYHAWK         |POINT (-108.790218 44.801926)|2024-01-23 00:00:00|
|CESSNA 510 CITATION MUSTANG|POINT (-110.795842 43.518448)|2024-01-26 00:00:00|
|BOEING 737-800             |POINT (-110.873321 43.079086)|2024-01-18 00:00:00|
+---------------------------+-----------------------------+-------------------+
only showing top 5 rows

3. Run the kNN Join

Once the two datasets are loaded, we can use the following SQL to run the kNN Joins.

The query below runs the Exact kNN Join with the number of neighbors set to 4.

df_knn_join = sedona.sql("""
SELECT
    WEATHER.GEOMETRY AS WEATHER_GEOM,
    FLIGHTS.GEOMETRY AS FLIGHTS_GEOM
FROM WEATHER
JOIN FLIGHTS ON ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
""")

You can also run the Approximate kNN Join with number of neighbors set to 4, and add UDFs to the result DataFrame as following.

df_knn_join = sedona.sql("""
SELECT
        WEATHER.GEOMETRY AS WEATHER_GEOM,
        [WEATHER.ID](http://weather.id/) AS QID,
        FLIGHTS.GEOMETRY AS FLIGHTS_GEOM,
        ST_DISTANCESPHERE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS DISTANCE,
        ST_MAKELINE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS LINE
FROM WEATHER
JOIN FLIGHTS ON ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
""")

The joined DataFrame looks like the following for a single query point.

+-------+-------------------------+-----------------------------+-----------------+
|QID    |QUERIES_GEOM             |OBJECTS_GEOM                 |DISTANCE         |
+-------+-------------------------+-----------------------------+-----------------+
|2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948|
+-------+-------------------------+-----------------------------+-----------------+
only showing top 4 rows+-------+-------------------------+-----------------------------+-----------------+
|QID    |QUERIES_GEOM             |OBJECTS_GEOM                 |DISTANCE         |
+-------+-------------------------+-----------------------------+-----------------+
|2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948|
+-------+-------------------------+-----------------------------+-----------------+
only showing top 4 rows

4. Visualize the kNN Join Results

We can also visualize the Exact kNN Join results using SedonaKepler by adding the joined DataFrame to a map.

# create map for the results
map_view = SedonaKepler.create_map(df_unique_qid.select('QUERIES_GEOM'), name="WEATHER EVENTS")
SedonaKepler.add_df(map_view, df=df_related_rows.select('OBJECTS_GEOM', 'DISTANCE').withColumnRenamed("OBJECTS_GEOM", "geometry"), name="FLIGHTS")
SedonaKepler.add_df(map_view, df=df_related_rows.select('LINE', 'DISTANCE').withColumnRenamed("LINE", "geometry"), name="KNN LINES")

# show the map
map_view

Below we see a portion of the map we generated above. In this portion of the map, we see the results of the Exact kNN Join. Given our distance requirement between a weather event and flight (the lines), for each of the weather events from our query database (blue dots), we see the 4 closest flights to each weather event (yellow dots) .

Get started with the Wherobots kNN Join

Wherobots Users

All Wherobots users have access to the Exact and Approximate kNN Joins.

If you haven’t already, create a free Wherobots organization subscribed to the Community Edition of Wherobots.
Start a Wherobots Notebook
In the Notebook environment, explore the python/wherobots-db/wherobots-db-knn-joins.ipynb example that you can use to get started.
Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

We’re releasing the Exact kNN Join in Apache Sedona 1.7.0. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the releases and get started!

What’s next

We’re excited to hear what algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Start using Wherobots for free

Get Started

Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were)

What We’re Building

The Source Schema

Setting Up the Streaming Ingest

SedonaContext Initialization

Ensuring the Catalog Schema and Table Exist

The Spatial Transform

Creating Geometry with ST_PointZM

Business Rule: The Speeding Flag

Business Rule: Geofence Violation Detection

Why This Has to Happen Inside foreachBatch

Loading the Geofence Table: Broadcast + Cache

The Spatial Join Pattern

File-Level Lineage with input_file_name()

The foreachBatch Writer

Adding Batch and Job Run Lineage

The Wherobots-Idiomatic Write Pattern

Wiring It All Together

Viewing the Results

Ingest Stats and Lineage

Batch-Level Detail

Records per Batch — Bar Chart

The Map: PyDeck with Speeding Visualization

Why This Pattern Matters

Spatial Enrichment at Ingest, Not at Query Time

Business Rules as First-Class Columns

Lineage Without Extra Infrastructure

Iceberg: Geometry as a Native Type

Catalog-Managed Tables, Not Path-Managed Files

Key Takeaways

About the Author

Introduction

What Is Out-of-Database Raster Architecture?

How Out-DB Architecture Transforms Raster Operations

What Raster Capabilities Does WherobotsDB Include?

What Is RasterFlow and How Does It Work?

What Comes Next: Query Performance and Spatial Analytics

Intro to SedonaDB: A modern query engine that gets spatial right

What led to SedonaDB?

Accelerating innovation

Making development easier

Optimizing for spatial data = a better future

Intro to SpatialBench: The first standard for spatial query performance

Next Steps

Key Takeaways

What is Zonal Statistics Analysis and Why Does Scale Matter?

How Google Earth Engine + BigQuery Handles Large-Scale Spatial Joins (And Where It Struggles)

How Apache Sedona + Wherobots Handles the Same Problem

Platform Comparison: Apache Sedona + Wherobots vs Google Earth Engine + BigQuery

Benchmark Results: Wherobots vs Google Earth Engine + BigQuery

Using ST_RegionStats in BigQuery via Google Earth Engine

Using Apache Sedona and Wherobots

Results

Test 1: All of Texas

Test 2: One city

Performance Summary

When to Use Apache Sedona + Wherobots vs Google Earth Engine

Performance and scale

Architecture and flexibility

Wherobots and GeoPandas example computation

Read an Iceberg table into a GeoPandas DataFrame

How Sedona and GeoPandas execute computations differently

Single-node execution with GeoPandas vs. Sedona cluster computing

Lazy Sedona execution vs. eager computation with GeoPandas

Out-of-memory datasets with GeoPandas

Sedona vs. GeoPandas: Query optimizations

Sedona vs. GeoPandas: Syntax differences

Sedona vs. GeoPandas: Library ecosystem

Sedona GeoPandas API

Dask-GeoPandas overview

Converting from Sedona to GeoPandas

Converting from GeoPandas to a Sedona DataFrame

Visualizing Iceberg Data with Sedona Wherobots, GeoPandas, and Folium

Comparing Sedona Wherobots and GeoPandas performance for a given query

Overture buildings & Postal codes join details

Accessing the datasets

Conclusion

Reliable geospatial transactions with Iceberg

Create an Iceberg table with geometry data

Geospatial delete operations with Iceberg

Why This Has to Happen Inside `foreachBatch`