Spatial Tech Archives

Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down

Posted on March 26, 2026March 26, 2026 by Matt Forrest

Mobility data is the continuous stream of GPS location records capturing how people, vehicles, and assets move through the world. Processing it at scale is fundamentally different from standard spatial data work because it carries temporal dependencies: the order and timing of observations define movement, not just position. Most organizations have discovered, often painfully, that the tools they already use were not designed with that in mind. In Part 2, we walk through the technical implementation: a three-notebook medallion architecture built on Wherobots and Apache Sedona that takes raw GPS pings and transforms them into analysis-ready, GeoParquet-backed analytical views.

Why Mobility Data Is Harder to Process Than It Looks

Every second, billions of GPS-equipped devices generate spatiotemporal data capturing how people, goods, and vehicles move through the physical world. The market reflects how seriously organizations are treating this: the global fleet management market is valued at approximately $27 billion in 2025 and projected to exceed $122 billion by 2035. Mobility data analytics platforms are on a similar trajectory, from $2.5 billion to over $11 billion by 2034. But collecting this data and actually extracting reliable intelligence from it are two very different things.

Most organizations have discovered, often painfully, that the tools they already use were not designed with GPS mobility data in mind. The result is brittle pipelines, inconsistent methodologies, and analyses that quietly produce misleading conclusions. Researchers at ACM have characterized the field as requiring its own dedicated science because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires looking at the specific properties that make mobility data uniquely difficult to process correctly.

What Makes Mobility Data Different From Standard Spatial Data

Mobility data is not just spatial data with timestamps attached. It is a distinct category of data that violates assumptions built into most data processing systems. Researchers at ACM have characterized the field as requiring its own dedicated science—Mobility Data Science—because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires examining the specific properties that make trajectory data uniquely challenging.

Why GPS Data Volume Overwhelms Traditional Databases

The volume and velocity problem is the recognition that GPS data generation at fleet scale is not a batch analytics challenge, it is a continuous, high-throughput data engineering problem that demands distributed processing from the start. A single connected vehicle generating GPS pings at one-second intervals produces roughly 86,400 records per day. A fleet of 10,000 vehicles generates over 860 million data points daily. Multiply this across the millions of connected vehicles, delivery drones, rideshare fleets, and maritime vessels operating globally, and the scale becomes staggering.

Traditional spatial databases like PostGIS, which excel at transactional workloads and moderate-scale analytics, were not designed for this volume. Loading hundreds of millions of GPS points into PostgreSQL, constructing geometries, and running spatial joins or trajectory reconstruction queries can take hours or days on a single node. Adding more hardware does not solve the fundamental problem: PostGIS was not built for distributed, parallel spatial computation.

How GPS Signal Noise Corrupts Downstream Analysis

GPS noise is error in raw location readings caused by signal reflection, satellite loss in tunnels and urban canyons, and atmospheric interference and it cascades through every downstream analysis built on top of it. Signals bounce off buildings, lose satellites in tunnels and urban canyons, and suffer from atmospheric interference. Studies have documented median GPS errors of 7 meters with standard deviations exceeding 23 meters in urban environments. Points can appear on the wrong side of a street, inside buildings, or kilometers from the actual position when signal quality degrades.

Speed calculations between consecutive points can spike to physically impossible values. Distance measurements accumulate systematic errors. Clustering algorithms identify phantom stop locations. Without rigorous cleaning and validation at the earliest stages of the pipeline, every subsequent insight is built on a compromised foundation.

Researchers at the University of Pennsylvania’s Computational Social Science Lab studied exactly this problem in the context of COVID-19 epidemic modeling. Using the same GPS mobility dataset, they found that different but individually reasonable preprocessing choices led to substantially different conclusions. a methodological “garden of forking paths” where reproducibility became nearly impossible. The root causes: data sparsity, sampling bias, and inconsistent algorithmic choices at the preprocessing stage.

Why Temporal Ordering Is Critical in Mobility Data Processing

Trip segmentation is the process of splitting a continuous GPS stream into discrete trips by detecting temporal gaps, periods where no data was recorded or the device was stationary. Mobility data is not simply geospatial—it is spatiotemporal. Every GPS point has a position and a timestamp, and the relationship between consecutive observations is what defines movement. This trip segmentation step alone introduces significant methodological complexity, because the threshold you choose (5 minutes? 20 minutes? 60 minutes?) fundamentally changes the structure of your resulting trajectories and all metrics derived from them.

Beyond segmentation, ordering matters. Trajectories are sequences, not sets. Every analytical operation—speed calculation, direction changes, stop detection, map matching—depends on correct chronological ordering within each trip. Systems that do not preserve or guarantee ordering (a common challenge in distributed frameworks) can produce geometries that appear valid but contain scrambled temporal information.

Why 2D Spatial Systems Fail for Mobility Analytics

The dimensionality problem is the gap between how most spatial systems model location, latitude and longitude only and what mobility analysis increasingly requires: 3D and 4D geometry that encodes elevation and time directly into the geometry itself. Most spatial systems treat location as a 2D construct: latitude and longitude. But mobility analysis increasingly demands 3D and 4D processing. Elevation matters for fuel consumption modeling, route optimization in mountainous terrain, aviation and drone trajectories, and any analysis where the difference between 2D and 3D distance is materially significant. Adding a temporal measure dimension (the “M” in XYZM geometries) enables encoding timestamps directly into the geometry itself, supporting trajectory validation and interpolation operations that are impossible with 2D points.

Yet 4D geometry support—constructing XYZM points, building trajectories from them, and performing analytical operations that respect all four dimensions—is rare. Most spatial SQL implementations either lack these functions entirely or implement them inconsistently. This forces practitioners to maintain separate columns for elevation and time, losing the computational advantages of integrated 4D geometry processing.

Where Traditional Spatial Systems Fail for Mobility Data

System	Primary Limitation for Mobility Data
Desktop GIS (QGIS, ArcGIS Pro)	Single-machine ceiling, no distributed processing
PostGIS	Single-node, not built for hundreds of millions of GPS points
Cloud data warehouses (Snowflake, BigQuery)	Shallow spatial support, cannot handle XYZM or map matching
Vanilla Apache Spark	No native spatial types, no spatial indexing
External map matching APIs	Rate limits and per-request pricing make batch processing prohibitive

Desktop GIS: The Single-Machine Ceiling

Tools like QGIS and ArcGIS Pro are extraordinarily capable for visualization, manual analysis, and working with datasets that fit in memory. But they hit a hard wall with mobility data at scale. Loading millions of GPS trajectories, performing trip segmentation with window functions, running DBSCAN clustering on stop points, and executing map matching against a road network are not operations that desktop GIS was designed to handle. Analysts working with fleet-scale data routinely encounter out-of-memory errors, multi-hour processing times, and the inability to iterate quickly on analytical parameters.

Spatial Databases: Scale Without Spatial Intelligence

PostGIS remains the gold standard for spatial SQL and is an excellent choice for many use cases. However, it is fundamentally a single-node system. Scaling PostGIS to handle hundreds of millions of GPS points requires expensive vertical scaling, and even then, operations like trajectory construction across thousands of users, spatial indexing with H3 or GeoHash, and DBSCAN clustering at urban scale can exhaust available resources.

Cloud data warehouses like Snowflake, BigQuery, and Redshift have added spatial capabilities, but these tend to be shallow implementations optimized for simple point-in-polygon or distance queries. Constructing XYZM trajectories from ordered GPS points, performing spatial clustering, computing 3D distances, or running map matching against a road network are either unsupported or require convoluted workarounds that sacrifice performance and maintainability.

General-Purpose Distributed Frameworks: Power Without Spatial Awareness

Apache Spark provides the distributed computing muscle needed for mobility-scale data, but vanilla Spark has no concept of spatial data types, spatial indexing, or geometric operations. Running a spatial join in pure Spark requires broadcasting datasets or implementing custom partitioning strategies—both of which are error-prone and perform poorly at scale compared to purpose-built spatial engines.

This is precisely the gap that Apache Sedona and Wherobots were designed to fill. Sedona extends Spark (and other distributed frameworks) with native spatial data types, over 290 spatial SQL functions, spatial indexing, and optimized query planning that understands geometric predicates. Wherobots builds on Sedona to provide a fully managed, cloud-native spatial intelligence platform where teams can process GPS-scale data without managing infrastructure, configuring clusters, or bolting together fragmented toolchains.

Map Matching: The Unsolved Infrastructure Problem

Map matching—the process of snapping noisy GPS traces to the actual road network—is one of the most computationally demanding and methodologically complex steps in any mobility pipeline. It requires loading a complete road network graph, computing probabilistic alignments between GPS observations and candidate road segments, and resolving ambiguities at intersections, parallel roads, and complex interchanges.

Most map matching solutions are either commercial APIs with strict rate limits and per-request pricing that make batch processing of historical data prohibitively expensive, or open-source tools that require significant infrastructure setup and do not scale to city-wide or fleet-wide datasets. Researchers have consistently identified scalability as the primary bottleneck: algorithms that produce accurate matches on small datasets fail to perform when confronted with millions of trajectories.

Having map matching available as an integrated capability within the same distributed environment where you are already processing and analyzing your trajectory data—rather than as an external API call or a separate system—eliminates an entire category of infrastructure complexity and data movement overhead.

The Real Cost of a Fragmented Mobility Data Pipeline

In practice, most organizations processing mobility data have assembled a patchwork of tools: Python scripts for data cleaning, PostGIS for spatial operations, custom code for trip segmentation, an external API for map matching, a separate clustering library, and a visualization tool at the end. Each transition between tools introduces data serialization overhead, potential for schema drift, and opportunities for subtle bugs.

This fragmentation carries real costs beyond engineering time. When a researcher needs to change the trip segmentation threshold from 20 minutes to 30 minutes, the entire pipeline must be re-executed across multiple systems. When a new data source arrives with a slightly different schema, adapters must be updated at each integration point. When results need to be reproduced for regulatory or academic review, reconstructing the exact sequence of operations across disparate tools is often impractical.

The ideal mobility data pipeline processes GPS pings through ingestion, cleaning, enrichment, trajectory construction, map matching, spatial indexing, clustering, and analytical aggregation—all within a single, distributed, spatially-aware environment where every step is expressed in SQL or Python, every intermediate result is inspectable, and the entire pipeline can be reproduced with a single execution.

What a Modern Mobility Data Architecture Looks Like

The medallion architecture—Bronze, Silver, Gold—has become the standard pattern for progressive data refinement in the data lakehouse world. But applying it to mobility data requires rethinking what each layer does, because spatial data introduces transformations and enrichment steps that have no analog in conventional data engineering.

Bronze is not just ingestion—it is spatial profiling. You are not only loading CSV or Parquet files; you are constructing point geometries, validating coordinate bounds, assessing data quality metrics like altitude validity and temporal coverage, and establishing the spatial extent of your dataset.

Silver is where the heavy lifting happens. This is trip segmentation, 4D geometry construction, trajectory building, movement metric derivation, spatial indexing, and map matching. Each of these operations is computationally intensive, order-dependent, and requires spatial functions that most data platforms simply do not have.

Gold produces the analytical views that power downstream consumption: H3 hexbin density heatmaps, temporal activity patterns, stop detection via spatial clustering, trajectory anomaly flagging, and road segment speed analysis. These views are written as GeoParquet files—compatible with Kepler.gl, QGIS, Felt, Foursquare Studio, and DuckDB Spatial—ensuring that the output of the pipeline is immediately consumable by any modern geospatial visualization or analytics tool.

How Wherobots Handles Mobility Data Processing: What We Built

To demonstrate this architecture in practice, we built a three-notebook Wherobots Mobility Solution Accelerator using the Microsoft Research GeoLife GPS Trajectories dataset—one of the few open mobility datasets that includes elevation data, enabling full 4D geometry processing. The dataset contains 17,621 trajectories from 182 users in Beijing, with latitude, longitude, altitude, and timestamps spanning 2007–2012.

In Part 2, we walk through every notebook in detail: the Bronze layer’s ingestion and profiling pipeline, the Silver layer’s 4D trajectory construction and map matching workflow, and the Gold layer’s analytical and exploratory views. We cover the specific Apache Sedona spatial SQL functions used at each step, the PySpark window function patterns for trip segmentation and movement metrics, and the real-world challenges we encountered and solved—from Spark’s schema inference corrupting timestamp values, to COLLECT_LIST not preserving order in trajectory construction, to DBSCAN requiring physical column references.

If you are building mobility analytics pipelines and hitting the limitations of your current toolchain, Part 2 will give you a concrete, reproducible blueprint for how to do it on Wherobots.

Get Started with Wherobots

Access Now

PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong

Posted on March 19, 2026March 19, 2026 by Matt Forrest

When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights.

For decision-makers, the choice between a traditional spatial database (like PostGIS, an open-source extension of PostgreSQL that adds support for storing and querying location data) and a cloud-native geospatial analytics platform (like Wherobots, built with distributed computing including Apache Spark to process massive spatial datasets in parallel across compute clusters) comes down to two fundamental metrics: Time to Insight and Total Cost of Ownership (TCO).

To understand where to invest your budget, we need to look beyond the software labels and understand the economics of how these systems handle data.

If you are new to the PostGIS vs Wherobots discussion, start with [Part 1: PostGIS, Wherobots, and the Spatial Data Lakehouse: A Strategic Guide for Leaders]. This post assumes you understand the architectural difference and focuses on what it actually costs you to choose wrong.

Key Takeaways:

PostGIS is optimized for low-latency lookups. Wherobots is optimized for high-throughput analytics and data processing. Using the wrong one for the wrong job costs you both time and money.
A PostGIS server must be provisioned for peak load, so you pay for maximum capacity even when usage is low. Wherobots is designed to charge primarily for active compute time, so you are not paying for idle capacity.
For industries like insurance, logistics, and urban planning, the right architecture choice can dramatically reduce query time for large-scale spatial analysis — in some cases from hours or days down to minutes.
PostGIS and Wherobots are not mutually exclusive. Many enterprises use Wherobots to process data at scale, then serve results through PostGIS for live application access.
Use the checklist in this post as a fast diagnostic: if you are waiting hours for spatial queries or your cloud bill is outpacing your revenue growth, you have a strong case for cloud-native spatial compute.

Why Slow Spatial Queries Cost More Than You Think

In the modern enterprise, the value of data decays over time. An answer delivered in 5 seconds is actionable; an answer delivered in 5 days is a post-mortem. The architecture you choose dictates how fast you can answer complex questions.

Low Latency vs High Throughput: What Speed Actually Means for Each Tool

It is crucial to understand the difference between “speed” for an app and “speed” for analytics.

Low Latency (PostGIS): This is the speed of retrieval. When a customer opens your delivery app and asks, “Where is my driver?”, they need an answer in milliseconds. PostGIS is optimized for this. It uses heavy indexing to find a single “needle in a haystack” instantly.
High Throughput (Wherobots): This is the speed of processing. When your risk analyst asks, “Which of our 50,000 retail locations are at risk of flooding based on the new 100-year climate models?”, they are not looking for a needle; they are looking for patterns across the whole haystack.

The Bottleneck: If you try to run that massive climate model analysis in PostGIS, the database has to check every single location against every single flood zone, even with optimizations for search like spatial indexing. Because PostGIS typically runs on a single server, running that analysis forces your app and your analytics to compete for the same resources. The query might take 24 hours, and your customer-facing app slows down the whole time.

The Solution: Wherobots breaks that same job into parallel tasks across a cluster of worker nodes, each handling a partitioned geographic slice. Because the work happens simultaneously, the job finishes in minutes instead of hours.

Business Impact: Your analysts get answers before lunch, not next week. Your operational apps remain fast for customers because the heavy lifting happened elsewhere.

PostGIS vs Wherobots: Why the Pricing Models Are Fundamentally Different

The second major factor is how you pay for these capabilities. Query speed is only half the cost story. The other half is how each system charges you. The pricing models for databases and cloud-native engines are fundamentally different.

Why PostGIS Forces You to Pay for Peak Capacity Around the Clock

A high-performance database like PostGIS requires expensive hardware specifically, high-speed RAM and fast CPUs, to keep your data accessible.

The Lease Model: PostGIS requires provisioning a server for peak load, which means you pay for maximum capacity even when usage is low. Think of it like leasing a Ferrari just to drive to the grocery store on Sundays. You have to provision this server for your peak usage. If you need to run a heavy report once a week that requires 64 cores of CPU, you must pay for a 64-core server 24 hours a day, 7 days a week.
The Waste: For the other 6 days and 23 hours, that expensive server sits idle, burning budget. It is analogous to leasing a Ferrari just to drive to the grocery store on Sundays.

How Wherobots Charges Only for Active Compute Time

Wherobots uses an elastic, on-demand pricing model: you consume compute while an operation is actively running, then billing stops when it finishes. Storage and compute are decoupled, meaning your data sits in low-cost object storage and you rent processing power only when you need it.

Storage is Cheap: You keep your massive datasets in low-cost Object Storage (like Amazon S3), which costs pennies per gigabyte.
Compute is On-Demand: When you need to run that heavy climate model, you rent the 1,000 computers for exactly 15 minutes. The moment the job is done, the machines turn off, and the billing stops.
The Savings: You convert a massive fixed Capital Expenditure (CapEx) into a lean, manageable Operational Expenditure (OpEx).

Create your Wherobots account

Get Started

PostGIS vs Wherobots: Which One Is Right for Your Use Case?

To make this concrete, let’s look at three distinct industry examples and which tool provides the best ROI for each.

1. Logistics and Delivery: Why Real-Time Tracking Needs PostGIS

Scenario: You need to route drivers in real-time and show customers where their package is.

The Choice: PostGIS.
Why: You need transactional guarantees. If a driver marks a package as “Delivered,” that data must be instantly saved and visible. You are doing millions of tiny, fast lookups.

2. Insurance and Real Estate: Why Portfolio-Scale Risk Analysis Needs Wherobots

Scenario: You need to calculate risk premiums for 10 million homes based on historical wildfire data, distance to fire stations, and vegetation density indices.

The Choice: Wherobots.
Why: This is a “Global Join.” You are comparing massive datasets against each other. PostGIS would take weeks to process this at a national scale. Wherobots can recalculate the entire portfolio every night, allowing you to adjust pricing dynamically.

3. Urban Planning: Why Time-Series Sensor Data Overwhelms a Standard Database

Scenario: You are ingesting telemetry data from 50,000 connected traffic lights and sensors to analyze traffic congestion trends over the last 5 years.

The Choice: Wherobots.
Why: The volume of data (Time-Series) is too large for a standard database. A database would bloat, slow down, and become expensive to back up. Wherobots can read this data directly from cheap storage, aggregate it into trends, and output the results.

PostGIS vs Wherobots: Decision Checklist

PostGIS is a good choice when your primary use case is powering a user-facing application that needs fast, transactional lookups on a relatively stable dataset. Wherobots is the better choice when you are running analytical queries across complex datasets, processing historical data at scale, or need compute costs that flex with actual usage rather than peak capacity.

If you are currently evaluating your data stack, use this simple checklist to guide your architecture decision.

Stick with PostGIS if:

[ ] Your primary goal is powering a user-facing application.
[ ] You need to edit data manually (e.g., fixing property boundaries).
[ ] Your dataset is relatively stable and fits comfortably on one large server.
[ ] You require strict “ACID” transactions (meaning every write is confirmed and visible before the next read — no partial updates, no stale reads)

Move to Wherobots if:

[ ] You are waiting hours or days for analytical queries to finish.
[ ] Your cloud database bill is growing faster than your revenue.
[ ] You need to join two massive datasets (e.g., “All Buildings” + “All Parcels”).
[ ] You are building AI/Machine Learning models that need to “learn” from all your historical data.

The most competitive organizations today realize they don’t have to choose just one. They use Wherobots to crunch the data cheaply and efficiently, and then move the polished results into PostGIS for instant access—a strategy we will cover in our next post on the “Medallion Architecture”, a data design pattern where raw, refined, and production-ready data are stored in separate layers, each optimized for different workloads

Ready to see what this looks like for your workload? Contact us and get a cost comparison built around your actual data volume.

Streaming Spatial Data into Wherobots with Spark Structured Streaming

Posted on March 18, 2026March 19, 2026 by Daniel Smith

Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were)

I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be something between a deep sigh and a nervous laugh.

Here’s the thing: streaming tabular data with Spark Structured Streaming is a well-trodden path. There are tutorials everywhere. But streaming spatial data where you need to create geometry objects, apply spatial business rules, and land the results in a format that actually understands what a Point is (pun definitely intended); that’s where most guides quietly end.

This post walks through a complete, working pipeline that takes raw telemetry data (fleet tracking, asset monitoring, IoT sensors, etc.) and streams it into a Wherobots managed Iceberg tables with full Sedona geometry support. No hand-waving… Real code… The kind of thing you can run on Wherobots Cloud this afternoon.

What We’re Building

The architecture is intentionally straightforward. Three components, one direction, no backflips required:

Data Source — Parquet files landing in an S3 bucket. Each file contains asset tracking events: lat/lon/altitude, speed, heading, operator, asset ID, and status. Think of this as the output from any fleet management system, IoT gateway, or sensor network.
Streaming Ingest Job — A Spark Structured Streaming job running on Wherobots that watches the S3 path, picks up new Parquet files as they arrive, converts raw coordinates into Sedona geometry objects, applies business rules (like a speeding flag), and writes each micro-batch to an Iceberg table in the Wherobots catalog.
Viewer Notebook — A Wherobots notebook that reads the catalog table and gives you stats, batch-level detail, a bar chart, and an interactive map colored by speeding status. The “so what” layer.
A Downstream Application Layer — Once the data is prepared and ready for consumption, it can be fed into a downstream application or database that can serve the data to other consumers: a map, or another data processing system.

The generator that produces the Parquet files is a separate concern; any system that drops Parquet into S3 works. We’re focused on the Wherobots side: how you consume, transform, and catalog streaming spatial data.

The Source Schema

The source data uses a flat schema — no geometry column yet. That’s deliberate. Most real-world telemetry arrives as plain numbers (latitude, longitude, altitude), not as WKB or WKT geometry strings. The spatial enrichment happens during ingest.

Column	Type	Description
`event_id`	STRING	Unique event identifier
`timestamp`	TIMESTAMP	Event time
`x`	DOUBLE	Longitude
`y`	DOUBLE	Latitude
`z`	DOUBLE	Altitude (meters)
`speed_mph`	DOUBLE	Speed in miles per hour
`heading`	DOUBLE	Compass heading (degrees)
`operator_name`	STRING	Fleet operator
`asset_id`	STRING	Individual asset identifier
`asset_type`	STRING	Asset category (e.g., truck, drone)
`status`	STRING	Operational status

Eleven columns. No geometry. That’s what Sedona is for.

Setting Up the Streaming Ingest

SedonaContext Initialization

The first thing the ingest job does is create a SedonaContext. On Wherobots Cloud, Spark is already pre-initialized — you just need to wrap it with Sedona capabilities:

from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Ensuring the Catalog Schema and Table Exist

Before the stream starts, we make sure the target schema and table exist. This is idempotent — safe to run every time the job starts:

sedona.sql("CREATE SCHEMA IF NOT EXISTS wherobots.streaming_demo")

sedona.sql("""
    CREATE TABLE IF NOT EXISTS wherobots.streaming_demo.asset_tracks (
        event_id STRING,
        timestamp TIMESTAMP,
        x DOUBLE,
        y DOUBLE,
        z DOUBLE,
        speed_mph DOUBLE,
        heading DOUBLE,
        operator_name STRING,
        asset_id STRING,
        asset_type STRING,
        status STRING,
        geometry GEOMETRY,
        is_speeding BOOLEAN,
        source_file STRING,
        batch_id LONG,
        job_run_id STRING
    ) 
""")

One thing to call out here:

The table has more columns than the source. The source Parquet has 11 fields. The table has 16. The extra five (geometry, is_speeding, source_file, batch_id, job_run_id) are all derived during the ingest process. This is the whole point; we’re enriching raw telemetry into spatially-aware, business-rule-tagged, lineage-tracked records.

The Spatial Transform

This is where the real work happens, and honestly it’s surprisingly compact. Here’s the transform that runs on every micro-batch:

transformed = raw.selectExpr(
    "event_id",
    "timestamp",
    "x", "y", "z",
    "speed_mph",
    "heading",
    "operator_name",
    "asset_id",
    "asset_type",
    "status",
    "ST_SetSRID(ST_PointZM(x, y, z, unix_timestamp(timestamp)), 4326) AS geometry",
    "CASE WHEN speed_mph > 65 THEN true ELSE false END AS is_speeding",
    "input_file_name() AS source_file",
)

Three derived columns in one selectExpr. Let’s break them down.

Creating Geometry with ST_PointZM

ST_SetSRID(ST_PointZM(x, y, z, unix_timestamp(timestamp)), 4326) AS geometry

ST_PointZM creates a 4D point geometry:

X = longitude
Y = latitude
Z = altitude in meters
M = the “measure” dimension — here we encode the unix timestamp

This is a pattern I’ve come to appreciate. By encoding time as the M value, each point carries its full spatiotemporal context in the geometry itself. If you later need to compute distances or trajectories, the temporal ordering is baked right into the geometry. We wrap the whole thing in ST_SetSRID(..., 4326) to declare it’s in WGS 84.

Business Rule: The Speeding Flag

CASE WHEN speed_mph > 65 THEN true ELSE false END AS is_speeding

Nothing fancy. Speed over 65 mph? You’re speeding. This is computed once, during ingest, and stored in the catalog. Downstream consumers (dashboards, notebooks, APIs) read it directly — no need to recompute. This is the “silver layer” principle applied to streaming: apply your business rules at ingest time so the data is analysis-ready when it lands.

Business Rule: Geofence Violation Detection

Here’s where things get spatial. The is_speeding flag is a simple column-level rule — no external data needed. But what about spatial business rules that depend on other geometry? Things like: “is this asset inside a school zone buffer?” or “did this truck enter a restricted area?”

That’s where Wherobots in_fence column. During each micro-batch, we left-join the asset points against a table of geofence buffer polygons using ST_Intersects. If a point intersects any buffer, in_fence = true.

Why This Has to Happen Inside `foreachBatch`

Your first instinct might be to add the geofence check in the selectExpr alongside the spatial transform. I tried that. Spark won’t let you join a streaming DataFrame against a catalog table directly — the streaming DF is “unresolved” and the join planner doesn’t know what to do with it. The foreachBatch callback is the escape hatch: by the time your function is called, the DataFrame is fully materialized (static), so you can join it against anything.

You might also try createOrReplaceTempView so you can write the join as SQL. That works locally, but on Wherobots Cloud it fails with TABLE_OR_VIEW_NOT_FOUND because the internal DROP that precedes the CREATE hits the catalog resolver. DataFrame API join avoids this entirely.

Loading the Geofence Table: Broadcast + Cache

The geofence table is small (typically dozens to a few hundred buffer polygons) and it rarely changes while the stream is running. Re-reading it from the catalog on every micro-batch would be wasteful. Instead, we load it once at startup, broadcast it so every executor gets a local copy, and cache it so Spark doesn’t re-evaluate it:

geofence_df = (
    spark.table(GEOFENCE_TABLE)
    .select(F.col("geometry").alias("fence_geom"))
    .cache()
)
geofence_df.count()  # materialize the cache

F.broadcast() tells Spark to use a broadcast hash join instead of a shuffle join and the small geofence DataFrame gets shipped to every executor and held in memory, so the join against each micro-batch is local and fast. The .cache() ensures the table read only happens once. The .count() call forces materialization so the cache is warm before the first batch arrives.

One tradeoff: if you update the geofence table while the stream is running, the stream won’t see the changes until you restart. For a table that changes rarely (new school zones don’t appear every 30 seconds), that’s a fine deal.

The Spatial Join Pattern

# geofence_df is the cached DataFrame loaded at startup.
# F.broadcast() here — at the join site — so Spark sees it as
# part of the join plan and uses a broadcast hash join (no shuffle).
df = (
    df.alias("a")
    .join(
        F.broadcast(geofence_df).alias("b"),
        F.expr("ST_Intersects(a.geometry, b.fence_geom)"),
        "left",
    )
    .select("a.*", F.expr("b.fence_geom IS NOT NULL AS in_fence"))
)

A few things to note:

Alias the geometry column on the geofence side to fence_geom. Both tables have a geometry column; without the alias, Spark can’t resolve the join condition.
Left join ensures every asset point survives, even if it doesn’t intersect any fence. Non-matching points get in_fence = false.
The IS NOT NULL check on fence_geom is what converts the join result into a boolean flag.

File-Level Lineage with input_file_name()

input_file_name() AS source_file

input_file_name() is a Spark SQL function that resolves to the path of the source file for each row. It has to be called in the selectExpr during the read phase — before foreachBatch materializes the DataFrame — because the file metadata is lost after materialization. This gives you row-level lineage back to the exact Parquet file that produced each record. Useful for debugging, auditing, and answering the question “where did this data come from?”

The foreachBatch Writer

The write side uses Spark’s foreachBatch pattern, which gives you a regular DataFrame and a batch ID for each micro-batch:

def write_batch(df, batch_id):
    count = df.count()
    if count > 0:
        df = (
            df.withColumn("batch_id", F.lit(batch_id).cast("long"))
              .withColumn("job_run_id", F.lit(JOB_RUN_ID))
        )
        df.writeTo(CATALOG_TABLE).append()

Adding Batch and Job Run Lineage

Two columns are added inside foreachBatch because they can’t be computed earlier:

batch_id — Spark’s micro-batch identifier. Resets to 0 on each job run.
job_run_id — Pulled from the Wherobots environment variable WBC_LABEL_product_instance_id. This is the unique ID for each Wherobots Job Run, which means you can trace every record back to the specific job execution that produced it.

Together, batch_id + job_run_id give you globally unique batch identification across job runs. Since batch_id resets to 0 on every restart, job_run_id is what makes cross-run analysis possible.

The Wherobots-Idiomatic Write Pattern

df.writeTo(CATALOG_TABLE).append()

Not df.write.format("iceberg").mode("append").save(path). The .writeTo().append() pattern uses the catalog-managed table reference, which means Wherobots handles the underlying storage location, metadata, and Iceberg commit lifecycle. It’s cleaner, and it’s how catalog-native writes are meant to work.

Wiring It All Together

The streaming query connects the read, transform, and write stages:

query = (
    transformed.writeStream
    .foreachBatch(write_batch)
    .option("checkpointLocation", CHECKPOINT_PATH)
    .trigger(processingTime="30 seconds")
    .start()
)

The checkpoint location (an S3 path) tracks which files have already been processed. If the job restarts, it picks up exactly where it left off — no duplicate processing, no missed files. The trigger interval of 30 seconds means Spark checks for new files twice a minute.

Viewing the Results

Once data starts landing in the catalog, a Wherobots notebook can read it directly:

df = sedona.table("wherobots.streaming_demo.asset_tracks")

That’s it. No path management, no format specification, no schema declaration. The catalog knows what the table is, where it lives, and what the schema looks like.

Ingest Stats and Lineage

The viewer notebook computes summary statistics including lineage metrics:

stats = df.agg(
    F.count("*").alias("total_records"),
    F.countDistinct("asset_id").alias("unique_assets"),
    F.countDistinct("batch_id").alias("unique_batches"),
    F.countDistinct("job_run_id").alias("unique_job_runs"),
    F.countDistinct("source_file").alias("unique_source_files"),
    F.min("timestamp").alias("earliest_event"),
    F.max("timestamp").alias("latest_event"),
    F.round(F.avg("speed_mph"), 1).alias("avg_speed_mph"),
)

Batch-Level Detail

Because we captured batch_id, job_run_id, and source_file during ingest, we can inspect exactly what landed in each batch:

df.groupBy("job_run_id", "batch_id").agg(
    F.count("*").alias("records"),
    F.countDistinct("source_file").alias("files_in_batch"),
    F.min("timestamp").alias("first_event"),
    F.max("timestamp").alias("last_event"),
).orderBy("job_run_id", "batch_id").show(50, truncate=False)

Records per Batch — Bar Chart

A quick matplotlib chart shows the volume of each micro-batch, which is useful for understanding whether your trigger interval and maxFilesPerTrigger settings are well-tuned:

import matplotlib.pyplot as plt

batch_pdf = (
    df.groupBy("batch_id")
    .agg(F.count("*").alias("records"))
    .orderBy("batch_id")
    .toPandas()
)

fig, ax = plt.subplots(figsize=(12, 4))
ax.bar(batch_pdf["batch_id"], batch_pdf["records"], color="steelblue")
ax.set_xlabel("Batch ID")
ax.set_ylabel("Records")
ax.set_title("Records per Batch")
plt.tight_layout()
plt.show()

The Map: PyDeck with Speeding Visualization

The payoff. A PyDeck map over Iowa with CARTO’s dark-matter basemap, two stacked layers, and a four-color scheme that encodes both is_speeding and in_fence at a glance:

Blue — normal: not speeding, not in a geofence
Red — speeding only
Orange — geofence violation only
Magenta — speeding and in a geofence

Points that hit a geofence get a larger radius (24px, or 30px if also speeding) so they pop out of the sea of blue dots.

def assign_color(row):
    if row["is_speeding"] and row["in_fence"]:
        return (200, 30, 200)   # magenta — both violations
    elif row["in_fence"]:
        return (255, 140, 0)    # orange — geofence only
    elif row["is_speeding"]:
        return (220, 50, 50)    # red — speeding only
    else:
        return (30, 100, 220)   # blue — normal

sample_pdf["radius"] = sample_pdf.apply(
    lambda r: 30 if (r["is_speeding"] and r["in_fence"])
              else 24 if r["in_fence"]
              else 12,
    axis=1,
)

The map itself uses two layers stacked together. Underneath, a PolygonLayer renders the geofence buffer polygons as semi-transparent orange regions, so you can see the zones that triggered violations. On top, the ScatterplotLayer plots every asset point with the four-color scheme:

import pydeck as pdk

geofence_layer = pdk.Layer(
    "PolygonLayer",
    data=geofence_pdf,
    get_polygon="coordinates",
    get_fill_color=[255, 140, 0, 40],
    get_line_color=[255, 140, 0, 160],
    get_line_width=2,
    pickable=True,
)

points_layer = pdk.Layer(
    "ScatterplotLayer",
    data=sample_pdf,
    get_position=["x", "y"],
    get_fill_color=["color_r", "color_g", "color_b", 180],
    get_radius="radius",
    pickable=True,
    auto_highlight=True,
)

deck = pdk.Deck(
    layers=[geofence_layer, points_layer],
    initial_view_state=pdk.ViewState(latitude=41.9, longitude=-93.4, zoom=6.5),
    tooltip={
        "text": "{asset_id} ({asset_type})\\n{operator_name}\\nSpeed: {speed_mph} mph\\nSpeeding: {is_speeding}\\nIn Fence: {in_fence}\\nBatch: {batch_id}"
    },
    map_style="<https://basemaps.cartocdn.com/gl/dark-matter-gl-style/style.json>",
)
deck.show()

The sample is capped at 50,000 points for rendering performance. And because both is_speeding and in_fence are already in the catalog, we’re just reading them — not recomputing anything in the notebook. That’s the whole point of doing the enrichment at ingest time.

Why This Pattern Matters

Let me step back and talk about why this approach is worth your time, beyond the “hey, cool map” factor.

Spatial Enrichment at Ingest, Not at Query Time

In a lot of organizations, raw coordinates sit in tables as plain doubles, and every analyst who needs geometry has to create it themselves. That means everyone is writing their own ST_Point calls, hoping they got the coordinate order right, and probably not setting the SRID. Baking the geometry creation into the ingest pipeline means it’s done once, correctly, and everyone downstream gets a proper GEOMETRY column with SRID 4326.

Business Rules as First-Class Columns

The is_speeding flag is one example, but we went further. The in_fence column demonstrates the same principle with a spatial join: during each micro-batch, every point is checked against a table of geofence buffer polygons using ST_Intersects. Speed violations, geofence containment, proximity to restricted areas are all computed during ingest using Sedona on Wherobots and stored as columns in the Iceberg table. Downstream consumers don’t need to know the logic. They just filter on a boolean.

Lineage Without Extra Infrastructure

source_file, batch_id, and job_run_id give you three levels of traceability — file, batch, and job run — without deploying a separate lineage system. When someone asks “where did this record come from?”, you can answer with a SQL query against the same table.

Iceberg: Geometry as a Native Type

This is the part that would have made 2010-me cry tears of joy. The GEOMETRY column in a Iceberg table is a native type (as of Iceberg v3), not a serialized blob or a WKT string. Spatial predicates can push down to the storage layer. You can query the table with ST_Contains, ST_DWithin, ST_Intersects — the full Sedona function catalog — and the Iceberg metadata helps prune partitions that can’t possibly match. That’s not something you get from storing WKT in a STRING column in a regular Parquet table.

Catalog-Managed Tables, Not Path-Managed Files

sedona.table("wherobots.streaming_demo.asset_tracks") is all a downstream consumer needs. No S3 paths, no Parquet glob patterns, no format hints. The Wherobots catalog manages the table lifecycle: schema evolution, ACID transactions, snapshot isolation, time travel. If you’ve ever debugged a pipeline that broke because someone renamed an S3 prefix, you understand why this matters.

Key Takeaways

Spark Structured Streaming works natively on Wherobots. You can use readStream / writeStream with foreachBatch just like you would in any Spark environment — but with Sedona spatial functions and Iceberg tables available out of the box.
Convert coordinates to geometry at ingest time. Use ST_PointZM (or ST_Point, ST_PointZ depending on your data) with ST_SetSRID to create proper geometry objects. Don’t make every downstream consumer do this.
Apply business rules during ingest. Flags like is_speeding, geofence containment, proximity alerts — compute them once and store them as columns. The silver layer principle applied to streaming.
Capture lineage in the stream. input_file_name() for file-level tracing, batch_id from foreachBatch, and WBC_LABEL_product_instance_id for job-run-level tracing. Three columns, zero additional infrastructure.
Write with .writeTo().append(), not .write.format().save(). Use the catalog-managed write pattern. Let Wherobots handle the storage details.

About the Author

Daniel Smith is a Solution Architect at Wherobots with over two decades of experience in geospatial technology. He builds demos, helps customers, breaks things, fixes them, and writes about what he learns from time to time. When he’s not wrestling with coordinate reference systems or touching things he shouldn’t, he’s probably skating or convincing his kids that geography is, in fact, the coolest subject. You can find him on LinkedIn.

Ready to stream spatial data into your own Wherobots catalog?

Get Started with Wherobots

WherobotsDB is 3x faster with up to 45% better price performance

Posted on March 11, 2026March 23, 2026 by Damian

Today we are announcing the next generation of WherobotsDB, the Apache Sedona and Spark 4 compatible engine, is now generally available. Compared to the the previous generation of WherobotsDB, this next gen (now the current) architecture and version accelerates queries by up to 3x, with up to 45% better price-performance. Previously in preview, it is now offered through the latest version of WherobotsDB.

How Customers Use WherobotsDB

Our customers are using WherobotsDB to create insights from spatial data at scale that result in improved products, services, and decision making in the physical world. They are realizing breakthroughs in fleet operations, improving their risk projections, increasing the accuracy of vegetative forecasts, analyzing change, and overall are more capable of innovating against physical world interests.

With Wherobots on AWS, not only can we easily scale to millions of acres and continuous tractor telemetry normalization within LeafLake, but also we can rest assured that our costs won’t spiral out of control.

— G. Bailey Stockdale

CEO, Leaf Agriculture

The workloads customers run or want to run, are becoming more ambitious too. Customers and AI alike demand better/faster/cheaper solutions for working a wide variety of raster and vector spatial datasets, and they need to fuse this data with valuable business context as well. And of course, they want to do it without scale and function limitations.

WherobotsDB was always designed from the ground up to meet these needs. But solutions are now even easier to realize, because the latest version of WherobotsDB allows you to do more, faster, at a lower cost.

WherobotsDB Benchmark Results: TPC-H and SpatialBench Performance

The next generation of WherobotsDB delivers a substantial performance increase for both spatial and non-spatial queries.

Compared to the previous engine, our benchmarking runs for of TPC-H and SpatialBench both show significant performance improvements across scale factors of 100 and 1000.

Up to a 3x peak acceleration for analytical queries based on TPC-H at a scale factor of 1000. Workloads can see a nearly threefold increase in throughput. The mean observed acceleration was 1.7x.
Up to a 2.5x acceleration for spatial queries and joins based on SpatialBench at a scale factor of 1000. Spatial operations, such as intersects and joins, complete significantly faster on small to massive datasets. The mean observed acceleration was 1.9x.
Up to a 45% improvement in price performance. Customers can achieve more with less cost, and the added performance makes it possible for interactive workloads to downscale into smaller runtimes. Compared to the previous version of WherobotsDB, the average improvement in price performance was 25%.

Price Performance vs Next Best Engine

This shows the total cost of all SpatialBench queries at SF 1000 that the next best engine could finish under a timeout of 10 hours, which limited the comparison to Q1-Q5 and Q7. The remainder of the queries (Q6, Q8-Q12) could not be completed by that engine and were excluded from this analysis.

The current generation of WherobotsDB is more capable and 46% lower cost than the next best engine, which is a popular Spark based serverless engine with Spatial SQL support.

How the current version of WherobotsDB compares on cost to the previous, as well as the next best engine.

WherobotsDB Capabilities: What No Other Engine Offers

WherobotsDB is the only engine capable of meeting the following spatial data requirements that customers have. You get:

✅ high performance, cost efficient, and scalable vector, raster, tabular data operations in a unified query environment
✅ compatibility with Spark 4 and Sedona
✅ interoperability with zero-copy on lakehouses and data lakes to keep data in your control
✅ unification with RasterFlow, to easily orchestrate planetary scale inference and analytics workflows starting with raw imagery datasets

The following matrix isolates the vector data processing capabilities of the next best alternatives to WherobotsDB using SpatialBench runs at a scale factor of 1000. Raster data capabilities were not compared, because WherobotsDB was the only engine in the set that supports raster data capabilities.

Apache Sedona SpatialBench query capability matrix running on WherobotsDB in comparison to multiple other spark based engines.

WherobotsDB Architecture: Rust-Native, Arrow-Columnar Execution

The original architecture for WherobotsDB was built on a JVM-based execution model. It is an extraordinary platform for distributed computing, but its row-oriented execution model and JVM memory management introduce overhead that compounds at scale, especially for spatial-heavy workloads where every row carries complex spatial objects that must be serialized, deserialized, and processed one at a time.

The new WherobotsDB version is built on a new architecture that addresses these bottlenecks by replacing the JVM-based execution layer with a Rust-native, Arrow-columnar engine, optimized for spatial data executions.

Native Spatial Execution with SedonaDB

The latest version of WherobotsDB takes advantage of SedonaDB, an open-source, blazing-fast analytical database engine where geospatial data is the first-class citizen. The use of Rust and SedonaDB allowed us to move spatial logic out of the JVM and directly onto the native execution layer. It provides a unified execution model that supports everything from scalar and window functions to complex spatial joins, aggregations, and geometry operations.

Apache DataFusion Integration

Apache DataFusion is a Rust-native query engine built from the ground up on the Apache Arrow in-memory columnar format. By integrating DataFusion’s native execution with WherobotsDB’s distributed engine, you get the best of both worlds: WherobotsDB’ battle-tested scheduling and DataFusion’s high-performance native processing. This means your existing WherobotsDB-based workflows, SQL queries, and Python notebooks continue to work exactly as before, but the actual computation accelerates from optimized native code rather than in the JVM.

Zero-Copy Efficiency

To remove the significant overhead of data conversion, we implemented a high-performance native geometry type based on the GeoArrow specification. This allows for “Zero-Copy” data handling, utilizing Arrow’s nested memory layout to represent geometries and geographies without the costly serialization and deserialization steps typically found in spatial databases.

Spark 4 Compatability

WherobotsDB is now compatible with the latest features of Spark 4 to provide a modern, robust environment that enforces ANSI SQL by default. This upgrade integrates the Wherobots engine with the newest advancements in distributed computing, including improved query planning and execution protocols.

Try WherobotsDB Free: 30-Day Trial

Get started now using a 30 day, $300 free trial available for the Professional Edition of Wherobots.

The latest version of WherobotsDB is generally available today and is the default version for all new runtimes on Wherobots Cloud.

Due to the enforcement of ANSI SQL, if you’re an existing customer we recommend testing your workloads on the latest version of WherobotsDB in a notebook, SQL session, or job run. For most workloads, the upgrade should be seamless.

If you have questions about the upgrade, experience unexpected behavior, want a custom benchmark, or would like to discuss how Wherobots can benefit your business, reach out to the Wherobots team at support@wherobots.com, sales@wherobots.com, or by filling out our contact us form.

It takes 15 minutes for the Caltrain to get from Sunnyvale to SAP Center

Posted on February 19, 2026February 20, 2026 by Pouyan Aminian

That’s how long it took our MCP server to go from “how many bus stops are in Maryland” to an answer

I’ve been doing a lot of reading lately on how AI is going to transform spatial workloads and that curiosity led me to this post on geoMusings. Here, Bill is demonstrating how Claude Code and agent skills capabilities can be used to wire up a chat-to-query-results interface in a few hours. He showcased the new skill by getting the agent to query his local Postgres instance for the number of Metro bus stops in Maryland, which returned a precise 4,563.

I need to count the number of records in the metro_bus_stops table that are inside Maryland.The database is at localhost:5432, database name is “dev”,user “postgres” with password “postgres”.
Points table: public.metro_bus_stops (geometry column: geom, id column: id)Polygons table: public.maryland_boundary (geometry column: geom, name column: name)

As a dabbler of AI agents and a minor contributor to Wherobots’ very own MCP server, I immediately wondered how our MCP server would do against such a challenge. So I fired up my VS Code and just straight up asked:

“How many bus stops are in Maryland?”

Bear in mind, at the time I did not know if we have any data with bus stops in it in Wherobots’ data catalogs, I did not know what shape that data was in, I did not know if the MCP server could come up with a reasonable administrative boundary for Maryland, etc. And I fired off this query just as my CalTrain was departing Sunnyvale station.

In about 5 minutes, the MCP server already identified two tables with bus stop information called places_place under the Overture Maps Foundation database in Wherobots Open Catalog. It achieved that by exploring our catalog and running sample queries against those tables to find the right data; all with zero human intervention. We are right about Lawrence Station at the point.

In the next 5 minutes, the MCP server ran a series of queries against that table, self-identified errors (i.e., got 0 results and understood it was not expected), adjusted the query, switched tables, changed approaches until it was able to produce actual results. Our MCP server believes there are 19,740 bus stops in Maryland which is ~5 times as many as Bill’s post suggests. We just got to Santa Clara station, by the way, for those of you who are still following.

So being a good aspiring data engineer, I challenged the MCP server:

Why does this blog think there are only 4563 then?
https://blog.geomusings.com/2026/01/14/spatial-analysis-with-claude-code/

The MCP server went back to work and gave me the diagnosis; Bill’s query is focused on Metro bus stops and my original question did not specify that:

So in the last 5 minutes of this journey, I asked it to focus on Washington Metropolitan Area Transit Authority (WMATA) bus stops only and see what it comes up with! And just as we were about to pull into San Jose Diridon Station, the MCP server told me that there are 6,224 Metro bus stops in Maryland.

Now, whether there are 4,563 Metro bus stops in Maryland or 6,224 ones, is a matter that shall be validated with people far more knowledgeable than myself on buses and their stops. The main point is that AI is making it possible for non-experts like myself to go from a question (expressed in natural language) to real insights in minutes (well a 15-minute train ride to be precise). Wherobots MCP is giving the AI the ability to answer questions about the real-world.

In the real world, I would have asked the MCP server to generate a Notebook for me to reproduce this output and plot it on a map. I would then share that with my colleague to help me validate, correct and optimize my findings. What would have taken days to weeks (to go from theory to some early explorations to a shareable PoC and, finally, to production-quality code) can now be achieved in a matter of hours.

The Caltrain experiment was just one question. In our recent office hours, we walked through the MCP server end to end, showing how it explores catalogs, generates spatial queries, debugs errors, and produces reproducible outputs. See the full workflow in action.

Want to get started with our MCP server? Check out our getting started guide. It takes less than 5 minutes to configure the server and start chatting with the physical world!

Try Wherobots

Get Started

Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis

Posted on February 5, 2026February 18, 2026 by Pranav Toggi

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.

The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.

This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.

To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.

The Professional’s Dilemma: Static vs Adaptive Search

The `ST_DWithin` Approach

For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:

Copy Code

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_DWithin(a.geometry, b.geometry, 500, true)  -- Fixed 500m radius
  ORDER BY distance
''')

This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.

The Spatial Density Problem

Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:

Figure 1: The same ST_DWithin query produces many candidates in dense urban areas but 0 candidates in sparse rural areas

The same query produces wildly different result set cardinalities based on local feature density.

The Radius Paradox

Attempting to solve this with radius adjustment creates new problems:

Radius	Urban Result	Rural Result	Problem
500m	20 candidates	0 candidates	Rural queries fail
2km	200 candidates	3 candidates	Urban over-processing
10km	2000+ candidates	10 candidates	Urban becomes intractable

No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.

This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.

KNN Spatial Analysis: Solving the Density Problem at Scale

K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.

A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.

The Effect in Practice

Copy Code

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_KNN(a.geometry, b.geometry, 10)  -- Always 10 candidates
  ORDER BY distance
''')

The same query now produces consistent results across all spatial densities:

Figure 2: KNN always returns exactly k candidates—nearby ones in dense areas, more distant ones in sparse areas.

KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”

The Adaptive Advantage

Metric	ST_DWithin	ST_KNN
Result cardinality	Variable (0 to 1000+)	Bounded (k)
Time complexity	Data-dependent	Predictable O(n × k)
Result quality	Density-dependent	Distribution-invariant
Parameter tuning	Manual, error-prone	Not required

The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.

Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.

The Two-Stage Pattern for Accurate Spatial Proximity

KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.

We address this with a two-stage refinement approach:

Stage 1 (Candidate Generation): KNN selects k candidates using approximate centroid-based distance— $O(n\log k)$
Stage 2 (Exact Refinement): Precise geometric calculation (ST_ClosestPoint, ST_DistanceSpheroid) on k candidates only— $O(k)$ per query point

This decomposition gives us both fast candidate pruning and geometrically precise results.

KNN Implementation Pattern

Here’s the complete two-stage pattern using Wherobots SQL:

Stage 1 – Candidate Generation:

Copy Code

knn_df = sedona.sql('''
  SELECT 
      query.id AS query_id,
      query.geometry AS query_geometry,
      target.id AS target_id,
      target.geometry AS target_geometry,
 
      -- Exact closest point calculation
      ST_ClosestPoint(target.geometry, query.geometry) AS closest_point,
 
      -- Precise spheroidal distance
      ST_DistanceSpheroid(
          ST_ClosestPoint(target.geometry, query.geometry),
          query.geometry
      ) AS distance_meters
 
  FROM query_geometries AS query
  JOIN target_geometries AS target
      ON ST_AKNN(query.geometry, target.geometry, 10, false)
''')
 
knn_df.writeTo('wherobots.pranav.knn_candidates')

Stage 2 – Exact refinement:

Copy Code

sedona.sql('''
  ranked AS ( 
      SELECT *,
             ROW_NUMBER() OVER (
                 PARTITION BY query_id 
                 ORDER BY distance_meters ASC
             ) AS rank
      FROM wherobots.pranav.knn_candidates
  )
 
  SELECT * FROM ranked WHERE rank = 1;
''')

Key functions:

ST_AKNN(..., 10, false): Approximate KNN with k=10 candidates, Euclidean distance
Why ST_AKNN over ST_KNN?: When using KNN for candidate generation (followed by exact refinement via ST_ClosestPoint + ST_DistanceSpheroid), approximate KNN is preferred. The relaxed precision bounds of ST_AKNN enable faster index traversal, and any approximation error is eliminated in the refinement stage.
ST_ClosestPoint: Precise closest point on target geometry
ST_DistanceSpheroid: Accurate geodetic distance in meters
ROW_NUMBER(): Rank candidates and select the true closest

The pattern works for all large non-point geometries.

Results: Consistent Performance Across Spatial Densities

We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.

State	Buildings (Query Geometries)	Roads (Target Features)	Time (sec)	Cost	Throughput
New York	6,447,782	75,329	1,374	$25.31	4,693/sec
Texas	13,289,136	166,943	2,140	$41.60	6,208/sec
Colorado	2,764,970	34,432	1,074	$21.35	2,574/sec
California	13,648,296	135,247	2,086	$36.95	6,541/sec
Florida	8,201,965	122,048	1,490	$30.35	5,504/sec
Total	44,425,199	535,538	535,538	$157.08	5,301/sec

Key Observations

Consistent Throughput: Average of 5,300 geometries per second across all states, despite each containing vastly different urban/rural mixtures. This consistency is the adaptive search radius in action.
Cost Predictability: $0.0035 per geometry regardless of local spatial density. Budget with confidence.
Linear Scaling: Processing time scales linearly with geometry count—no density-dependent surprises.

Practical Guidance

`ST_KNN` vs `ST_AKNN` for Proximity Analysis

For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):

Function	Speed	Precision	Best For
ST_KNN	Fast	Exact	When KNN result is the final answer
ST_AKNN	Faster	Approximate	Search space reduction before exact calculations

Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.

Using Distance Bounds with KNN

If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:

Copy Code

-- Only consider candidates within 5000 meters
ST_AKNN(query.geometry, target.geometry, 10, TRUE, 5000)

Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.
This is useful when:

Business logic requires a limit: e.g., “nearest hospital within 10km”
Performance optimization: You know neighbors beyond X meters aren’t relevant
Emergency response: Facilities must be within a critical response distance

KNN Performance Optimization Tips

Materialize intermediate results: Either cache or write to disk as Iceberg table, the KNN output before applying filters
Process categories separately: Run KNN for each target type (e.g., motorways, then trunk roads) independently
Use appropriate runtime: Medium runtime handled 13M geometries in ~35 minutes

Applications of KNN-Based Spatial Proximity Analysis

The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:

Infrastructure & Planning
- Properties → nearest utilities, transit, services
- Customers → nearest stores, facilities, competitors
Environmental Analysis
- Development sites → nearest protected areas, water bodies
- Facilities → nearest residential areas, schools
Business Intelligence
- Locations → nearest amenities, employment centers
- Assets → nearest maintenance facilities, resources
General Pattern: For each query geometry A, find target geometry B that minimizes some expensive function f(A, B). Use KNN to reduce candidates, then apply exact calculation to the reduced set.

Conclusion

Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.

Key Takeaways

Density-Invariant Selection: KNN’s bounded cardinality constraint naturally accommodates varying feature density—no manual threshold tuning required.
Predictable Performance: Consistent throughput and cost across heterogeneous spatial data, enabling reliable budgeting and planning.
Two-Stage Pattern: Combine KNN’s fast candidate selection with precise geometric calculations for both speed and accuracy.

At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.

KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.

Resources

Wherobots Documentation: ST_AKNN, ST_KNN, spatial function reference
Wherobots Cloud: Distributed spatial analytics platform
Overture Maps Foundation: Open map data used in our validation

Create Your Wherobots Account

Get Started

Wherobots brought modern infrastructure to spatial data in 2025

Posted on January 26, 2026January 26, 2026 by Damian

In 2026 we’re bridging the gap between AI and data from the physical world.

Entering 2025, we knew we needed to prove Wherobots is fundamentally the best place to create and run spatial data workloads at scale. Last year we directed the vast majority of our energy at strengthening the core fundamentals– ease of use, cost, performance, and reliability, knowing this focus would resonate with customers and value would be amplified through what we build later on.

We knew we needed to bring spatial data into the modern data architecture, which from our vantage point, is the data lakehouse. If we did nothing, much of this data would otherwise remain siloed, “special”, and out of reach of modern analytics engines that could put this data to work. This is why we led contributions of GEO type support to Iceberg and Parquet.

Off-platform through the open source Apache Sedona project, we saw an opportunity to develop a lightweight query engine that would appeal to developers because it would provide the support they need out of the box and accelerate iterations with spatial data. And so SedonaDB was born.

This post is a high level summary of these and other accomplishments from our team in 2025. Now, we are actively building on top of this improved foundation, to enable AI and data practitioners across industries and use cases to operate with a heightened understanding of the physical world, for any area of interest.

Customer success with scalable spatial data platforms

This is what matters most. Everything else written here is just supporting evidence that shows how we made our customers more successful with spatial data this year. And what better proof than their own words?

“Working with Wherobots let us focus on what matters – helping our clients make better land decisions. Their platform helps us scale efficiently while keeping our attention on real-world outcomes across energy, conservation, and development”

Danan Margason

Founder & CEO at Aarden.ai

“The fact that Wherobots can mosaic imagery over millions of square kilometers, run AI models over those mosaics, and then organize the outputs in minutes, for 10 to hundreds of dollars is nothing short of incredible, making global scale analysis a routine pipeline instead of an enormous and extremely expensive endeavor, has the potential to change monitoring tasks far beyond field boundary analysis.”

Caleb Robinson

Principal Research Scientist, Microsoft AI for Good

“With Wherobots, we were able to merge 15+ complex vector datasets in minutes and run high-resolution ML inference at a fraction of the cost of our legacy stack. The combination of speed, scalability, and ease of integration has boosted our engineering productivity and will accelerate how quickly we can deliver new geospatial data products to market.”

Rashmit Singh

CTO, SatSure

“We’re helping to democratize large-scale spatial analytics for everyone. With Wherobots, we can move faster, scale bigger, and help more organizations make smarter decisions about the physical world.”

Eric Pollard

Founder and CEO, ParGo

Elevating spatial data and AI capabilities for AWS customers

It’s no surprise that many of these customers are AWS customers. In late 2024, we launched Wherobots Cloud as a product AWS customers could subscribe to directly through the AWS Marketplace. This activated a key value distribution channel between Wherobots and AWS customers. It also started our partnership in earnest with AWS. We continue to work closely with the AWS team to bring world-class spatial capabilities into the hands of their customers so they can better realize their objectives with physical world data.

Leadership in the open data and lakehouse ecosystem

We were the team that led the introduction of GEO type support to Iceberg and Parquet, which led to the incorporation of GEO types in the Databricks Delta Lake project. Because these projects form foundational components of the modern data architecture, with GEO type support, a significant portion of spatial data could now be interpreted and safely interoperated on by common compute engines. That also meant it no longer had to live in siloed architectures. It could thrive in the common data estate – the data lake, processed using engines like Spark, Snowflake, BigQuery, or Wherobots. If you squint, there is now a clear path for making spatial data look just like “data” in the eyes of developers and AI systems, particularly when capable engines like Wherobots can crunch it without a problem.

Cloud and lakehouse integrations for spatial workflows

Wherobots Cloud evolved into a full-fledged spatial intelligence platform designed not just for querying geospatial data very efficiently at scale, but for building production-grade workflows that integrate high value derivatives of physical world data into customers’ existing data architectures. Our native integration with Amazon S3 makes it possible for customers to run Wherobots on spatial data in their storage. In 2025 we announced our integration with Unity Catalog, enabling Databricks customers to activate the value of Wherobots on spatial data in their Databricks lakehouse. We will continue to add integrations such that our customers can just add Wherobots’ magic to the data infrastructure they already have.

Apache Sedona appeals to a wider audience

The Apache Sedona community developed and launched SedonaDB, the first open-source, single-node analytical database engine that treats spatial data as a first-class citizen. They also made it significantly easier to compare query performance across engines using SpatialBench, the first benchmarking framework for spatial queries. SedonaDB makes spatial data significantly more useful and analytically accessible for a wider range of use cases and personas. SpatialBench is there to streamline the decision making process for users looking to choose an engine based on spatial query price-performance and capability. Here’s a link to the announcement for both.

WherobotsDB keeps getting better

We’re continuously improving the WherobotsDB engine, raising the bar we’re self-setting for spatial query price-performance and capability. In 2025 we announced multiple new functions, tools, and compatibility with GeoPandas. We also announced a preview of a new runtime version, 2.x, which contains the latest optimizations for spatial range queries, spatial filtering, and spatial joins. Rust is at the core, and it leverages vectorized execution. Compared to the first major version, 2.x further accelerates spatial queries of up to 3.3x.

While Wherobots is generally known for its spatial data capability, 2.x is significantly more performant for general purpose query operations, to a degree in which it’s also TPC-H competitive to alternative managed Spark engines in the market.

We will be sharing benchmarking results when we announce general availability for version 2 soon.

Making remote sensing and Earth sbservation data AI-ready

We launched RasterFlow in private preview, the first serverless workflow purpose built to prepare and perform inference on large scale Earth observation (EO) datasets. RasterFlow is an Earth Intelligence solution, addressing the infrastructure challenges and high costs that prevented companies from utilizing raw EO data in the first place. We’ve packaged years of GeoAI expertise into a serverless, easy to use product. RasterFlow creates inference-ready mosaics after digesting large, unprepared imagery datasets, and runs model inference on these mosaics with custom or open PyTorch models to perform tasks such as change detection, classification, and segmentation. Results are delivered as geometries in Iceberg tables in a customer’s S3 bucket or as tables in Databricks Unity Catalog, to be processed by WherobotsDB or alternative engines.

Enabling AI on physical world data

Physical world data is noisy, it’s large, and it’s generally semi-structured or unstructured. This data also needs more context in order for it to be useful, which requires spatial joins to other datasets. It can be publicly available, or reside as private assets within an organization’s data estate.

Teams and AI systems alike need this data to be processed and contextualized with other data in order for it to be useful. At scale Wherobots provides arguably the best tools for spatial processing and contextualization at the most fundamental level, for the modern data architecture. Now Wherobots is ready to be wired up to AI systems.

Late in 2025 we announced the availability of the Wherobots MCP server to give your LLM access to Wherobots’ tools. Now, LLMs can use the MCP server to efficiently design queries by understanding the spatial and non spatial data in your data estate (via the S3 and Unity Catalog integration), and run those queries on a high performance engine to answer questions about the physical world.

Soon we’ll integrate the MCP server with RasterFlow. That way, an AI agent can design and trigger a workflow in Wherobots that starts with fresh EO data, prepares it for inference, generates predictions using a collection of PyTorch machine learning models, and perform additional enrichment or transforms if needed to produce the result. Join the upcoming office hour on MCP server to learn more.

What’s next for Wherobots in 2026

You can reach out to the product team at product@wherobots.com, or me directly at damian@wherobots.com to share the challenges you’re facing and see how we can solve them now or with capabilities we’ll add. Here are a few discrete roadmap items we are working on, and categories of investment planned in 2026.

Make RasterFlow generally available
Enabling AI on physical world data
Bringing Wherobots closer to developers (VS Code, Kiro, etc)
Providing support for Wherobots in the compute environment you need it to run in
Making WherobotsDB even faster and cost effective
Broaden the capabilities of SedonaDB

The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach

Posted on January 9, 2026January 12, 2026 by Matt Forrest

When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline.

This isn’t about just storing location data in your existing medallion setup. It’s about recognizing that spatial data introduces complexity in data structures, compute patterns, and efficiency requirements that traditional architectures simply cannot handle without significant compromise. The medallion (or multi-hop) architecture, when properly adapted for geospatial workloads, becomes something more powerful: a systematic approach to managing spatial intelligence at scale.

In this post, I will outline how we have taught this in the Wherobots Geospatial Data Engineering Associate course to leverage this base design and group different spatial tasks, query patterns, and outputs into each step, leveraging both raster and vector datasets.

Why Geospatial Data Demands Architectural Rethinking

Before diving into the medallion architecture itself, let’s acknowledge what makes geospatial data different.

The Fragmentation Problem

In most organizations, geospatial data doesn’t arrive nicely packaged. You might pull property boundaries from a municipal data portal in shapefiles, elevation data from AWS open data as cloud-optimized GeoTIFFs, satellite imagery from NASA, and local POI data from your internal systems.

Each source has different formats, coordinate systems, refresh rates, and quality levels.

Traditional data lakes aren’t equipped to handle this heterogeneity efficiently. Moving terabytes of raw geospatial data through your pipeline adds cost and latency.

And then there’s the format problem: legacy formats like shapefiles and JPEG2000 can’t be partially queried. If you want Seattle property data from global shapefiles, the traditional approach is to download the entire file, unzip it, filter locally, then transfer what you need.

The Scale-Complexity Tradeoff

Geospatial scale isn’t just about row count. It’s about the combination of volume, global distribution, mixed data types (vectors, rasters, array-based formats, point clouds), conversion / interpolation between those data types, and the inherent cost of spatial operations. A simple point-in-polygon operation across a million properties and thousands of geographical features involves computation that traditional systems struggle with.

This is why you see organizations either underutilizing their spatial data or building custom solutions that work for one specific problem but don’t generalize. The efficiency gap is real, and it’s expensive.

The Medallion Architecture: Bronze, Silver, Gold

The medallion architecture solves these challenges by enforcing structure while maintaining flexibility. Here’s how it works for geospatial data:

Bronze Layer: Ingestion and Preservation

The bronze layer is deceptively simple: get raw geospatial data into your system without transformation.

For Data Engineers: This layer is your staging area. You’re ingesting data from disparate sources: APIs, files, databases, cloud storage without enforcing business logic. The goal is to preserve raw data fidelity while standardizing the container.

For geospatial data specifically, this means several important decisions:

Leave data at source when possible. If you have a global elevation dataset sitting on AWS S3, don’t download it into your local systems. Create a remote reference to it. This is one of the biggest efficiency gains in modern geospatial data pipelines.
Convert to cloud-native formats. As data arrives in your bronze layer, immediately convert legacy formats (shapefiles, JPEG2000, uncompressed GeoTIFFs) into cloud-native equivalents (GeoParquet, cloud-optimized GeoTIFFs). This isn’t optimization but a prerequisite for efficient querying downstream.
Preserve lineage and versioning. Geospatial data often changes such as when property boundaries get redrawn, satellite imagery updates, elevation models improve. You need to track when data arrived, from where, and what version you’re working with.
Raster tiling and storage optimization. For raster data (imagery, DEMs), bronze handles tiling and creates remote references. Instead of loading massive rasters into memory, you tile them and access only what you need.

For GIS Professionals: Think of bronze as your data repository. In traditional GIS workflows, you manage multiple datasets, each in its own format, living in different folders or databases. Bronze centralizes this but respects the raw nature of the data. You’re not making decisions about coordinate systems, geometry validation, or spatial relationships yet.

This is also where automation becomes critical. Instead of manually downloading datasets monthly, bronze ingestion jobs run on schedules, automatically pulling new data and versioning it. This means your analyses always reflect current reality.

Data table formats such as Apache Iceberg and Delta Lake provide several advantages here since you can easily append data to the table for records that have changed. They also use underlying cloud-native formats such as GeoParquet with the new V3 Iceberg specification and within Wherobots you can read out-of-database rasters (or a reference to a specific part of an image) without moving the source raster file saving massive amounts of I/O operations.

How Wherobots Handles the Bronze Layer

Wherobots Cloud simplifies bronze layer operations through WherobotsDB, its cloud-native spatial processing engine. When ingesting raw geospatial data, Wherobots immediately enables critical capabilities for you:

Format conversion at ingestion: Wherobots automatically converts legacy spatial formats (shapefiles, JPEG2000, standard GeoTIFFs) into cloud-native formats during ingestion. This happens transparently—you connect to data in whatever format it arrives, and it’s can be optimized for cloud storage and querying.
Remote reference handling: For massive open datasets, Wherobots manages remote references natively. Global elevation datasets sitting on AWS S3 don’t get duplicated into your systems. Instead, Wherobots creates efficient references to the original data, allowing you to query it as if it were local while paying only for what you access.
Apache Iceberg storage with spatial extensions: Data in bronze is stored using Havasu, Wherobots’ Apache Iceberg-based spatial table format – which is the foundation of the Iceberg v3 spec. This provides built-in versioning, time travel, and ACID transactions from day one. You can rewind to yesterday’s data for audits or reprocessing without complex manual versioning schemes.
Automatic metadata tracking: Wherobots tracks lineage, data arrival times, and schema information automatically, eliminating the need for manual data cataloging in bronze.

The key to Wherobots’ approach to bronze is that it enforces good practices without requiring you to build them yourself. You’re getting schema evolution, ACID compliance, and spatial optimization automatically.

Silver Layer: Enrichment and Standardization

Silver is where the intelligence begins to emerge. This is where you clean, enrich, and structure data into a consistent, analysis-ready state. It’s also often where Data Science teams may want to work before “publishing” analysis ready data to end users (in gold tables).

For Data Engineers: Silver involves:

Coordinate system harmonization. Raw geospatial data arrives in dozens of different projections. Silver transforms everything into standard coordinate systems (typically WGS 84 for global work, or local UTM zones for regional analysis).
Geometry validation and repair. Geospatial data is messy. Overlapping polygons, self-intersecting lines, invalid geometries from legacy systems all appears in silver transformation. You validate geometries and fix what’s repairable.
Spatial enrichment operations. This is where you add geospatial relationships. Perform spatial joins to associate properties with neighborhoods. Buffer roads to identify nearby areas. Create 3D geometries by joining elevation and additional data attributes to 2D features. Calculate spatial statistics like area, perimeter, or distance to nearest feature.
Advanced spatial operations. Often times we think of spatial relationships just as a spatial join. But performing analysis such as a K-Nearest Neighbor join, zonal statistics (or a raster to vector join), spatial aggregation, distance within (i.e. features within N distance), area weighted interpolation (such as weighted population statistics), shortest path analysis, line of sight analysis, space and time proximity analysis (i.e. near miss) and more. These require not only optimized spatial data but the right functions to make them work.

The key architectural principle in silver is that you’re not making final analytical decisions you’re preparing the data so downstream consumers can make those decisions efficiently.

For GIS Professionals: Silver is where your spatial data processing happens. This is the layer where you answer structural questions:

What coordinate system makes sense for this analysis region?
Are there geometry issues I need to resolve?
What spatial relationships exist between datasets that I’ll need downstream?
Should I enrich my vector data with elevation? Satellite indices? Population density?

In traditional GIS, this work happens ad-hoc before each analysis. In the medallion architecture, you do it once, well, and capture the work in reusable transformations. This is efficiency multiplied across teams and projects.

Another key note is that you will likely want different scales of compute to handle these processes. Easier process like spatial joins on vector basic geometries may only require a smaller compute instance whereas a large scale NDVI (or other vegetation indexes) would require a larger compute instance. The ability to mix and match your compute scales is a critical advantage in a spatial platform not only for speed but cost optimization.

How Wherobots Handles the Silver Layer

The silver layer is where WherobotsDB spatial optimization truly shines. Wherobots provides native spatial operations that make complex transformations not just possible but efficient at scale:

Coordinate system transformations at speed: Wherobots includes optimized functions for reprojecting geometries across coordinate systems. What might take hours in desktop GIS or loose scripts in Python happens in seconds on Wherobots’ distributed architecture. This is critical because spatial efficiency compounds—faster transformations mean you can afford to enrich data more completely.
Geometry validation and repair: Wherobots includes native geometry validation and repair functions. You can identify invalid geometries, split self-intersecting polygons, and fix common data quality issues in SQL. This work is parallelized across the cluster, not constrained by a single machine’s memory.
Spatial enrichment operations: Silver transformations in Wherobots use Spatial SQL with 300+ functions for vector and raster operations. Spatial joins that would take hours in traditional systems execute in minutes. Buffer operations, intersection calculations, proximity analysis—all happen efficiently on distributed data. Wherobots handles both vector and raster data natively in the same query, meaning you can enrich vector property data with raster elevation or satellite imagery in a single operation.
Raster tiling and remote storage: For raster data, Wherobots can tile and optimize imagery while keeping it remote on S3. You’re not loading massive datasets into memory. Instead, Wherobots’ raster functions work against remote tiles, accessing only what’s needed for each query.
Reusable SQL transformations: Silver transformations are written in Spatial SQL and stored as views or jobs. These are version-controlled, reproducible, and can be re-run as new data arrives in bronze. Unlike one-off Python scripts, silver logic in Wherobots is production-grade from day one.

The combination of Spatial SQL and WherobotsDB’s distributed architecture means silver layer work that once required deep geospatial expertise and careful optimization is now accessible to data engineers who know SQL. Wherobots handles the spatial complexity.

Gold Layer: Analytical Readiness and Delivery

Gold is your analytics-ready product layer. Data here is fully prepared, enriched, and available for immediate use in downstream applications and BI or GIS systems.

For Data Engineers: Gold contains:

Advanced aggregates and spatial statistics. Point-in-polygon counts (how many properties in each neighborhood), zonal statistics (average elevation by region), spatial clustering (identify hotspots) updated on a regular basis, ready to join by identifier to non spatial data.
Optimized indices and pre-computed relationships. Store commonly-needed spatial joins and queries pre-computed, reducing downstream query time.
Tiled and multi-format outputs. Generate PMTiles for web visualization, GeoParquet for analytics, and optionally push to PostGIS or SedonaDB/DuckDB for application serving.
Quality-controlled data. Remove erroneous geometries that made it through silver, apply final business logic, and ensure consistency.
AI ready data. Create language based data that is ready for an LLM or agentic applications to consume removing heavy geometries which can confuse an LLM.

Gold is where you can afford to be opinionated because the work upstream ensures those opinions are well-informed.

For GIS Professionals: Gold is your analysis starting point. Instead of wrestling with raw data, you access gold tables that are:

Geometrically valid and properly projected
Enriched with relevant spatial context (elevation, proximity to features, administrative boundaries)
Pre-aggregated for common questions
Formatted for direct use in your tools and custom applications (QGIS, Felt, BI dashboards, machine learning pipelines)

You go from “I need to combine three datasets and fix projection issues” to “Here’s the enriched regional dataset, ready for analysis.”

How Wherobots Handles the Gold Layer

Gold layer work in Wherobots focuses on making spatial data immediately useful across your entire organization:

Pre-computed aggregates at scale: Wherobots can compute spatial aggregates—point-in-polygon counts, zonal statistics, spatial clustering—and store them efficiently. These pre-computed metrics are small enough to serve directly to applications but powerful enough to support complex analyses. A query that aggregates a billion point observations into regional summaries completes in seconds.
Multiple output formats: Gold data can be materialized in multiple forms depending on consumer needs. Export as GeoParquet for analytics, create PMTiles for web visualization, push to PostGIS for application serving, or generate standard Parquet for BI tools. Wherobots’ native format support means you write once and consume many ways.
Web tile generation at scale: Wherobots includes a scalable vector tile (VTiles) generator that’s optimized for producing map tiles from massive datasets. Instead of spending weeks generating tiles for a global dataset, Wherobots produces them in hours. These tiles feed directly into web applications, Felt, QGIS, or any tile-consuming tool.
RasterFlow integration: Gold layer work in Wherobots increasingly includes AI-powered enrichment. RasterFlow raster inference allows you to extract insights from satellite imagery at scale—identifying buildings, roads, vegetation patterns—and embed those insights directly into your gold tables. Machine learning models that would require custom integration work in other systems are built into Wherobots’ gold workflow.
Spatial SQL API for consumption: Instead of exporting and distributing files, gold data is served through Wherobots’ Spatial SQL API. Applications, dashboards, and downstream systems query gold data directly through Python SDK, Java JDBC driver, or standard SQL endpoints. This means gold data stays fresh and you don’t distribute copies.

Gold in Wherobots is less about storing static snapshots and more about providing a live, queryable product layer that adapts to consumer needs.

Why This Architecture Solves Geospatial Problems

Efficiency Through Format Optimization

Traditional systems treat geospatial data like regular tabular data. This is inefficient. A shapefile is a collection of files in a zip that requires downloading the entire dataset to access a subset. GeoParquet, by contrast, stores data in columnar format with spatial indexing built-in. You can query it over the internet by bounding box or geometry filter and only transfer what you need.

The medallion architecture enforces this progression from legacy to cloud-native formats, reducing storage costs and query latency dramatically. A terabyte global elevation dataset stays on AWS S3 as a reference. You don’t duplicate it.

Separation of Concerns

Bronze, silver, and gold are separate databases with separate lineage. This means:

Raw data safety. Your source data is never modified. If a transformation downstream goes wrong, you can reprocess from bronze without losing the original.
Independent evolution. Teams can improve silver transformations without affecting gold. Applications consuming gold don’t care about silver changes as long as contracts remain stable.
Governance simplicity. Access controls are straightforward. Grant different teams access to different layers based on their role.

Automation and Scalability

By encoding transformations into standardized layers, you enable automation. Data ingestion becomes scheduled jobs. Updates propagate automatically from bronze through silver to gold. You stop manually managing data and start managing logic.

This scales to global datasets because you’re not moving data—you’re moving queries. Spatial predicates push down to where the data lives, reducing network transfer and computation.

Real-World Application: Housing Analytics at Scale

Consider a concrete example: analyzing housing market patterns across Seattle using a blend of property records, elevation, transportation networks, satellite imagery, and census data.

Bronze Layer:

Ingest property sales records from the county (updated monthly)
Pull property boundaries from municipal GIS (shapefile, converted to GeoParquet)
Reference global DEM (stays on AWS S3)
Ingest road network (OpenStreetMap)
Store satellite imagery references (Copernicus, AWS Open Data)

Silver Layer:

Transform property boundaries to WGS 84 and validate geometries
Perform spatial join: properties to neighborhoods
Enrich with elevation by querying DEM at property centroids
Calculate proximity metrics: distance to nearest transit, nearest park
Tile satellite imagery for efficient access

Gold Layer:

Aggregate: median price by neighborhood, price trends by elevation
Pre-compute spatial clusters: identify hot markets
Generate PMTiles for web visualization
Export clean GeoParquet for ML pipelines
Push refined data to PostGIS for application serving

This progression from raw, fragmented sources to a unified, enriched analytical product is exactly what the medallion architecture enables. And the efficiency gains compound as more analyses build on the same gold layer.

Comparison to Other Approaches

Why Not Keep Everything in PostGIS?

PostGIS is powerful for transactional spatial queries and operations, but it’s a database optimized for consistency and ACID transactions, not analytics at scale. As data volume grows, PostGIS becomes expensive to operate and scale. You’re paying for transactional guarantees you don’t need for analytics.

The medallion approach uses cloud storage (S3) as the primary store, which scales cheaply, and can use PostGIS only for the final gold layer serving to applications that need it. This is typically more cost-effective and performant.

Why Not Use a Specialized GIS Data Warehouse?

Geospatial-specific data warehouses exist but are typically proprietary, expensive, and inflexible. They don’t integrate easily with your existing data infrastructure or machine learning pipelines. The medallion architecture is platform-agnostic—it works with Spark, Flink, DuckDB, or any distributed compute framework that understands spatial operations.

Why Not Just Use Desktop GIS with Cloud Storage?

Desktop GIS tools (QGIS, ArcGIS) can read cloud storage but aren’t designed for production pipelines. They require manual steps, don’t automate updates, and don’t scale to the volume and frequency of modern geospatial data. The medallion architecture automates what desktop GIS does manually.

Implementation Considerations

Technology Stack

The medallion architecture for geospatial data typically uses:

Storage: Cloud object storage (AWS S3, Google Cloud Storage, Azure Blob)
Table Format: Apache Iceberg with spatial extensions (Havasu), which provides schema evolution, ACID transactions, time travel, and spatial indexing
Compute: Apache Sedona or Wherobots (distributed geospatial framework) or SedonaDB (single-node spatial SQL)
Orchestration: Workflow tools (Airflow) to manage bronze→silver→gold pipelines
Visualization: Web tiles (PMTiles), GIS tools (QGIS, Felt), or BI platforms

Iceberg: The Foundation

Apache Iceberg is the table format that makes this all work efficiently. It’s a metadata layer over Parquet files that provides:

Schema evolution: Add columns without breaking downstream queries
ACID transactions: Reliable concurrent updates
Time travel: Query historical snapshots (rewind to yesterday’s data)
Partitioning: Automatic data organization for efficient queries
Spatial indexing: Efficient spatial predicates and pushdown optimization

With Iceberg’s spatial extensions, you get native geometry storage and spatial optimizations, making it the ideal foundation for medallion pipelines. Plus with many upstream data warehouses now accepting Iceberg V3 you have a zero ETL process to all of these systems from your gold tables.

Data Lineage and Governance

The medallion architecture inherently supports data governance. Each layer has clear ownership and lineage. Track which source datasets feed which analyses. Implement role-based access controls at layer boundaries. Maintain data catalogs describing bronze sources, silver transformations, and gold products.

Putting this into practice with Wherobots

From Theory to Practice

The medallion architecture isn’t a theoretical concept—it’s proven across thousands of data engineering organizations. What makes it powerful for geospatial is that it acknowledges geospatial’s unique challenges:

Format diversity: Enforced conversion to cloud-native formats
Computation intensity: Spatial predicates pushed down to efficient compute
Scale complexity: Remote references and tiling instead of data movement
Governance needs: Clear layer separation for access control and lineage
Integration requirements: Gold layer can feed into GIS tools, ML pipelines, or applications

For teams managing geospatial data, whether you’re data engineers building analytics platforms or GIS professionals upgrading legacy workflows, the medallion architecture provides a systematic, scalable path forward.

The alternative (managing spatial data ad-hoc) becomes increasingly untenable as volume, velocity, and complexity grow. The medallion architecture makes it possible to move at scale without moving data.

Key Takeaways

Bronze is raw: Get all your geospatial sources into cloud storage without transformation, leaving data at source where possible
Silver is structure: Standardize projections, validate geometries, enrich with spatial relationships, and convert to cloud-native formats
Gold is analysis-ready: Aggregate, optimize, and prepare data for consumption across tools and applications
Iceberg is the glue: Use Apache Iceberg’s spatial extensions as your table format to manage schema evolution, lineage, and efficient spatial operations
Efficiency is existential: Geospatial data demands careful format choices and architectural decisions—the medallion approach systematizes those choices

The geospatial industry is shifting from moving data to moving queries. The medallion architecture is how you make that shift sustainable.

Create your Wherobots account

Get Started

Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence

Posted on December 10, 2025January 8, 2026 by Philip Darringer

We’re very excited to announce RasterFlow is now available to select customers in a private preview. If you are interested in learning more or would like to request access to the preview, contact us here!

RasterFlow is a serverless image preparation and inference engine that makes it significantly easier to generate Earth Intelligence from planetary scale Earth Observation (EO) datasets. With it, customers and their AI agents will be significantly more capable of innovating with EO data and integrating earth insights into their data infrastructure.

How RasterFlow Powers Earth Intelligence at Scale

A few weeks ago, we announced our collaboration with the Taylor Geospatial Engine to help them evaluate their Fields of the World (FTW) machine learning model that segments agricultural field boundaries. Using an early release of RasterFlow, we were able to quickly and cost-effectively run this model at scale.

Here’s a breakdown of how this works in practice. RasterFlow ingests and assembles the source imagery – in this case Sentinel-2 – into an inference-ready mosaic, generating representative features using the FTW model for planting and harvest seasons, and removing cloud cover as needed (1, 2). The FTW model is run against this mosaic using RasterFlow’s distributed inference engine to predict fields and field boundaries (3). RasterFlow predictions are then vectorized into geometries and made available as an Iceberg table (4) that can be used in WherobotsDB or other downstream applications and data systems for field-level crop insights. RasterFlow’s applicability is much wider than Sentinel 2 and FTW. It supports Zarr and COG imagery datasets and PyTorch computer vision models for inference.

RasterFlow at Scale
The images above represent sample outputs for a small area in Kansas, but RasterFlow can be very attractive for larger scale runs. In our collaboration with the Taylor Geospatial Engine, we executed larger scale runs including the Continental United States (CONUS), Japan, Mexico, South Africa, Switzerland and Rwanda. RasterFlow’s efficient parallel processing enabled each of these large scale workflows to complete in minutes to a few hours. RasterFlow autoscales compute resources based on expected compute and inference load, which is a function of area and time range, dataset density, and model complexity.

Challenges using EO Data

Most data teams do not have the expertise or the budget to build and operate the unique infrastructure and software stack required to extract insights from EO datasets using computer vision models. These barriers have prevented innovative ideas from getting off the ground.

According to Gartner, only 1% of AI models today leverage physical world data, vs a projected 80% by 2029. Similarly, AI agents are projected to generate 10 times more data from physical environments than from all digital AI applications combined.¹ However, AI agents can’t economically make sense of this raw data because it has to be prepared by the same costly, complex, and unique infrastructure the data teams need, but neither have access to.

Here’s an example that underscores these challenges: if you or an AI agent are trying to analyze wildfire state and predicted spread to measure risk to infrastructure, developers typically need to build dedicated pipelines that:

Ingest and prepare imagery for inference, minimizing noise such as cloud cover and edge effects
Deploy a machine learning model on prepared imagery, trained to segment and classify fires
Tune model inference for scale and efficiency, while minimizing edge and tiling effects from individual tasks
Measure change over time using models that take into account wind direction, speed, vegetation, buildings and other infrastructure in the probable path of the fire
Join model predictions with other important context including building footprints, land parcels and infrastructure such as powerlines and pipelines to calculate overall risk
Forecast the spread of the fire

In total, these steps require significant investments in both infrastructure development, operations, and talent that most businesses are unable to justify, much even accomplish.

On-Demand Imagery Preparation and Inference for Earth Observation Workflows

The inspiration for RasterFlow was to make it easy for any company to use large scale sensor datasets and computer vision models to unblock innovation and AI applications for the physical world. RasterFlow does this by combining decades of expertise with a fully managed, inference and mosaicking workflow and API designed for Earth Intelligence at any scale.

Here are a few key capabilities:

On-demand serverless operations for imagery ingestion, preparation (also known as mosaicking), and inference.
Built-in support for popular open datasets and open models so you can get started quickly.
Inference results that can be converted to vector geometries and integrated into a lakehouse architecture; in a customer’s cloud storage bucket as Parquet files in Apache Iceberg tables.
Ability to easily postprocess these results with WherobotsDB or other lakehouse engines with support for spatial operations, such as Databricks, Snowflake, or Google BigQuery.
Simple enough for any engineer, scientist, or analyst to use: just pick a model, an area of interest to deploy that model, and a time range. Advanced users can take advantage of lower-level APIs to customize their planetary-scale inference runs.

RasterFlow Operators: Core Functions for Preparing Imagery, Model Inference, and Vectorization

RasterFlow provides fully managed operations required for processing Earth Observation datasets, including:

Imagery ingestion and preparation to remove cloud cover, edge effects, and build a high quality inference-ready mosaic
Distributed inference for large scale computer vision, geospatial foundational and other PyTorch model runs
Vectorization of model outputs into geometries or as analytics ready rasters

Ingesting and Preparing Satellite Imagery for Model Inference

Satellites and drones capture imagery on a particular flight path. And it may take multiple drone flights, or days, weeks, or even months for the flight paths of a satellite constellation to capture clean imagery for a particular area of interest. Clouds and weather events may still block what you may be interested in. In these circumstances it’s important to understand the rate of coverage and define your time horizon accordingly, to build a mosaic.

A mosaic is a composite image that is the result of composing high-quality pixels (e.g., cloud free) over a time range, and stitching them together for a particular area. Base satellite layers in your favorite map applications (Google Maps, Mapbox) are cloud-free mosaics composed from images over a wide time range. Many computer vision models are trained to find relatively durable things on Earth, like buildings, roads, and land cover. But when clouds, coverage, imagery edge effects, or other types of “noise” exist in the input imagery, the quality of inference suffers. The purpose of the mosaic is to correct for this noise and make imagery, inference-ready, so model inference produces the results you want. RasterFlow takes care of this heavy lifting for you, creating an inference-ready mosaic that maximizes the usefulness of today’s Earth Observation models.

Distributed Geospatial Inference at Planetary Scale

We’ve moved past the use of eyes to analyze imagery, and are now capable of letting machines do this work for us. With RasterFlow, today’s machine learning models can perform tasks such as object detection, segmentation, and classification, on a very large area of interest, with orders of magnitude more efficiency and scale than an analyst’s eyes can offer. The RasterFlow inference engine is designed for small to very large scale runs. It efficiently parallelizes across the input mosaic across a distributed and serverless inference architecture while minimizing tiling effects typically produced when inference pipelines operate on individual tiles.

Running Hosted or Custom Geospatial AI Models with RasterFlow

For convenience, RasterFlow currently hosts popular open source PyTorch geospatial computer vision models that are ready to use. These models currently include:

You can also import your own custom PyTorch model to your Wherobots Organization for private deployment.

RasterFlow + TorchGeo: Simplifying PyTorch-Based Geospatial AI

Wherobots actively supports the TorchGeo project which helps machine learning experts to more easily work with geospatial data within the PyTorch ecosystem. We will continue to build out RasterFlow integrations with TorchGeo, including onboarding additional TorchGeo models and further simplifying the model lifecycle for PyTorch models. While we are starting with support for PyTorch focusing on TorchGeo models, we are open to adding support for other model frameworks.

Calling Geospatial Model Developers: Contribute to RasterFlow

We are continually adding new, open source geospatial computer vision models to the Wherobots Model Hub. And if you’re a model developer, we’re interested in speaking with you to onboard your model and distribute the value of your work to a wider audience using Wherobots RasterFlow.

Vectorizing Model Outputs: From Raster Predictions to Geospatial Geometries

Many computer vision models output rasters, where each pixel in the raster represents a predicted real-world value such as height of the tree canopy, or the confidence that the pixel represents a certain feature such as an agricultural field boundary or a sidewalk. RasterFlow provides built-in support for raster vectorization, turning pixel values into rich, concise geometries. These geometries represent features of interest that can be post-processed, conflated, and integrated into your workflows because they are yours, stored in open source file (Parquet) and table (Iceberg) formats in your S3 bucket.

Using RasterFlow with Geospatial Foundation Models and Embeddings

Recent developments in Geospatial Foundation Models have generated tremendous interest in the research community, potentially accelerating Earth Observation applications the same way that Large Language Models (LLMs) and embeddings have transformed AI’s ability to generate language. RasterFlow can generate embeddings from the latest open Geospatial Foundation Models, including OlmoEarth from the Allen Institute for AI (Ai2) and Clay. With RasterFlow’s ability to cost-effectively generate embeddings at scale, researchers and practitioners can easily generate embeddings for their area of interest and evaluate their suitability and power.

Customers and Partners Using RasterFlow for Scalable Earth Intelligence

One highlight while developing RasterFlow has been our collaboration with customers and partners like SatSure, Taylor Geospatial Engine, and Spyrosoft. We’ve used feedback from these teams to ensure we are solving for customer needs. Before the Thanksgiving holiday we shared our recent learnings from working together with Taylor Geospatial Engine, who have been incredibly helpful in providing input on the types of ways their ecosystem of developers and ML engineers would want to interact with RasterFlow.

SatSure is an existing Wherobots customer and an early adopter of RasterFlow, and we are excited to see what they build next with it.

"RasterFlow meaningfully accelerates the work SatSure and Wherobots already do together. By automating mosaicking, preprocessing, and distributed inference into a single, on-demand workflow, it removes much of the engineering overhead required to operationalize our models at national and multi-season scale. This helps us move new geospatial AI models into production faster, iterate more quickly with customers, and deliver fresher, high-resolution insights across agriculture, banking and financial services, and infrastructure use cases."

Rashmit Singh

CTO and co-founder, SatSure

RasterFlow Availability and Multi-Cloud Architecture

Wherobots infrastructure runs natively on AWS and customers pay for use through the AWS marketplace. RasterFlow and WherobotsDB support hybrid architectures, where data is read from, and results are written to other environments such as GCP, Azure, Oracle, or on-premises. This is particularly useful when processing open datasets or using open models and the environment in which data is processed may not be a concern. On-demand pricing for RasterFlow will be announced at a later date, but can be discussed with customers participating in the private preview.

Next Steps: Try RasterFlow and Explore the Wherobots Spatial Data Platform

We invite anyone who wants to test out RasterFlow to request to join the private preview here.
Sign up for the RasterFlow webinar in January.
Get started building with the most capable and efficient spatial data platform using the Wherobots Professional Edition.
Sign up for the newsletter to keep pace with what’s happening at Wherobots.

Source – 27 August 2025, Gartner Innovation Insight: World Models Are Set to Empower AI Agents With Imagination ↩︎

Wherobots and Taylor Geospatial Engine Bring Fields-of-the-World Models to Production Scale

Posted on November 26, 2025November 26, 2025 by Ben Pruden

Agriculture depends on timely, reliable insight into what’s happening on the ground—what’s being planted, what’s being harvested, and how fields evolve over time. The Fields of The World (FTW) project was created to support exactly this mission, by building a fully open ecosystem of labeled data, software, standards and models to create a reliable global map of agricultural field boundaries using AI and Earth Observation (EO) data.

Over the past several months, Wherobots has been working closely with the Taylor Geospatial Engine (TGE) team driving the FTW project to turn high-performing research models into operational, production-scale data products. This collaboration builds on TGE’s broader effort to accelerate AI & EO development through connecting cutting edge research to real world needs. Learn more about FTW at fieldsofthe.world and about TGE’s agricultural AI initiatives here.

Turning Fields-of-the-World Research Models Into Operational Pipelines

FTW Phase 2 surfaced state-of-the-art computer vision models for interpreting Sentinel-2 time-series imagery and predicting locations of agricultural fields. But, research models alone aren’t enough to deliver real agricultural insight. They must be reproducible at scale, compute-efficient, and aligned with downstream applications.

This is where Wherobots focused its efforts: optimizing inference pipelines, distributing computation efficiently, and generating data products that can serve as reliable foundations for agricultural modeling, monitoring, and analysis.

TGE’s objective was simple: transform breakthrough research into something developers, end-users, and organizations can actually use.

"This project is a testament to what happens when the academic community, nonprofits, and industry all pull in the same direction… Wherobots' ability to run these open-source models at scale makes it possible for the community's work to reach a global audience, and open access to predictions and mosaics ensures that more researchers and innovators can build on top of it."

Jennifer Marcus

Executive Director, Taylor Geospatial Engine

Open-Sourced Seasonal Mosaics and Model Predictions

Today, we’re releasing the first production-scale outputs from this collaboration:

Sentinel-2 Seasonal Mosaics

Cloud-free, analysis-ready mosaics tailored to key agricultural seasons—planting and harvest—built from Sentinel-2 imagery. These mosaics provide a consistent, high-quality foundation for model training, monitoring workflows, and large-scale geospatial analysis.

FTW Phase 2 Model Predictions

Per-pixel prediction outputs generated by running the top-performing FTW Phase 2 model across these seasonal mosaics, aligned with FTW’s agricultural labeling standards.

Expand for caption

This is an example output of the field boundary model in a large scale AOI in Japan and Mexico. The comparison is between the 2023 predictions and 2023 + 2024 in predictions (2024 in bright green).

Both datasets are openly available on Source Cooperative: https://source.coop/wherobots/fields-of-the-world

These resources are designed to make advanced agricultural AI more accessible—supporting innovation in food security, sustainability, and data-driven farming.

The Wherobots AI for Earth team were critical in scaling inference with the new field boundary delineation models to a multi-country scale. The fact that they can mosaic multi-season Sentinel-2 imagery over millions of square kilometers, then run models over those mosaics, and organize the outputs in minutes for a few hundred dollars is nothing short of incredible. Making global scale analysis a routine pipeline instead of an enormous one-off effort, has the potential to change monitoring tasks far beyond field boundary analysis.

Caleb Robinson,

Principal Research Scientist, Microsoft AI for Good

Enabled by the Spatial Intelligence Cloud

This release also highlights the underlying engine that made it possible. The Wherobots Spatial Intelligence Cloud is built specifically for large-scale geospatial machine learning workflows—constructing analysis-ready mosaics, executing distributed model inference, and writing results into modern, cloud-native formats like Zarr with exceptional efficiency. And model outputs can be further processed and analyzed within WherobotsDB, using the power of Apache Sedona to refine the field geometries or calculate vegetative indices at scale. These capabilities are part of our suite of tools within our AI for Earth product area.

Under the hood, the platform uses state-of-the-art tooling for raster processing, GPU-accelerated inference, chunk-aligned storage, and aggressive cost optimization. These choices allow us to run continental-scale pipelines at speeds and costs that would have been unimaginable even a few years ago.

As agricultural AI models continue to grow in scope and resolution, this kind of infrastructure becomes essential. Our goal is to make it straightforward for teams to experiment, scale, and operationalize geospatial ML without needing to reinvent the entire data stack.

One of the most important parts about what the Wherobots’ team has done is to make it easy to see where and how the model is failing. For example, the model's poor performance in Nevada and the tiling artifacts from the previous model runs led to important changes in how we should be training models.

Nathan Jacobs

Assistant Vice Provost for Digital Transformation, Washington University in St. Louis

What’s Next

This is just the beginning. We’re continuing to work with the FTW and TGE teams to expand coverage, operationalize more models, and build richer analysis-ready layers for the agricultural AI ecosystem.

If you’re exploring geospatial ML, agricultural monitoring, or large-scale satellite data processing, we’re excited to see what you build with these new open datasets—and with the Wherobots AI for Earth capabilities in your pipeline.

Getting in Touch about Wherobots AI for Earth

If you would like to continue to stay up to date about the Wherobots AI for Earth capabilities for solving real world problems, or speak to the Wherobots team about utilizing these capabilities in production, please reach out to us here.

Why Mobility Data Is Harder to Process Than It Looks

What Makes Mobility Data Different From Standard Spatial Data

Why GPS Data Volume Overwhelms Traditional Databases

How GPS Signal Noise Corrupts Downstream Analysis

Why Temporal Ordering Is Critical in Mobility Data Processing

Why 2D Spatial Systems Fail for Mobility Analytics

Where Traditional Spatial Systems Fail for Mobility Data

Desktop GIS: The Single-Machine Ceiling

Spatial Databases: Scale Without Spatial Intelligence

General-Purpose Distributed Frameworks: Power Without Spatial Awareness

Map Matching: The Unsolved Infrastructure Problem

The Real Cost of a Fragmented Mobility Data Pipeline

What a Modern Mobility Data Architecture Looks Like

How Wherobots Handles Mobility Data Processing: What We Built

Key Takeaways:

Why Slow Spatial Queries Cost More Than You Think

Low Latency vs High Throughput: What Speed Actually Means for Each Tool

PostGIS vs Wherobots: Why the Pricing Models Are Fundamentally Different

Why PostGIS Forces You to Pay for Peak Capacity Around the Clock

How Wherobots Charges Only for Active Compute Time

PostGIS vs Wherobots: Which One Is Right for Your Use Case?

1. Logistics and Delivery: Why Real-Time Tracking Needs PostGIS

2. Insurance and Real Estate: Why Portfolio-Scale Risk Analysis Needs Wherobots

3. Urban Planning: Why Time-Series Sensor Data Overwhelms a Standard Database

PostGIS vs Wherobots: Decision Checklist

Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were)

What We’re Building

The Source Schema

Setting Up the Streaming Ingest

SedonaContext Initialization

Ensuring the Catalog Schema and Table Exist

The Spatial Transform

Creating Geometry with ST_PointZM

Business Rule: The Speeding Flag

Business Rule: Geofence Violation Detection

Why This Has to Happen Inside foreachBatch

Loading the Geofence Table: Broadcast + Cache

The Spatial Join Pattern

File-Level Lineage with input_file_name()

The foreachBatch Writer

Adding Batch and Job Run Lineage

The Wherobots-Idiomatic Write Pattern

Wiring It All Together

Viewing the Results

Ingest Stats and Lineage

Batch-Level Detail

Records per Batch — Bar Chart

The Map: PyDeck with Speeding Visualization

Why This Pattern Matters

Spatial Enrichment at Ingest, Not at Query Time

Business Rules as First-Class Columns

Lineage Without Extra Infrastructure

Iceberg: Geometry as a Native Type

Catalog-Managed Tables, Not Path-Managed Files

Key Takeaways

About the Author

With Wherobots on AWS, not only can we easily scale to millions of acres and continuous tractor telemetry normalization within LeafLake, but also we can rest assured that our costs won’t spiral out of control.

— G. Bailey Stockdale

WherobotsDB Benchmark Results: TPC-H and SpatialBench Performance

Price Performance vs Next Best Engine

WherobotsDB Capabilities: What No Other Engine Offers

WherobotsDB Architecture: Rust-Native, Arrow-Columnar Execution

Native Spatial Execution with SedonaDB

Apache DataFusion Integration

Zero-Copy Efficiency

Spark 4 Compatability

Try WherobotsDB Free: 30-Day Trial

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

The Professional’s Dilemma: Static vs Adaptive Search

The ST_DWithin Approach

The Spatial Density Problem

The Radius Paradox

KNN Spatial Analysis: Solving the Density Problem at Scale

The Effect in Practice

The Adaptive Advantage

The Two-Stage Pattern for Accurate Spatial Proximity

KNN Implementation Pattern

Results: Consistent Performance Across Spatial Densities

Key Observations

Practical Guidance

Why This Has to Happen Inside `foreachBatch`

The `ST_DWithin` Approach

`ST_KNN` vs `ST_AKNN` for Proximity Analysis