Spatial Lakehouse Archives

Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down

Posted on March 26, 2026March 26, 2026 by Matt Forrest

Mobility data is the continuous stream of GPS location records capturing how people, vehicles, and assets move through the world. Processing it at scale is fundamentally different from standard spatial data work because it carries temporal dependencies: the order and timing of observations define movement, not just position. Most organizations have discovered, often painfully, that the tools they already use were not designed with that in mind. In Part 2, we walk through the technical implementation: a three-notebook medallion architecture built on Wherobots and Apache Sedona that takes raw GPS pings and transforms them into analysis-ready, GeoParquet-backed analytical views.

Why Mobility Data Is Harder to Process Than It Looks

Every second, billions of GPS-equipped devices generate spatiotemporal data capturing how people, goods, and vehicles move through the physical world. The market reflects how seriously organizations are treating this: the global fleet management market is valued at approximately $27 billion in 2025 and projected to exceed $122 billion by 2035. Mobility data analytics platforms are on a similar trajectory, from $2.5 billion to over $11 billion by 2034. But collecting this data and actually extracting reliable intelligence from it are two very different things.

Most organizations have discovered, often painfully, that the tools they already use were not designed with GPS mobility data in mind. The result is brittle pipelines, inconsistent methodologies, and analyses that quietly produce misleading conclusions. Researchers at ACM have characterized the field as requiring its own dedicated science because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires looking at the specific properties that make mobility data uniquely difficult to process correctly.

What Makes Mobility Data Different From Standard Spatial Data

Mobility data is not just spatial data with timestamps attached. It is a distinct category of data that violates assumptions built into most data processing systems. Researchers at ACM have characterized the field as requiring its own dedicated science—Mobility Data Science—because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires examining the specific properties that make trajectory data uniquely challenging.

Why GPS Data Volume Overwhelms Traditional Databases

The volume and velocity problem is the recognition that GPS data generation at fleet scale is not a batch analytics challenge, it is a continuous, high-throughput data engineering problem that demands distributed processing from the start. A single connected vehicle generating GPS pings at one-second intervals produces roughly 86,400 records per day. A fleet of 10,000 vehicles generates over 860 million data points daily. Multiply this across the millions of connected vehicles, delivery drones, rideshare fleets, and maritime vessels operating globally, and the scale becomes staggering.

Traditional spatial databases like PostGIS, which excel at transactional workloads and moderate-scale analytics, were not designed for this volume. Loading hundreds of millions of GPS points into PostgreSQL, constructing geometries, and running spatial joins or trajectory reconstruction queries can take hours or days on a single node. Adding more hardware does not solve the fundamental problem: PostGIS was not built for distributed, parallel spatial computation.

How GPS Signal Noise Corrupts Downstream Analysis

GPS noise is error in raw location readings caused by signal reflection, satellite loss in tunnels and urban canyons, and atmospheric interference and it cascades through every downstream analysis built on top of it. Signals bounce off buildings, lose satellites in tunnels and urban canyons, and suffer from atmospheric interference. Studies have documented median GPS errors of 7 meters with standard deviations exceeding 23 meters in urban environments. Points can appear on the wrong side of a street, inside buildings, or kilometers from the actual position when signal quality degrades.

Speed calculations between consecutive points can spike to physically impossible values. Distance measurements accumulate systematic errors. Clustering algorithms identify phantom stop locations. Without rigorous cleaning and validation at the earliest stages of the pipeline, every subsequent insight is built on a compromised foundation.

Researchers at the University of Pennsylvania’s Computational Social Science Lab studied exactly this problem in the context of COVID-19 epidemic modeling. Using the same GPS mobility dataset, they found that different but individually reasonable preprocessing choices led to substantially different conclusions. a methodological “garden of forking paths” where reproducibility became nearly impossible. The root causes: data sparsity, sampling bias, and inconsistent algorithmic choices at the preprocessing stage.

Why Temporal Ordering Is Critical in Mobility Data Processing

Trip segmentation is the process of splitting a continuous GPS stream into discrete trips by detecting temporal gaps, periods where no data was recorded or the device was stationary. Mobility data is not simply geospatial—it is spatiotemporal. Every GPS point has a position and a timestamp, and the relationship between consecutive observations is what defines movement. This trip segmentation step alone introduces significant methodological complexity, because the threshold you choose (5 minutes? 20 minutes? 60 minutes?) fundamentally changes the structure of your resulting trajectories and all metrics derived from them.

Beyond segmentation, ordering matters. Trajectories are sequences, not sets. Every analytical operation—speed calculation, direction changes, stop detection, map matching—depends on correct chronological ordering within each trip. Systems that do not preserve or guarantee ordering (a common challenge in distributed frameworks) can produce geometries that appear valid but contain scrambled temporal information.

Why 2D Spatial Systems Fail for Mobility Analytics

The dimensionality problem is the gap between how most spatial systems model location, latitude and longitude only and what mobility analysis increasingly requires: 3D and 4D geometry that encodes elevation and time directly into the geometry itself. Most spatial systems treat location as a 2D construct: latitude and longitude. But mobility analysis increasingly demands 3D and 4D processing. Elevation matters for fuel consumption modeling, route optimization in mountainous terrain, aviation and drone trajectories, and any analysis where the difference between 2D and 3D distance is materially significant. Adding a temporal measure dimension (the “M” in XYZM geometries) enables encoding timestamps directly into the geometry itself, supporting trajectory validation and interpolation operations that are impossible with 2D points.

Yet 4D geometry support—constructing XYZM points, building trajectories from them, and performing analytical operations that respect all four dimensions—is rare. Most spatial SQL implementations either lack these functions entirely or implement them inconsistently. This forces practitioners to maintain separate columns for elevation and time, losing the computational advantages of integrated 4D geometry processing.

Where Traditional Spatial Systems Fail for Mobility Data

System	Primary Limitation for Mobility Data
Desktop GIS (QGIS, ArcGIS Pro)	Single-machine ceiling, no distributed processing
PostGIS	Single-node, not built for hundreds of millions of GPS points
Cloud data warehouses (Snowflake, BigQuery)	Shallow spatial support, cannot handle XYZM or map matching
Vanilla Apache Spark	No native spatial types, no spatial indexing
External map matching APIs	Rate limits and per-request pricing make batch processing prohibitive

Desktop GIS: The Single-Machine Ceiling

Tools like QGIS and ArcGIS Pro are extraordinarily capable for visualization, manual analysis, and working with datasets that fit in memory. But they hit a hard wall with mobility data at scale. Loading millions of GPS trajectories, performing trip segmentation with window functions, running DBSCAN clustering on stop points, and executing map matching against a road network are not operations that desktop GIS was designed to handle. Analysts working with fleet-scale data routinely encounter out-of-memory errors, multi-hour processing times, and the inability to iterate quickly on analytical parameters.

Spatial Databases: Scale Without Spatial Intelligence

PostGIS remains the gold standard for spatial SQL and is an excellent choice for many use cases. However, it is fundamentally a single-node system. Scaling PostGIS to handle hundreds of millions of GPS points requires expensive vertical scaling, and even then, operations like trajectory construction across thousands of users, spatial indexing with H3 or GeoHash, and DBSCAN clustering at urban scale can exhaust available resources.

Cloud data warehouses like Snowflake, BigQuery, and Redshift have added spatial capabilities, but these tend to be shallow implementations optimized for simple point-in-polygon or distance queries. Constructing XYZM trajectories from ordered GPS points, performing spatial clustering, computing 3D distances, or running map matching against a road network are either unsupported or require convoluted workarounds that sacrifice performance and maintainability.

General-Purpose Distributed Frameworks: Power Without Spatial Awareness

Apache Spark provides the distributed computing muscle needed for mobility-scale data, but vanilla Spark has no concept of spatial data types, spatial indexing, or geometric operations. Running a spatial join in pure Spark requires broadcasting datasets or implementing custom partitioning strategies—both of which are error-prone and perform poorly at scale compared to purpose-built spatial engines.

This is precisely the gap that Apache Sedona and Wherobots were designed to fill. Sedona extends Spark (and other distributed frameworks) with native spatial data types, over 290 spatial SQL functions, spatial indexing, and optimized query planning that understands geometric predicates. Wherobots builds on Sedona to provide a fully managed, cloud-native spatial intelligence platform where teams can process GPS-scale data without managing infrastructure, configuring clusters, or bolting together fragmented toolchains.

Map Matching: The Unsolved Infrastructure Problem

Map matching—the process of snapping noisy GPS traces to the actual road network—is one of the most computationally demanding and methodologically complex steps in any mobility pipeline. It requires loading a complete road network graph, computing probabilistic alignments between GPS observations and candidate road segments, and resolving ambiguities at intersections, parallel roads, and complex interchanges.

Most map matching solutions are either commercial APIs with strict rate limits and per-request pricing that make batch processing of historical data prohibitively expensive, or open-source tools that require significant infrastructure setup and do not scale to city-wide or fleet-wide datasets. Researchers have consistently identified scalability as the primary bottleneck: algorithms that produce accurate matches on small datasets fail to perform when confronted with millions of trajectories.

Having map matching available as an integrated capability within the same distributed environment where you are already processing and analyzing your trajectory data—rather than as an external API call or a separate system—eliminates an entire category of infrastructure complexity and data movement overhead.

The Real Cost of a Fragmented Mobility Data Pipeline

In practice, most organizations processing mobility data have assembled a patchwork of tools: Python scripts for data cleaning, PostGIS for spatial operations, custom code for trip segmentation, an external API for map matching, a separate clustering library, and a visualization tool at the end. Each transition between tools introduces data serialization overhead, potential for schema drift, and opportunities for subtle bugs.

This fragmentation carries real costs beyond engineering time. When a researcher needs to change the trip segmentation threshold from 20 minutes to 30 minutes, the entire pipeline must be re-executed across multiple systems. When a new data source arrives with a slightly different schema, adapters must be updated at each integration point. When results need to be reproduced for regulatory or academic review, reconstructing the exact sequence of operations across disparate tools is often impractical.

The ideal mobility data pipeline processes GPS pings through ingestion, cleaning, enrichment, trajectory construction, map matching, spatial indexing, clustering, and analytical aggregation—all within a single, distributed, spatially-aware environment where every step is expressed in SQL or Python, every intermediate result is inspectable, and the entire pipeline can be reproduced with a single execution.

What a Modern Mobility Data Architecture Looks Like

The medallion architecture—Bronze, Silver, Gold—has become the standard pattern for progressive data refinement in the data lakehouse world. But applying it to mobility data requires rethinking what each layer does, because spatial data introduces transformations and enrichment steps that have no analog in conventional data engineering.

Bronze is not just ingestion—it is spatial profiling. You are not only loading CSV or Parquet files; you are constructing point geometries, validating coordinate bounds, assessing data quality metrics like altitude validity and temporal coverage, and establishing the spatial extent of your dataset.

Silver is where the heavy lifting happens. This is trip segmentation, 4D geometry construction, trajectory building, movement metric derivation, spatial indexing, and map matching. Each of these operations is computationally intensive, order-dependent, and requires spatial functions that most data platforms simply do not have.

Gold produces the analytical views that power downstream consumption: H3 hexbin density heatmaps, temporal activity patterns, stop detection via spatial clustering, trajectory anomaly flagging, and road segment speed analysis. These views are written as GeoParquet files—compatible with Kepler.gl, QGIS, Felt, Foursquare Studio, and DuckDB Spatial—ensuring that the output of the pipeline is immediately consumable by any modern geospatial visualization or analytics tool.

How Wherobots Handles Mobility Data Processing: What We Built

To demonstrate this architecture in practice, we built a three-notebook Wherobots Mobility Solution Accelerator using the Microsoft Research GeoLife GPS Trajectories dataset—one of the few open mobility datasets that includes elevation data, enabling full 4D geometry processing. The dataset contains 17,621 trajectories from 182 users in Beijing, with latitude, longitude, altitude, and timestamps spanning 2007–2012.

In Part 2, we walk through every notebook in detail: the Bronze layer’s ingestion and profiling pipeline, the Silver layer’s 4D trajectory construction and map matching workflow, and the Gold layer’s analytical and exploratory views. We cover the specific Apache Sedona spatial SQL functions used at each step, the PySpark window function patterns for trip segmentation and movement metrics, and the real-world challenges we encountered and solved—from Spark’s schema inference corrupting timestamp values, to COLLECT_LIST not preserving order in trajectory construction, to DBSCAN requiring physical column references.

If you are building mobility analytics pipelines and hitting the limitations of your current toolchain, Part 2 will give you a concrete, reproducible blueprint for how to do it on Wherobots.

Get Started with Wherobots

Access Now

PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong

Posted on March 19, 2026March 19, 2026 by Matt Forrest

When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights.

For decision-makers, the choice between a traditional spatial database (like PostGIS, an open-source extension of PostgreSQL that adds support for storing and querying location data) and a cloud-native geospatial analytics platform (like Wherobots, built with distributed computing including Apache Spark to process massive spatial datasets in parallel across compute clusters) comes down to two fundamental metrics: Time to Insight and Total Cost of Ownership (TCO).

To understand where to invest your budget, we need to look beyond the software labels and understand the economics of how these systems handle data.

If you are new to the PostGIS vs Wherobots discussion, start with [Part 1: PostGIS, Wherobots, and the Spatial Data Lakehouse: A Strategic Guide for Leaders]. This post assumes you understand the architectural difference and focuses on what it actually costs you to choose wrong.

Key Takeaways:

PostGIS is optimized for low-latency lookups. Wherobots is optimized for high-throughput analytics and data processing. Using the wrong one for the wrong job costs you both time and money.
A PostGIS server must be provisioned for peak load, so you pay for maximum capacity even when usage is low. Wherobots is designed to charge primarily for active compute time, so you are not paying for idle capacity.
For industries like insurance, logistics, and urban planning, the right architecture choice can dramatically reduce query time for large-scale spatial analysis — in some cases from hours or days down to minutes.
PostGIS and Wherobots are not mutually exclusive. Many enterprises use Wherobots to process data at scale, then serve results through PostGIS for live application access.
Use the checklist in this post as a fast diagnostic: if you are waiting hours for spatial queries or your cloud bill is outpacing your revenue growth, you have a strong case for cloud-native spatial compute.

Why Slow Spatial Queries Cost More Than You Think

In the modern enterprise, the value of data decays over time. An answer delivered in 5 seconds is actionable; an answer delivered in 5 days is a post-mortem. The architecture you choose dictates how fast you can answer complex questions.

Low Latency vs High Throughput: What Speed Actually Means for Each Tool

It is crucial to understand the difference between “speed” for an app and “speed” for analytics.

Low Latency (PostGIS): This is the speed of retrieval. When a customer opens your delivery app and asks, “Where is my driver?”, they need an answer in milliseconds. PostGIS is optimized for this. It uses heavy indexing to find a single “needle in a haystack” instantly.
High Throughput (Wherobots): This is the speed of processing. When your risk analyst asks, “Which of our 50,000 retail locations are at risk of flooding based on the new 100-year climate models?”, they are not looking for a needle; they are looking for patterns across the whole haystack.

The Bottleneck: If you try to run that massive climate model analysis in PostGIS, the database has to check every single location against every single flood zone, even with optimizations for search like spatial indexing. Because PostGIS typically runs on a single server, running that analysis forces your app and your analytics to compete for the same resources. The query might take 24 hours, and your customer-facing app slows down the whole time.

The Solution: Wherobots breaks that same job into parallel tasks across a cluster of worker nodes, each handling a partitioned geographic slice. Because the work happens simultaneously, the job finishes in minutes instead of hours.

Business Impact: Your analysts get answers before lunch, not next week. Your operational apps remain fast for customers because the heavy lifting happened elsewhere.

PostGIS vs Wherobots: Why the Pricing Models Are Fundamentally Different

The second major factor is how you pay for these capabilities. Query speed is only half the cost story. The other half is how each system charges you. The pricing models for databases and cloud-native engines are fundamentally different.

Why PostGIS Forces You to Pay for Peak Capacity Around the Clock

A high-performance database like PostGIS requires expensive hardware specifically, high-speed RAM and fast CPUs, to keep your data accessible.

The Lease Model: PostGIS requires provisioning a server for peak load, which means you pay for maximum capacity even when usage is low. Think of it like leasing a Ferrari just to drive to the grocery store on Sundays. You have to provision this server for your peak usage. If you need to run a heavy report once a week that requires 64 cores of CPU, you must pay for a 64-core server 24 hours a day, 7 days a week.
The Waste: For the other 6 days and 23 hours, that expensive server sits idle, burning budget. It is analogous to leasing a Ferrari just to drive to the grocery store on Sundays.

How Wherobots Charges Only for Active Compute Time

Wherobots uses an elastic, on-demand pricing model: you consume compute while an operation is actively running, then billing stops when it finishes. Storage and compute are decoupled, meaning your data sits in low-cost object storage and you rent processing power only when you need it.

Storage is Cheap: You keep your massive datasets in low-cost Object Storage (like Amazon S3), which costs pennies per gigabyte.
Compute is On-Demand: When you need to run that heavy climate model, you rent the 1,000 computers for exactly 15 minutes. The moment the job is done, the machines turn off, and the billing stops.
The Savings: You convert a massive fixed Capital Expenditure (CapEx) into a lean, manageable Operational Expenditure (OpEx).

Create your Wherobots account

Get Started

PostGIS vs Wherobots: Which One Is Right for Your Use Case?

To make this concrete, let’s look at three distinct industry examples and which tool provides the best ROI for each.

1. Logistics and Delivery: Why Real-Time Tracking Needs PostGIS

Scenario: You need to route drivers in real-time and show customers where their package is.

The Choice: PostGIS.
Why: You need transactional guarantees. If a driver marks a package as “Delivered,” that data must be instantly saved and visible. You are doing millions of tiny, fast lookups.

2. Insurance and Real Estate: Why Portfolio-Scale Risk Analysis Needs Wherobots

Scenario: You need to calculate risk premiums for 10 million homes based on historical wildfire data, distance to fire stations, and vegetation density indices.

The Choice: Wherobots.
Why: This is a “Global Join.” You are comparing massive datasets against each other. PostGIS would take weeks to process this at a national scale. Wherobots can recalculate the entire portfolio every night, allowing you to adjust pricing dynamically.

3. Urban Planning: Why Time-Series Sensor Data Overwhelms a Standard Database

Scenario: You are ingesting telemetry data from 50,000 connected traffic lights and sensors to analyze traffic congestion trends over the last 5 years.

The Choice: Wherobots.
Why: The volume of data (Time-Series) is too large for a standard database. A database would bloat, slow down, and become expensive to back up. Wherobots can read this data directly from cheap storage, aggregate it into trends, and output the results.

PostGIS vs Wherobots: Decision Checklist

PostGIS is a good choice when your primary use case is powering a user-facing application that needs fast, transactional lookups on a relatively stable dataset. Wherobots is the better choice when you are running analytical queries across complex datasets, processing historical data at scale, or need compute costs that flex with actual usage rather than peak capacity.

If you are currently evaluating your data stack, use this simple checklist to guide your architecture decision.

Stick with PostGIS if:

[ ] Your primary goal is powering a user-facing application.
[ ] You need to edit data manually (e.g., fixing property boundaries).
[ ] Your dataset is relatively stable and fits comfortably on one large server.
[ ] You require strict “ACID” transactions (meaning every write is confirmed and visible before the next read — no partial updates, no stale reads)

Move to Wherobots if:

[ ] You are waiting hours or days for analytical queries to finish.
[ ] Your cloud database bill is growing faster than your revenue.
[ ] You need to join two massive datasets (e.g., “All Buildings” + “All Parcels”).
[ ] You are building AI/Machine Learning models that need to “learn” from all your historical data.

The most competitive organizations today realize they don’t have to choose just one. They use Wherobots to crunch the data cheaply and efficiently, and then move the polished results into PostGIS for instant access—a strategy we will cover in our next post on the “Medallion Architecture”, a data design pattern where raw, refined, and production-ready data are stored in separate layers, each optimized for different workloads

Ready to see what this looks like for your workload? Contact us and get a cost comparison built around your actual data volume.

Raster Processing at Scale: The Out-of-Database Architecture Behind WherobotsDB

Posted on March 10, 2026March 10, 2026 by Pranav Toggi

Introduction

Raster data (satellite imagery, elevation models, sensor grids) is critical to understanding the physical world and increasingly to powering AI. The challenge most data teams face is processing it at scale.

Processing raster data at scale requires an architecture that avoids loading entire files into memory. WherobotsDB solves this with an out-of-database approach that fetches pixel data on demand, enabling terabyte-scale processing without the memory overhead of traditional raster engines.

WherobotsDB extends open-source Apache Sedona with capabilities and performance optimizations purpose-built for preparing physical-world data for AI at scale, while maintaining full API compatibility. Existing Sedona workloads run without code changes.

This post covers how WherobotsDB handles the full raster lifecycle: scalable processing architecture, raster math, coordinate reference systems, vector-raster hybrid workflows, and planetary-scale inference.

“With Wherobots, we were able to merge 15+ complex vector datasets in minutes and run high-resolution ML inference on raster imagery at a fraction of the cost of our legacy stack. The combination of speed, scalability, and ease of integration has boosted our engineering productivity and will accelerate how quickly we can deliver new geospatial data products to market.”

— Rashmit Singh, CTO, SatSure

What Is Out-of-Database Raster Architecture?

At the foundation of Wherobots raster capabilities is an out-of-database raster architecture – which makes it far easier to process raster imagery in an embarrassingly parallel fashion. Instead of loading entire raster files into memory, only metadata is stored and pixel data is fetched on-demand. This means teams can process terabyte-scale imagery collections — statewide mosaics, multi-year satellite archives, continental elevation models — with the same interface they use for vector data. Operations like zonal statistics, clipping, masking, filtering, and raster algebra scale to datasets that would overwhelm in-memory approaches.

Capability	Apache Sedona	WherobotsDB	Notes
Out-DB Raster Support	◔	●	Creates lightweight raster references; pixel data loaded only when needed
Intelligent Caching Layer	○	●	Minimizes repeated remote reads for frequently-accessed rasters
Optimized Shuffle Operations	○	●	Data movement handles only metadata — orders of magnitude faster than full rasters
On-Demand Materialization	○	●	Selectively convert external rasters to in-database format when needed
Automatic Metadata Optimization	○	●	Pre-loads metadata and intelligently repartitions for optimal parallelism
Cloud-Optimized GeoTIFF Support	◐	●	Native COG support with tile-based partial reads from cloud storage

How Out-DB Architecture Transforms Raster Operations

The out-of-database architecture fundamentally changes how raster operations execute:

Operation	Apache Sedona (In-DB Only)	WherobotsDB
Data Movement (Shuffle)	Serializes all pixel data across executors	Serializes only metadata (~KB vs GB)
Tiling Operations	Copies pixel data into each tile	Memory-efficient tiling without data duplication
Clipping	Full pixel-by-pixel processing	Optimized processing paths for common operations
Zonal Statistics	Processes entire raster regardless of region size	Materializes only zonal pixels, optimizing I/O based on region of interest
Raster Loading	Loads entire raster into memory at read time	Lazy loading: metadata on-demand, pixels only when accessed
Resource Management	Standard memory lifecycle	Intelligent caching layer with disk caching for remote files

Key benefits:

On-Demand Data Access: Instead of loading entire raster files into memory, WherobotsDB fetches pixel data only when an operation requires it, reducing memory overhead and enabling processing at terabyte scale.
Memory-Efficient Tiling: Tiles share references to the underlying file with different spatial bounds — enabling massive parallelism without memory overhead.
Smart I/O Reduction: Operations optimize I/O based on the region of interest, and spatial filter push-down skips irrelevant raster files entirely.

Cloud-Optimized GeoTIFFs (COGs) are GeoTIFF files structured so that only the specific byte ranges needed for a given operation are fetched from remote storage, rather than downloading the entire file. Combined with on-demand loading, this architecture minimizes both memory footprint and network I/O.

What Raster Capabilities Does WherobotsDB Include?

Building on this foundation, WherobotsDB includes enhanced raster functions that enable satellite imagery, elevation models, and sensor data analysis directly in SQL alongside traditional vector operations.

Capability	Apache Sedona	WherobotsDB	Notes
Raster to Vector Conversion	○	●	Convert raster regions to vector polygons for hybrid vector-raster analysis workflows
Multi-Band Tile Processing	○	●	Align, stack, and tile rasters from different sources, CRS, and resolutions for distributed multi-source analysis
Zonal Statistics	◕	●	Both support the full statistics suite; WherobotsDB’s Out-DB architecture materializes only zonal pixels, enabling scalability across millions of zones
Custom Raster Algebra	◕	●	Flexible map algebra expressions with near-native execution performance
Spatial Filter Push-down for Rasters	○	●	Uses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries

These capabilities fall into two categories:

Transforming raster data for hybrid workflows – Raster to Vector Conversion, Multi-Band Tile Processing.
Analyzing raster data in place – Zonal Statistics, Custom Raster Algebra, Spatial Filter Push-down.

Raster to Vector Conversion converts contiguous raster regions with the same pixel value into vector polygons. Essential for workflows that need to analyze raster-derived features (flood extents, land cover classifications, building footprints from DSM) using vector spatial operations like overlay, buffering, or spatial joins.

Multi-Band Tile Processing solves one of the most common friction points in raster analysis: combining data from different sources. Satellite imagery from different sensors, time periods, or providers typically arrives in different coordinate reference systems, resolutions, and data types. WherobotsDB aligns and stacks these into a unified multi-band raster, then tiles it into spatial chunks for distributed processing — all in a single operation. This enables workflows like change detection across multi-temporal composites, fusing Sentinel-2 optical bands with elevation data, or building analysis-ready multi-spectral stacks, without manual reprojection or resampling steps.

Zonal Statistics computes aggregate statistics (count, sum, mean, median, mode, stddev, variance, min, max) for raster pixels falling within vector zones. Both Apache Sedona and WherobotsDB support zonal statistics — the differentiator is scale. WherobotsDB’s out-of-database architecture materializes only the pixels within each zone rather than processing the entire raster, making it practical to run zonal statistics across millions of zones on terabyte-scale imagery.

Custom Raster Algebra executes user-defined raster algebra expressions with near-native execution performance. It supports complex multi-band calculations, conditional logic, and neighborhood operations — enabling workflows like computing NDVI (Normalized Difference Vegetation Index, a measure of vegetation density derived from red and near-infrared bands) from satellite imagery, or applying threshold-based classification across large imagery collections.

Spatial Filter Push-down for Rasters uses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries. When a catalog contains thousands of scenes but only a few intersect your area of interest, irrelevant files are eliminated before any processing begins.

Because all of these functions are built on the out-of-database architecture, they inherit the same scalability characteristics described above — lazy loading, selective pixel materialization, and intelligent caching, without additional configuration.

What Is RasterFlow and How Does It Work?

Recently, Wherobots has added an entirely new inference and perception engine for planetary-scale image processing – extending the raster lifecycle beyond analysis into AI. RasterFlow enables teams to run computer vision models against large-scale raster datasets. From preparing imagery, mosaicking, removing edge effects across tiles, executing distributed model inference, and converting predictions into vector geometries, all within Wherobots Cloud.

RasterFlow’s outputs are stored as vectorized results in Apache Iceberg tables — an open table format for large-scale analytic datasets — or as predictions within ZARR (a cloud-native format for chunked, compressed multi-dimensional arrays) or COGs, which can be seamlessly analyzed using the full suite of spatial operations in WherobotsDB. This creates end-to-end raster workflows — from raw imagery through model inference to spatial analytics, without moving data between systems or building custom infrastructure.

RasterFlow supports both popular open-source geospatial AI models and custom PyTorch models, and can generate embeddings from geospatial foundation models. It is currently available to select customers in private preview. If you’re interested in RasterFlow, join our upcoming session to see it in action.

What Comes Next: Query Performance and Spatial Analytics

Raster processing is not only a first-class capability in WherobotsDB, but also it’s one part of a broader set of spatial data processing advances we’ve built beyond open-source Sedona. Vector and raster workloads both benefit from the same query performance optimizations under the hood: spatial relationship acceleration, automatic join optimization, dynamic data redistribution, and a vectorized GeoParquet reader. Queries that require careful tuning with self-managed Sedona run optimally out-of-the-box with WherobotsDB.

In the next post in this series, we’ll go deep on query performance and spatial analytics, how WherobotsDB accelerates spatial joins, range queries, and analytical functions across both vector and raster data types.

Get Started with Wherobots

Access Now

PostGIS, Wherobots, and the Spatial Data Lakehouse: A Strategic Guide for Leaders

Posted on February 27, 2026March 10, 2026 by Matt Forrest

For nearly two decades, the answer to the question “Where should we store our location data?” was simple and singular: The Database. Specifically, the industry-standard PostgreSQL database extended with PostGIS. It was reliable, powerful, and sufficient for the era of web maps and queries.

But the world has changed. Organizations today aren’t just managing fixed assets like utility poles or land parcels. They are ingesting high-velocity telemetry from delivery fleets, processing terabytes of daily satellite imagery, and analyzing global datasets from building footprints to flood analysis to human mobility data.

The “one-size-fits-all” database can no longer handle this diversity of scale. As a result, modern data leaders face an architectural choice among three interrelated approaches:

The Database (PostGIS): The operational gold standard for high-speed transactions.
The Processing Engine (Wherobots): The cloud-native engine built for massive-scale geospatial analytics and AI.
The Spatial Data Lakehouse: A unified architecture that combines the low cost of a data lake with the governance of a warehouse.

Understanding the specific role of each and how they fit together can help create a nimble, cost-effective data strategy for spatial data and analytics.

What Is PostGIS and When Should You Use It?

PostGIS is an open-source extension for PostgreSQL that adds support for geographic objects, enabling location queries directly inside a relational database.Think of PostGIS as the high-precision engine that powers your day-to-day business operations. It is a “Scale-Up” technology, meaning it lives on a single server that you make larger as your needs grow.

PostGIS Strengths: Speed, Transactions, and Precision

Instant Precision: PostGIS is optimized for “row-level” access. If your mobile app needs to tell a user, “Find the five closest drivers to my current location right now,” PostGIS is the perfect tool. It uses sophisticated indexing to find that needle in the haystack in milliseconds.
Data Integrity (ACID): In industries like government, banking, or real estate, you cannot afford to lose data or have “eventual consistency.” PostGIS guarantees that when a record is written, it is saved instantly and correctly.
The “Vertical” Ceiling: The limitation of PostGIS is physics. Because it runs on one server, there is a hard limit to how much it can process. If you ask it to analyze five years of historical GPS data for 10 million vehicles, the server will likely slow down or fail. It wasn’t built for “Big Data” analytics; it was built for fast transactions.

What Is Wherobots and When Does It Outperform PostGIS?

Wherobots is a cloud-native spatial analytics platform built on Apache Sedona. Unlike traditional databases that run on a single server, it distributes workloads across hundreds of machines simultaneously. If PostGIS is a sports car designed for speed and agility, Wherobots is a freight train designed for massive hauling capacity. It represents a “Scale-Out” architecture, built specifically for the era of Cloud and AI. Built by the original creators of Apache Sedona, which delivers the same types of spatial SQL functions that PostGIS delivers, but in a Spark based architecture, it enables the heavy distributed computing and processing that Spark has unleashed in preparing data for Cloud and AI workloads.

Wherobots Strengths: Scale, AI Pipelines, and Cost Control

Unlimited Scalability: Wherobots doesn’t run on one computer. When you send it a job, it spins up a cluster of hundreds or thousands of worker nodes to tackle the problem in parallel. This allows it to process planetary-scale datasets like “All Buildings in the World” or “Global Weather Patterns” in minutes rather than weeks.
Separation of Compute & Storage: This is a critical cost factor. With Wherobots, your data lives in cheap object storage (like Amazon S3), and you only pay for the computing power when you are actually running a query or a join. You can stop paying for the compute the moment the job finishes.
AI & Data Science Native: Modern data science teams work in Python and notebooks, not just SQL. Wherobots is native to this ecosystem (built on Apache Sedona), making it the primary engine for training Machine Learning models on geospatial data, such as generating embeddings, predicting crop yields from satellite photos or forecasting traffic congestion.

What Is a Spatial Data Lakehouse and Why Does It Matter?

A Spatial Data Lakehouse is an architectural pattern that stores geospatial data in open formats like Apache Iceberg or Parquet in cloud object storage, then allows multiple tools, from BI platforms to AI engines, to query that same data without duplication. It emerged as a solution to a longstanding problem: companies were forced to maintain two separate worlds, a data warehouse for structured reports and a data lake for raw files, creating silos where data was either too expensive to store or too messy to query.

The Spatial Data Lakehouse is the modern solution that bridges this gap.

Spatial Data Lakehouse Benefits: One Copy, Flexible Access, Lower Cost

One Copy of Data: Instead of copying data back and forth between systems (which creates errors and version conflicts), data stays in one place, usually cloud object storage, in open formats like Apache Iceberg or Parquet.
Flexible Access: The Lakehouse allows different engines to access the same data leveraging open table formats like Apache Iceberg or Delta Tables, governed by catalog system like the Wherobots hub, Databricks Unity Catalog or Snowflake Polaris Catalog. Your data scientists can use Wherobots to run heavy AI models on the data, while your business analysts use a BI tool to view the same data, without needing to move it.
Governance & Cost Control: It brings the “grown-up” features of a database (like security, version history, and transaction safety) to the low-cost environment of the data lake. You get the structure of a warehouse with the low price tag of a lake.

The Decision Matrix: What to Use When?

To help you navigate this landscape, we’ve broken down the best use cases for each technology.

Use PostGIS When:

The Mission is “Now”: You are powering a live application where sub-second response times are critical for user experience.
Transactional Safety is Paramount: You are managing a system of record (e.g., a Land Registry or Utility Network) where complex edits happen frequently.
Data Volume is Manageable: Your active dataset is in the Gigabytes up to low Terabytes range.

Use Wherobots When:

The Mission is “Insight”: You are analyzing trends, patterns, or aggregates over time (e.g., “Show me the 5-year flood risk for our entire real estate portfolio”).
Data Volume is Massive: You are dealing with High-Velocity Telemetry, Raster images, or datasets in the Terabytes or Petabytes range.
Cost Efficiency is Critical: You want to avoid paying for idle servers or prefer a “pay-as-you-use” model for heavy workloads (available in the Wherobots Pro tier).
You are building for AI: Your team needs to feed massive geospatial datasets into Machine Learning or embedding generation pipelines.

How PostGIS, Wherobots, and the Lakehouse Work Together

The market is moving away from binary choices. The most successful organizations do not view this as “PostGIS vs. Wherobots.” Instead, they view it as a supply chain.

They use Wherobots as the heavy industrial refinery for ingesting, cleaning, and analyzing the massive raw materials of the data lake. They then ship the refined, high-value insights to PostGIS, which serves as the high-speed distribution center for the business.

By understanding the unique strengths of each player in this landscape, you can build a data architecture that is not only powerful enough for today’s AI demands but sustainable for tomorrow’s budget.

Try Wherobots

Get Started

Wherobots and Felt Partner to Modernize Spatial Intelligence

Posted on February 10, 2026February 18, 2026 by Ben Pruden

We’re excited to announce Wherobots and Felt are partnering to enable data teams to innovate with physical world data and move beyond legacy GIS, using the modern spatial intelligence stack.

The stack with Wherobots and Felt provides a cloud-native, spatial processing and collaborative mapping solution that accelerates innovation and time-to-insight across an organization.

Wherobots delivers the most capable spatial query and inference engine for creating insights from physical world data (raster, vector, structured) of any scale.
Felt delivers an intuitive, browser-based experience that enables business teams to explore this data, ask questions, and share insights collaboratively.
The combination provides a new path to move forward for teams that are innovation-limited by their desktop-bound GIS tools, restrictive licensing arrangements, and unscalable workflows.

What is Felt?

Felt is the new standard for collaborative mapping, and their product is often described as “the Google Docs of GIS.” It is a cloud-native platform designed to turn complex geospatial data into actionable insights through its unique collaborative map development capabilities. Unlike traditional Geographic Information Systems (GIS) that are often desktop-bound, Felt lives entirely in the browser, allowing teams to create, analyze, and share interactive maps with the speed and ease of a modern productivity tool.

Why Legacy GIS is Limiting Spatial Data Innovation

Spatial data comes in various formats and sizes, and it needs to be processed and combined with other datasets for people, systems, and AI to innovate with it. Because few systems were developed to handle this wide spectrum of data complexity, format, and scale in a graceful way, users have been forced to create cumbersome, time consuming workarounds. In turn, this has led to a high degree of specialization, and a special category of GIS tools that simply can’t keep up with demand.

Buyers, faced with few options, have been limited to old-guard licensing arrangements for lagging technology solutions. Such engagements plant a thorn in the side of many organizations who end up constrained in their ability to innovate unless they hire more specialized staff or contractors to develop workarounds that copy data to and from bespoke GIS tools or databases. What buyers want is to deploy modern tooling and AI that “just absorbs the complexity” such that their data practitioners can just work with this data and achieve their goals. They also want more flexibility to take the best tools over time and demand data sovereignty.

If you’re working on a small yet nimble team of data scientists and engineers solving world-level problems, you want the capability to easily crunch terabytes of vector, raster, and structured data using familiar tools. Data interactivity is key because the easiest way to understand spatial data quality at scale is to inspect relationships visually with performance, and you can iterate faster towards the shared objective when in-team collaboration is seamless. Historically teams relied on piecemeal operations that extended the innovation cycles because of data and work siloes, or limited tooling support.

Similarly, analysts, planners, and stakeholders want simple, AI enabled, visual-first tools to understand and ask their questions about spatial data so they can make decisions from it. Their ability to tailor analytics from a visual-first tool, was limited by the capabilities of the underlying query engine and the data available to it.

Moving from Legacy GIS to a Modern Spatial Intelligence Stack

Our partnership with Felt directly addresses this friction with a seamless integration between platforms to deliver the spatial intelligence stack.

Using Felt’s AI-assisted and collaborative map development experience, customers can easily query, interrogate, and build insightful visualizations from multiple sources of spatial and non spatial data in their private data lakes or open repositories. This is done using Wherobots as the query engine that absorbs the heavy lifting associated with multi-modal spatial data processing and cataloging.

Key Benefits of the Wherobots and Felt Integration

Easy for everyone: No user is left behind. Wherobots supports data engineers, scientists, and developers who need speed and performance, while Felt enables front end developers, GIS users, and business teams to get insight, instantly from the analysis.
AI first: from inference, MCP tools, to collaborative maps driven by AI and natural language prompts, users have the modern AI tools to generate insights quickly at scale.
Built for the lakehouse: The lakehouse architecture maximizes data sovereignty and agility by giving you the freedom to choose the best tools for the job over time.
Spatially optimized: World class spatial capability and performance is ready out of the box. Wherobots provides the scale, performance, and capability to crunch and perform inference on multi-modal spatial data than any other cloud platform.
A fully managed, friendly, open standards based deployment: on-demand deployment and pay as you go pricing lets you reach your goals through open, well understood standards and architecture.

Customer Example: How Leaf Agriculture Uses Wherobots and Felt

The partnership started with a joint customer request from Leaf Agriculture, who has been using Wherobots and Felt for productionizing their LeafLake offering over this past year, and to harmonize large scale tractor data for their Leaf Unified API. Leaf serves customers and partners like Syngenta, Bayer and Farmers Edge, and many others in the agricultural economy.

The delivery of LeafLake, supported by Wherobots and Felt, creates a high-velocity data fabric that turns fragmented agronomic data into planetary-scale intelligence. Wherobots serves as the query engine, running distributed spatial SQL operations to create insight-ready datasets from millions of acres of machine data and imagery at 5–20x the speed and at a fraction of the cost of traditional solutions. By running directly on Leaf’s unified data lake, Wherobots transforms raw telemetry into structured, “AI-ready” insights in seconds.

Felt acts as the collaborative “window” into this data, providing a high-performance mapping interface that lives entirely in the browser. Through a native integration, data processed in Wherobots is made directly to Felt. This “SQL-to-map” workflow allows agronomists and decision-makers to interact with LeafLake data in real-time in Felt, enabling a “Google Docs” style of collaboration over complex agricultural insights.

This map showcases farm data from Leaf Agriculture’s LeafLake platform through Felt’s interface for the purpose of precision farming based on tractor telemetry and soil data.

Get Started Today

This integration marks a major step forward making the creation of spatial intelligence accessible to entire organizations. We can’t wait to see what you build with the combined power of Wherobots and Felt.

To learn more:

Check out the documentation to get started in minutes.
Join us in this upcoming session to see how Wherobots and Felt work together to build the foundation for spatial AI.
Talk to our team to explore options.

Wherobots Spatial Intelligence Engine Integrates with Databricks Unity Catalog for Spatial Data

Posted on September 11, 2025March 24, 2026 by Damian

TL;DR: Wherobots now integrates with Databricks Unity Catalog, enabling users to process spatial data up to 20x faster with 60% cost savings. This integration supports raster/vector data, 300+ spatial functions, and enterprise security—all while maintaining full compatibility with Apache Sedona and Spark.

Databricks Geospatial Performance with Wherobots

5-20x faster spatial query performance
60% cost reduction on spatial workloads
300+ spatial SQL/Python/Scala functions
100% Apache Sedona compatibility (zero code changes)

Databricks users can now enhance their geospatial analytics capabilities with Wherobots, a spatial intelligence engine purpose-built for processing data from the physical world. Wherobots brings advanced raster processing, computer vision ML inference, and industry-leading performance to your existing Databricks environment.

With Wherobots Data Federation for Unity Catalog, you can:

Expand spatial coverage to grow revenues, improve margins, and make better decisions with complete spatial intelligence
Create innovative spatial data products that leverage aerial imagery, IoT sensors, and mobility data at planetary scale
Build with raster, vector, and tabular data using familiar SQL, Python, or Scala interfaces
Run computer vision ML models on geo-imagery and sensor datasets from local to continental scale
Migrate existing workflows with zero code changes—WherobotsDB is fully compatible with Apache Sedona and Spark

Customers like Dotlas, Leaf Agriculture, and Overture are achieving step-function improvements in performance, cost efficiency, and innovation by integrating Wherobots with their Databricks platforms.

Wherobots Data Federation connects directly to Unity Catalog, allowing you to read from and write to Iceberg or Delta tables with Databricks service principal and OAuth or PAT token authentication—no data migration required.

Get started | Schedule a demo

What is Spatial Intelligence?

Spatial intelligence is the understanding of features of interest, and their relationships across space and time in multi-dimensional environments. With it, bridges are formed between digital and physical worlds. Decision making can improve, and you can create better products or services with higher returns.

Simple spatial intelligence: Customer visits grouped for a location or area. But simplified formats like aggregations inherently lack precision, and you need to make tradeoffs between cost and resolution, all of which limit their usefulness. These aggregations are typically grouped by cells in a grid (like H3) and most commonly used to create visualizations. While interesting to look at, visualizations can be used to support a decision, intuition, or analysis, but they are generally less actionable because precision or other context is missing.
Complete spatial intelligence: Forecast, or a composite risk, opportunity, or value score associated with potentially millions of specific assets or features across a continent derived from any valuable combination of IoT, location, building, weather, road network, terrain, crops, parcel, aerial imagery, BI, or mobility datasets. By processing perspectives about features of interest from multiple valued aspects, the complete picture forms, which becomes highly actionable, and extremely useful intelligence. But even the most popular data platforms still don’t make it easy to create.

Wherobots enables complete spatial intelligence at scale with Databricks Unity Catalog integration.

Why Do Traditional Data Platforms Struggle with Geospatial Data?

There are many cloud data engines and warehouses that support the simple form described above, including BigQuery, Snowflake, and now Databricks with its Spatial SQL support in preview. However they still lack feature and data type support, reasonable query price-performance at scale, and the solution expertise you may need to create a complete form of spatial intelligence. Here’s why.

Shaped by demand, most data platforms were first designed to handle the structured data exhaust from the web and devices connected to it, not data collected from or about the physical world.

Physical world data is inherently complex, unstructured, and doesn’t fit neatly into key-based joins. It takes a specialized compute engine to make it easy to create solutions from spatial data. Easy means it’s capable of fusing and transforming various spatial and non spatial data types with high accuracy, scale, performance, and low cost, while ensuring development is productive with the teams you have.

The spatial extensions and APIs for today’s big data engines and warehouses provide limited support for simple workloads. But because of design bottlenecks, missing features, and limited technical support, spatial solutions on these platforms can be expensive and difficult to build, while ideas remain far-fetched.

What Makes a Modern Spatial Intelligence Solution Effective?

Ideally the solution for creating complete spatial intelligence just fits into your existing software development workflows, already supports your future needs and the data you want to utilize, is accessible to the teams you have, and just performs at the right scale — on-demand, at a cost that encourages innovation. It’s lakehouse ready, so you don’t need to move your data or utilize proprietary formats or data types to use it. You also have dedicated expertise in reach to unblock innovation.

With this capability at your fingertips, ideas can flow and innovation takes place. Your business can reach higher levels of efficiency, reducing costs, carbon footprint, and risk. You can speed up deliveries or pickups, increase the effectiveness of CAPEX, improve consistency, grow revenue, and build in ways that were thought to be impossible.

This capability is Wherobots, and it’s directly available to Databricks users via data federation with Databricks Unity Catalog.

How Wherobots Solves Geospatial Data Challenges

Using Wherobots you can easily build a complete picture of what’s happened, over space and time, and integrate this intelligence into your Databricks data platform to drive growth – faster and at a lower cost than ever. Our mission is to make spatial data easy to utilize, and it’s all we are focused on. The results of our focus speak for themselves.

Databricks Unity Catalog Geospatial Integration: Key Wherobots Features

Wherobots makes it easy and economical to produce local to planetary scale data solutions that rely on any combination of aerial and overhead imagery, IoT and mobility data, ground truth datasets, and your own business context. And using Wherobots Data Federation with Unity Catalog, you can easily integrate the data products you build with Wherobots, into your Databricks data platform while retaining custody and governance of data.

What Types of Spatial Data Does Wherobots Support? (Raster & Vector)

There are two main classes of spatial data supported by Wherobots: raster and vector data. You also get the support and scale you’d expect for tabular data operations from Wherobots’ Spark compatible engine.

Raster data is typically a collection of sensor or imagery data, where each pixel in the image represents information about what is being captured, like temperature, elevation, infrared spectrum, etc. File formats include GeoTIFFs, Zarr, and NetCDF and more. Raster datasets are commonly GBs to TBs in scale.
Vector data is a collection of multi-dimensional geometries or geographies that represent the trajectory, shape, elevation, and location of things. They can be trips, points, and outlines of features like buildings or parcel and crop boundaries. File formats include GeoParquet (soon to be Parquet), Shapefiles, and GeoJSON.

Performance Benchmarks: Up to 20x faster queries, 60% cost savings

Customers like Leaf Agriculture, Dotlas, Overture and others have compared the price-performance of using Wherobots for their spatial data workloads vs other managed Spark or other leading data platforms. Subscribed to the Professional Edition, they are self-reporting up to 20x better performance (5x-20x is typical) with on-demand savings reaching as high as 60%, with even higher savings from the Enterprise Edition.

Data teams are equally less limited by scaling bottlenecks. This becomes apparent after workloads finish faster on smaller WherobotsDB runtimes, and after customers realize they have significant headroom to scale well past their existing needs.

“Previously, our data volumes and processing requirements were increasing faster than we could keep up with, burdening our team with costly rebuilds. Now with Wherobots, not only can we easily scale to millions of acres, we also can rest assured that our costs won’t spiral out of control.”

– G. Bailey Stockdale, CEO Leaf Agriculture

These results are a function of specialization and a company-wide focus; WherobotsDB was built first for processing spatial data. This intentional design obviates the typical performance bottlenecks and complexities now alive in leading data platforms and warehouses, which were first designed for purposes unrelated to processing spatial data.

While the quotes from our customers matter the most, we also know how important open performance benchmarks are. But currently there are no spatial query benchmarks (or at least reputable ones), which makes query performance hard to compare across platforms without trials. It’s also hard to claim progress was made on performance when standards have not been established. We’re working on this too, and soon we will release a new open source spatial query benchmarking framework for Apache Sedona, and we will release spatial query performance results for query engines, data warehouses, and data platforms.

We already have preliminary results that compare WherobotsDB to Apache Sedona on various managed Spark engines along with query performance from engines with Spatial SQL APIs. Feel free to reach out and we can share these results when you contact us.

Security & Compliance: Enterprise Grade Data Protection

Wherobots is serverless and built for data security first. There’s no infrastructure to manage, although customers can also choose to run Wherobots in their AWS VPC for maximum control.

Apache Sedona Expertise: Built by the Original Creators

Wherobots was founded by the original creators of Apache Sedona, and Sedona is the most widely used geospatial extension for Apache Spark and in Databricks. With decades of research and experience with spatial data, open source, and cloud-scale systems, our product and team are ready to support Databricks customers’ solutions on the lakehouse.

We’re also a team leading geospatial modernization efforts in open source. Wherobots has supported GEO types for years with our Havasu table format. But rather than keeping this support in-house, we decided these types would better serve the physical world in the open so we proactively drove support for them in Apache Iceberg and Parquet.

When to Use Databricks Native Spatial SQL vs. Wherobots

Choose Databricks Native Spatial SQL when you need:

Exploratory geospatial analysis
Basic spatial joins (ST_Intersects, ST_Contains, ST_Distance)
Standard point-in-polygon queries
Simple location-based aggregations
No raster or satellite imagery processing

Choose Wherobots for Databricks when you need:

Production spatial intelligence workloads at scale
Advanced raster and vector data processing
Computer vision ML inference on satellite/aerial imagery
Performance optimization (5-20x faster queries)
Cost reduction (up to 60% savings on spatial compute)
Planetary-scale datasets (TB to PB range)
Complex spatial ETL pipelines
Apache Sedona compatibility for existing workflows

Get Started with Wherobots x Databricks Geospatial Analytics

The lakehouse gives you the ability to choose the product best suited for the job. Don’t settle for the simple form of spatial intelligence or what the default provider offers, when complete is in reach with better economics, scale, capability, performance, and support.

By integrating Wherobots into your Databricks workflows, organizations can reduce costs, improve operations, and realize new innovations powered by data from the physical world.

Ready to enhance your Databricks geospatial capabilities?

Get Started Here – Schedule a meeting with experts
Read Documentation – Integration guide and API reference

Processing spatial data in Databricks? Check out this spatial query benchmark on Databricks with SpatialBench.

Wherobots received its SOC 2 Type 2 attestation

Posted on June 18, 2025September 24, 2025 by Jia Yu

We are thrilled to announce Wherobots has received its SOC 2 Type 2 attestation report, reinforcing our commitment to data security and data privacy for our customers.

SOC 2 (Systems and Organization Controls) is a standard trusted by industry leaders and a requirement for many enterprises engaging with software providers. It evaluates an organization’s information security practices, ensuring that controls are prompt and effective, and that data is kept secure and confidential. While a Type 1 attestation assesses policies and procedures at a single point in time, the Type 2 attestation we have received demands a rigorous, in-depth evaluation and audit of the effectiveness of the implemented controls over time.

To view our SOC 2 Type 2 report, or more information on our security policies, please view our Trust Center or contact us at security@wherobots.com.

Why this matters

There are billions of devices roaming the world, logging petabytes of data from trips, activities, and events. Satellites and drones are scanning the world, capturing what’s happening on Earth, how its changing, and how humans terraforming it. This data can be highly sensitive and highly valued. With Wherobots, businesses can utilize this data from the physical world at a distinguished scale, price-performance, and ease, while keeping data secure and in their control.

With Wherobots’ cloud-native Lakehouse architecture, you can bring Wherobots’ capabilities for spatial and non-spatial data analytics and AI at planetary scale – right to your data, wherever it lives.

Security first

Security is not just a feature, it’s part of our engineering culture and infused into how we design and build our software, our internal systems, and our production environments.

Wherobots Cloud was developed from the ground up with best practices and secure-by-design principles that work backwards from data security first. In its architecture, the Wherobots Cloud control plane is isolated from its compute plane, and each workload is isolated from the cloud hypervisor up to create a trusted environment for your data. This architecture is serverless by default, but can also run its compute plane in your own cloud VPC (BYOC).

Wherobots is the distinguished and obvious solution for spatial computation and AI on the lakehouse. Our SOC 2 Type 2 report, available on the Wherobots Trust Center, is now available to expedite procurement, vendor, and security reviews.

Want to learn about more Wherobots security capabilities? Visit our docs on Getting Started with Wherobots to review service principles, audit logs, SAML SSO, and more.

Create a free Wherobots account to get started

Try now

Iceberg GEO: Technical Insights and Implementation Strategies

Posted on February 14, 2025February 27, 2025 by Ben Pruden

In our previous blog post, we announced Apache Iceberg and Parquet’s support for spatial data types and discussed their significance. These enhancements significantly improve the economics of utilizing geospatial data in end solutions. Organizations will be capable of creating higher value, lower cost products, faster over time. Today, we take a closer look at these GEO data types in Iceberg (collectively Iceberg GEO), exploring design, key features, and implementation considerations. We will also demonstrate how to leverage these new Iceberg features with Apache Sedona and Wherobots in an upcoming blog post.

The Story Behind Iceberg Support for GEO Types

The foundation for Iceberg GEO can be traced back to early 2022 with the launch of the GeoParquet project, which aimed to standardize spatial data exchange across different vendors in the cloud-native and big data ecosystem. This initiative was a crucial first step toward unifying spatial data formats in modern data platforms. Its history, and how GeoParquet will become just Parquet, is described in this GeoParquet community blog post.

As interest in spatial data support for lakehouses grew, early design explorations took shape through projects like GeoLake. Wherobots later expanded on these concepts, developing Havasu in 2023, a production-ready extension of Iceberg designed to support geometry, geography, and raster data in cloud data lakehouses. Havasu’s geometry encoding was inspired by GeoParquet, while its storage format remained fully Parquet-compatible, inheriting Iceberg’s key features such as ACID transactions, schema evolution, time travel, and data versioning.

Recognizing the value and need to bring native spatial support to Iceberg, the cloud native spatial community ultimately decided that integrating these features directly into Iceberg would be more beneficial than maintaining a separate extension. This direction would ensure that spatial data was well supported by Iceberg without modifications.

In early 2024, the spatial and Iceberg communities formally initiated discussions on adding GEO data type to Iceberg, using Havasu’s design as a reference. The collaboration involved many community members at Wherobots, CARTO, Planet, Apple, Databricks, Snowflake, and others. Through joint efforts and extensive discussions, the proposal was refined and successfully merged.

Iceberg GEO Type Definitions and Storage

The Iceberg GEO proposal introduces two spatial data types: geometry and geography. This distinction addresses the varying levels of spatial data processing support across different engines. Some primarily work with geometry, while others emphasize geography. The primary distinction between these types is how edges between points are interpolated. All other aspects of the Iceberg GEO proposal such as encoding, and bounds apply to both unless explicitly stated otherwise.

Geometry Type

The geometry type represents spatial objects in a planar space using Cartesian geometry, assuming all calculations, including distance and area measurements, are performed on a flat surface. It is best suited for applications requiring high-precision, local-scale spatial operations, such as urban planning, country-level modeling, and traffic engineering. Geometry uses planar edge interpolation, where edges between vertices are treated as straight lines.

Geography Type

The geography type represents spatial objects on the ellipsoidal surface of the Earth, making it more appropriate for global-scale applications such as satellite tracking, aviation navigation, and long-distance routing. Unlike geometry, geography accounts for the Earth’s curvature, ensuring that spatial operations like distance and area calculations reflect real-world geography. It requires non-planar edge interpolation algorithms, which define how edges between points behave on a curved surface.

Edge Interpolation Algorithms

Since the Earth is not flat, different interpolation methods impact the accuracy of spatial operations. The community identified six primary interpolation algorithms: Planar, Spherical, Vincenty, Thomas, Andoyer, and Karney. Taking Planar and Spherical as examples, planar interpolation assumes straight-line edges in a Cartesian plane and is used in the Geometry type while spherical interpolation models edges as geodesic curves on a sphere and could be used in the Geography type.

Details of all interpolation methods can be found in this paper. The Iceberg Geography type requires implementations to explicitly specify which non-planar interpolation algorithm is used, ensuring consistency in spatial computations across different engines.

Encoding

Both Iceberg GEO types follow the OGC Simple Feature Access v1.2.1 data model, supporting geometric objects such as points, polygons, line strings, and geometry collections. It uses the ISO Well-Known Binary (WKB) format for encoding, which supports higher-dimensional geometries (Z and M values) but does not include a Spatial Reference Identifier (SRID). A more detailed comparison of WKB variants can be found in the GEOS library documentation.

To ensure consistency across spatial tools, Iceberg GEO enforces a longitude-latitude (X, Y) coordinate order, aligning with standards used in GeoPandas, Apache Sedona, and Google Maps.

Coordinate Reference Systems (CRS)

Both geometry and geography types support CRS definitions using either a SRID (e.g., srid:4326) or a PROJJSON string (e.g., projjson: {…}), which provides a self-contained CRS definition. SRIDs are storage efficient for well-known coordinate systems, while PROJJSON allows detailed CRS specifications. A minor difference between the Geometry and Geography type is that the former allows any CRS, whereas the latter only allows geographic CRS, which only makes sense in the context of non-planar edge interpolation.

Lower and Upper Bounds

Iceberg GEO extends Iceberg’s lower and upper bounds statistics for spatial data by defining bounds based on the westernmost, easternmost, northernmost, and southernmost extents of spatial objects. While these longitude and latitude bounds can theoretically define the bounding box of a data file in Iceberg, allowing query predicates to be checked against it, they help optimize spatial filtering operations like ST_Intersects. However, certain complexities must be considered.

This bounding method is necessary for handling objects that cross the antimeridian (±180° longitude), where the lower longitude bound may be greater than the upper bound. Additionally, for non-planar edge interpolation used in the Geography type, a shape’s bounding box may not always be defined by its vertices, requiring a more precise bounding approach. For example, the territorial waters of Fiji span both hemispheres, with points at (179°E, 18°S), (-179°W, 18°S), (-179°W, 16°S), and (179°E, 16°S). A naive min/max longitude calculation might incorrectly assume the bounding box extends from -179°W to 179°E, nearly covering the entire globe. Instead, Iceberg GEO correctly identifies 179°E as the westernmost point and -179°W as the easternmost, ensuring accurate query filtering and optimization.

For geography types, longitude bounds must fall within [-180, 180], while latitude bounds must be within [-90, 90].

Operations and Catalogs

DDL, DML, and DQL

Iceberg GEO is natively supported by Iceberg’s table operations, allowing spatial data to be stored, modified, and queried efficiently. When defining a table schema, users can specify geometry or geography columns with optional CRS parameters.

For data manipulation, Iceberg GEO supports inserting, updating, and deleting spatial data using formats such as WKB and WKT. Queries can leverage spatial functions like ST_Intersects, ST_Contains, and ST_Distance, enabling efficient spatial filtering and analysis. Iceberg’s manifest metadata optimizes query execution by pruning unnecessary data files, significantly improving performance for large datasets.

Z-Ordering

Iceberg GEO does not define a specific behavior for Z-ordering spatial objects, but engines can implement custom solutions. A common approach is to compute spatial indices such as H3 or S2 and use them for Z-order clustering. This helps preserve spatial locality in storage, improving query performance by reducing unnecessary scans.

Compaction

Compaction in Iceberg consolidates small files to improve performance. For spatial data, compaction can leverage Z-ordering to group spatially related objects together, enhancing data locality and reducing read overhead. During compaction, Iceberg needs to recalculate the lower and upper bounds for spatial objects to maintain accurate spatial statistics, ensuring that query pruning remains effective.

Catalog Support

Iceberg GEO is fully compatible with the Iceberg REST catalog, which stores metadata in JSON format and allows seamless integration across multiple compute engines. For catalogs like Apache Polaris, AWS Glue, and Hive Metastore, additional change may be required to recognize the new geometry and geography types. However, since Iceberg GEO follows Iceberg’s existing metadata structures, the effort required for adaptation is minimal.

Migrating from Havasu-Iceberg to Native Iceberg GEO on Wherobots

With the geo types now merged into Apache Iceberg, Wherobots will soon begin assisting customers in migrating all Havasu-Iceberg tables to native Iceberg tables. This transition will streamline spatial data management while ensuring full compatibility with the Apache Iceberg ecosystem.

If you are using Wherobots-managed Havasu tables, we will handle the migration automatically. Your existing code and workflows will remain unaffected, and you will receive a notification once the migration is complete.
For those using self-managed Havasu tables on Wherobots, we will provide migration tools to help transition your datasets to native Iceberg tables efficiently.
Users relying on raster types in Havasu will have a dedicated migration solution available soon. Stay tuned for further updates.

Wherobots is the best compute engine for processing spatial data, makes using Iceberg very easy, and has been tuned for working efficiently with Iceberg tables. Our technologies continue to enhance cost performance and data governance, ensuring the best possible experience for spatial data workloads.

If you want to get started working with our Iceberg enabled Spatial Intelligence Cloud, and begin taking advantage of all the benefits of Iceberg GEO, sign up for a Wherobots pro account on the AWS marketplace, which includes $400 in compute credits. We are hosting regular getting started sessions, and the historical ones can be viewed on our Wherobots Youtube channel. As we mentioned upfront, expect to see additional content along with demonstrations in our blog moving forward.

Sign up for our newsletter to stay up to date with everything we are doing to enable the spatial community to embrace the modern geospatial lake-house.

Apache Iceberg and Parquet now support GEO

Posted on February 11, 2025May 29, 2025 by Ben Pruden

Geospatial data isn’t special anymore, and that’s a good thing.

Geospatial solutions were thought of as “special”, because what modernized the data ecosystem of today, left geospatial data mostly behind. This changes today.

Thanks to the efforts of the Apache Iceberg and Parquet communities, we are excited to share that both Iceberg and Parquet now support geometry and geography (collectively the GEO) data types.

Geospatial challenges

Geospatial data has been disconnected from the broader data ecosystem that modernized from open file formats like Apache Parquet, and open table formats like Apache Iceberg, Delta Lake, and Apache Hudi.

The benefits of these cloud-native open file and table formats fueled widespread adoption of data lake and lakehouse architectures. Organizations moved away from the use of expensive proprietary systems, away from data siloes that coupled compute with storage and didn’t scale, and away from formats that locked them in and stifled innovation. Relative to legacy options, these cloud-native formats fundamentally change how data is stored, managed, and accessed. This in turn lowers costs, increases agency, and unlocks innovation over time. But because geospatial data was different, which led to a number of technical challenges, it wasn’t supported by these formats from the start. As a result developers building solutions with geospatial data struggled with fragmented formats, proprietary file types, and data siloes – making solutions harder and costlier to build.

The silos will break down

With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, DuckDB, Apache Sedona, Apache Spark, Databricks, Snowflake, and BigQuery on your data. All the while benefitting from faster queries and lower storage costs from Parquet formatted data.

These changes improve short and long term economics for geospatial solutions. Organizations will have a new freedom to innovate with a lower cost, highly interoperable architecture. They get to choose the best tool for the job over time without having to shuttle data between systems. Their costs reduce, productivity improves, innovation accelerates, and the playing field is leveled with respect to who can provide the best solution for their data. The legacy siloes will break down, just like they’ve done for non-geospatial data. And most importantly, these changes will lead to new innovation about our physical world.

Benefits of Iceberg and Parquet

These changes make geospatial solutions based on a data lake a lot more attractive. Here are a few benefits.

Iceberg and Parquet alone don’t separate compute from storage, but together they make it possible to utilize low cost data lake storage, along with multiple independent high performance computing solutions for different use cases
ACID transactions and data versioning enable the use of multiple compute engines without conflicts
Time travel allows tracking of data changes over time
Query performance is higher from features like column pruning, row-group filtering, and fast file access
Open data formats minimize vendor lock-in
Geospatial data will be supported across a broader ecosystem of tools and services
And many more…

In the coming weeks, we will be covering these features in detail and demonstrate how they’re beneficial for geospatial solutions.

Grassroots efforts made this happen

These changes were the result of grassroots initiatives, investment, and influence from community members at Planet, CARTO, Wherobots, and many others across the Cloud Native Geospatial community. This includes GeoParquet, which was a grassroots project and an extension of Parquet that proved its worth through use and popularity, countless meetups, and discussions. And we also want to give credit to the Iceberg community for working with members of the Wherobots team, to bring a solution forward while also influencing the Parquet community to make a GEO native data type.

While Iceberg and Parquet communities led with support for GEO data types, we welcome compatibility and support for GEO data types in all cloud-native formats, including Apache Hudi and Delta Lake.

Thoughts from Szehon Ho, Apache Iceberg PMC Member
“The long-awaited incorporation of geospatial data types in the Iceberg V3 spec extends a core theme of Iceberg as a project to provide a universal ‘shared warehouse storage’ across many engines and users, and will now allow this huge, growing ecosystem to work on the same geospatial data as well, unlocking many exciting use cases. It is also a demonstration of Iceberg community’s willingness to take the time and ‘do hard things’, engaging in months of very active discussions across companies and OSS communities, finally reaching consensus on a spec that supports the largest variety of use cases in the fast-evolving geospatial data domain.”

Thoughts from Chris Holmes, co-creator of GeoParquet
“The community developed and rallied behind GeoParquet to make geospatial data in Parquet fully interoperable and to let the geospatial world tap into all the advantages the big data world has been getting from Parquet. I’m very excited to see Parquet and Iceberg formally support geospatial types, and look forward to the acceleration in geospatial innovation that these changes will activate across industries and for our planet.”

Looking ahead

Committers are already working to bring support for these changes into Apache Sedona, and will notify the community as they are introduced.

At Wherobots, we’ve supported these GEO data types in Havasu (our Iceberg fork) which we built to enable geospatial lakehouse architectures with Wherobots, along with GeoParquet. We’ve begun developing native support for Iceberg and Parquet into how Wherobots operates on customer data, and will put our full support behind these native formats moving forward.

To learn more about the reasoning behind the Iceberg GEO types design, the trade-offs we navigated, and what it all means for implementers, please read our follow-up blog: Iceberg GEO: Technical Insights and Implementation Strategies. If you need support throughout your journey adopting and utilizing these cloud-native formats for geospatial use, reach out to Apache Iceberg on Slack or Apache Sedona on Discord.

Watch this livestream from Wednesday, May 7 with leaders from Foursquare, Databricks, Planet, and Wherobots as they discuss the historical challenges of handling spatial data, bridging the gap, and future adoption of these advancements.

Sign up for our newsletter to stay up to date with everything we are doing to enable the spatial community to embrace the modern geospatial lake-house.

Building a Spatial Data Lakehouse

Posted on December 16, 2024December 2, 2025 by Pranav Toggi

Introduction

In today’s data-centric world, geospatial data is collected at an unprecedented scale. Large organizations in sectors like logistics, telecommunications, urban planning, and environmental monitoring gather vast amounts of data daily—from GPS locations to satellite imagery and IoT sensors. This data is incredibly rich, holding insights into everything from human movement patterns to natural resource management. However, it is also extremely complex and challenging to store, process, and analyze effectively.

These challenges stem from the sheer scale, complexity, and unique processing needs of geospatial data. Traditional data lakes and warehouses were designed for tabular data and typically struggle to handle massive raster imagery, multi-layered vector datasets, and the spatial queries essential for geospatial analysis. This gap has left many organizations unable to fully leverage their geospatial data, which remains scattered, under-utilized, and costly to manage.

Click here to launch this interactive notebook

Launch Notebook

Enter Havasu—Wherobots’ spatial data lakehouse solution that bridges this gap. Built on Apache Iceberg, Havasu brings the scalability of a data lake and the structured efficiency of a data warehouse to geospatial data, offering a comprehensive solution designed to meet the specific demands of spatial data management. You can learn more about the detailed specifications of Havasu here.

With Havasu, organizations can:

Achieve Performance at Scale: Havasu optimizes for high-speed processing of geospatial data, whether you’re performing complex spatial joins, analyzing real-time data streams, or querying historical records.
Store Massive Volumes Efficiently: Havasu’s flexible in-DB and out-DB raster storage options allow organizations to manage enormous datasets, keeping costs low while enabling easy access to high-resolution imagery, vector layers, and more.
Streamline Data Management: Built on the Iceberg table format, Havasu supports schema evolution, time travel, and versioning, empowering organizations to manage changing geospatial data with ease.

This blog will walk you through the core principles of building a spatial data lakehouse with Havasu, showing how you can create and manage spatial datasets at scale, integrate external data sources, and optimize for performance and cost-efficiency. Whether your focus is on city planning, environmental monitoring, or large-scale logistics, this guide will equip you to unlock the full potential of your geospatial data.

Creating a Table

Creating tables in Wherobots is a breeze, and the process is highly adaptable. Whether working with vector data (points, lines, polygons) or raster images, you can set up tables in just a few steps. For instance, creating a table for storing vector geometries might look like this:

sedona.sql("""
CREATE TABLE wherobots.test_db.cities (id bigint, name string, geom geometry)
""")

This command creates a simple table where the geom column is used to store spatial data in a geometry format, which can later be used in spatial queries. Similarly, you can create raster tables to store geospatial images like satellite data.

Creating Tables & CRUD Operations

In addition to creating tables, Wherobots fully supports CRUD (Create, Read, Update, Delete) operations. For example, if you need to update spatial data or remove records, it’s just as straightforward as using SQL commands. You can perform spatially-aware deletes, as shown below:

sedona.sql("""
DELETE FROM wherobots.test_db.cities 
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

This flexibility allows you to manage your spatial data with the same ease as any other data in the database, while still taking advantage of the powerful geospatial capabilities of Wherobots.

Connecting Storage

A key feature of Havasu is its ability to integrate with external storage, allowing users to bring their own S3 buckets for managing spatial data. This enables users to scale their storage independently, providing flexibility and control over where and how their spatial data is stored.

You can configure a catalog in Wherobots to connect directly to your own S3 bucket. This allows Havasu tables to be stored in your specified S3 location, ensuring that your spatial data lakehouse is highly scalable and aligned with your organization’s storage needs.

For detailed instructions on setting up cross-account access and configuring your storage, please refer to documentation on S3 storage integrations.

Sample Queries

Once your table is created, running spatial queries is as easy as using standard SQL commands. However, Wherobots enhances this experience with powerful spatial functions. For example, finding all locations within a specific area could be done using the ST_Intersects function:

sedona.sql("""
SELECT * FROM wherobots.test_db.cities
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

These spatial queries leverage Wherobots’ built-in geospatial functions, making it easy to work with complex spatial relationships directly in your database.

Vector Data In Havasu

Vector tables in Havasu are crucial for managing and querying vector data, which includes points, lines, and polygons. These tables allow you to store geospatial features and leverage Havasu’s advanced spatial indexing and optimizations, such as spatial filter pushdown, to ensure fast data retrieval and processing. Let’s break down the process of working with vector tables in Wherobots.

Creating Tables

Creating a vector table in Wherobots is a straightforward process. When starting your spatial data lakehouse, the first step is to define how you will store vector data.

Each feature (or record) in these tables is stored as a geometry column, which holds the spatial data—whether it’s a point representing a city or a polygon outlining a forest.

For instance, to create a table for storing buildings in a city, you would use the following SQL command:

sedona.sql("""
CREATE TABLE wherobots.test_db.buildings (id bigint, name string, geom geometry)
""")

This creates a table where the geom column is used for vector data like points, lines, and polygons. Once created, you can use Wherobots’ spatial SQL functions to run geospatial queries, such as finding intersections, distances, proximity relationships and more.

Importing Common Formats

WherobotsDB provides extensive support for importing spatial data in various commonly used formats, making it easy to integrate your existing geospatial datasets into Havasu. Whether you’re working with GeoJSON, Shapefiles, GeoParquet, WKT files, or even geospatial databases like PostGIS, Wherobots ensures seamless compatibility.

Havasu stores spatial data in Apache Parquet format on cloud storage, a design choice that ensures cost-effective storage and a decoupled architecture for computation. This approach allows organizations to scale both storage and compute independently, optimizing for performance and budget.

To learn more about importing specific formats and integrating your data with Havasu, check out the Wherobots Documentation on Importing Data Formats.

Migrating from Parquet

Many organizations already store their geospatial data in Parquet format due to its efficient columnar storage. However, Parquet lacks optimized support for spatial queries. Migrating your existing Parquet datasets to Havasu enables you to take full advantage of Native Spatial Data Types for geometry and raster, spatial indexing, optimized querying, and Havasu’s native geospatial capabilities.

Create an external table from your Parquet file:

sedona.sql("""
CREATE TABLE spark_catalog.default.parquet_table (
    id BIGINT,
    name STRING,
    geometry BINARY
) USING parquet LOCATION 's3://path/to/parquet/files'
""")

Convert binary columns to geometry:
Once the Parquet table is loaded, you can convert the geometry from a binary format into a format that Wherobots can use (EWKB or another supported geometry type). This can be done using an ALTER TABLE command:
```
sedona.sql("""
ALTER TABLE wherobots.test_db.parquet_table
SET GEOMETRY FIELDS geometry AS 'ewkb'
""")
```

This process ensures that you can utilize your existing Parquet datasets in Wherobots without reprocessing all the data.

Click here to learn more about migrating Parquet files to Havasu.

Converting Tables to Havasu

Havasu extends the Iceberg format with spatial capabilities, and converting existing tables into Havasu tables brings a lot of advantages like schema evolution, time travel, and partitioning that optimizes spatial queries.

To convert an existing table to a Havasu table, you simply need to perform a snapshot of the table. For example:

sedona.sql("""
CALL wherobots.system.snapshot('spark_catalog.default.parquet_table', 'wherobots.test_db.new_havasu_table')
""")

This converts the table into a Havasu table format that benefits from all of Iceberg’s features.

Clustering Geometries

Clustering geometries is an optimization technique used to spatially organize data, which improves query performance, particularly for large datasets. By clustering geometries, spatial queries such as range queries or nearest-neighbor searches can be significantly faster.

In Wherobots, you can cluster geometries during table creation or after the table has been populated. Here’s how you would cluster geometries in an existing table:

sedona.sql("""
ALTER TABLE wherobots.test_db.buildings 
CLUSTER BY geom")
""")

This ensures that the geometries are spatially co-located, which minimizes the number of partitions scanned during spatial queries.

To learn more about Spatial Filter Push-down and indexing strategies, click here.

Raster Data In Havasu

Havasu introduces raster data as a primitive type, meaning you can define tables with raster columns to store and manipulate spatial imagery, such as satellite imagery and elevation models.

Users can create tables to store either in-DB rasters (where the image data is stored directly in the database) or out-DB rasters (where only metadata and paths are stored, with the images in external storage). With out-DB raster support, Havasu allows you to store only the metadata and file paths within the database, while actual raster files remain in external storage like AWS S3. This architecture makes it feasible to manage and query vast amounts of high-resolution imagery, supporting everything from city-scale maps to continent-wide satellite data without straining database resources.

Creating and Loading Raster Data

Creating a Basic Raster Table

Here’s how you would create a raster table in Havasu:

sedona.sql("""
CREATE TABLE wherobots.test_db.imagery (
id bigint,
description string,
rast raster)
""")

With this setup, the rast column will hold your raster data, allowing you to apply Havasu’s RS_ functions for raster processing and spatial queries.

Loading Raster Data

Havasu supports writing raster data using DataFrames, especially useful when you have image files (like GeoTIFFs) stored in S3 or local storage. This method allows you to read binary files and convert them into raster-compatible formats using the RS_FromGeoTiff function.

# Create a DataFrame containing raster data
df_binary = sedona.read.format("binaryFile")\
    .option("pathGlobFilter", "*.tif")\
    .option("recursiveFileLookup", "true")\
    .load('s3://wherobots-examples/data/eurosat_small')
df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))\
    .selectExpr("rast", "path AS data")

# Create a Havasu table using the DataFrame
df_geotiff.writeTo("wherobots.test_db.test_table").create()

This approach provides the option to write data as in-DB rasters or to reference them externally as out-DB rasters, depending on your storage and access requirements.

Choosing Between In-DB and Out-DB Rasters

Choosing between in-DB and out-DB storage is key for optimal performance. Here’s a breakdown:

In-DB Rasters: Suitable for small to medium-sized rasters. These are fully managed within the database and provide faster access since all data is loaded directly from the database tables.
Out-DB Rasters: Ideal for large, high-resolution images. With this setup, Havasu stores only metadata and file paths, while the raster images reside in external storage like S3, reducing database storage costs and allowing for lazy loading when data is accessed.

Additional optimizations, such as tiling and caching frequently accessed tiles, allow you to efficiently handle extensive data operations without unnecessary overhead, making Havasu a cost-effective choice for large-scale raster data management.

Best Practices for Out-DB Rasters

Caching: Enable caching of frequently accessed tiles to avoid repeated reads from remote storage, improving overall performance.
Read-Ahead Configuration: Set spark.wherobots.raster.outdb.readahead to an appropriate value (e.g., 64k or higher) to balance between read performance and network overhead.
Using Cloud-Optimized GeoTIFFs (COG): These files are optimized for remote storage, reducing latency and bandwidth by organizing data in tile-friendly formats .

Click here for more details about the Out-DB Raster type and a guide on improving it’s performance.

Optimizing Performance with Spatial Filter Pushdown

One of the strengths of Havasu for a raster data lakehouse is spatial filter pushdown, which applies spatial filters directly to reduce the amount of data read from storage. When querying for specific geographic regions, spatial pushdown ensures that only relevant raster tiles are accessed, significantly improving performance.

For example:

sedona.sql("""
SELECT * FROM wherobots.test_db.imagery
WHERE RS_Intersects(rast, ST_GeomFromText('POLYGON ((...))'))
""")

This method is particularly effective when the raster tables are partitioned by location or other relevant attributes, making spatial filter pushdown even more efficient.

For large-scale data lakehouse solutions, partitioning and indexing are essential to enhance query performance. Partitioning raster tables by spatial identifiers, such as SRID or city names, groups similar data together, further improving the efficacy of spatial filtering. For example, partitioning by SRID can be beneficial when dealing with geographically diverse datasets in different UTM zones.

Partitioning by Region: Use partitionBy during table creation to reduce scanning times, especially for spatially filtered queries.

# Write the data into a table partitioned by SRID/UTM zone
df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))\
    .sort('srid')\
    .write.format("havasu.iceberg").partitionBy("srid")\
    .saveAsTable("wherobots.test_db.eurosat_ms_srid")

Indexing: For tables where partitions aren’t as useful, consider indexing raster data by geographic location using Havasu’s spatial indexing capabilities (e.g., hilbert indexing).
```
sedona.sql("""
CREATE SPATIAL INDEX FOR wherobots.db.test_table 
USING hilbert(rast, 10)")
""")
```

Hilbert curve with different number of iterations

Advanced Raster Operations: Map Algebra and Pixel-Level Operations

Havasu supports a suite of raster RS_ functions that enable complex analyses, including map algebra and zonal statistics. Lets dive into when each method is appropriate and their impact on performance.

Map Algebra and Pixel-Level Operations

This functionality is crucial for applications like remote sensing, where you might need to compute remote sensing indices on large geographical regions across multiple raster bands.

For instance you can use RS_MapAlgebra to calculate the Normalized Difference Vegetation Index (NDVI) on a multi-band raster:

sedona.sql("""
SELECT RS_MapAlgebra(rast, 'D', 'out = (rast[3] - rast[0]) / (rast[3] + rast[0]);') as ndvi 
FROM raster_table
""")

This flexibility allows Analysts to perform custom calculations directly within Havasu, simplifying workflows for raster analysis.

RS_ZonalStats for Efficient Calculations within Areas of Interest

For use cases where only specific areas within a raster need processing, RS_ZonalStats offers an efficient alternative. This function calculates statistics (e.g., mean, sum, etc) for pixels within a designated geometry (e.g., a polygonal area of interest), loading only the required subset of pixels into memory.

Unlike pixel-level operations that materialize the entire raster, RS_ZonalStats selectively materializes pixels only within the specified geometry. This efficiency makes it ideal for spatial analyses that focus on specific regions without the overhead of processing the entire raster:

sedona.sql("""
SELECT RS_ZonalStats(outdb_raster, geometry, 'mean')
FROM raster_table
""")

Because RS_ZonalStats is targeted to the geometry, tiling may not be necessary for these operations. This distinction allows you to avoid tiling when performing zonal statistics, keeping the process streamlined and reducing unnecessary data handling steps.

By choosing the appropriate approach based on your operation type, you can maximize efficiency when working with large out-DB raster datasets in Havasu, ensuring both performance and resource management.

Handling Pixel-Level Operations on Out-DB Rasters in Havasu

When performing pixel-level operations on raster data in Havasu, it’s essential to understand the implications of using out-DB rasters. Any operation that accesses individual pixels directly, such as map algebra calculations, will materialize these pixels in memory, which can be resource-intensive if working on a large out-DB raster dataset. This section explores best practices for managing these operations efficiently.

Preprocessing with RS_TileExplode for Pixel-Level Operations

For out-DB rasters, any pixel-level transformation will load pixel data from remote storage into memory, creating a significant memory load. To make these operations more manageable:

Tile the Raster Data: Using RS_TileExplode splits the raster into smaller, more manageable tiles, allowing you to work on localized sections rather than loading the entire raster at once.


df_tiled = sedona.sql("""
SELECT RS_TileExplode(raster_data, 1024, 1024, true) AS tile
FROM outdb_rasters
""")

Before and after RS_TileExplode(). Image taken from nv5GeoSpatialSoftware

With the raster data pre-tiled, pixel-level functions like RS_MapAlgebra can be applied to individual tiles, preventing full materialization of large rasters in memory and optimizing processing time and memory usage.

This approach is particularly beneficial for complex calculations that require pixel-by-pixel transformations, as tiling minimizes memory consumption and enables parallelized processing of smaller raster sections.

STAC Rasters in Havasu

To incorporate STAC (SpatioTemporal Asset Catalog) datasets into Havasu, the process involves accessing the STAC dataset as a Sedona DataFrame, surfacing STAC assets (raster bands) as out-db or in-db rasters, and finally saving it in a Havasu table. This guide provides the essentials to help you set up STAC-based raster data in Havasu, with specific attention to handling the dataset efficiently in Wherobots.

Accessing STAC Data and Creating a DataFrame

STAC datasets like Sentinel-2 imagery are typically available in JSON format, which includes metadata about each raster, such as acquisition time, spatial extent, and the URLs of imagery files. The first step is to read these JSON files into a DataFrame:

Fetch and Store JSON Metadata: STAC items, usually hosted on platforms like Earth-search, can be stored as JSON files in an S3 bucket or any accessible location.
Load into Sedona DataFrame: Use Sedona’s GeoJSON reader to load the STAC metadata into a DataFrame. This DataFrame will contain essential attributes like URLs to individual raster bands (e.g., blue, red, NIR).

df_stac = sedona.read.format("geojson").load("s3://your-bucket/path-to-stac-metadata")

This DataFrame now acts as a container for your STAC items, enabling you to manipulate, transform and/or perform spatial joins on the metadata before loading actual raster data.

Saving STAC Rasters as Out-DB Columns

Havasu’s out-DB raster capability allows you to reference large raster files externally, ideal for high-resolution datasets like Sentinel-2 imagery. To set this up:

Transform URLs: Ensure each STAC asset URL points to a valid path accessible by Wherobots (e.g., converting https URLs to s3:// format if stored on S3).
Create Out-DB Raster Columns: Use RS_FromPath to reference each asset as an out-DB raster. Here’s an example with a blue band raster:

df_rasters = df_stac.withColumn("raster_blue", expr("RS_FromPath(assets.blue.href)"))

Repeat this process for each required band (e.g., red, NIR) to build a DataFrame that references all the necessary raster paths. This setup maintains efficient storage by referencing images externally, reducing data load times and storage requirements in Havasu.

Saving the STAC Raster Data to Havasu

Once you’ve set up the DataFrame with out-DB raster columns, the next step is to save it as a Havasu table:

Write to Havasu: Use writeTo() to save the DataFrame as a new Havasu table, enabling further spatial queries and transformations within Wherobots.

df_rasters.writeTo("wherobots.test_db.stac_raster_table").create()

With this approach, your STAC imagery is now part of a Havasu table, ready for efficient querying and spatial operations. WherobotsDB enables advanced spatial functions like RS_Intersects and RS_MapAlgebra, making it straightforward to analyze the imagery stored as out-DB rasters.

Considerations

URL Compatibility: Ensure paths are accessible by Wherobots (e.g., transforming https to s3:// if needed) to avoid data access errors.
Tile Size and Storage Optimization: For performance, adjust tile sizes and consider filtering edge-case tiles to optimize spatial queries.
Tuning for Performance: For extensive STAC datasets, consider caching and configuring Spark to handle out-DB rasters efficiently by setting appropriate cache sizes and connection thresholds.

This streamlined approach provides an effective way to incorporate STAC datasets in Havasu, enabling data engineers to leverage high-resolution satellite imagery in their spatial data lakehouse while optimizing for performance and storage.

Monitoring and Maintenance

Monitor your data lakehouse solution’s performance using WherobotsDB’s configurations and consider periodically converting standard GeoTIFF files to Cloud-Optimized GeoTIFF (COG) to reduce storage latency. Use gdalinfo to inspect file structures, tile sizes, and compression formats, ensuring that your files are optimized for remote access.

Conclusion

Geospatial data presents a wealth of opportunities for organizations across industries, yet its inherent complexity often makes it difficult to unlock its full potential. With Havasu, Wherobots has developed a solution tailored specifically to the demands of geospatial data management, combining the scalability of a data lake with the structured efficiency of a data warehouse.

From its seamless integration with cloud storage and support for common geospatial formats to advanced capabilities like spatial indexing, filter pushdown, and raster processing, Havasu empowers organizations to efficiently store, query, and analyze spatial data at scale. By leveraging its open architecture and powerful features, teams can transform vast geospatial datasets into actionable insights, driving innovation in logistics, urban planning, environmental monitoring, and beyond.

When paired with WherobotsDB, Havasu transforms geospatial analytics by providing high-speed querying for both raster and vector data, advanced spatial indexing, and seamless compatibility with cloud-native storage solutions like AWS S3. This differentiates Havasu from legacy solutions that struggle with scalability or require complex application-layer workarounds for spatial data.

As geospatial data continues to grow in both scale and importance, adopting a spatial data lakehouse architecture with Havasu ensures that your organization remains at the forefront of data-driven decision-making. Whether you’re just beginning to integrate spatial data into your workflows or optimizing an existing system, Havasu equips you with the tools to manage, process, and analyze geospatial data with unmatched efficiency and flexibility.

Learn more about how Havasu can help you redefine your geospatial data strategy in the official documentation. Unlock the future of geospatial intelligence today.

Create a free account

Get Started