Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

Raster Processing at Scale: The Out-of-Database Architecture Behind WherobotsDB

Authors

Introduction

Raster data (satellite imagery, elevation models, sensor grids) is critical to understanding the physical world and increasingly to powering AI. The challenge most data teams face is processing it at scale.

Processing raster data at scale requires an architecture that avoids loading entire files into memory. WherobotsDB solves this with an out-of-database approach that fetches pixel data on demand, enabling terabyte-scale processing without the memory overhead of traditional raster engines.

WherobotsDB extends open-source Apache Sedona with capabilities and performance optimizations purpose-built for preparing physical-world data for AI at scale, while maintaining full API compatibility. Existing Sedona workloads run without code changes.

This post covers how WherobotsDB handles the full raster lifecycle: scalable processing architecture, raster math, coordinate reference systems, vector-raster hybrid workflows, and planetary-scale inference.

“With Wherobots, we were able to merge 15+ complex vector datasets in minutes and run high-resolution ML inference on raster imagery at a fraction of the cost of our legacy stack. The combination of speed, scalability, and ease of integration has boosted our engineering productivity and will accelerate how quickly we can deliver new geospatial data products to market.”

— Rashmit Singh, CTO, SatSure

What Is Out-of-Database Raster Architecture?

At the foundation of Wherobots raster capabilities is an out-of-database raster architecture – which makes it far easier to process raster imagery in an embarrassingly parallel fashion. Instead of loading entire raster files into memory, only metadata is stored and pixel data is fetched on-demand. This means teams can process terabyte-scale imagery collections — statewide mosaics, multi-year satellite archives, continental elevation models — with the same interface they use for vector data. Operations like zonal statistics, clipping, masking, filtering, and raster algebra scale to datasets that would overwhelm in-memory approaches.

CapabilityApache SedonaWherobotsDBNotes
Out-DB Raster SupportCreates lightweight raster references; pixel data loaded only when needed
Intelligent Caching LayerMinimizes repeated remote reads for frequently-accessed rasters
Optimized Shuffle OperationsData movement handles only metadata — orders of magnitude faster than full rasters
On-Demand MaterializationSelectively convert external rasters to in-database format when needed
Automatic Metadata OptimizationPre-loads metadata and intelligently repartitions for optimal parallelism
Cloud-Optimized GeoTIFF SupportNative COG support with tile-based partial reads from cloud storage

How Out-DB Architecture Transforms Raster Operations

The out-of-database architecture fundamentally changes how raster operations execute:

OperationApache Sedona (In-DB Only)WherobotsDB
Data Movement (Shuffle)Serializes all pixel data across executorsSerializes only metadata (~KB vs GB)
Tiling OperationsCopies pixel data into each tileMemory-efficient tiling without data duplication
ClippingFull pixel-by-pixel processingOptimized processing paths for common operations
Zonal StatisticsProcesses entire raster regardless of region sizeMaterializes only zonal pixels, optimizing I/O based on region of interest
Raster LoadingLoads entire raster into memory at read timeLazy loading: metadata on-demand, pixels only when accessed
Resource ManagementStandard memory lifecycleIntelligent caching layer with disk caching for remote files

Key benefits:

  • On-Demand Data Access: Instead of loading entire raster files into memory, WherobotsDB fetches pixel data only when an operation requires it, reducing memory overhead and enabling processing at terabyte scale.
  • Memory-Efficient Tiling: Tiles share references to the underlying file with different spatial bounds — enabling massive parallelism without memory overhead.
  • Smart I/O Reduction: Operations optimize I/O based on the region of interest, and spatial filter push-down skips irrelevant raster files entirely.

Cloud-Optimized GeoTIFFs (COGs) are GeoTIFF files structured so that only the specific byte ranges needed for a given operation are fetched from remote storage, rather than downloading the entire file. Combined with on-demand loading, this architecture minimizes both memory footprint and network I/O.

What Raster Capabilities Does WherobotsDB Include?

Building on this foundation, WherobotsDB includes enhanced raster functions that enable satellite imagery, elevation models, and sensor data analysis directly in SQL alongside traditional vector operations.

CapabilityApache SedonaWherobotsDBNotes
Raster to Vector ConversionConvert raster regions to vector polygons for hybrid vector-raster analysis workflows
Multi-Band Tile ProcessingAlign, stack, and tile rasters from different sources, CRS, and resolutions for distributed multi-source analysis
Zonal StatisticsBoth support the full statistics suite; WherobotsDB’s Out-DB architecture materializes only zonal pixels, enabling scalability across millions of zones
Custom Raster AlgebraFlexible map algebra expressions with near-native execution performance
Spatial Filter Push-down for RastersUses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries

These capabilities fall into two categories: 

  1. Transforming raster data for hybrid workflows – Raster to Vector Conversion, Multi-Band Tile Processing.
  2. Analyzing raster data in place – Zonal Statistics, Custom Raster Algebra, Spatial Filter Push-down.

Raster to Vector Conversion converts contiguous raster regions with the same pixel value into vector polygons. Essential for workflows that need to analyze raster-derived features (flood extents, land cover classifications, building footprints from DSM) using vector spatial operations like overlay, buffering, or spatial joins.

Multi-Band Tile Processing solves one of the most common friction points in raster analysis: combining data from different sources. Satellite imagery from different sensors, time periods, or providers typically arrives in different coordinate reference systems, resolutions, and data types. WherobotsDB aligns and stacks these into a unified multi-band raster, then tiles it into spatial chunks for distributed processing — all in a single operation. This enables workflows like change detection across multi-temporal composites, fusing Sentinel-2 optical bands with elevation data, or building analysis-ready multi-spectral stacks, without manual reprojection or resampling steps.

Zonal Statistics computes aggregate statistics (count, sum, mean, median, mode, stddev, variance, min, max) for raster pixels falling within vector zones. Both Apache Sedona and WherobotsDB support zonal statistics — the differentiator is scale. WherobotsDB’s out-of-database architecture materializes only the pixels within each zone rather than processing the entire raster, making it practical to run zonal statistics across millions of zones on terabyte-scale imagery.

Custom Raster Algebra executes user-defined raster algebra expressions with near-native execution performance. It supports complex multi-band calculations, conditional logic, and neighborhood operations — enabling workflows like computing NDVI (Normalized Difference Vegetation Index, a measure of vegetation density derived from red and near-infrared bands) from satellite imagery, or applying threshold-based classification across large imagery collections.

Spatial Filter Push-down for Rasters uses bounding boxes to skip irrelevant raster files, dramatically reducing I/O for selective queries. When a catalog contains thousands of scenes but only a few intersect your area of interest, irrelevant files are eliminated before any processing begins.

Because all of these functions are built on the out-of-database architecture, they inherit the same scalability characteristics described above — lazy loading, selective pixel materialization, and intelligent caching, without additional configuration.

What Is RasterFlow and How Does It Work?

Recently, Wherobots has added an entirely new inference and perception engine for planetary-scale image processing – extending the raster lifecycle beyond analysis into AI. RasterFlow enables teams to run computer vision models against large-scale raster datasets. From preparing imagery, mosaicking, removing edge effects across tiles, executing distributed model inference, and converting predictions into vector geometries, all within Wherobots Cloud.

RasterFlow’s outputs are stored as vectorized results in Apache Iceberg tables — an open table format for large-scale analytic datasets — or as predictions within ZARR (a cloud-native format for chunked, compressed multi-dimensional arrays) or COGs, which can be seamlessly analyzed using the full suite of spatial operations in WherobotsDB. This creates end-to-end raster workflows — from raw imagery through model inference to spatial analytics, without moving data between systems or building custom infrastructure.

RasterFlow supports both popular open-source geospatial AI models and custom PyTorch models, and can generate embeddings from geospatial foundation models. It is currently available to select customers in private preview. If you’re interested in RasterFlow, join our upcoming session to see it in action.

What Comes Next: Query Performance and Spatial Analytics

Raster processing is not only a first-class capability in WherobotsDB, but also it’s one part of a broader set of spatial data processing advances we’ve built beyond open-source Sedona. Vector and raster workloads both benefit from the same query performance optimizations under the hood: spatial relationship acceleration, automatic join optimization, dynamic data redistribution, and a vectorized GeoParquet reader. Queries that require careful tuning with self-managed Sedona run optimally out-of-the-box with WherobotsDB.

In the next post in this series, we’ll go deep on query performance and spatial analytics, how WherobotsDB accelerates spatial joins, range queries, and analytical functions across both vector and raster data types.

Get Started with Wherobots

Frequently Asked Questions

What is out-of-database raster architecture?

Out-of-database raster architecture stores only metadata in the database while fetching pixel data on demand from remote storage. Instead of loading entire raster files into memory, only the pixels required for a specific operation are materialized. This enables teams to process terabyte-scale imagery collections — including statewide mosaics, multi-year satellite archives, and continental elevation models — without the memory overhead of traditional in-memory approaches.

How does WherobotsDB differ from open-source Apache Sedona for raster processing?

WherobotsDB extends open-source Apache Sedona with additional capabilities and performance optimizations purpose-built for large-scale physical-world data processing. While both support raster operations like zonal statistics, WherobotsDB’s out-of-database architecture materializes only the pixels within each zone rather than processing the entire raster — making it practical to run zonal statistics across millions of zones on terabyte-scale imagery. Existing Apache Sedona workloads run on WherobotsDB without code changes.

What raster operations does WherobotsDB support?

WherobotsDB supports raster-to-vector conversion, multi-band tile processing, zonal statistics, custom raster algebra, and spatial filter push-down. These capabilities cover both transforming raster data for hybrid workflows and analyzing raster data in place. All functions are built on the out-of-database architecture, inheriting lazy loading, selective pixel materialization, and intelligent caching without additional configuration.

What is RasterFlow?

RasterFlow is an inference and perception engine for large-scale image processing built into Wherobots Cloud. It enables teams to run computer vision models against large-scale raster datasets, handling imagery preparation, mosaicking, edge effect removal, distributed model inference, and conversion of predictions into vector geometries. RasterFlow supports open-source geospatial AI models and custom PyTorch models, and can generate embeddings from geospatial foundation models. It is currently available to select customers in private preview.

Do existing Apache Sedona workloads run on WherobotsDB without code changes?

Yes. WherobotsDB maintains full API compatibility with open-source Apache Sedona. Existing Sedona workloads run without code changes.