Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Fields of the World (FTW) is a Taylor Geospatial effort to produce globally consistent agricultural field-boundary data for land-use monitoring, food-system analysis, and model development. For the 2024-2025 global release, Taylor Geospatial partnered with Wherobots to run their latest FTW model, PRUE, on Wherobots RasterFlow; see the PRUE paper.
In this notebook, we walk through the production pipeline behind that release. The PRUE model turns seasonal Sentinel-2 composites into global field and field_boundaries predictions, and RasterFlow handles the larger systems problem: running data access, compositing, inference, and vectorization as one coordinated workflow and publishing the results as analysis-ready raster and vector products.
field
field_boundaries
The focus here is practical architecture rather than model internals. We cover how the pipeline is structured, what it produced at global scale, and the implementation choices that made the run reproducible.
What we cover:
build_mosaic
predict_mosaic
vectorize_mosaic
RasterFlow provides a reproducible way to run a global-scale GeoAI pipeline.
It is a geospatial workflow system backed by Flyte V2 and Ray. We chose Flyte because long-running global jobs need strong workflow semantics: versioned executions, retries, lineage, and reproducible reruns. We chose Ray Datasets because raster inference and vectorization are naturally blockwise workloads, and Ray gives us elastic parallelism and task streaming support to support the many small task problem.
The rasterflow_remote client exposes common GeoAI workflow steps while running in each organization’s isolated namespace. As a result, resource controls, failure containment, and reruns stay scoped to that organization.
rasterflow_remote
We construct the client and load the model config for the PRUE model:
from rasterflow_remote.model_registry import get_model_registry_config, ModelRegistryEnum from rasterflow_remote import RasterflowClient client = RasterflowClient() # see https://huggingface.co/wherobots/prue-pt2 for model details cfg = get_model_registry_config(ModelRegistryEnum.HUGGINGFACE, "wherobots/prue-pt2")
build_mosaic converts raw satellite scenes into analysis-ready feature mosaics on a shared grid.
For the FTW production run, this stage computes planting and harvest composites, writes feature COGs, and then materializes a global feature mosaic that can be stacked with model outputs. We persist that global feature product in Zarr because chunked, cloud-native storage lets us read small spatial windows directly from object storage during validation, inference, and downstream analytics.
from rasterflow_remote import DatasetEnum from datetime import datetime mosaic = client.build_mosaic( datasets=[DatasetEnum.S2_MED_PLANTING, DatasetEnum.S2_MED_HARVEST], aoi="https://github.com/fieldsoftheworld/ftw-inference-app/blob/main/src/data/s2-grid.json", start=datetime(2023, 1, 1), end=datetime(2025, 1, 1), crs_epsg=4326, )
For implementation details, see Appendix: Feature COG and Appendix: GDAL notes.
The produced feature mosaic is published as a Zarr store on Source Cooperative. Therefore, we can inspect structure, bands, and a spatial subset directly.
import xarray as xr import rasterix mosaic_ds = xr.open_zarr( "https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr" ).pipe(rasterix.assign_index) mosaic_ds
<xarray.Dataset> Size: 502TB Dimensions: (time: 2, band: 10, y: 1566049, x: 4007517) Coordinates: * time (time) datetime64[ns] 16B 2024-01-01 2025-01-01 * band (band) object 80B 's2med_harvest:B02' ... 's2med_planting:N_... * y (y) float64 13MB 83.75 83.75 83.75 ... -56.93 -56.93 -56.93 * x (x) float64 32MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 spatial_ref int64 8B ... Data variables: variables (time, band, y, x) float32 502TB dask.array<chunksize=(1, 1, 4096, 4096), meta=np.ndarray> Indexes: ┌ x RasterIndex (crs=None) └ y
array(['2024-01-01T00:00:00.000000000', '2025-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
array(['s2med_harvest:B02', 's2med_harvest:B03', 's2med_harvest:B04', 's2med_harvest:B08', 's2med_harvest:N_VALID_PIXELS', 's2med_planting:B02', 's2med_planting:B03', 's2med_planting:B04', 's2med_planting:B08', 's2med_planting:N_VALID_PIXELS'], dtype=object)
[1566049 values with dtype=float64]
[4007517 values with dtype=float64]
[1 values with dtype=int64]
RasterIndex(crs=None) AxisAffineTransformIndex(AxisAffineTransform(a=8.983e-05, b=0, c=-180, d=0, e=-8.983e-05, f=83.75, axis=X, dim='x')) AxisAffineTransformIndex(AxisAffineTransform(a=8.983e-05, b=0, c=-180, d=0, e=-8.983e-05, f=83.75, axis=Y, dim='y'))
Persisting an intermediate global Zarr mosaic lets us query any region without reprojecting or resampling.
Training data for the PRUE model is also aligned to EPSG:4326, so feature and prediction stores remain grid-compatible for side-by-side analysis. For data layout details, see Appendix: Zarr internals.
EPSG:4326
For the examples below, we use the same Japan AOI shown later in the PMTiles preview so the raster and vector artifacts are easy to compare. For more information on xarray, see Appendix: Xarray notes.
import matplotlib.pyplot as plt import numpy as np bbox = { "x": slice(140.24, 140.34), "y": slice(37.36, 37.26), } subset = mosaic_ds.sel(x=bbox["x"], y=bbox["y"]) geo_aspect = 1 / np.cos(np.deg2rad(float(subset.y.mean()))) fig, axes = plt.subplots(1, 2, figsize=(9, 9), constrained_layout=True) for ax, band_prefix, title in [ (axes[0], "s2med_planting", "Planting RGB (PMTiles AOI, 2025)"), (axes[1], "s2med_harvest", "Harvest RGB (PMTiles AOI, 2025)"), ]: subset["variables"].sel( time="2025-01-01", band=[f"{band_prefix}:B04", f"{band_prefix}:B03", f"{band_prefix}:B02"], ).plot.imshow(ax=ax, robust=True) ax.set_aspect(geo_aspect) ax.set_title(title) plt.show()
See additional information in Appendix: Zarr internals and Appendix: Xarray notes.
predict_mosaic runs global model inference over the feature mosaic and writes one prediction Zarr store.
You provide the model config, and the workflow executes inference across the global grid. Keeping the result in Zarr preserves the same chunked access pattern as the feature store, so validation and downstream vectorization can stay blockwise and cloud-native instead of stitching regional outputs back together later.
predictions = client.predict_mosaic( store="https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr", model_path=cfg.model_path, patch_size=cfg.patch_size, clip_size=cfg.clip_size, device=cfg.device, features=cfg.features, labels=cfg.labels, actor=cfg.actor, max_batch_size=cfg.max_batch_size, merge_mode=cfg.merge_mode, )
For internals and scheduling details, see Appendix: config, Appendix: Ray usage, and Appendix: distributed inference.
The prediction Zarr is also published on Source Cooperative. Let’s verify shape, bands, and value ranges.
The shape, bounds, and time remain aligned with the feature mosaic; however, the bands now represent predicted classes.
pred_ds = xr.open_zarr( "https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr" ).pipe(rasterix.assign_index) pred_ds
<xarray.Dataset> Size: 151TB Dimensions: (time: 2, band: 3, y: 1566049, x: 4007517) Coordinates: * time (time) datetime64[ns] 16B 2024-01-01 2025-01-01 * band (band) <U20 240B 'non_field_background' ... 'field_boundaries' * y (y) float64 13MB 83.75 83.75 83.75 ... -56.93 -56.93 -56.93 * x (x) float64 32MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 spatial_ref int64 8B ... Data variables: variables (time, band, y, x) float32 151TB dask.array<chunksize=(1, 1, 2048, 2048), meta=np.ndarray> Indexes: ┌ x RasterIndex (crs=None) └ y
array(['non_field_background', 'field', 'field_boundaries'], dtype='<U20')
Let’s plot the three labels for the same PMTiles-area AOI so the raster predictions line up with the vector preview later in the post:
import numpy as np bbox = { "x": slice(140.24, 140.34), "y": slice(37.36, 37.26), } pred_subset = pred_ds.sel(x=bbox["x"], y=bbox["y"]) geo_aspect = 1 / np.cos(np.deg2rad(float(pred_subset.y.mean()))) grid = pred_subset["variables"].plot.imshow(col="band", row="time", figsize=(12, 8)) for ax in grid.axes.flat: ax.set_aspect(geo_aspect) plt.show()
Zarr metadata for those curious is available in Appendix: Zarr internals.
vectorize_mosaic converts prediction rasters into field-boundary GeoParquet outputs.
You specify feature bands, threshold, and method. Then the workflow writes distributed vector outputs for analytics and downstream publishing. We write GeoParquet because it keeps the result columnar, splittable, and easy to consume from batch analytics systems.
Conceptually, this output is a distributed geospatial table rather than an array: each row is a polygonized field candidate, the geometry column holds the boundary, and the remaining columns carry the scalar attributes you want to inspect in GeoPandas or query later in SedonaDB and WherobotsDB.
geometry
For distributed implementation details, see Appendix: vectorization.
from rasterflow_remote.data_models import VectorizeMethodEnum, SemSegRasterioConfig vectors = client.vectorize_mosaic( store="https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr", features=["field"], threshold=0.5, vectorize_method=VectorizeMethodEnum.SEMANTIC_SEGMENTATION_RASTERIO, vectorize_config=SemSegRasterioConfig(stats=False, medial_skeletonize=False), )
A lightweight way to inspect that schema locally is:
import geopandas as gpd vector_preview = gpd.read_parquet(vectors.uri) vector_preview.head(3)
That gives you a normal GeoDataFrame view of the vector output: one geometry column plus the per-row attributes emitted by the vectorization stage.
Implementation details for distributed/blockwise vectorization are in Appendix: vectorization.
PMTiles packages vector outputs for fast, interactive map delivery. We use it here because a single archive is easy to publish, cache, and stream into browser maps without standing up a tile service.
We also support generating PMTiles from large GeoParquet outputs directly in WherobotsDB using our scalable PMTiles generator workflow, as described here: How to Generate PMTiles for Overture Maps with WherobotsDB VTiles.
This section summarizes the scale of each pipeline step and the size of the generated outputs. Understanding the scale of this production run will help when we discuss design choices later in this blog post.
s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/features/cogs/
s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/features/zarr/alpha/global.zarr
s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/predictions/zarr/alpha/global.zarr
s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/predictions/vectors/alpha/results/
Aggregate data output (above artifacts): ~348.7 TB across 540,794 objects.
You can reproduce this end-to-end global-scale GeoAI pipeline with three API calls:
from datetime import datetime from rasterflow_remote import RasterflowClient, DatasetEnum from rasterflow_remote.model_registry import get_model_registry_config, ModelRegistryEnum from rasterflow_remote.data_models import VectorizeMethodEnum, SemSegRasterioConfig client = RasterflowClient() cfg = get_model_registry_config(ModelRegistryEnum.HUGGINGFACE, "wherobots/prue-pt2") mosaic = client.build_mosaic( datasets=[DatasetEnum.S2_MED_PLANTING, DatasetEnum.S2_MED_HARVEST], aoi="https://github.com/fieldsoftheworld/ftw-inference-app/blob/main/src/data/s2-grid.json", start=datetime(2023, 1, 1), end=datetime(2025, 1, 1), crs_epsg=4326, ) predictions = client.predict_mosaic( store=mosaic.uri, model_path=cfg.model_path, patch_size=cfg.patch_size, clip_size=cfg.clip_size, device=cfg.device, features=cfg.features, labels=cfg.labels, actor=cfg.actor, max_batch_size=cfg.max_batch_size, merge_mode=cfg.merge_mode, ) vectors = client.vectorize_mosaic( store=predictions.uri, features=["field"], threshold=0.5, vectorize_method=VectorizeMethodEnum.SEMANTIC_SEGMENTATION_RASTERIO, vectorize_config=SemSegRasterioConfig(stats=False, medial_skeletonize=False), )
In short, this global-scale GeoAI pipeline with RasterFlow starts from global feature mosaics, runs one global inference workflow, and publishes both analytics-ready and map-ready outputs.
If you want a hands-on starting point, sign up for the RasterFlow Private Preview today and try the FTW notebook in the Solution Gallery.
If your team is evaluating global-scale raster + vector production workflows, this is exactly what RasterFlow private preview is for.
For API details and additional examples, see the Rasterflow documentation.
Internally, build_mosaic orchestrates four key steps:
The current DOY heuristic and greedy objective are intentionally conservative and production-stable, but there is room for improvement (for example, region-specific phenology priors, dynamic cloud climatology, and cost-aware objective tuning).
We also produce a COG index, which is used with GDAL’s GTI driver to build the underlying global mosaic.
import geopandas as gpd import fsspec with fsspec.open("https://data.source.coop/ftw/global-data/features/cogs/alpha/index.parquet") as f: index = gpd.read_parquet(f) index.head(3)
After feature COGs are written, we build a global virtual mosaic index and then materialize aligned arrays on the target grid. This relies on GDAL’s GTI driver as the index layer and mosaicking.
Grid configuration for this run:
8.983119e-5
The GTI-backed approach keeps COG indexing separate from downstream chunk/block execution and maps cleanly to distributed scheduling.
Zarr is the storage layer that makes the global feature and prediction mosaics usable as products rather than just intermediate files.
Two layout choices matter here:
We split them on purpose. Small logical chunks keep regional reads and retry boundaries precise, while larger shards prevent the store from exploding into billions of tiny objects. Larger shards also lower overall S3 PUT and LIST costs and reduce request-rate pressure.
For the feature store, the logical chunk is (1, 1, 4096, 4096) and shards are written at (1, 1, 12288, 12288), so each object packs a 3 x 3 tile of logical chunks. For the prediction store, the logical chunk is (1, 1, 2048, 2048) and shards are written at (1, 3, 8192, 8192), so each object packs all three class bands across a 4 x 4 spatial tile of logical chunks.
(1, 1, 4096, 4096)
(1, 1, 12288, 12288)
3 x 3
(1, 1, 2048, 2048)
(1, 3, 8192, 8192)
4 x 4
The tradeoff is deliberate: larger shards improve object-store efficiency and reduce listing overhead, but smaller logical chunks still give us better partial reads, safer retries, and more control over Ray execution geometry.
The feature Zarr mosaic metadata is as follows:
import zarr group = zarr.open("https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr") group["variables"].metadata.__dict__
{'shape': (2, 10, 1566049, 4007517), 'data_type': Float32(endianness='little'), 'chunk_grid': RegularChunkGrid(chunk_shape=(1, 1, 12288, 12288)), 'chunk_key_encoding': DefaultChunkKeyEncoding(separator='/'), 'codecs': (ShardingCodec(chunk_shape=(1, 1, 4096, 4096), codecs=(BytesCodec(endian=<Endian.little: 'little'>), ZstdCodec(level=0, checksum=False)), index_codecs=(BytesCodec(endian=<Endian.little: 'little'>), Crc32cCodec()), index_location=<ShardingCodecIndexLocation.end: 'end'>),), 'dimension_names': ('time', 'band', 'y', 'x'), 'fill_value': np.float32(nan), 'attributes': {'coordinates': 'spatial_ref', '_FillValue': 'AAAAAAAA+H8='}, 'storage_transformers': (), 'extra_fields': {}}
The prediction zarr mosaic metadata is as follows:
group = zarr.open("https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr") group["variables"].metadata.__dict__
{'shape': (2, 3, 1566049, 4007517), 'data_type': Float32(endianness='little'), 'chunk_grid': RegularChunkGrid(chunk_shape=(1, 3, 8192, 8192)), 'chunk_key_encoding': DefaultChunkKeyEncoding(separator='/'), 'codecs': (ShardingCodec(chunk_shape=(1, 1, 2048, 2048), codecs=(BytesCodec(endian=<Endian.little: 'little'>), ZstdCodec(level=0, checksum=False)), index_codecs=(BytesCodec(endian=<Endian.little: 'little'>), Crc32cCodec()), index_location=<ShardingCodecIndexLocation.end: 'end'>),), 'dimension_names': ('time', 'band', 'y', 'x'), 'fill_value': np.float32(nan), 'attributes': {'coordinates': 'spatial_ref', '_FillValue': 'AAAAAAAA+H8='}, 'storage_transformers': (), 'extra_fields': {}}
We use rasterix to attach analytical coordinates lazily so opening the global feature store does not require eagerly reading large coordinate vectors every time.
Why this matters:
x
1,566,049
y
4,007,517
(1,566,049 + 4,007,517) * 8 ~= 45 MB
Operationally, our xarray pattern is:
.sel
This keeps exploratory analysis responsive while preserving direct comparability between feature and prediction stores on the same grid.
The InferenceConfig is the contract between model packaging and distributed execution. It tells RasterFlow how to read features, how large each inference window should be, what runtime to launch, and how overlapping predictions should be merged back into the output mosaic.
InferenceConfig
In practice, the fields fall into a few groups:
model_path
actor
patch_size
clip_size
features
labels
device
max_batch_size
merge_mode
from rasterflow_remote.data_models import MergeModeEnum, InferenceActorEnum { # path to the pt2 archive on hugging face 'model_path': 'https://huggingface.co/wherobots/prue-pt2/resolve/main/prue-efnetb7.pt2', # an enum used to specify how to handle the model for inference 'actor': InferenceActorEnum.SEMANTIC_SEGMENTATION_PYTORCH, # patch size determines the size of the inputs provided to the model 'patch_size': 256, # clip size determines the size we clip off the borders to reduce edge effects during inference 'clip_size': 32, # device to run inference on, e.g. 'cuda' or 'cpu' 'device': 'cuda', # list of features in order as required by the model. These names map to the input Zarr band labels. 'features': ['s2med_harvest:B04', 's2med_harvest:B03', 's2med_harvest:B02', 's2med_harvest:B08', 's2med_planting:B04', 's2med_planting:B03', 's2med_planting:B02', 's2med_planting:B08'], # the named labels for the predicted classes. These will be the output band labels in the output Zarr. 'labels': ['non_field_background', 'field', 'field_boundaries'], # the maximum batch size assuming our default A10 GPU with 24GB of memory. 'max_batch_size': 128, # the method to use when merging overlapping predictions to reduce edge effects at the patch level 'merge_mode': MergeModeEnum.WEIGHTED_AVERAGE }
We chose Ray as the data-plane engine because the workload is embarrassingly parallel at block level, but still needs data-aware scheduling and actor pools for GPU inference.
Ray is used as a bounded block-processing engine over sharded Zarr storage.
Key mapping model:
This decoupling lets us tune compute batch shape without rewriting storage layout. In the data pipeline, Ray Data uses Arrow-backed transport so tabular metadata/state can move across stages with minimal copying.
Practical takeaway: throughput scales with actor parallelism while per-task memory remains tied to block geometry, not global mosaic size.
The InferenceConfig above feeds directly into block planning, actor setup, and overlap-aware merging. Internally, predict_mosaic orchestrates three phases:
# conceptual : chunk/shard metadata -> Ray blocks chunk_meta = read_zarr_chunk_metadata(store) blocks = plan_ray_blocks(chunk_meta, target_rows_per_block) ray_ds = ray.data.from_items(blocks) result = ( ray_ds .map(read_zarr_block_with_overlap) .map(run_inference_actor_pool) .map(clip_and_write_block) )
vectorize_mosaic operates at block level so geometry generation scales linearly with partition count rather than requiring one global polygonization pass.
High-level flow:
This keeps memory bounded and allows retries/reprocessing for individual partitions.
Spatial Data Pipeline Architecture: PostGIS and Wherobots Together
In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]
Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade
Introduction Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it. Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, […]
Take-aways from the 2026 Geospatial Embeddings Workshop at Clark University
Some brief take-aways from a workshop to set standards for storing and sharing geospatial embeddings.
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: