Your AI can now contextualize physical world data using Wherobots Spatial AI Coding Tools Learn More

How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline

Authors

Fields of The World predictions created with RasterFlow
Fields of The World predictions created with RasterFlow

Fields of the World (FTW) is a Taylor Geospatial effort to produce globally consistent agricultural field-boundary data for land-use monitoring, food-system analysis, and model development. For the 2024-2025 global release, Taylor Geospatial partnered with Wherobots to run their latest FTW model, PRUE, on Wherobots RasterFlow; see the PRUE paper.

In this notebook, we walk through the production pipeline behind that release. The PRUE model turns seasonal Sentinel-2 composites into global field and field_boundaries predictions, and RasterFlow handles the larger systems problem: running data access, compositing, inference, and vectorization as one coordinated workflow and publishing the results as analysis-ready raster and vector products.

The focus here is practical architecture rather than model internals. We cover how the pipeline is structured, what it produced at global scale, and the implementation choices that made the run reproducible.

What we cover:

  1. RasterFlow and why it is designed for reproducible, scalable global runs.
  2. A Scale Profile with artifact sizes/object counts for this dataset.
  3. The RasterFlow workflow stages and resulting outputs:
  4. Implementation notes in Appendix — Technical Notes.

RasterFlow Execution Model for a Global-Scale GeoAI Pipeline

RasterFlow provides a reproducible way to run a global-scale GeoAI pipeline.

It is a geospatial workflow system backed by Flyte V2 and Ray. We chose Flyte because long-running global jobs need strong workflow semantics: versioned executions, retries, lineage, and reproducible reruns. We chose Ray Datasets because raster inference and vectorization are naturally blockwise workloads, and Ray gives us elastic parallelism and task streaming support to support the many small task problem.

The rasterflow_remote client exposes common GeoAI workflow steps while running in each organization’s isolated namespace. As a result, resource controls, failure containment, and reruns stay scoped to that organization.

We construct the client and load the model config for the PRUE model:

from rasterflow_remote.model_registry import get_model_registry_config, ModelRegistryEnum
from rasterflow_remote import RasterflowClient

client = RasterflowClient()

# see https://huggingface.co/wherobots/prue-pt2 for model details
cfg = get_model_registry_config(ModelRegistryEnum.HUGGINGFACE, "wherobots/prue-pt2")

Build Mosaic

build_mosaic converts raw satellite scenes into analysis-ready feature mosaics on a shared grid.

For the FTW production run, this stage computes planting and harvest composites, writes feature COGs, and then materializes a global feature mosaic that can be stacked with model outputs. We persist that global feature product in Zarr because chunked, cloud-native storage lets us read small spatial windows directly from object storage during validation, inference, and downstream analytics.

from rasterflow_remote import DatasetEnum
from datetime import datetime

mosaic = client.build_mosaic(
    datasets=[DatasetEnum.S2_MED_PLANTING, DatasetEnum.S2_MED_HARVEST],
    aoi="https://github.com/fieldsoftheworld/ftw-inference-app/blob/main/src/data/s2-grid.json",
    start=datetime(2023, 1, 1),
    end=datetime(2025, 1, 1),
    crs_epsg=4326,
)

For implementation details, see Appendix: Feature COG and Appendix: GDAL notes.

Inspect the published feature mosaic

The produced feature mosaic is published as a Zarr store on Source Cooperative. Therefore, we can inspect structure, bands, and a spatial subset directly.

import xarray as xr
import rasterix

mosaic_ds = xr.open_zarr(
    "https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr"
).pipe(rasterix.assign_index)
mosaic_ds
<xarray.Dataset> Size: 502TB
Dimensions:      (time: 2, band: 10, y: 1566049, x: 4007517)
Coordinates:
  * time         (time) datetime64[ns] 16B 2024-01-01 2025-01-01
  * band         (band) object 80B 's2med_harvest:B02' ... 's2med_planting:N_...
  * y            (y) float64 13MB 83.75 83.75 83.75 ... -56.93 -56.93 -56.93
  * x            (x) float64 32MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
    spatial_ref  int64 8B ...
Data variables:
    variables    (time, band, y, x) float32 502TB dask.array<chunksize=(1, 1, 4096, 4096), meta=np.ndarray>
Indexes:
  ┌ x        RasterIndex (crs=None)
  └ y

Persisting an intermediate global Zarr mosaic lets us query any region without reprojecting or resampling.

Training data for the PRUE model is also aligned to EPSG:4326, so feature and prediction stores remain grid-compatible for side-by-side analysis. For data layout details, see Appendix: Zarr internals.

For the examples below, we use the same Japan AOI shown later in the PMTiles preview so the raster and vector artifacts are easy to compare. For more information on xarray, see Appendix: Xarray notes.

import matplotlib.pyplot as plt
import numpy as np

bbox = {
    "x": slice(140.24, 140.34),
    "y": slice(37.36, 37.26),
}
subset = mosaic_ds.sel(x=bbox["x"], y=bbox["y"])
geo_aspect = 1 / np.cos(np.deg2rad(float(subset.y.mean())))

fig, axes = plt.subplots(1, 2, figsize=(9, 9), constrained_layout=True)
for ax, band_prefix, title in [
    (axes[0], "s2med_planting", "Planting RGB (PMTiles AOI, 2025)"),
    (axes[1], "s2med_harvest", "Harvest RGB (PMTiles AOI, 2025)"),
]:
    subset["variables"].sel(
        time="2025-01-01",
        band=[f"{band_prefix}:B04", f"{band_prefix}:B03", f"{band_prefix}:B02"],
    ).plot.imshow(ax=ax, robust=True)
    ax.set_aspect(geo_aspect)
    ax.set_title(title)

plt.show()

See additional information in Appendix: Zarr internals and Appendix: Xarray notes.

Predict Mosaic

predict_mosaic runs global model inference over the feature mosaic and writes one prediction Zarr store.

You provide the model config, and the workflow executes inference across the global grid. Keeping the result in Zarr preserves the same chunked access pattern as the feature store, so validation and downstream vectorization can stay blockwise and cloud-native instead of stitching regional outputs back together later.

predictions = client.predict_mosaic(
    store="https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr",
    model_path=cfg.model_path,
    patch_size=cfg.patch_size,
    clip_size=cfg.clip_size,
    device=cfg.device,
    features=cfg.features,
    labels=cfg.labels,
    actor=cfg.actor,
    max_batch_size=cfg.max_batch_size,
    merge_mode=cfg.merge_mode,
)

For internals and scheduling details, see Appendix: config, Appendix: Ray usage, and Appendix: distributed inference.

Inspect the published prediction mosaic

The prediction Zarr is also published on Source Cooperative. Let’s verify shape, bands, and value ranges.

The shape, bounds, and time remain aligned with the feature mosaic; however, the bands now represent predicted classes.

pred_ds = xr.open_zarr(
    "https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr"
).pipe(rasterix.assign_index)
pred_ds
<xarray.Dataset> Size: 151TB
Dimensions:      (time: 2, band: 3, y: 1566049, x: 4007517)
Coordinates:
  * time         (time) datetime64[ns] 16B 2024-01-01 2025-01-01
  * band         (band) <U20 240B 'non_field_background' ... 'field_boundaries'
  * y            (y) float64 13MB 83.75 83.75 83.75 ... -56.93 -56.93 -56.93
  * x            (x) float64 32MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
    spatial_ref  int64 8B ...
Data variables:
    variables    (time, band, y, x) float32 151TB dask.array<chunksize=(1, 1, 2048, 2048), meta=np.ndarray>
Indexes:
  ┌ x        RasterIndex (crs=None)
  └ y

Let’s plot the three labels for the same PMTiles-area AOI so the raster predictions line up with the vector preview later in the post:

import numpy as np

bbox = {
    "x": slice(140.24, 140.34),
    "y": slice(37.36, 37.26),
}
pred_subset = pred_ds.sel(x=bbox["x"], y=bbox["y"])
geo_aspect = 1 / np.cos(np.deg2rad(float(pred_subset.y.mean())))

grid = pred_subset["variables"].plot.imshow(col="band", row="time", figsize=(12, 8))
for ax in grid.axes.flat:
    ax.set_aspect(geo_aspect)

plt.show()

Zarr metadata for those curious is available in Appendix: Zarr internals.

Vectorize Mosaic

vectorize_mosaic converts prediction rasters into field-boundary GeoParquet outputs.

You specify feature bands, threshold, and method. Then the workflow writes distributed vector outputs for analytics and downstream publishing. We write GeoParquet because it keeps the result columnar, splittable, and easy to consume from batch analytics systems.

Conceptually, this output is a distributed geospatial table rather than an array: each row is a polygonized field candidate, the geometry column holds the boundary, and the remaining columns carry the scalar attributes you want to inspect in GeoPandas or query later in SedonaDB and WherobotsDB.

For distributed implementation details, see Appendix: vectorization.

from rasterflow_remote.data_models import VectorizeMethodEnum, SemSegRasterioConfig

vectors = client.vectorize_mosaic(
    store="https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr",
    features=["field"],
    threshold=0.5,
    vectorize_method=VectorizeMethodEnum.SEMANTIC_SEGMENTATION_RASTERIO,
    vectorize_config=SemSegRasterioConfig(stats=False, medial_skeletonize=False),
)

A lightweight way to inspect that schema locally is:

import geopandas as gpd

vector_preview = gpd.read_parquet(vectors.uri)
vector_preview.head(3)

That gives you a normal GeoDataFrame view of the vector output: one geometry column plus the per-row attributes emitted by the vectorization stage.

Implementation details for distributed/blockwise vectorization are in Appendix: vectorization.

PMTiles

PMTiles packages vector outputs for fast, interactive map delivery. We use it here because a single archive is easy to publish, cache, and stream into browser maps without standing up a tile service.

We also support generating PMTiles from large GeoParquet outputs directly in WherobotsDB using our scalable PMTiles generator workflow, as described here: How to Generate PMTiles for Overture Maps with WherobotsDB VTiles.

Scale Profile

This section summarizes the scale of each pipeline step and the size of the generated outputs. Understanding the scale of this production run will help when we discuss design choices later in this blog post.

Artifact Pipeline Step S3 Location Size Object Count Additional Scale Signal
Feature COGs build_mosaic s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/features/cogs/ 153 TB 90,918 Each COG is a median composite from ~5-10 Sentinel-2 scenes per pixel
Feature Zarr mosaic build_mosaic s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/features/zarr/alpha/global.zarr 150 TB 363,999 7,499,140 logical chunks (sharded to keep object count manageable)
Prediction Zarr mosaic predict_mosaic s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/predictions/zarr/alpha/global.zarr 45 TB 84,877 8,982,630 logical chunks (also sharded)
Vector outputs (GeoParquet) vectorize_mosaic s3://us-west-2.opendata.source.coop/tge-labs/ftw-global-data/predictions/vectors/alpha/results/ 675 GB 1,000 8,217,195,679 rows

Aggregate data output (above artifacts): ~348.7 TB across 540,794 objects.

RasterFlow summary

You can reproduce this end-to-end global-scale GeoAI pipeline with three API calls:

from datetime import datetime

from rasterflow_remote import RasterflowClient, DatasetEnum
from rasterflow_remote.model_registry import get_model_registry_config, ModelRegistryEnum
from rasterflow_remote.data_models import VectorizeMethodEnum, SemSegRasterioConfig

client = RasterflowClient()
cfg = get_model_registry_config(ModelRegistryEnum.HUGGINGFACE, "wherobots/prue-pt2")

mosaic = client.build_mosaic(
    datasets=[DatasetEnum.S2_MED_PLANTING, DatasetEnum.S2_MED_HARVEST],
    aoi="https://github.com/fieldsoftheworld/ftw-inference-app/blob/main/src/data/s2-grid.json",
    start=datetime(2023, 1, 1),
    end=datetime(2025, 1, 1),
    crs_epsg=4326,
)

predictions = client.predict_mosaic(
    store=mosaic.uri,
    model_path=cfg.model_path,
    patch_size=cfg.patch_size,
    clip_size=cfg.clip_size,
    device=cfg.device,
    features=cfg.features,
    labels=cfg.labels,
    actor=cfg.actor,
    max_batch_size=cfg.max_batch_size,
    merge_mode=cfg.merge_mode,
)

vectors = client.vectorize_mosaic(
    store=predictions.uri,
    features=["field"],
    threshold=0.5,
    vectorize_method=VectorizeMethodEnum.SEMANTIC_SEGMENTATION_RASTERIO,
    vectorize_config=SemSegRasterioConfig(stats=False, medial_skeletonize=False),
)

In short, this global-scale GeoAI pipeline with RasterFlow starts from global feature mosaics, runs one global inference workflow, and publishes both analytics-ready and map-ready outputs.

If you want a hands-on starting point, sign up for the RasterFlow Private Preview today and try the FTW notebook in the Solution Gallery.

If your team is evaluating global-scale raster + vector production workflows, this is exactly what RasterFlow private preview is for.

For API details and additional examples, see the Rasterflow documentation.

Appendix — Technical Notes

Building Feature COGs

Internally, build_mosaic orchestrates four key steps:

  1. Seasonal DOY filtering — candidate scenes are filtered by day-of-year windows derived from latitude-aware seasonal heuristics.
  2. Quality masking — cloud/quality masks suppress invalid observations.
  3. Greedy scene selection — we iteratively pick scenes that maximize uncovered valid area so we reach broad coverage with a bounded scene count.
  4. Temporal compositing — selected scenes are collapsed into median composites for downstream model features.

The current DOY heuristic and greedy objective are intentionally conservative and production-stable, but there is room for improvement (for example, region-specific phenology priors, dynamic cloud climatology, and cost-aware objective tuning).

We also produce a COG index, which is used with GDAL’s GTI driver to build the underlying global mosaic.

import geopandas as gpd
import fsspec

with fsspec.open("https://data.source.coop/ftw/global-data/features/cogs/alpha/index.parquet") as f:
    index = gpd.read_parquet(f)
index.head(3)
tile_id geometry time dataset url
0 01FBE POLYGON ((180 -50.59941, 178.76332 -50.5626, 1… 2024-01-01 00:00:00+00:00 s2med_planting s3://us-west-2.opendata.source.coop/tge-labs/f…
1 01FBE POLYGON ((180 -50.59941, 178.76332 -50.5626, 1… 2025-01-01 00:00:00+00:00 s2med_planting s3://us-west-2.opendata.source.coop/tge-labs/f…
2 01FBF POLYGON ((180 -49.70001, 178.84184 -49.66599, … 2024-01-01 00:00:00+00:00 s2med_planting s3://us-west-2.opendata.source.coop/tge-labs/f…

Mosaicking notes (GTI + Ray)

After feature COGs are written, we build a global virtual mosaic index and then materialize aligned arrays on the target grid. This relies on GDAL’s GTI driver as the index layer and mosaicking.

Grid configuration for this run:

  • CRS: EPSG:4326
  • Resolution: 8.983119e-5 degrees (~10 m at equator)
  • Resampling: cubic

The GTI-backed approach keeps COG indexing separate from downstream chunk/block execution and maps cleanly to distributed scheduling.

Zarr internals

Zarr is the storage layer that makes the global feature and prediction mosaics usable as products rather than just intermediate files.

Two layout choices matter here:

  • Logical chunks are the units we want readers and compute stages to address.
  • Shards are the larger storage objects we actually write to object storage.

We split them on purpose. Small logical chunks keep regional reads and retry boundaries precise, while larger shards prevent the store from exploding into billions of tiny objects. Larger shards also lower overall S3 PUT and LIST costs and reduce request-rate pressure.

For the feature store, the logical chunk is (1, 1, 4096, 4096) and shards are written at (1, 1, 12288, 12288), so each object packs a 3 x 3 tile of logical chunks. For the prediction store, the logical chunk is (1, 1, 2048, 2048) and shards are written at (1, 3, 8192, 8192), so each object packs all three class bands across a 4 x 4 spatial tile of logical chunks.

The tradeoff is deliberate: larger shards improve object-store efficiency and reduce listing overhead, but smaller logical chunks still give us better partial reads, safer retries, and more control over Ray execution geometry.

The feature Zarr mosaic metadata is as follows:

import zarr

group = zarr.open("https://data.source.coop/ftw/global-data/features/zarr/alpha/global.zarr")
group["variables"].metadata.__dict__
{'shape': (2, 10, 1566049, 4007517),
 'data_type': Float32(endianness='little'),
 'chunk_grid': RegularChunkGrid(chunk_shape=(1, 1, 12288, 12288)),
 'chunk_key_encoding': DefaultChunkKeyEncoding(separator='/'),
 'codecs': (ShardingCodec(chunk_shape=(1, 1, 4096, 4096), codecs=(BytesCodec(endian=<Endian.little: 'little'>), ZstdCodec(level=0, checksum=False)), index_codecs=(BytesCodec(endian=<Endian.little: 'little'>), Crc32cCodec()), index_location=<ShardingCodecIndexLocation.end: 'end'>),),
 'dimension_names': ('time', 'band', 'y', 'x'),
 'fill_value': np.float32(nan),
 'attributes': {'coordinates': 'spatial_ref', '_FillValue': 'AAAAAAAA+H8='},
 'storage_transformers': (),
 'extra_fields': {}}

The prediction zarr mosaic metadata is as follows:

group = zarr.open("https://data.source.coop/ftw/global-data/predictions/zarr/alpha/global.zarr")
group["variables"].metadata.__dict__
{'shape': (2, 3, 1566049, 4007517),
 'data_type': Float32(endianness='little'),
 'chunk_grid': RegularChunkGrid(chunk_shape=(1, 3, 8192, 8192)),
 'chunk_key_encoding': DefaultChunkKeyEncoding(separator='/'),
 'codecs': (ShardingCodec(chunk_shape=(1, 1, 2048, 2048), codecs=(BytesCodec(endian=<Endian.little: 'little'>), ZstdCodec(level=0, checksum=False)), index_codecs=(BytesCodec(endian=<Endian.little: 'little'>), Crc32cCodec()), index_location=<ShardingCodecIndexLocation.end: 'end'>),),
 'dimension_names': ('time', 'band', 'y', 'x'),
 'fill_value': np.float32(nan),
 'attributes': {'coordinates': 'spatial_ref', '_FillValue': 'AAAAAAAA+H8='},
 'storage_transformers': (),
 'extra_fields': {}}

Xarray notes

We use rasterix to attach analytical coordinates lazily so opening the global feature store does not require eagerly reading large coordinate vectors every time.

Why this matters:

  • x length: 1,566,049
  • y length: 4,007,517
  • At float64 precision, coordinates alone are roughly (1,566,049 + 4,007,517) * 8 ~= 45 MB before reading any model features.

Operationally, our xarray pattern is:

  1. Open lazily from object storage.
  2. Slice first (.sel) to bound I/O.
  3. Materialize only when plotting/validating.

This keeps exploratory analysis responsive while preserving direct comparability between feature and prediction stores on the same grid.

The Inference Config

The InferenceConfig is the contract between model packaging and distributed execution. It tells RasterFlow how to read features, how large each inference window should be, what runtime to launch, and how overlapping predictions should be merged back into the output mosaic.

In practice, the fields fall into a few groups:

  • model_path and actor select the model artifact and the inference runtime implementation.
  • patch_size and clip_size define the read window and how much border area is discarded to suppress edge artifacts.
  • features and labels define the exact tensor interface between the Zarr store and the model.
  • device and max_batch_size control throughput versus memory pressure on the target worker.
  • merge_mode defines how overlapping patch predictions are blended into the final raster.

from rasterflow_remote.data_models import MergeModeEnum, InferenceActorEnum

{
    # path to the pt2 archive on hugging face
    'model_path': 'https://huggingface.co/wherobots/prue-pt2/resolve/main/prue-efnetb7.pt2',
    # an enum used to specify how to handle the model for inference
    'actor': InferenceActorEnum.SEMANTIC_SEGMENTATION_PYTORCH,
    # patch size determines the size of the inputs provided to the model
    'patch_size': 256,
    # clip size determines the size we clip off the borders to reduce edge effects during inference
    'clip_size': 32,
    # device to run inference on, e.g. 'cuda' or 'cpu'
    'device': 'cuda',
    # list of features in order as required by the model. These names map to the input Zarr band labels.
    'features': ['s2med_harvest:B04',
    's2med_harvest:B03',
    's2med_harvest:B02',
    's2med_harvest:B08',
    's2med_planting:B04',
    's2med_planting:B03',
    's2med_planting:B02',
    's2med_planting:B08'],
    # the named labels for the predicted classes. These will be the output band labels in the output Zarr.
    'labels': ['non_field_background', 'field', 'field_boundaries'],
    # the maximum batch size assuming our default A10 GPU with 24GB of memory.
    'max_batch_size': 128,
    # the method to use when merging overlapping predictions to reduce edge effects at the patch level
    'merge_mode': MergeModeEnum.WEIGHTED_AVERAGE
}

How Ray is used for inference

We chose Ray as the data-plane engine because the workload is embarrassingly parallel at block level, but still needs data-aware scheduling and actor pools for GPU inference.

Ray is used as a bounded block-processing engine over sharded Zarr storage.

Key mapping model:

  • Logical chunks define storage math and indexability.
  • Shards reduce object count by packing multiple chunks per object.
  • Ray blocks define execution granularity and may span multiple chunks.

This decoupling lets us tune compute batch shape without rewriting storage layout. In the data pipeline, Ray Data uses Arrow-backed transport so tabular metadata/state can move across stages with minimal copying.

Practical takeaway: throughput scales with actor parallelism while per-task memory remains tied to block geometry, not global mosaic size.

Distributed inference (wherobots-rasterflow)

The InferenceConfig above feeds directly into block planning, actor setup, and overlap-aware merging. Internally, predict_mosaic orchestrates three phases:

  1. Block planning — derive execution blocks from chunk metadata and resource limits.
  2. Read/Infer/Write pipeline — execute overlap-aware reads, actor inference, and deterministic writes.
  3. Retry-safe commit — persist block outputs with stable target regions.
# conceptual : chunk/shard metadata -> Ray blocks

chunk_meta = read_zarr_chunk_metadata(store)
blocks = plan_ray_blocks(chunk_meta, target_rows_per_block)

ray_ds = ray.data.from_items(blocks)
result = (
    ray_ds
    .map(read_zarr_block_with_overlap)
    .map(run_inference_actor_pool)
    .map(clip_and_write_block)
)

Our distributed approach to vectorizing

vectorize_mosaic operates at block level so geometry generation scales linearly with partition count rather than requiring one global polygonization pass.

High-level flow:

  1. Read prediction blocks from Zarr.
  2. Apply thresholding per block.
  3. Polygonize per block.
  4. Emit GeoParquet partitions and merge metadata.

This keeps memory bounded and allows retries/reprocessing for individual partitions.

Sign up for RasterFlow Private Preview