Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Author: Len Strnad
AlphaEarth Foundations embeddings can be useful at the pixel level when the task is tightly local, such as inspecting boundaries, fine-scale heterogeneity, or small spatial disturbances. But many practical workflows need signals that stay meaningful after aggregation, whether to fields, parcels, patches, or other management units. Zonal statistics provide a simple way to test that idea: aggregate embeddings within field polygons and ask whether the field-level vectors still preserve interpretable structure.
In this notebook, we compute mean AlphaEarth Foundations (AEF) embeddings for about 3.5 million Iowa fields across 2023-2025, then project those field-level vectors into RGB using the first three PCA components. The result is a GeoParquet dataset we can inspect spatially and compare across fields.
The main takeaway is straightforward: the aggregated embeddings form visible clusters that may correspond to crop type or management practice. That does not prove a downstream classifier will work, but it is encouraging evidence that zonal AEF summaries could support lightweight field-level analysis with simple models rather than per-pixel inference.
RasterFlow is Wherobots’ serverless inference engine for Earth Observation data. It builds inference-ready mosaics from remote sensing data and runs distributed workflows at scale. This notebook uses four capabilities:
build_and_predict_mosaic_recipe
vectorize_mosaic
For this analysis, we use the global AlphaEarth Foundations Zarr mosaic hosted on Source Cooperative. Thanks to Taylor Geospatial and Source Cooperative for publishing the dataset in a form that can be accessed lazily with standard Python tooling.
The mosaic stores annual AEF embeddings at 10-meter resolution in EPSG:4326, with nine temporal snapshots from 2017 through 2025. Working from one global store keeps the workflow simple: we can subset Iowa, compute field-level summaries, and publish a derived product without assembling local inputs.
import xarray as xr
ds = xr.open_zarr("s3://us-west-2.opendata.source.coop/tge-labs/aef-mosaic/", consolidated=False) ds
<xarray.Dataset> Size: 4PB Dimensions: (time: 9, band: 64, y: 1859584, x: 4009984) Coordinates: * time (time) int32 36B 2017 2018 2019 2020 2021 2022 2023 2024 2025 * band (band) object 512B 'A00' 'A01' 'A02' 'A03' ... 'A61' 'A62' 'A63' * y (y) float64 15MB 83.69 83.69 83.69 ... -83.36 -83.36 -83.36 * x (x) float64 32MB -180.0 -180.0 -180.0 ... 180.2 180.2 180.2 Data variables: embeddings (time, band, y, x) int8 4PB dask.array<chunksize=(1, 64, 256, 256), meta=np.ndarray> Attributes: (12/15) proj:code: EPSG:4326 spatial:dimensions: ['y', 'x'] spatial:transform: [8.983111749910169e-05, 0.0, -180.0, 0.0, -8.983... spatial:transform_type: affine spatial:bbox: [-180.0, -83.36280346631479, 180.2213438735178, ... spatial:shape: [1859584, 4009984] ... ... geoemb:model: https://developers.google.com/earth-engine/datas... geoemb:source_data: https://source.coop/tge-labs/aef/v1/annual/ geoemb:data_type: int8 geoemb:gsd: 8.983111749910169e-05 geoemb:quantization: {'method': 'signed_square', 'original_dtype': 'f... zarr_conventions: [{'uuid': 'f17cb550-5864-4468-aeb7-f3180cfb622f'...
array([2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025], dtype=int32)
array(['A00', 'A01', 'A02', 'A03', 'A04', 'A05', 'A06', 'A07', 'A08', 'A09', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A30', 'A31', 'A32', 'A33', 'A34', 'A35', 'A36', 'A37', 'A38', 'A39', 'A40', 'A41', 'A42', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A50', 'A51', 'A52', 'A53', 'A54', 'A55', 'A56', 'A57', 'A58', 'A59', 'A60', 'A61', 'A62', 'A63'], dtype=object)
array([ 83.68566 , 83.685571, 83.685481, ..., -83.362579, -83.362669, -83.362759], shape=(1859584,))
array([-179.999955, -179.999865, -179.999775, ..., 180.221119, 180.221209, 180.221299], shape=(4009984,))
At first glance, 3.81 PiB and 1,024,049,664 chunks in S3 sound impractically large. For this demo, a few details keep the workflow manageable:
We use rayzon, a side project I built to compute zonal statistics using Ray.
Without tuning, it computes zonal statistics for about 3 million Iowa fields across three years in roughly five minutes on a Ray cluster with eight workers and 16 cores per worker.
It also supports t-digest-backed quantile estimation for statistics beyond mean, min, and max, though that path is still experimental. If this workflow is relevant to your team, let us know so we can prioritize productizing this type of analysis in RasterFlow.
Using that workflow, we compute mean zonal statistics for Iowa fields across three years of the global AEF Zarr mosaic.
The resulting GeoParquet is published at s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/ for anyone who wants to inspect it.
s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/
import pyarrow.parquet as pq ds = pq.ParquetDataset("s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/") ds.schema
geometry: binary score_mean: float score_std: float source_store: string local_x_offset: int64 local_y_offset: int64 global_x_offset: int64 global_y_offset: int64 path: string feature_id: int64 bbox: struct<xmax: double, xmin: double, ymax: double, ymin: double> child 0, xmax: double child 1, xmin: double child 2, ymax: double child 3, ymin: double mean: list<element: double> child 0, element: double pca_components: list<element: float> child 0, element: float hilbert_key: int64 time: dictionary<values=string, indices=int32, ordered=0> label: dictionary<values=string, indices=int32, ordered=0> -- schema metadata -- geo: '{"version": "1.1.0", "primary_column": "geometry", "columns": {"geo' + 1808
The schema exposes three columns of interest:
geometry
mean
pca_components
We also partition the dataset using a Hive-style layout on time and label. In this example, label is always field; the model also produces field_boundaries and non_field, but those classes are not the focus here.
time
label
field
field_boundaries
non_field
# what are the partitioning keys? ds.partitioning.dictionaries
[<pyarrow.lib.StringArray object at 0x14199d660> [ "2023-01-01", "2024-01-01", "2025-01-01" ], <pyarrow.lib.StringArray object at 0x14199d1e0> [ "field" ]]
The output contains about 3.5 million Iowa fields across the 2023-2025 period.
sum([fragment.count_rows() for fragment in ds.fragments])
3516044
The dataset also includes score_mean, the mean model confidence over each polygonized field. Because fields are formed from contiguous pixels above the vectorization threshold, this gives us a compact way to inspect how confident the model was within each extracted geometry. Here we use the 0.5 threshold from vectorize_mosaic.
score_mean
0.5
If the viewer below is not working, try fullscreen mode here.
We can see a great deal of variance in the confidence scores between fields, which can provide some critical feedback for modeling efforts and help guide stratefied sampling for well-balaned training sets.
We also visualize each field’s mean AEF embedding by mapping the first three PCA components to RGB in GeoParquet.
The map shows clear color clustering, especially purples, yellow-greens, and blues. To check whether that spatial pattern is reflected numerically, we next plot the first two PCA components in a scatter plot.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_parquet( "s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/", columns=["pca_components"], ) sample = df.sample(n=20000) sample["pca_1"] = sample["pca_components"].apply(lambda x: x[0]) sample["pca_2"] = sample["pca_components"].apply(lambda x: x[1]) sample["pca_3"] = sample["pca_components"].apply(lambda x: x[2]) sample.plot.scatter( x="pca_1", y="pca_2", c=sample["pca_components"].tolist(), alpha=0.5, figsize=(10, 8) ) # overlay 2d density plot plt.hexbin(sample["pca_1"], sample["pca_2"], gridsize=50, cmap='Greys', alpha=0.2, marginals=True) # add normalized density to plt.title("PCA Components Colored by (3) PCA Components") plt.xlabel("PCA Component 1") plt.ylabel("PCA Component 2") plt.show()
We see three broad clusters in the 2D PCA scatter plot, and the cluster colors align with the RGB visualization above. The purple and green groups appear tighter than the yellow group, but the overall separation is still encouraging. That suggests a simple linear probe on top of the AEF embeddings may be able to distinguish field types or management patterns without manual feature engineering or complex models.
Beyond classification, AEF may also be useful for stratified sampling, supervised training set design, or in situ analysis of field management practices.
If you are interested in this approach, reach out to the Wherobots team at support@wherobots.com.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Graph RAG for the Physical World
Introduction RAG (Retrieval Augmented Generation) has addressed one of AI’s biggest challenges for enterprise users: missing or hallucinating empirical business and real world context . Instead of generating answers from nothing, RAG retrieves relevant documents and feeds them to the model as context. It works. Ask an AI about your company’s Q4 revenue, and RAG […]
Building the Wherobots Mobility Solution Accelerator: A Technical Deep Dive
Three Notebooks, One Medallion Architecture, Full 4D GPS Trajectory Processing: Part 2 of 2
How well does SAM3 detect building footprints? Let’s ask the Wherobots Spatial AI Assistant!
In a recent post, we showed how easy it is to use RasterFlow and Meta’s Segment Anything 3 Model (SAM3) to detect features in the physical world. A single end-to-end pipeline built a 133 GB NAIP mosaic of Marion County, Oregon, ran SAM3 against it with text prompts spanning eight classes, and produced approximately one […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: