Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
AlphaEarth Foundations embeddings can be useful at the pixel level when the task is tightly local, such as inspecting boundaries, fine-scale heterogeneity, or small spatial disturbances. But many practical workflows need signals that stay meaningful after aggregation, whether to fields, parcels, patches, or other management units. Zonal statistics provide a simple way to test that idea: aggregate embeddings within field polygons and ask whether the field-level vectors still preserve interpretable structure.
In this notebook, we compute mean AlphaEarth Foundations (AEF) embeddings for about 3.5 million Iowa fields across 2023-2025, then project those field-level vectors into RGB using the first three PCA components. The result is a GeoParquet dataset we can inspect spatially and compare across fields.
The main takeaway is straightforward: the aggregated embeddings form visible clusters that may correspond to crop type or management practice. That does not prove a downstream classifier will work, but it is encouraging evidence that zonal AEF summaries could support lightweight field-level analysis with simple models rather than per-pixel inference.
RasterFlow is Wherobots’ serverless inference engine for Earth Observation data. It builds inference-ready mosaics from remote sensing data and runs distributed workflows at scale. This notebook uses four capabilities:
build_and_predict_mosaic_recipe
vectorize_mosaic
For this analysis, we use the global AlphaEarth Foundations Zarr mosaic hosted on Source Cooperative. Thanks to Taylor Geospatial and Source Cooperative for publishing the dataset in a form that can be accessed lazily with standard Python tooling.
The mosaic stores annual AEF embeddings at 10-meter resolution in EPSG:4326, with nine temporal snapshots from 2017 through 2025. Working from one global store keeps the workflow simple: we can subset Iowa, compute field-level summaries, and publish a derived product without assembling local inputs.
import xarray as xr
ds = xr.open_zarr("s3://us-west-2.opendata.source.coop/tge-labs/aef-mosaic/", consolidated=False) ds
<xarray.Dataset> Size: 4PB Dimensions: (time: 9, band: 64, y: 1859584, x: 4009984) Coordinates: * time (time) int32 36B 2017 2018 2019 2020 2021 2022 2023 2024 2025 * band (band) object 512B 'A00' 'A01' 'A02' 'A03' ... 'A61' 'A62' 'A63' * y (y) float64 15MB 83.69 83.69 83.69 ... -83.36 -83.36 -83.36 * x (x) float64 32MB -180.0 -180.0 -180.0 ... 180.2 180.2 180.2 Data variables: embeddings (time, band, y, x) int8 4PB dask.array<chunksize=(1, 64, 256, 256), meta=np.ndarray> Attributes: (12/15) proj:code: EPSG:4326 spatial:dimensions: ['y', 'x'] spatial:transform: [8.983111749910169e-05, 0.0, -180.0, 0.0, -8.983... spatial:transform_type: affine spatial:bbox: [-180.0, -83.36280346631479, 180.2213438735178, ... spatial:shape: [1859584, 4009984] ... ... geoemb:model: https://developers.google.com/earth-engine/datas... geoemb:source_data: https://source.coop/tge-labs/aef/v1/annual/ geoemb:data_type: int8 geoemb:gsd: 8.983111749910169e-05 geoemb:quantization: {'method': 'signed_square', 'original_dtype': 'f... zarr_conventions: [{'uuid': 'f17cb550-5864-4468-aeb7-f3180cfb622f'...
array([2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025], dtype=int32)
array(['A00', 'A01', 'A02', 'A03', 'A04', 'A05', 'A06', 'A07', 'A08', 'A09', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A30', 'A31', 'A32', 'A33', 'A34', 'A35', 'A36', 'A37', 'A38', 'A39', 'A40', 'A41', 'A42', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A50', 'A51', 'A52', 'A53', 'A54', 'A55', 'A56', 'A57', 'A58', 'A59', 'A60', 'A61', 'A62', 'A63'], dtype=object)
array([ 83.68566 , 83.685571, 83.685481, ..., -83.362579, -83.362669, -83.362759], shape=(1859584,))
array([-179.999955, -179.999865, -179.999775, ..., 180.221119, 180.221209, 180.221299], shape=(4009984,))
At first glance, 3.81 PiB and 1,024,049,664 chunks in S3 sound impractically large. For this demo, a few details keep the workflow manageable:
We use rayzon, a side project I built to compute zonal statistics using Ray.
Without tuning, it computes zonal statistics for about 3 million Iowa fields across three years in roughly five minutes on a Ray cluster with eight workers and 16 cores per worker.
It also supports t-digest-backed quantile estimation for statistics beyond mean, min, and max, though that path is still experimental. If this workflow is relevant to your team, let us know so we can prioritize productizing this type of analysis in RasterFlow.
Using that workflow, we compute mean zonal statistics for Iowa fields across three years of the global AEF Zarr mosaic.
The resulting GeoParquet is published at s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/ for anyone who wants to inspect it.
s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/
import pyarrow.parquet as pq ds = pq.ParquetDataset("s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/") ds.schema
geometry: binary score_mean: float score_std: float source_store: string local_x_offset: int64 local_y_offset: int64 global_x_offset: int64 global_y_offset: int64 path: string feature_id: int64 bbox: struct<xmax: double, xmin: double, ymax: double, ymin: double> child 0, xmax: double child 1, xmin: double child 2, ymax: double child 3, ymin: double mean: list<element: double> child 0, element: double pca_components: list<element: float> child 0, element: float hilbert_key: int64 time: dictionary<values=string, indices=int32, ordered=0> label: dictionary<values=string, indices=int32, ordered=0> -- schema metadata -- geo: '{"version": "1.1.0", "primary_column": "geometry", "columns": {"geo' + 1808
The schema exposes three columns of interest:
geometry
mean
pca_components
We also partition the dataset using a Hive-style layout on time and label. In this example, label is always field; the model also produces field_boundaries and non_field, but those classes are not the focus here.
time
label
field
field_boundaries
non_field
# what are the partitioning keys? ds.partitioning.dictionaries
[<pyarrow.lib.StringArray object at 0x14199d660> [ "2023-01-01", "2024-01-01", "2025-01-01" ], <pyarrow.lib.StringArray object at 0x14199d1e0> [ "field" ]]
The output contains about 3.5 million Iowa fields across the 2023-2025 period.
sum([fragment.count_rows() for fragment in ds.fragments])
3516044
The dataset also includes score_mean, the mean model confidence over each polygonized field. Because fields are formed from contiguous pixels above the vectorization threshold, this gives us a compact way to inspect how confident the model was within each extracted geometry. Here we use the 0.5 threshold from vectorize_mosaic.
score_mean
0.5
If the viewer below is not working, try fullscreen mode here.
We can see a great deal of variance in the confidence scores between fields, which can provide some critical feedback for modeling efforts and help guide stratefied sampling for well-balaned training sets.
We also visualize each field’s mean AEF embedding by mapping the first three PCA components to RGB in GeoParquet.
The map shows clear color clustering, especially purples, yellow-greens, and blues. To check whether that spatial pattern is reflected numerically, we next plot the first two PCA components in a scatter plot.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_parquet( "s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/", columns=["pca_components"], ) sample = df.sample(n=20000) sample["pca_1"] = sample["pca_components"].apply(lambda x: x[0]) sample["pca_2"] = sample["pca_components"].apply(lambda x: x[1]) sample["pca_3"] = sample["pca_components"].apply(lambda x: x[2]) sample.plot.scatter( x="pca_1", y="pca_2", c=sample["pca_components"].tolist(), alpha=0.5, figsize=(10, 8) ) # overlay 2d density plot plt.hexbin(sample["pca_1"], sample["pca_2"], gridsize=50, cmap='Greys', alpha=0.2, marginals=True) # add normalized density to plt.title("PCA Components Colored by (3) PCA Components") plt.xlabel("PCA Component 1") plt.ylabel("PCA Component 2") plt.show()
We see three broad clusters in the 2D PCA scatter plot, and the cluster colors align with the RGB visualization above. The purple and green groups appear tighter than the yellow group, but the overall separation is still encouraging. That suggests a simple linear probe on top of the AEF embeddings may be able to distinguish field types or management patterns without manual feature engineering or complex models.
Beyond classification, AEF may also be useful for stratified sampling, supervised training set design, or in situ analysis of field management practices.
If you are interested in this approach, reach out to the Wherobots team at support@wherobots.com.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Change Detection Using AlphaEarth Foundations (Part 2)
Continue exploring how Alpha Earth Embeddings reveal change over time using scores.
Introducing the Wherobots Python SDK
What is the Wherobots Python SDK? The Wherobots Python SDK is a typed Python client for submitting, monitoring, and managing Wherobots job runs. It ships on PyPI as wherobots-python-sdk. One install, one API key, and you’re running spatial jobs from any Python environment: CI/CD pipelines, notebooks, a local shell. The SDK is built for three […]
Detecting Objects From Text Prompts with RasterFlow and Segment Anything 3
Exploring the capabilities of Segment Anything 3 on high-resolution Earth observation data.
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: