Connect your AI coding assistants to the physical world with Wherobots MCP and CLI Learn More

AlphaEarth Embeddings, Zonal Statistics, and PCA

Authors

pca zonal stats alphaearth header image

AlphaEarth Foundations embeddings can be useful at the pixel level when the task is tightly local, such as inspecting boundaries, fine-scale heterogeneity, or small spatial disturbances. But many practical workflows need signals that stay meaningful after aggregation, whether to fields, parcels, patches, or other management units. Zonal statistics provide a simple way to test that idea: aggregate embeddings within field polygons and ask whether the field-level vectors still preserve interpretable structure.

In this notebook, we compute mean AlphaEarth Foundations (AEF) embeddings for about 3.5 million Iowa fields across 2023-2025, then project those field-level vectors into RGB using the first three PCA components. The result is a GeoParquet dataset we can inspect spatially and compare across fields.

The main takeaway is straightforward: the aggregated embeddings form visible clusters that may correspond to crop type or management practice. That does not prove a downstream classifier will work, but it is encouraging evidence that zonal AEF summaries could support lightweight field-level analysis with simple models rather than per-pixel inference.

RasterFlow Tools Used

RasterFlow is Wherobots’ serverless inference engine for Earth Observation data. It builds inference-ready mosaics from remote sensing data and runs distributed workflows at scale. This notebook uses four capabilities:

  • build_and_predict_mosaic_recipe on the RasterFlow Remote client for building a mosaic and running inference using the Fields of the World model.
  • vectorize_mosaic on the RasterFlow Remote client for converting field predictions from raster mosaics into GeoParquet.
  • An experimental zonal-statistics workflow for aggregating AEF embeddings over field geometries.
  • An experimental PCA workflow for appending the first three components for visualization.

Global Zarr AEF Mosaic

For this analysis, we use the global AlphaEarth Foundations Zarr mosaic hosted on Source Cooperative. Thanks to Taylor Geospatial and Source Cooperative for publishing the dataset in a form that can be accessed lazily with standard Python tooling.

The mosaic stores annual AEF embeddings at 10-meter resolution in EPSG:4326, with nine temporal snapshots from 2017 through 2025. Working from one global store keeps the workflow simple: we can subset Iowa, compute field-level summaries, and publish a derived product without assembling local inputs.

import xarray as xr
ds = xr.open_zarr("s3://us-west-2.opendata.source.coop/tge-labs/aef-mosaic/", consolidated=False)
ds
<xarray.Dataset> Size: 4PB
Dimensions:     (time: 9, band: 64, y: 1859584, x: 4009984)
Coordinates:
  * time        (time) int32 36B 2017 2018 2019 2020 2021 2022 2023 2024 2025
  * band        (band) object 512B 'A00' 'A01' 'A02' 'A03' ... 'A61' 'A62' 'A63'
  * y           (y) float64 15MB 83.69 83.69 83.69 ... -83.36 -83.36 -83.36
  * x           (x) float64 32MB -180.0 -180.0 -180.0 ... 180.2 180.2 180.2
Data variables:
    embeddings  (time, band, y, x) int8 4PB dask.array<chunksize=(1, 64, 256, 256), meta=np.ndarray>
Attributes: (12/15)
    proj:code:               EPSG:4326
    spatial:dimensions:      ['y', 'x']
    spatial:transform:       [8.983111749910169e-05, 0.0, -180.0, 0.0, -8.983...
    spatial:transform_type:  affine
    spatial:bbox:            [-180.0, -83.36280346631479, 180.2213438735178, ...
    spatial:shape:           [1859584, 4009984]
    ...                      ...
    geoemb:model:            https://developers.google.com/earth-engine/datas...
    geoemb:source_data:      https://source.coop/tge-labs/aef/v1/annual/
    geoemb:data_type:        int8
    geoemb:gsd:              8.983111749910169e-05
    geoemb:quantization:     {'method': 'signed_square', 'original_dtype': 'f...
    zarr_conventions:        [{'uuid': 'f17cb550-5864-4468-aeb7-f3180cfb622f'...

At first glance, 3.81 PiB and 1,024,049,664 chunks in S3 sound impractically large. For this demo, a few details keep the workflow manageable:

  • only land-intersecting chunks are present
  • the dataset is sharded into nine spatial chunks, which lowers the effective object count
  • we subset Iowa rather than operating on the full global mosaic

Zonal Statistics Experimental Workflow

We use rayzon, a side project I built to compute zonal statistics using Ray.

Without tuning, it computes zonal statistics for about 3 million Iowa fields across three years in roughly five minutes on a Ray cluster with eight workers and 16 cores per worker.

It also supports t-digest-backed quantile estimation for statistics beyond mean, min, and max, though that path is still experimental. If this workflow is relevant to your team, let us know so we can prioritize productizing this type of analysis in RasterFlow.

Zonal Statistics Results

Using that workflow, we compute mean zonal statistics for Iowa fields across three years of the global AEF Zarr mosaic.

The resulting GeoParquet is published at s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/ for anyone who wants to inspect it.

import pyarrow.parquet as pq

ds = pq.ParquetDataset("s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/")
ds.schema
geometry: binary
score_mean: float
score_std: float
source_store: string
local_x_offset: int64
local_y_offset: int64
global_x_offset: int64
global_y_offset: int64
path: string
feature_id: int64
bbox: struct<xmax: double, xmin: double, ymax: double, ymin: double>
  child 0, xmax: double
  child 1, xmin: double
  child 2, ymax: double
  child 3, ymin: double
mean: list<element: double>
  child 0, element: double
pca_components: list<element: float>
  child 0, element: float
hilbert_key: int64
time: dictionary<values=string, indices=int32, ordered=0>
label: dictionary<values=string, indices=int32, ordered=0>
-- schema metadata --
geo: '{"version": "1.1.0", "primary_column": "geometry", "columns": {"geo' + 1808

The schema exposes three columns of interest:

  • geometry: the geometry of each field, with one row per field
  • mean: a 64-element list containing the mean value for each AEF embedding dimension within the field
  • pca_components: a 3-element list containing the first three PCA components of each mean embedding, used here as RGB channels for visualization

We also partition the dataset using a Hive-style layout on time and label. In this example, label is always field; the model also produces field_boundaries and non_field, but those classes are not the focus here.

# what are the partitioning keys?
ds.partitioning.dictionaries
[<pyarrow.lib.StringArray object at 0x14199d660>
 [
   "2023-01-01",
   "2024-01-01",
   "2025-01-01"
 ],
 <pyarrow.lib.StringArray object at 0x14199d1e0>
 [
   "field"
 ]]

The output contains about 3.5 million Iowa fields across the 2023-2025 period.

sum([fragment.count_rows() for fragment in ds.fragments])
3516044

Visualizing Confidences

The dataset also includes score_mean, the mean model confidence over each polygonized field. Because fields are formed from contiguous pixels above the vectorization threshold, this gives us a compact way to inspect how confident the model was within each extracted geometry. Here we use the 0.5 threshold from vectorize_mosaic.

If the viewer below is not working, try fullscreen mode here.

We can see a great deal of variance in the confidence scores between fields, which can provide some critical feedback for modeling efforts and help guide stratefied sampling for well-balaned training sets.

Visualizing Zonal Statistics of Mean Embeddings with PCA

We also visualize each field’s mean AEF embedding by mapping the first three PCA components to RGB in GeoParquet.

If the viewer below is not working, try fullscreen mode here.

The map shows clear color clustering, especially purples, yellow-greens, and blues. To check whether that spatial pattern is reflected numerically, we next plot the first two PCA components in a scatter plot.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_parquet(
    "s3://wherobots-examples/rasterflow/vectors/iowa_aef_zonal_stats/",
    columns=["pca_components"],
)

sample = df.sample(n=20000)
sample["pca_1"] = sample["pca_components"].apply(lambda x: x[0])
sample["pca_2"] = sample["pca_components"].apply(lambda x: x[1])
sample["pca_3"] = sample["pca_components"].apply(lambda x: x[2])

sample.plot.scatter(
    x="pca_1", y="pca_2", c=sample["pca_components"].tolist(), alpha=0.5, figsize=(10, 8)
)

# overlay 2d density plot
plt.hexbin(sample["pca_1"], sample["pca_2"], gridsize=50, cmap='Greys', alpha=0.2, marginals=True)

# add normalized density to 
plt.title("PCA Components Colored by (3) PCA Components")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

We see three broad clusters in the 2D PCA scatter plot, and the cluster colors align with the RGB visualization above. The purple and green groups appear tighter than the yellow group, but the overall separation is still encouraging. That suggests a simple linear probe on top of the AEF embeddings may be able to distinguish field types or management patterns without manual feature engineering or complex models.

Beyond classification, AEF may also be useful for stratified sampling, supervised training set design, or in situ analysis of field management practices.

Takeaways and Next Steps

  • Aggregated AEF embeddings remain informative at field scale, not just pixel scale.
  • The zonal-statistics workflow scales to millions of fields and hundreds of millions of pixels on a modest Ray cluster.
  • GeoParquet and a single global Zarr mosaic work well together for this style of analysis.
  • Field-level confidence scores provide an additional quality signal alongside the embeddings.
  • The PCA visualization shows clear clustering that may relate to crop type or management practice and is worth testing with labeled data.

If you are interested in this approach, reach out to the Wherobots team at support@wherobots.com.

Sign up for RasterFlow Preview