WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Raster Spatial Joins at Scale: Google Earth Engine and BigQuery vs Apache Sedona and Wherobots

Authors

If you are deciding between Google Earth Engine + BigQuery and Apache Sedona + Wherobots for large-scale geospatial analysis, here is the short answer: Wherobots completed a statewide zonal statistics analysis across every building in Texas in 3 minutes 28 seconds. BigQuery timed out on the same query. For a Dallas metro area test using the same datasets, Apache Sedona finished in 37 seconds vs BigQuery’s 8 minutes 24 seconds, making it 13x faster. And before BigQuery could even run the spatial join, filtering the global dataset down to a single bounding box took 58 minutes and 50 seconds.

This post breaks down why that gap exists, what it means architecturally, and when each platform actually makes sense for your workload.

Key Takeaways

Here is what we found:

  • Apache Sedona via Wherobots is 13x faster than BigQuery for large-scale zonal statistics on a metro-area dataset
  • BigQuery with Google Earth Engine could not complete a state-scale spatial join without timing out; Wherobots finished the same job in under 4 minutes
  • Apache Sedona reads directly from cloud storage (S3, STAC) with no data migration required; BigQuery requires importing data into Google Cloud first
  • Apache Sedona supports 300+ spatial functions; BigQuery offers 71, with ST_REGIONSTATS as the only raster-vector function
  • For production pipelines at scale, Wherobots wins on speed, flexibility, and cost of data movement. Google Earth Engine remains useful for exploratory, one-off analysis within its own dataset catalog

What is Zonal Statistics Analysis and Why Does Scale Matter?

A zonal statistics analysis is one kind of result of a spatial join between a raster dataset and a vector dataset. This analysis can return things like:

  • sum: Sum of the pixel values in a zone
  • mean: Arithmetic mean of those values
  • min or max: Limit values of the pixels in the zones

For example, we could use vector boundaries of postal codes to find the average elevation from a raster:

Three 5x5 grids of squares arranged horizontally like a formula: the first plus the second makes the third. The first grid is labeled “zip code vectors” and shows the 25 squares divided into three contiguous regions labeled 27605, 27606, and 27607. The second grid is labeled “elevation raster” and shows different numbers in each square, ranging from 826 to 1,396. The third is labeled “mean elevation per zip code.” It overlays the 3 zip code regions on the 25 individual squares. All the squares within a zip code region show the same number. For example, all squares in the 27605 zip code region read 1,165.

Imagine that you want to take a study area, maybe a specific bounding box or metro area, and understand the dominant land classification in that area. In this case, each grid square in the raster might have a value like ”water” or “trees” and you want to find the classification with the maximum count in each of your vector areas.

At a small scale this kind of analysis is relatively easy for one-off work. There are a few engines that allow you to do this.

  • Shapely/GeoPandas and Rasterio are three commonly used Python libraries used together to do this.
  • PostGIS, which includes spatial SQL functions to work with raster and vector data, extending PostgreSQL with spatial functionality.
  • Google Earth Engine, which includes the ability to perform zonal statistics and now add support to perform calculations with vector data inside BigQuery.
  • Apache Sedona, which provides a functional layer to perform distributed spatial queries using Apache Spark and other engines.

However, there are plenty of use cases where you do want to do this at large scale:

  • Understand the land classification across thousands, hundreds of thousands, or even millions of individual buildings or properties.
  • Analyze climate or weather data in small or large scale areas over time.
  • Get up to date fire risk using global or country specific data.
  • Calculate and update flood risk across thousands of properties by analyzing the height above nearest drainage points.

These data sets have the potential to add massive value to analytical workflows for many industries and sectors. Unfortunately, the Python and PostGIS approaches are difficult to scale effectively. While they work up to a point, they run on a single compute instance and therefore will only scale as far as the computing power and memory of that machine.

So how can you do this at scale and without disrupting your existing architecture? Let’s look at the two cloud options, Apache Sedona and Google BigQuery.

How Google Earth Engine + BigQuery Handles Large-Scale Spatial Joins (And Where It Struggles)

Apache Sedona and Google Earth Engine both provide solutions when you need to process larger amounts of data.

Google Earth Engine, originally created in 2009, generally stores its raster data in multi-resolution pyramids in data that is not directly exposed to the user via a proprietary library. Google Earth Engine does store some vector data and it’s distributed in a similar format via the libraries themselves.

In Google Earth Engine you can access planetary scale data and analyze it for small and large scale problems without the need to download the data on your computer. In recent years, Google has started to expand the capabilities for Earth Engine to integrate with the Google Cloud Suite of products, most notably BigQuery.

BigQuery is an online analytical processing (or OLAP) data warehouse that allows you to store massive amounts of long-form data and analyze it efficiently using distributed computing. It uses a proprietary storage format that allows it to quickly distribute workloads across data that’s stored in the Google Cloud architecture.

The downside of this proprietary format is that you must import your data into BigQuery before leveraging the distributed computing resources that make it powerful. BigQuery makes it possible to calculate zonal statistics using Earth Engine data, but unfortunately that geospatial data does not reside in the same proprietary, fast format. (this differs from the approach with Wherobots and Apache Sedona outlined later in the post).

Recently, BigQuery added the function ST_REGIONSTATS that allows you to directly integrate BigQuery SQL with datasets from Earth Engine.

All in all, this workflow is great for traditional tabular analytics if you already have or are planning to migrate data into BigQuery. However, if you want to leverage open formats, Apache Sedona and Wherobots allow you do do that at scale with open core tooling.

How Apache Sedona + Wherobots Handles the Same Problem

Apache Sedona is a distributed computing framework that brings spatial functionality into the Apache Spark ecosystem. It allows you to work with large-scale datasets, both vector and raster, and combine them into new datasets using zonal statistical functions. It uses an underlying distributed computing framework that allows you to distribute work in parallel, allowing you to scale up to planetary scale computational workloads.

A key to Sedona’s scalability is that it allows you to leave the data where it sits. Many exabytes of raster data are already in an accessible storage format on the web and Sedona lets you leave that data there without spending the time or money needed to copy it yourself.

This means that there are no proprietary formats, no data ingestion, and no extra steps to actually start working with this data.

Platform Comparison: Apache Sedona + Wherobots vs Google Earth Engine + BigQuery

Here’s how the two approaches compare across key technical and operational dimensions:

Apache Sedona & WherobotsGoogle Earth Engine & Big Query
Open formats✅ Works with standard formats like Cloud-Optimized GeoTIFF; no vendor lock-in⚠️ Data needs to be ingested into BigQuery, internal proprietary storage for vector and raster data
No ETL needed✅ Reads directly from remote raster sources like S3 or STAC endpoints without importing❌ Only works with supported Google Earth Engine datasets and data stored in Google Cloud Storage
Scalable✅ Distributed compute + open-source stack enables scaling from local to cloud✅ Good for one-off, small to mid-scale analyses
Spatial function support✅ Over 300 raster and vector functions supported in Apache Sedona⚠️ ST_REGIONSTATS is the only raster/vector function; 71 total spatial functions in BigQuery
Cloud-native✅ Fully supported cloud environment in AWS (GCP coming soon)✅ Integrated with Google Cloud; Easy to use if your stack is already in GCP

To scale zonal statistics queries even more, Wherobots only materializes the values or pixels of raster data it needs. Not only does this save time on the computational workload, but it also eliminates a significant amount of moving data back and forth, which adds time and cost in other systems.

Benchmark Results: Wherobots vs Google Earth Engine + BigQuery

Using ST_RegionStats in BigQuery via Google Earth Engine

To compare the two systems I used a common open dataset to analyze land classification near buildings in Texas. This analysis can help determine if a building is in a built-up urban environment or if it is an area with different natural resources.

I used the WorldCover dataset from the European Space Agency, which contains land cover data across the entire globe. The data is in AWS S3 storage as an open dataset here, as well as Google Earth Engine.

For the buildings, I used the Overture Maps dataset, which contains building footprints for the entire globe. Overture data is in both the Wherobots Open Data Catalog and the BigQuery Open Data project.

Initially, I tried to compare the query speed for every building in Texas. Wherobots finished in about 3½ minutes, but BigQuery continued to time out after several attempts.

To scale back my test, I focused on buildings in a small area around Dallas, Texas. Each building was buffered by 30 meters to ensure that we are looking at the area around the buildings, not just the footprint.

To make this query work, I first needed to enable Earth Engine in BigQuery in the Google Cloud console, including:

  1. Activating the Earth Engine API
  2. Adding in required permissions for my project
  3. Locating and subscribing to the required Raster dataset via the Analytics Hub

Below is the code that I used to run this for the Overture Maps data in BigQuery.

WITH texas AS (
  SELECT
    st_buffer(geometry, 30) as geometry, id
  FROM
    <code>PROJECT.DATASET.overture_texas</code>
      where
  st_intersects(geometry, st_geogfromtext('POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))'))
)
SELECT
  t.geometry,
  t.id,
  ST_REGIONSTATS(
    t.geometry,
    (SELECT assets.image.href
     FROM <code>PROJECT.DATASET.landcover</code>
    ),
    'Map'
  ).mean as mean
FROM
  texas AS t
ORDER BY
  mean DESC;

You can use the existing Overture Maps data in BigQuery, however I recommend making a new table with just the data you need for the analysis. Using a WHERE statement to filter the global Overture Maps table to the small bounding box took 58 minutes and 50 seconds.

You can view the complete guide for doing this in GCP here.

Using Apache Sedona and Wherobots

Using Apache Sedona and Wherobots we can efficiently load in both the ESA WorldCover data from AWS S3 open data using built-in functions to read a Spatio-Temporal Asset Catalog (or STAC) endpoint that has an API to read large collections of raster data.

Click here to launch this interactive notebook
stac_df = sedona.read.format("stac").load(
    "<https://services.terrascope.be/stac/collections/urn:eop:VITO:ESA_WorldCover_10m_2021_AWS_V2>"
)

stac_df.printSchema()

Alternatively, we can also load this directly from the S3 data source, as you can see here which is automatically retiled which sets up nicely to create our tiled out-of-database raster dataframe, which we can then turn into an Iceberg table to use in the future:

esa = sedona.read.format("raster")\\
    .option("retile", "true")\\
    .load("s3://esa-worldcover/v200/2021/map/*.tif*")
    
esa.createOrReplaceTempView("esa_outdb_rasters")

YOUR_CATALOG_NAME = 'my_catalog'

# Persist as a raster table in Iceberg
sedona.sql(f"""
CREATE OR REPLACE TABLE wherobots.{YOUR_CATALOG_NAME}.esa_world_cover
AS
SELECT * FROM esa_outdb_rasters
""")

# ✅ Out‑DB raster table created!

From here I can set up my zonal stats query using the function RS_ZonalStatsAll which will actually return a STRUCT of all the stats instead of just the selected one as BigQuery does. The values included are:

  • count: Count of the pixels.
  • sum: Sum of the pixel values.
  • mean: Arithmetic mean.
  • median: Median.
  • mode: Mode.
  • stddev: Standard deviation.
  • variance: Variance.
  • min: Minimum value of the zone.
  • max: Maximum value of the zone.
zonal_stats_df = sedona.sql(f"""
SELECT
  p.id,
  RS_ZonalStatsAll(r.rast, p.buffer, 1) AS stats
FROM
  wherobots.{YOUR_CATALOG_NAME}.texas_buildings p
JOIN
  wherobots.{YOUR_CATALOG_NAME}.esa_world_cover r
  ON RS_Intersects(r.rast, p.buffer)
""")

# ✅ Zonal stats computed!

With this query, you could choose to write the results to an Iceberg table, or you could choose to persist them to files. In this case, I chose to persist these to files for our test.

zonal_stats_df.write \\
	.format("geoparquet") \\
	.mode("overwrite") \\
	.save(user_uri + "/results")")

And that’s it. You don’t need to move any data, download anything, or spin up any additional services apart from the notebook to run this process.

Results

The performance gap between the two platforms was significant at both scales tested.

Test 1: All of Texas

The first test that I ran tried to intersect all of the Overture Maps buildings in the state of Texas against this global raster dataset.

With Wherobots, I ran the iteration seven times, writing the complete dataset to disk each time. The results for the iterations were:

  • 3 minutes 28 seconds (average for each iteration)
  • ± 25.9 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

BigQuery was unable to finish the test without timing out.

Test 2: One city

For this test, I limited the query to a bounding box in the area of Dallas shown below.

POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))
Bounding box around the Dallas Texas metro area used in the spatial join comparison.

For this smaller area, the BigQuery computation ran in 8 minutes and 24 seconds.

Screenshot of a Google BigQuery query running time summary. A value labeled "Duration" reads "8 min 24 sec."

Below are the results in Wherobots using the smaller area:

  • 37.4 seconds (average over 7 iterations)
  • ± 1.78 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

Performance Summary

AreaBigQueryWherobots
State of TexasN/A (timed out)3m 28s
Dallas, Texas metro area8m 24s37s

When to Use Apache Sedona + Wherobots vs Google Earth Engine

The benchmark results make the performance case clearly. But there are architectural reasons to consider Sedona and Wherobots beyond raw speed.

Performance and scale

  • Sedona scales horizontally using distributed compute across Apache Spark, with strategic partitioning and adjustable runtime size to handle metro, national, or global workloads.
  • Apache Sedona supports 300+ raster and vector functions, compared to BigQuery’s single raster-vector function, ST_REGIONSTATS. That function coverage matters when building full analytical pipelines, not just one-off queries.
  • Wherobots includes job management for scheduled, repeatable runs, which matters when your raster datasets change over time, such as fire risk, flood risk, or land cover updates.

Architecture and flexibility

  • Sedona reads directly from remote cloud storage like AWS S3 or STAC endpoints without copying data into a proprietary system. BigQuery requires data to live inside Google Cloud before its distributed compute kicks in.
  • You are not limited to Earth Engine’s dataset catalog. Any open dataset accessible via cloud storage works with Sedona without ingestion or reformatting.
  • The stack is open core. You can run Apache Sedona locally to prototype, then move to Wherobots for production without rewriting your analysis.

For teams evaluating Google Earth Engine + BigQuery against Apache Sedona + Wherobots for production geospatial workloads, the benchmark is clear. At metro scale, Sedona is 13x faster. At state scale, BigQuery could not finish the job. If your data lives outside Google Cloud, or you need more than one raster-vector function, or you need repeatable scheduled pipelines, Wherobots is the stronger architectural choice. Google Earth Engine remains a capable tool for exploratory analysis within its own catalog.

Try the interactive notebook