Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

Raster Spatial Joins at Scale: Google Earth Engine and BigQuery vs Apache Sedona and Wherobots

Authors

There is a wealth of geospatial insights available in raster data sets, including satellite and drone imagery, elevation, land cover, and more. Combining these grids of spatial data with traditional analytical data and workflows can be complicated, slow, and expensive. In this blog post we will explore:

  • Common approaches to this problem
  • Making spatial joins with raster and vector data at scale
  • Comparing a raster/vector spatial join in BigQuery and Google Earth Engine and in Wherobots and Apache Sedona
  • Architectural advantages for doing this with Sedona and Wherobots

Key Takeaways

This benchmark compared Apache Sedona and Google Earth Engine + BigQuery for large-scale raster-vector spatial joins using real-world datasets (ESA WorldCover and Overture Maps):

  • Wherobots processed all Texas buildings in 3m 28s
  • BigQuery timed out on the same Texas-wide analysis
  • For Dallas metro area: Sedona completed in 37s vs BigQuery’s 8m 24s (13x faster)
  • No data migration required: Sedona reads directly from remote cloud storage without copying data
  • Open standards: Works with Cloud-Optimized GeoTIFF and STAC endpoints

Read on for detailed methodology, code examples, and architectural analysis.

What is Zonal Statistics Analysis?

A zonal statistics analysis is one kind of result of a spatial join between a raster dataset and a vector dataset. This analysis can return things like:

  • sum: Sum of the pixel values in a zone
  • mean: Arithmetic mean of those values
  • min or max: Limit values of the pixels in the zones

For example, we could use vector boundaries of postal codes to find the average elevation from a raster:

Three 5x5 grids of squares arranged horizontally like a formula: the first plus the second makes the third. The first grid is labeled “zip code vectors” and shows the 25 squares divided into three contiguous regions labeled 27605, 27606, and 27607. The second grid is labeled “elevation raster” and shows different numbers in each square, ranging from 826 to 1,396. The third is labeled “mean elevation per zip code.” It overlays the 3 zip code regions on the 25 individual squares. All the squares within a zip code region show the same number. For example, all squares in the 27605 zip code region read 1,165.

Imagine that you want to take a study area, maybe a specific bounding box or metro area, and understand the dominant land classification in that area. In this case, each grid square in the raster might have a value like ”water” or “trees” and you want to find the classification with the maximum count in each of your vector areas.

At a small scale this kind of analysis is relatively easy for one-off work. There are a few engines that allow you to do this.

  • Shapely/GeoPandas and Rasterio are three commonly used Python libraries used together to do this.
  • PostGIS, which includes spatial SQL functions to work with raster and vector data, extending PostgreSQL with spatial functionality.
  • Google Earth Engine, which includes the ability to perform zonal statistics and now add support to perform calculations with vector data inside BigQuery.
  • Apache Sedona, which provides a functional layer to perform distributed spatial queries using Apache Spark and other engines.

However, there are plenty of use cases where you do want to do this at large scale:

  • Understand the land classification across thousands, hundreds of thousands, or even millions of individual buildings or properties.
  • Analyze climate or weather data in small or large scale areas over time.
  • Get up to date fire risk using global or country specific data.
  • Calculate and update flood risk across thousands of properties by analyzing the height above nearest drainage points.

These data sets have the potential to add massive value to analytical workflows for many industries and sectors. Unfortunately, the Python and PostGIS approaches are difficult to scale effectively. While they work up to a point, they run on a single compute instance and therefore will only scale as far as the computing power and memory of that machine.

So how can you do this at scale and without disrupting your existing architecture? Let’s look at the two cloud options, Apache Sedona and Google BigQuery.

Google Earth Engine + BigQuery Architecture

Apache Sedona and Google Earth Engine both provide solutions when you need to process larger amounts of data.

Google Earth Engine, originally created in 2009, generally stores its raster data in multi-resolution pyramids in data that is not directly exposed to the user via a proprietary library. Google Earth Engine does store some vector data and it’s distributed in a similar format via the libraries themselves.

In Google Earth Engine you can access planetary scale data and analyze it for small and large scale problems without the need to download the data on your computer. In recent years, Google has started to expand the capabilities for Earth Engine to integrate with the Google Cloud Suite of products, most notably BigQuery.

BigQuery is an online analytical processing (or OLAP) data warehouse that allows you to store massive amounts of long-form data and analyze it efficiently using distributed computing. It uses a proprietary storage format that allows it to quickly distribute workloads across data that’s stored in the Google Cloud architecture.

The downside of this proprietary format is that you must import your data into BigQuery before leveraging the distributed computing resources that make it powerful. BigQuery makes it possible to calculate zonal statistics using Earth Engine data, but unfortunately that geospatial data does not reside in the same proprietary, fast format. (this differs from the approach with Wherobots and Apache Sedona outlined later in the post).

Recently, BigQuery added the function ST_REGIONSTATS that allows you to directly integrate BigQuery SQL with datasets from Earth Engine.

All in all, this workflow is great for traditional tabular analytics if you already have or are planning to migrate data into BigQuery. However, if you want to leverage open formats, Apache Sedona and Wherobots allow you do do that at scale with open core tooling.

Apache Sedona + Wherobots

Apache Sedona is a distributed computing framework that brings spatial functionality into the Apache Spark ecosystem. It allows you to work with large-scale datasets, both vector and raster, and combine them into new datasets using zonal statistical functions. It uses an underlying distributed computing framework that allows you to distribute work in parallel, allowing you to scale up to planetary scale computational workloads.

A key to Sedona’s scalability is that it allows you to leave the data where it sits. Many exabytes of raster data are already in an accessible storage format on the web and Sedona lets you leave that data there without spending the time or money needed to copy it yourself.

This means that there are no proprietary formats, no data ingestion, and no extra steps to actually start working with this data.

Platform Comparison: Apache Sedona + Wherobots vs Google Earth Engine + BigQuery

Here’s how the two approaches compare across key technical and operational dimensions:

Apache Sedona & WherobotsGoogle Earth Engine & Big Query
Open formats✅ Works with standard formats like Cloud-Optimized GeoTIFF; no vendor lock-in⚠️ Data needs to be ingested into BigQuery, internal proprietary storage for vector and raster data
No ETL needed✅ Reads directly from remote raster sources like S3 or STAC endpoints without importing❌ Only works with supported Google Earth Engine datasets and data stored in Google Cloud Storage
Scalable✅ Distributed compute + open-source stack enables scaling from local to cloud✅ Good for one-off, small to mid-scale analyses
Spatial function support✅ Over 300 raster and vector functions supported in Apache Sedona⚠️ ST_REGIONSTATS is the only raster/vector function; 71 total spatial functions in BigQuery
Cloud-native✅ Fully supported cloud environment in AWS (GCP coming soon)✅ Integrated with Google Cloud; Easy to use if your stack is already in GCP

To scale zonal statistics queries even more, Wherobots only materializes the values or pixels of raster data it needs. Not only does this save time on the computational workload, but it also eliminates a significant amount of moving data back and forth, which adds time and cost in other systems.

Comparing Wherobots and Google Earth Engine + BigQuery

Using ST_RegionStats in BigQuery via Google Earth Engine

To compare the two systems I used a common open dataset to analyze land classification near buildings in Texas. This analysis can help determine if a building is in a built-up urban environment or if it is an area with different natural resources.

I used the WorldCover dataset from the European Space Agency, which contains land cover data across the entire globe. The data is in AWS S3 storage as an open dataset here, as well as Google Earth Engine.

For the buildings, I used the Overture Maps dataset, which contains building footprints for the entire globe. Overture data is in both the Wherobots Open Data Catalog and the BigQuery Open Data project.

Initially, I tried to compare the query speed for every building in Texas. Wherobots finished in about 3½ minutes, but BigQuery continued to time out after several attempts.

To scale back my test, I focused on buildings in a small area around Dallas, Texas. Each building was buffered by 30 meters to ensure that we are looking at the area around the buildings, not just the footprint.

To make this query work, I first needed to enable Earth Engine in BigQuery in the Google Cloud console, including:

  1. Activating the Earth Engine API
  2. Adding in required permissions for my project
  3. Locating and subscribing to the required Raster dataset via the Analytics Hub

Below is the code that I used to run this for the Overture Maps data in BigQuery.

WITH texas AS (
  SELECT
    st_buffer(geometry, 30) as geometry, id
  FROM
    <code>PROJECT.DATASET.overture_texas</code>
      where
  st_intersects(geometry, st_geogfromtext('POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))'))
)
SELECT
  t.geometry,
  t.id,
  ST_REGIONSTATS(
    t.geometry,
    (SELECT assets.image.href
     FROM <code>PROJECT.DATASET.landcover</code>
    ),
    'Map'
  ).mean as mean
FROM
  texas AS t
ORDER BY
  mean DESC;

You can use the existing Overture Maps data in BigQuery, however I recommend making a new table with just the data you need for the analysis. Using a WHERE statement to filter the global Overture Maps table to the small bounding box took 58 minutes and 50 seconds.

You can view the complete guide for doing this in GCP here.

Using Apache Sedona and Wherobots

Using Apache Sedona and Wherobots we can efficiently load in both the ESA WorldCover data from AWS S3 open data using built-in functions to read a Spatio-Temporal Asset Catalog (or STAC) endpoint that has an API to read large collections of raster data.

Click here to launch this interactive notebook
stac_df = sedona.read.format("stac").load(
    "<https://services.terrascope.be/stac/collections/urn:eop:VITO:ESA_WorldCover_10m_2021_AWS_V2>"
)

stac_df.printSchema()

Alternatively, we can also load this directly from the S3 data source, as you can see here which is automatically retiled which sets up nicely to create our tiled out-of-database raster dataframe, which we can then turn into an Iceberg table to use in the future:

esa = sedona.read.format("raster")\\
    .option("retile", "true")\\
    .load("s3://esa-worldcover/v200/2021/map/*.tif*")
    
esa.createOrReplaceTempView("esa_outdb_rasters")

YOUR_CATALOG_NAME = 'my_catalog'

# Persist as a raster table in Iceberg
sedona.sql(f"""
CREATE OR REPLACE TABLE wherobots.{YOUR_CATALOG_NAME}.esa_world_cover
AS
SELECT * FROM esa_outdb_rasters
""")

# ✅ Out‑DB raster table created!

From here I can set up my zonal stats query using the function RS_ZonalStatsAll which will actually return a STRUCT of all the stats instead of just the selected one as BigQuery does. The values included are:

  • count: Count of the pixels.
  • sum: Sum of the pixel values.
  • mean: Arithmetic mean.
  • median: Median.
  • mode: Mode.
  • stddev: Standard deviation.
  • variance: Variance.
  • min: Minimum value of the zone.
  • max: Maximum value of the zone.
zonal_stats_df = sedona.sql(f"""
SELECT
  p.id,
  RS_ZonalStatsAll(r.rast, p.buffer, 1) AS stats
FROM
  wherobots.{YOUR_CATALOG_NAME}.texas_buildings p
JOIN
  wherobots.{YOUR_CATALOG_NAME}.esa_world_cover r
  ON RS_Intersects(r.rast, p.buffer)
""")

# ✅ Zonal stats computed!

With this query, you could choose to write the results to an Iceberg table, or you could choose to persist them to files. In this case, I chose to persist these to files for our test.

zonal_stats_df.write \\
	.format("geoparquet") \\
	.mode("overwrite") \\
	.save(user_uri + "/results")")

And that’s it. You don’t need to move any data, download anything, or spin up any additional services apart from the notebook to run this process.

Results

Test 1: All of Texas

The first test that I ran tried to intersect all of the Overture Maps buildings in the state of Texas against this global raster dataset.

With Wherobots, I ran the iteration seven times, writing the complete dataset to disk each time. The results for the iterations were:

  • 3 minutes 28 seconds (average for each iteration)
  • ± 25.9 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

BigQuery was unable to finish the test without timing out.

Test 2: One city

For this test, I limited the query to a bounding box in the area of Dallas shown below.

POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))
Bounding box around the Dallas Texas metro area used in the spatial join comparison.

For this smaller area, the BigQuery computation ran in 8 minutes and 24 seconds.

Screenshot of a Google BigQuery query running time summary. A value labeled "Duration" reads "8 min 24 sec."

Below are the results in Wherobots using the smaller area:

  • 37.4 seconds (average over 7 iterations)
  • ± 1.78 s per iteration (mean ± std. dev. of 7 runs, 1 loop each)

Performance Summary

AreaBigQueryWherobots
State of TexasN/A (timed out)3m 28s
Dallas, Texas metro area8m 24s37s

Why Choose Apache Sedona + Wherobots: Beyond Performance

In addition to the processing speed, there are other advantages to using Apache Sedona with Wherobots.

  • First is the ability to scale. You can scale this up by using strategic partitioning, as well as increasing the runtime size for larger areas, even global areas.
  • You are using open core solutions, meaning that you can start this process using Apache Sedona locally and migrate it as needed.
  • You’re also not constrained to using data and moving data into Google Cloud. You can use datasets from any cloud storage service like AWS S3 and never have to move them into your compute instance.
  • You are not limited to the data sets that exist within Google Earth Engine. Although it is a comprehensive catalog of data, you can use any open data set that you want.
  • Overall, this code environment will also be a little bit easier for those coming from a Python background, although you can also leverage spatial SQL if you choose.
  • There are many other raster and vector functions that are available through Apache Sedona that allow you to complete and scale to build full scale pipelines.
  • Finally, they can leverage Wherobots for job management to run this on a repeated basis if you are using a data set that has temporal changes.

To wrap up, Wherobots and Apache Sedona enables you to use open data on cloud storage, scale your compute as needed, and connect to an existing ecosystem of open data. This provides a scalable, repeatable, and open core solution that allows you to leverage the wealth of raster data that exists and integrate that into your analytics systems.

Try the interactive notebook