Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
There is a wealth of geospatial insights available in raster data sets, including satellite and drone imagery, elevation, land cover, and more. Combining these grids of spatial data with traditional analytical data and workflows can be complicated, slow, and expensive. In this blog post we will explore:
This benchmark compared Apache Sedona and Google Earth Engine + BigQuery for large-scale raster-vector spatial joins using real-world datasets (ESA WorldCover and Overture Maps):
Read on for detailed methodology, code examples, and architectural analysis.
A zonal statistics analysis is one kind of result of a spatial join between a raster dataset and a vector dataset. This analysis can return things like:
sum
mean
min
max
For example, we could use vector boundaries of postal codes to find the average elevation from a raster:
Imagine that you want to take a study area, maybe a specific bounding box or metro area, and understand the dominant land classification in that area. In this case, each grid square in the raster might have a value like ”water” or “trees” and you want to find the classification with the maximum count in each of your vector areas.
At a small scale this kind of analysis is relatively easy for one-off work. There are a few engines that allow you to do this.
However, there are plenty of use cases where you do want to do this at large scale:
These data sets have the potential to add massive value to analytical workflows for many industries and sectors. Unfortunately, the Python and PostGIS approaches are difficult to scale effectively. While they work up to a point, they run on a single compute instance and therefore will only scale as far as the computing power and memory of that machine.
So how can you do this at scale and without disrupting your existing architecture? Let’s look at the two cloud options, Apache Sedona and Google BigQuery.
Apache Sedona and Google Earth Engine both provide solutions when you need to process larger amounts of data.
Google Earth Engine, originally created in 2009, generally stores its raster data in multi-resolution pyramids in data that is not directly exposed to the user via a proprietary library. Google Earth Engine does store some vector data and it’s distributed in a similar format via the libraries themselves.
In Google Earth Engine you can access planetary scale data and analyze it for small and large scale problems without the need to download the data on your computer. In recent years, Google has started to expand the capabilities for Earth Engine to integrate with the Google Cloud Suite of products, most notably BigQuery.
BigQuery is an online analytical processing (or OLAP) data warehouse that allows you to store massive amounts of long-form data and analyze it efficiently using distributed computing. It uses a proprietary storage format that allows it to quickly distribute workloads across data that’s stored in the Google Cloud architecture.
The downside of this proprietary format is that you must import your data into BigQuery before leveraging the distributed computing resources that make it powerful. BigQuery makes it possible to calculate zonal statistics using Earth Engine data, but unfortunately that geospatial data does not reside in the same proprietary, fast format. (this differs from the approach with Wherobots and Apache Sedona outlined later in the post).
Recently, BigQuery added the function ST_REGIONSTATS that allows you to directly integrate BigQuery SQL with datasets from Earth Engine.
ST_REGIONSTATS
All in all, this workflow is great for traditional tabular analytics if you already have or are planning to migrate data into BigQuery. However, if you want to leverage open formats, Apache Sedona and Wherobots allow you do do that at scale with open core tooling.
Apache Sedona is a distributed computing framework that brings spatial functionality into the Apache Spark ecosystem. It allows you to work with large-scale datasets, both vector and raster, and combine them into new datasets using zonal statistical functions. It uses an underlying distributed computing framework that allows you to distribute work in parallel, allowing you to scale up to planetary scale computational workloads.
A key to Sedona’s scalability is that it allows you to leave the data where it sits. Many exabytes of raster data are already in an accessible storage format on the web and Sedona lets you leave that data there without spending the time or money needed to copy it yourself.
This means that there are no proprietary formats, no data ingestion, and no extra steps to actually start working with this data.
Here’s how the two approaches compare across key technical and operational dimensions:
To scale zonal statistics queries even more, Wherobots only materializes the values or pixels of raster data it needs. Not only does this save time on the computational workload, but it also eliminates a significant amount of moving data back and forth, which adds time and cost in other systems.
To compare the two systems I used a common open dataset to analyze land classification near buildings in Texas. This analysis can help determine if a building is in a built-up urban environment or if it is an area with different natural resources.
I used the WorldCover dataset from the European Space Agency, which contains land cover data across the entire globe. The data is in AWS S3 storage as an open dataset here, as well as Google Earth Engine.
For the buildings, I used the Overture Maps dataset, which contains building footprints for the entire globe. Overture data is in both the Wherobots Open Data Catalog and the BigQuery Open Data project.
Initially, I tried to compare the query speed for every building in Texas. Wherobots finished in about 3½ minutes, but BigQuery continued to time out after several attempts.
To scale back my test, I focused on buildings in a small area around Dallas, Texas. Each building was buffered by 30 meters to ensure that we are looking at the area around the buildings, not just the footprint.
To make this query work, I first needed to enable Earth Engine in BigQuery in the Google Cloud console, including:
Below is the code that I used to run this for the Overture Maps data in BigQuery.
WITH texas AS ( SELECT st_buffer(geometry, 30) as geometry, id FROM <code>PROJECT.DATASET.overture_texas</code> where st_intersects(geometry, st_geogfromtext('POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))')) ) SELECT t.geometry, t.id, ST_REGIONSTATS( t.geometry, (SELECT assets.image.href FROM <code>PROJECT.DATASET.landcover</code> ), 'Map' ).mean as mean FROM texas AS t ORDER BY mean DESC;
You can use the existing Overture Maps data in BigQuery, however I recommend making a new table with just the data you need for the analysis. Using a WHERE statement to filter the global Overture Maps table to the small bounding box took 58 minutes and 50 seconds.
WHERE
You can view the complete guide for doing this in GCP here.
Using Apache Sedona and Wherobots we can efficiently load in both the ESA WorldCover data from AWS S3 open data using built-in functions to read a Spatio-Temporal Asset Catalog (or STAC) endpoint that has an API to read large collections of raster data.
stac_df = sedona.read.format("stac").load( "<https://services.terrascope.be/stac/collections/urn:eop:VITO:ESA_WorldCover_10m_2021_AWS_V2>" ) stac_df.printSchema()
Alternatively, we can also load this directly from the S3 data source, as you can see here which is automatically retiled which sets up nicely to create our tiled out-of-database raster dataframe, which we can then turn into an Iceberg table to use in the future:
esa = sedona.read.format("raster")\\ .option("retile", "true")\\ .load("s3://esa-worldcover/v200/2021/map/*.tif*") esa.createOrReplaceTempView("esa_outdb_rasters") YOUR_CATALOG_NAME = 'my_catalog' # Persist as a raster table in Iceberg sedona.sql(f""" CREATE OR REPLACE TABLE wherobots.{YOUR_CATALOG_NAME}.esa_world_cover AS SELECT * FROM esa_outdb_rasters """) # ✅ Out‑DB raster table created!
From here I can set up my zonal stats query using the function RS_ZonalStatsAll which will actually return a STRUCT of all the stats instead of just the selected one as BigQuery does. The values included are:
RS_ZonalStatsAll
STRUCT
count
median
mode
stddev
variance
zonal_stats_df = sedona.sql(f""" SELECT p.id, RS_ZonalStatsAll(r.rast, p.buffer, 1) AS stats FROM wherobots.{YOUR_CATALOG_NAME}.texas_buildings p JOIN wherobots.{YOUR_CATALOG_NAME}.esa_world_cover r ON RS_Intersects(r.rast, p.buffer) """) # ✅ Zonal stats computed!
With this query, you could choose to write the results to an Iceberg table, or you could choose to persist them to files. In this case, I chose to persist these to files for our test.
zonal_stats_df.write \\ .format("geoparquet") \\ .mode("overwrite") \\ .save(user_uri + "/results")")
And that’s it. You don’t need to move any data, download anything, or spin up any additional services apart from the notebook to run this process.
The first test that I ran tried to intersect all of the Overture Maps buildings in the state of Texas against this global raster dataset.
With Wherobots, I ran the iteration seven times, writing the complete dataset to disk each time. The results for the iterations were:
BigQuery was unable to finish the test without timing out.
For this test, I limited the query to a bounding box in the area of Dallas shown below.
POLYGON((-97.000482 33.023937, -96.463632 33.023937, -96.463632 32.613222, -97.000482 32.613222, -97.000482 33.023937))
For this smaller area, the BigQuery computation ran in 8 minutes and 24 seconds.
Below are the results in Wherobots using the smaller area:
In addition to the processing speed, there are other advantages to using Apache Sedona with Wherobots.
To wrap up, Wherobots and Apache Sedona enables you to use open data on cloud storage, scale your compute as needed, and connect to an existing ecosystem of open data. This provides a scalable, repeatable, and open core solution that allows you to leverage the wealth of raster data that exists and integrate that into your analytics systems.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
It takes 15 minutes for the Caltrain to get from Sunnyvale to SAP Center
That’s how long it took our MCP server to go from “how many bus stops are in Maryland” to an answer
Wherobots and Felt Partner to Modernize Spatial Intelligence
We’re excited to announce Wherobots and Felt are partnering to enable data teams to innovate with physical world data and move beyond legacy GIS, using the modern spatial intelligence stack. The stack with Wherobots and Felt provides a cloud-native, spatial processing and collaborative mapping solution that accelerates innovation and time-to-insight across an organization. What is […]
Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis
How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: