Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
This post shows you how to run computations on geospatial data with Wherobots, Sedona, and GeoPandas, explains how the engines execute computations differently, and describes how you can use these technologies for an amazing workflow. Wherobots notebooks include the Sedona and GeoPandas engines, so you can easily access either library when running computations.
The post will also show you how to write Sedona code with GeoPandas syntax using the Sedona GeoPandas API.
Wherobots is great for quickly running computations on large datasets in parallel. GeoPandas is familiar to lots of geospatial programmers and is interoperable with a lot of libraries, so both computation engines can be useful for an analysis.
Let’s start with an example analysis that shows how to compute building centroids with either Sedona or GeoPandas.
Here’s how to read a GeoParquet file into a Sedona DataFrame.
path = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet' buildings = sedona.read.format("geoparquet").load(path)
Let’s inspect a few rows of data.
buildings.select("BUILD_ID", "height_val", "geom").limit(5).show() +--------+----------+--------------------+ |BUILD_ID|height_val| geom| +--------+----------+--------------------+ | 2080161| 0.0|MULTIPOLYGON (((-...| | 2080159| 0.0|MULTIPOLYGON (((-...| | 2080158| 0.0|MULTIPOLYGON (((-...| | 7776189| 8.49|MULTIPOLYGON (((-...| | 7776214| 10.49|MULTIPOLYGON (((-...| +--------+----------+--------------------+
Let’s filter out all the rows with a zero height_val and compute the centroid for each building.
height_val
(buildings .where(col("height_val") != 0.0) .withColumn("centroid", ST_Centroid(col("geom"))) .select("BUILD_ID", "height_val", "centroid") .limit(5) .show(truncate=False)) +--------+----------+---------------------------------------------+ |BUILD_ID|height_val|centroid | +--------+----------+---------------------------------------------+ |7776189 |8.49 |POINT (-73.9244121583531 40.876054867139366) | |7776214 |10.49 |POINT (-73.92479319267416 40.87608655645236) | |7740167 |37.44 |POINT (-73.93514476461972 40.81269938665691) | |7740151 |37.27 |POINT (-73.93622071995519 40.81267611885931) | |7740289 |36.19 |POINT (-73.93705258498942 40.813024939805196)| +--------+----------+---------------------------------------------+
You could also convert this buildings dataset to a GeoPandas DataFrame and compute the centroid for each building. Here’s how we will run the computation:
buildings
Here’s how to filter the DataFrame and grab a few columns with Sedona:
sedona_df = ( buildings .where(col("height_val") != 0.0) .select("BUILD_ID", "height_val", "geometry") )
Now convert the DataFrame to a pandas DataFrame:
pandas_df = sedona_df.toPandas()
Convert the pandas DataFrame to a GeoPandas DataFrame:
import geopandas as gpd geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry")
Compute the centroids with GeoPandas and view the contents of the GeoPandas DataFrame:
geopandas_df["centroid"] = geopandas_df.geometry.centroid print(geopandas_df[["BUILD_ID", "height_val", "centroid"]].head(5)) --------------------------------------------------- BUILD_ID height_val centroid 0 7776189 8.490000 POINT (-73.92441 40.87605) 1 7776214 10.490000 POINT (-73.92479 40.87609) 2 7740167 37.439999 POINT (-73.93514 40.8127) 3 7740151 37.270000 POINT (-73.93622 40.81268) 4 7740289 36.189999 POINT (-73.93705 40.81302)
Suppose you have an Iceberg table named local.db.icetable and would like to analyze the data with GeoPandas.
local.db.icetable
GeoPandas does not have an Iceberg reader yet, so let’s read the table with Sedona and then convert it to a GeoPandas DataFrame.
Here’s how to create the Sedona DataFrame:
sedona_df = sedona.table("local.db.icetable")
Now let’s convert the DataFrame to a GeoPandas DataFrame:
import geopandas as gpd pandas_df = sedona_df.toPandas() geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry") print(geopandas_df) --------------------------- id geometry 0 a LINESTRING (1 3, 3 1) 1 d LINESTRING (2 7, 4 9) 2 e LINESTRING (6 7, 6 9) 3 f LINESTRING (8 7, 9 9)
You can only convert smaller DataFrames from Sedona DataFrames to GeoPandas DataFrames. Let’s take a look at the architectural differences between the engines to learn more.
Sedona executes computations on a distributed cluster, so it can quickly run computations on large datasets.
GeoPandas runs computations on a single node, so it’s not as scalable, but it can be quick for small datasets and is interoperable with libraries in the pandas ecosystem.
GeoPandas computations can be distributed with Sedona GeoPandas or dask-geopandas, and we will describe those technologies in the sections that follow.
Let’s take a look at the differences in more detail.
GeoPandas runs computations on a single computer and processes data using only a single node. This is fine for small datasets, but it cannot scale beyond the resources of a single machine.
Sedona runs on many machines working in unison as a cluster. The cluster has a driver node responsible for orchestration and worker nodes that process the data in parallel.
When a computation runs on a single node, only one machine reads the data and executes the query. When a computation is run on a cluster, all worker nodes read data in parallel and execute the query. As you can imagine, running on a cluster is faster and more scalable for large datasets.
Wherobots is a lazy compute engine, and GeoPandas is an eager compute engine.
Lazy compute engines build up query execution graphs but don’t run the underlying computations until they need to produce a result. Let’s take a look at a line of Sedona code that reads a file:
df = sedona.read.format("geojson").load("overture_buildings.geojson")
This line doesn’t read the underlying file. It waits until the user requests a result from the data before doing the work—it’s lazy. This is good because it can inspect and optimize the queries before running them.
Let’s compare this with GeoPandas code that reads in data to a DataFrame:
overture_buildings = gpd.read_file("overture_buildings.geojson")
This GeoPandas code runs instantly and eagerly loads all the data into an in-memory DataFrame. Eager execution can be beneficial, especially when the data is small and when the DataFrame is repeatedly queried. However, it has drawbacks for larger-than-memory datasets.
GeoPandas eagerly loads data into memory. When the machine does not have enough memory to load all the data, it will error out with an out-of-memory exception. In practice, GeoPandas works well with dataset sizes much smaller than the total memory on the machine.
Wherobots is designed to run on datasets larger than a given machine’s memory. It processes data incrementally and does not load it into memory all at once, allowing It to process out-of-memory datasets.
Let’s now dive into how the computations are executed.
Wherobots uses a query optimizer to ensure the code executes efficiently. For example, suppose a query requires three columns of a 10-column dataset stored in GeoParquet. The query optimizer will intelligently select only the needed three columns rather than reading all 10 columns of data.
GeoPandas doesn’t have a query optimizer, so users must manually specify the optimizations they want to be applied. For example, when a GeoPandas user queries a GeoParquet file, they need to specify the column pruning and row-group filtering logic manually.
The query optimizer provides a better developer experience and better performance, too.
Sedona supports Python, SQL, Java, Scala, and R interfaces.
GeoPandas just supports a Python API.
The Wherobots Python syntax is different from the GeoPandas Python syntax. Sedona is syntactically similar to PySpark, and GeoPandas is syntactically similar to Pandas.
Now let’s take a look at the Sedona GeoPandas API.
Sedona and GeoPandas are compatible with different libraries.
For example, suppose you would like to read a large geospatial dataset stored in Apache Iceberg, perform some data munging, and then display the results with matplotlib.
Sedona is compatible with Iceberg, and GeoPandas is interoperable with matplotlib, so here is how you can run this computation:
Sedona and GeoPandas are interoperable with different libraries, and that’s one of the main reasons why it’s nice to have access to both.
The Sedona GeoPandas API is identical to the GeoPandas API, but executes computations with Sedona in a distributed manner. It’s a great way for developers who are familiar with GeoPandas syntax to write code that’s executed with Sedona.
Here’s an example of how to create a Sedona GeoPandas DataFrame:
The dask-geopandas library is another way to parallelize GeoPandas computations to leverage all cores of a given machine or distribute computations to a cluster.
Here’s a good summary of dask-geopandas: “Since GeoPandas is an extension to the pandas DataFrame, the same way Dask scales pandas can also be applied to GeoPandas”.
A deep dive into the differences between Sedona and dask-geopanadas is beyond the scope of this post, which focuses on Sedona and GeoPandas, but we’re excited to investigate this in a future post!
You can convert a Wherobots DataFrame to a GeoPandas DataFrame, as you can see in this code snippet:
pandas_df = sedona_df.toPandas() geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geom")
When you convert from Sedona DataFrames to GeoPandas, you load all the data into the driver node and process it on a single machine. Converting from a Sedona DataFrame to a GeoPandas DataFrame is not recommended if the dataset is large.
It’s better to convert from Sedona GeoPandas once a dataset has been pre-processed and reduced in size. Once the data is small, you can convert it to a GeoPandas DataFrame and enjoy the interoperability offered by the GeoPandas ecosystem.
You can also convert from a GeoPandas DataFrame to a Sedona DataFrame.Here’s an example of how to read a FlatGeoBuf file into a GeoPandas DataFrame and then convert it to a Sedona DataFrame:
geopandas_df = gpd.read_file("overture_buildings.fgb") overture_buildings_df = sedona.createDataFrame(geopandas_df)
Sedona doesn’t have a FlatGeoBuf reader yet, so this is a good way to read FlatGeoBuf files into Sedona DataFrames!
Let’s illustrate how to leverage Sedona’s computational power to process large Iceberg tables, reduce the data to a manageable size, and then visualize the results with GeoPandas and Folium. This workflow showcases the strength of combining these tools for geospatial data analysis.
First, we utilize Wherobots Sedona to efficiently read and process our extensive Iceberg data:
buildings_joined = sedona.sql(""" WITH dry_state_roi AS ( SELECT * FROM wherobots_open_data.overture_2025_03_19_1.divisions_division_area WHERE country = 'US' AND subtype = 'region' ) SELECT Count(building.id) AS buildings_count, roi.region AS region, roi.geometry AS region_geometry FROM wherobots_open_data.overture_2025_03_19_1.buildings_building building JOIN dry_state_roi roi ON ST_Intersects(building.geometry, roi.geometry) GROUP BY roi.region, roi.geometry """)
Here, Sedona’s ability to handle large datasets shines as it quickly aggregates building counts by region. This reduction in data size is crucial, as it prepares the data for the next step with GeoPandas.
Next, we convert the Sedona DataFrame to a GeoPandas DataFrame:
import geopandas as gpd buildings_pdf = buildings_joined.toPandas() buildings_gpdf = gpd.GeoDataFrame(buildings_pdf, geometry="region_geometry")
Now, we move to visualization using Folium, a library that integrates seamlessly with GeoPandas:
import folium m = folium.Map([43, -100], zoom_start=4) folium.Choropleth( geo_data=buildings_gpdf.to_json(), data=buildings_gpdf, columns=["region", "buildings_count"], key_on="feature.properties.region", fill_color="YlOrRd", fill_opacity=0.7, line_opacity=0.2, legend_name="Building Count", ).add_to(m) m
While Sedona offers other graphing libraries, Folium’s direct compatibility with GeoPandas simplifies the visualization process. This code generates an interactive choropleth map, clearly displaying building density across US regions.
This example illustrates the power of Wherobots Sedona for rapid, large-scale data reduction, followed by GeoPandas and Folium for effective visualization. This combination allows for efficient analysis and presentation of complex geospatial data.
Let’s compare the query performance for a dataset with different scale factors using Sedona Wherobots and GeoPandas.
The GeoPandas query is run on a machine with 12 cores and 96GB of RAM. The Wherobots Sedona query is run on a cluster of machines with a total of 24 cores and 96GB of RAM. Both setups have the same total RAM.
Important note: This comparison is for GeoPandas vs. Sedona Wherobots. Dask-geopandas is not included in this analysis and would be a more direct comparison, as it is a parallel compute engine. The purpose of this comparison is to see the performance inflection point, not to act as though GeoPandas and Sedona Wherebots are similar.
We tested spatial joins on three different datasets with GeoPandas and Sedona Wherobots.
Let’s start with a simple spatial join query that counts the number of Overture Buildings located within postal code zones, by joining the Overture Buildings dataset with a postal codes dataset. Here is the query:
WITH t AS ( SELECT * FROM overture_buildings buildings JOIN postal_codes zones ON ST_Intersects(buildings.geometry, zones.geometry)) ) SELECT COUNT(1) FROM t
Let’s perform this query with different dataset sizes:
Here are the benchmarking results for each scale factor:
‘-’ Kernel died due to OOM (Out Of Memory)
Wherobots is faster for all scale factors. GeoPandas cannot run the query on the large dataset because it’s too much data to fit into memory.
Here is the GeoPandas code to run the query:
import geopandas as gpd overture_buildings = gpd.read_parquet("path/to/overture_buildings/") postal_codes = gpd.read_parquet("path/to/postal_codes/") t = overture_buildings.sindex.query(postal_codes["geometry"], "intersects") count = t.shape[0]
Here’s how to run the code with Sedona:
df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/1M/overture-buildings/") df.createOrReplaceTempView("overture_buildings") df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/postal-code/") df.createOrReplaceTempView("postal_codes") sedona.sql(""" WITH t AS ( SELECT * FROM overture_buildings buildings JOIN postal_codes zones ON ST_Intersects(buildings.geometry, zones.geometry) ) SELECT COUNT(1) FROM t """).show()
For those interested in replicating the analyses and experiments discussed in this blog post, the datasets are available in a public Amazon S3 bucket. Here’s how you can access them:
s3://wherobots-examples/data/geopandas_blog/ ├── 1M/ │ └── overture-buildings/ ├── 10M │ └── overture-buildings/ ├── 50M │ └── overture-buildings/ └── postal-code/
Each xM/ directory (e.g., 1M/, 10M/, 50M/) contains datasets scaled to different sizes, allowing for performance comparisons and scalability testing. The postal-code/ directory contains the postal code dataset used in the spatial join examples.
xM/
1M/
10M/
50M/
Sedona and GeoPandas are both great tools for analyzing spatial data.
GeoPandas has many advantages:
GeoPandas is an excellent option if you have a small dataset and are familiar with pandas.
Wherobots and Sedona are better for larger datasets. Wherobots has Python and SQL APIs and is also better for users who prefer writing SQL.
But as you’ve seen in this post, Sedona and GeoPandas are better together. The technologies combined give you access to a larger ecosystem of libraries and the ability to run computations on large and small datasets.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
WherobotsDB is 3x faster with up to 45% better price performance
The next generation of WherobotsDB, the Apache Sedona and Spark 4 compatible engine, is now generally available.
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: