WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Wherobots, Sedona, and GeoPandas are better together

Authors

Geopandas blog header image

This post shows you how to run computations on geospatial data with Wherobots, Sedona, and GeoPandas, explains how the engines execute computations differently, and describes how you can use these technologies for an amazing workflow.  Wherobots notebooks include the Sedona and GeoPandas engines, so you can easily access either library when running computations.

The post will also show you how to write Sedona code with GeoPandas syntax using the Sedona GeoPandas API.

Wherobots is great for quickly running computations on large datasets in parallel.  GeoPandas is familiar to lots of geospatial programmers and is interoperable with a lot of libraries, so both computation engines can be useful for an analysis.

Let’s start with an example analysis that shows how to compute building centroids with either Sedona or GeoPandas.

Wherobots and GeoPandas example computation

Here’s how to read a GeoParquet file into a Sedona DataFrame.

path = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet'

buildings = sedona.read.format("geoparquet").load(path)

Let’s inspect a few rows of data.

buildings.select("BUILD_ID", "height_val", "geom").limit(5).show()

+--------+----------+--------------------+
|BUILD_ID|height_val|                geom|
+--------+----------+--------------------+
| 2080161|       0.0|MULTIPOLYGON (((-...|
| 2080159|       0.0|MULTIPOLYGON (((-...|
| 2080158|       0.0|MULTIPOLYGON (((-...|
| 7776189|      8.49|MULTIPOLYGON (((-...|
| 7776214|     10.49|MULTIPOLYGON (((-...|
+--------+----------+--------------------+

Let’s filter out all the rows with a zero height_val and compute the centroid for each building.

(buildings
    .where(col("height_val") != 0.0)
    .withColumn("centroid", ST_Centroid(col("geom")))
    .select("BUILD_ID", "height_val", "centroid")
    .limit(5)
    .show(truncate=False))

+--------+----------+---------------------------------------------+
|BUILD_ID|height_val|centroid                                     |
+--------+----------+---------------------------------------------+
|7776189 |8.49      |POINT (-73.9244121583531 40.876054867139366) |
|7776214 |10.49     |POINT (-73.92479319267416 40.87608655645236) |
|7740167 |37.44     |POINT (-73.93514476461972 40.81269938665691) |
|7740151 |37.27     |POINT (-73.93622071995519 40.81267611885931) |
|7740289 |36.19     |POINT (-73.93705258498942 40.813024939805196)|
+--------+----------+---------------------------------------------+

You could also convert this buildings dataset to a GeoPandas DataFrame and compute the centroid for each building.  Here’s how we will run the computation:

  • Use Sedona to read and filter the data
  • Convert from Sedona to GeoPandas
  • Use GeoPandas to compute the centroids

Here’s how to filter the DataFrame and grab a few columns with Sedona:

sedona_df = (
    buildings
        .where(col("height_val") != 0.0)
        .select("BUILD_ID", "height_val", "geometry")
)

Now convert the DataFrame to a pandas DataFrame:

pandas_df = sedona_df.toPandas()

Convert the pandas DataFrame to a GeoPandas DataFrame:

import geopandas as gpd

geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry")

Compute the centroids with GeoPandas and view the contents of the GeoPandas DataFrame:

geopandas_df["centroid"] = geopandas_df.geometry.centroid

print(geopandas_df[["BUILD_ID", "height_val", "centroid"]].head(5))

---------------------------------------------------
   BUILD_ID  height_val                    centroid
0   7776189    8.490000  POINT (-73.92441 40.87605)
1   7776214   10.490000  POINT (-73.92479 40.87609)
2   7740167   37.439999   POINT (-73.93514 40.8127)
3   7740151   37.270000  POINT (-73.93622 40.81268)
4   7740289   36.189999  POINT (-73.93705 40.81302)

Read an Iceberg table into a GeoPandas DataFrame

Suppose you have an Iceberg table named local.db.icetable and would like to analyze the data with GeoPandas.

GeoPandas does not have an Iceberg reader yet, so let’s read the table with Sedona and then convert it to a GeoPandas DataFrame.

Here’s how to create the Sedona DataFrame:

sedona_df = sedona.table("local.db.icetable")

Now let’s convert the DataFrame to a GeoPandas DataFrame:

import geopandas as gpd

pandas_df = sedona_df.toPandas()
geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geometry")

print(geopandas_df)

---------------------------
 id               geometry
0  a  LINESTRING (1 3, 3 1)
1  d  LINESTRING (2 7, 4 9)
2  e  LINESTRING (6 7, 6 9)
3  f  LINESTRING (8 7, 9 9)

You can only convert smaller DataFrames from Sedona DataFrames to GeoPandas DataFrames.  Let’s take a look at the architectural differences between the engines to learn more.

How Sedona and GeoPandas execute computations differently

Sedona executes computations on a distributed cluster, so it can quickly run computations on large datasets.

GeoPandas runs computations on a single node, so it’s not as scalable, but it can be quick for small datasets and is interoperable with libraries in the pandas ecosystem.

GeoPandas computations can be distributed with Sedona GeoPandas or dask-geopandas, and we will describe those technologies in the sections that follow.

Let’s take a look at the differences in more detail.

Single-node execution with GeoPandas vs. Sedona cluster computing

GeoPandas runs computations on a single computer and processes data using only a single node. This is fine for small datasets, but it cannot scale beyond the resources of a single machine.

Sedona runs on many machines working in unison as a cluster.  The cluster has a driver node responsible for orchestration and worker nodes that process the data in parallel.  

When a computation runs on a single node, only one machine reads the data and executes the query. When a computation is run on a cluster, all worker nodes read data in parallel and execute the query. As you can imagine, running on a cluster is faster and more scalable for large datasets.

Lazy Sedona execution vs. eager computation with GeoPandas

Wherobots is a lazy compute engine, and GeoPandas is an eager compute engine.

Lazy compute engines build up query execution graphs but don’t run the underlying computations until they need to produce a result.  Let’s take a look at a line of Sedona code that reads a file:

df = sedona.read.format("geojson").load("overture_buildings.geojson")

This line doesn’t read the underlying file. It waits until the user requests a result from the data before doing the work—it’s lazy. This is good because it can inspect and optimize the queries before running them.

Let’s compare this with GeoPandas code that reads in data to a DataFrame:

overture_buildings = gpd.read_file("overture_buildings.geojson")

This GeoPandas code runs instantly and eagerly loads all the data into an in-memory DataFrame. Eager execution can be beneficial, especially when the data is small and when the DataFrame is repeatedly queried. However, it has drawbacks for larger-than-memory datasets.

Out-of-memory datasets with GeoPandas

GeoPandas eagerly loads data into memory. When the machine does not have enough memory to load all the data, it will error out with an out-of-memory exception. In practice, GeoPandas works well with dataset sizes much smaller than the total memory on the machine.

Wherobots is designed to run on datasets larger than a given machine’s memory. It processes data incrementally and does not load it into memory all at once, allowing It to process out-of-memory datasets.

Let’s now dive into how the computations are executed.

Sedona vs. GeoPandas: Query optimizations

Wherobots uses a query optimizer to ensure the code executes efficiently. For example, suppose a query requires three columns of a 10-column dataset stored in GeoParquet. The query optimizer will intelligently select only the needed three columns rather than reading all 10 columns of data.

GeoPandas doesn’t have a query optimizer, so users must manually specify the optimizations they want to be applied.  For example, when a GeoPandas user queries a GeoParquet file, they need to specify the column pruning and row-group filtering logic manually.

The query optimizer provides a better developer experience and better performance, too.

Sedona vs. GeoPandas: Syntax differences

Sedona supports Python, SQL, Java, Scala, and R interfaces.

GeoPandas just supports a Python API.

The Wherobots Python syntax is different from the GeoPandas Python syntax.  Sedona is syntactically similar to PySpark, and GeoPandas is syntactically similar to Pandas.

Now let’s take a look at the Sedona GeoPandas API.

Sedona vs. GeoPandas: Library ecosystem

Sedona and GeoPandas are compatible with different libraries.

For example, suppose you would like to read a large geospatial dataset stored in Apache Iceberg, perform some data munging, and then display the results with matplotlib.

Sedona is compatible with Iceberg, and GeoPandas is interoperable with matplotlib, so here is how you can run this computation:

  1. Read the Iceberg table with Sedona
  2. Perform data wrangling with Sedona and reduce the dataset size
  3. Convert to GeoPandas and plot with matplotlib

Sedona and GeoPandas are interoperable with different libraries, and that’s one of the main reasons why it’s nice to have access to both.

Sedona GeoPandas API

The Sedona GeoPandas API is identical to the GeoPandas API, but executes computations with Sedona in a distributed manner.  It’s a great way for developers who are familiar with GeoPandas syntax to write code that’s executed with Sedona.

Here’s an example of how to create a Sedona GeoPandas DataFrame:

Dask-GeoPandas overview

The dask-geopandas library is another way to parallelize GeoPandas computations to leverage all cores of a given machine or distribute computations to a cluster.

Here’s a good summary of dask-geopandas: “Since GeoPandas is an extension to the pandas DataFrame, the same way Dask scales pandas can also be applied to GeoPandas”.

A deep dive into the differences between Sedona and dask-geopanadas is beyond the scope of this post, which focuses on Sedona and GeoPandas, but we’re excited to investigate this in a future post!

Converting from Sedona to GeoPandas

You can convert a Wherobots DataFrame to a GeoPandas DataFrame, as you can see in this code snippet:

pandas_df = sedona_df.toPandas()
geopandas_df = gpd.GeoDataFrame(pandas_df, geometry="geom")

When you convert from Sedona DataFrames to GeoPandas, you load all the data into the driver node and process it on a single machine. Converting from a Sedona DataFrame to a GeoPandas DataFrame is not recommended if the dataset is large.

It’s better to convert from Sedona GeoPandas once a dataset has been pre-processed and reduced in size. Once the data is small, you can convert it to a GeoPandas DataFrame and enjoy the interoperability offered by the GeoPandas ecosystem.

Converting from GeoPandas to a Sedona DataFrame

You can also convert from a GeoPandas DataFrame to a Sedona DataFrame.
Here’s an example of how to read a FlatGeoBuf file into a GeoPandas DataFrame and then convert it to a Sedona DataFrame:

geopandas_df = gpd.read_file("overture_buildings.fgb")
overture_buildings_df = sedona.createDataFrame(geopandas_df)

Sedona doesn’t have a FlatGeoBuf reader yet, so this is a good way to read FlatGeoBuf files into Sedona DataFrames!

Visualizing Iceberg Data with Sedona Wherobots, GeoPandas, and Folium

Let’s illustrate how to leverage Sedona’s computational power to process large Iceberg tables, reduce the data to a manageable size, and then visualize the results with GeoPandas and Folium. This workflow showcases the strength of combining these tools for geospatial data analysis.

First, we utilize Wherobots Sedona to efficiently read and process our extensive Iceberg data:

buildings_joined = sedona.sql("""
WITH dry_state_roi AS (
    SELECT 
        *
    FROM 
        wherobots_open_data.overture_2025_03_19_1.divisions_division_area
    WHERE
        country = 'US'
    AND 
        subtype = 'region'
)
SELECT 
    Count(building.id) AS buildings_count, roi.region AS region, roi.geometry AS region_geometry
FROM
    wherobots_open_data.overture_2025_03_19_1.buildings_building building
JOIN
    dry_state_roi roi
ON
    ST_Intersects(building.geometry, roi.geometry)
GROUP BY
    roi.region, roi.geometry
""")

Here, Sedona’s ability to handle large datasets shines as it quickly aggregates building counts by region. This reduction in data size is crucial, as it prepares the data for the next step with GeoPandas.

Next, we convert the Sedona DataFrame to a GeoPandas DataFrame:

import geopandas as gpd
buildings_pdf = buildings_joined.toPandas()
buildings_gpdf = gpd.GeoDataFrame(buildings_pdf, geometry="region_geometry")

Now, we move to visualization using Folium, a library that integrates seamlessly with GeoPandas:

import folium
m = folium.Map([43, -100], zoom_start=4)
folium.Choropleth(
    geo_data=buildings_gpdf.to_json(),
    data=buildings_gpdf,
    columns=["region", "buildings_count"],
    key_on="feature.properties.region",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Building Count",
).add_to(m)
m

While Sedona offers other graphing libraries, Folium’s direct compatibility with GeoPandas simplifies the visualization process. This code generates an interactive choropleth map, clearly displaying building density across US regions.

This example illustrates the power of Wherobots Sedona for rapid, large-scale data reduction, followed by GeoPandas and Folium for effective visualization. This combination allows for efficient analysis and presentation of complex geospatial data.

Comparing Sedona Wherobots and GeoPandas performance for a given query

Let’s compare the query performance for a dataset with different scale factors using Sedona Wherobots and GeoPandas.

The GeoPandas query is run on a machine with 12 cores and 96GB of RAM. The Wherobots Sedona query is run on a cluster of machines with a total of 24 cores and 96GB of RAM.  Both setups have the same total RAM.

Important note: This comparison is for GeoPandas vs. Sedona Wherobots.  Dask-geopandas is not included in this analysis and would be a more direct comparison, as it is a parallel compute engine.  The purpose of this comparison is to see the performance inflection point, not to act as though GeoPandas and Sedona Wherebots are similar.

We tested spatial joins on three different datasets with GeoPandas and Sedona Wherobots.

Overture buildings & Postal codes join details

Let’s start with a simple spatial join query that counts the number of Overture Buildings located within postal code zones, by joining the Overture Buildings dataset with a postal codes dataset. Here is the query:

WITH t AS (
SELECT * FROM overture_buildings buildings
	JOIN postal_codes zones
ON ST_Intersects(buildings.geometry, zones.geometry))
)
SELECT COUNT(1) FROM t

Let’s perform this query with different dataset sizes:

Sizeoverture_buildingspostal_codes
Small1,000,000154,452
Medium10,000,000154,452
Large50,000,000154,452
Full size706,967,095154,452

Here are the benchmarking results for each scale factor:

SizeWherobots execution timeGeoPandas execution timeOutput count
Small8.43s13.3s362,042
Medium12.8s1m 6s3,099,105
Large31.6s15,170,494
Full size5m 42s295,732,669

‘-’ Kernel died due to OOM (Out Of Memory)

Wherobots is faster for all scale factors. GeoPandas cannot run the query on the large dataset because it’s too much data to fit into memory.

Here is the GeoPandas code to run the query:

import geopandas as gpd

overture_buildings = gpd.read_parquet("path/to/overture_buildings/")
postal_codes = gpd.read_parquet("path/to/postal_codes/")

t = overture_buildings.sindex.query(postal_codes["geometry"], "intersects")
count = t.shape[0]

Here’s how to run the code with Sedona:

df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/1M/overture-buildings/")
df.createOrReplaceTempView("overture_buildings")
df = sedona.read.format("geoparquet").load("s3://wherobots-examples/data/geopandas_blog/postal-code/")
df.createOrReplaceTempView("postal_codes")

sedona.sql("""
WITH t AS (
SELECT * FROM overture_buildings buildings
	JOIN postal_codes zones
ON ST_Intersects(buildings.geometry, zones.geometry)
)
SELECT COUNT(1) FROM t
""").show()

Accessing the datasets

For those interested in replicating the analyses and experiments discussed in this blog post, the datasets are available in a public Amazon S3 bucket. Here’s how you can access them:

s3://wherobots-examples/data/geopandas_blog/
├── 1M/
│   └── overture-buildings/
├── 10M
│   └── overture-buildings/
├── 50M
│   └── overture-buildings/
└── postal-code/

Each xM/ directory (e.g., 1M/, 10M/, 50M/) contains datasets scaled to different sizes, allowing for performance comparisons and scalability testing. The postal-code/ directory contains the postal code dataset used in the spatial join examples.

Conclusion

Sedona and GeoPandas are both great tools for analyzing spatial data.

GeoPandas has many advantages:

  • Interoperability with lots of libraries in the pandas ecosystem
  • Built-in matplotlib support for graphing
  • Long-standing project with a good community and strong technical leadership 
  • Easy installation

GeoPandas is an excellent option if you have a small dataset and are familiar with pandas.

Wherobots and Sedona are better for larger datasets.  Wherobots has Python and SQL APIs and is also better for users who prefer writing SQL.

But as you’ve seen in this post, Sedona and GeoPandas are better together.  The technologies combined give you access to a larger ecosystem of libraries and the ability to run computations on large and small datasets.

Get Started with Wherobots for free
RELATED POSTS
VIEW MORE