Your AI can now contextualize physical world data using Wherobots Spatial AI Coding Tools Learn More

Introducing Scalability for GeoPandas in Apache Sedona

Screenshot

Introduction

Learn about the new GeoPandas API for Apache Sedona. This new API allows GeoPandas developers to seamlessly scale their analysis beyond the limitations of a single compute instance, unlocking insights from large-scale datasets. By combining the intuitive Pythonic GeoPandas API with Apache Sedona’s distributed processing capabilities, users get the best of both worlds: familiar syntax with planetary-scale performance.

What is GeoPandas?

Python developers have long relied on GeoPandas as a great starting point for their spatial intelligence needs. As an extension of Pandas DataFrames, GeoPandas adds support for geometries and geospatial functions, making the analysis of location data straightforward, efficient and familiar to Python developers. With a few lines of Pythonic code, developers can easily load various geospatial data formats into GeoPandas DataFrames and then apply powerful geospatial transformations or analyses.

Here’s an example of how you can use GeoPandas to load data from a Parquet file and then filter the data using coordinate-based indexing:

import geopandas
import s3fs

# Define the S3 file path to the NYC buildings dataset and a filesystem object
s3_path = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet'
s3 = s3fs.S3FileSystem(anon=True)

# Load the GeoParquet file directly from S3
nyc_buildings = geopandas.read_parquet(s3_path, filesystem=s3)

# Define a bounding box for Columbus Circle (approximate coordinates)
columbus_circle_bbox = [
    -73.983, 40.767, # bottom-left (longitude, latitude)
    -73.981, 40.769  # top-right (longitude, latitude)
]

# Filter for buildings within the bounding box using coordinate-based indexing
columbus_circle_buildings = nyc_buildings.cx[
   columbus_circle_bbox[0]:columbus_circle_bbox[2],
   columbus_circle_bbox[1]:columbus_circle_bbox[3]
]

# Print the ID, address, and height for the first five rows of the filtered data
print(columbus_circle_buildings[["BUILD_ID", "PROP_ADDR", "height_val"]].head(5))

      BUILD_ID               PROP_ADDR  height_val  
5160   7734681       2 COLUMBUS CIRCLE   38.750000   
5161   7734653           1775 BROADWAY   49.049999   
5164   7734648      233 WEST 58 STREET   41.220001   
5165   7734719  240 CENTRAL PARK SOUTH   58.410000   
5166   7734834            892 9 AVENUE   43.849998   

Scaling GeoPandas for Large Datasets

However, Pandas and GeoPandas alone had fundamental limitations that limited their usage for large scale analytical processes.

GeoPandas is limited to a single node and does not natively support distributed computation. It’s easy to get started and great for smaller scale data analysis on your laptop or a single machine. But, as soon as your data volume or compute needs grow, you may be forced to adopt a different approach. Queries may slow down or even fail to complete due to out-of-memory errors as data sizes grow.

When GeoPandas hits this wall, Python developers currently turn to distributed frameworks like Dask, Dask DataFrames, and dask-geopandas. While effective, these solutions can introduce additional complexity and a different programming paradigm. And, as we will discuss in more detail later, they lack dedicated optimizations for distributed processing of geospatial data. This is the gap our new API is designed to fill.

Apache Sedona GeoPandas API: Scalable Geospatial Processing

Today, Wherobots announces a new GeoPandas API for Apache Sedona, available in Wherobots and Apache Sedona 1.8.0. It combines the familiarity and convenience of GeoPandas DataFrames and APIs with the distributed query performance of Apache Sedona on Apache Spark. You can continue to use the GeoPandas syntax you know and love, but your computations will be executed in a distributed, scalable environment so you no longer need to worry about hitting the single node ceiling. Geospatial processing tasks that can be parallelized will instantly benefit from distributed computation, reducing the time and cost required for complex analysis.

Beyond parallelized processing, Apache Sedona also supports query planning and lazy execution to optimize complex analytical pipelines. By deferring execution as late as possible, Sedona maximizes parallelism and minimize costly data shuffling in distributed systems. For instance, with lazy execution, queries only need to read the relevant columns and sections of Parquet files once the specific query plan is understood. By comparison, GeoPandas typically reads the whole Parquet file at ingestion time, which can slow down the overall time to load and process larger datasets. For more details on the benefits of lazy loading, see “Eager vs. lazy data loading” below.

Now, when you evaluate frameworks for geospatial processing, you don’t have to choose between GeoPandas or Apache Sedona, or move data between the two frameworks for different types of analysis. With the new GeoPandas API for Apache Sedona, a single API provides the familiarity of GeoPandas with the seamlessly ability to scale.

How to Use the Geopandas API in Apache Sedona

In most cases, migrating existing code is as simple as changing your import statements to reference sedona.spark.geopandas:

import sedona.spark.geopandas as gpd

or from sedona.spark.geopandas import GeoDataFrame, read_parquet

# GeoPandas
import geopandas as gpd
df = gpd.read_parquet(path)
# GeoPandas on Sedona  
import sedona.spark.geopandas as gpd
df = gpd.read_parquet(path)
# That's it! Same API, distributed execution

GeoPandas vs. Apache Sedona Performance Comparison

Let’s look at some examples that show the relevant performance and scale differences between using GeoPandas directly or the GeoPandas API for Apache Sedona.

Note: The tests below were all conducted on a Medium Wherobots cluster with distributed compute across more than 20 nodes. Since GeoPandas is limited to a single node, it was only able to leverage the resources of the driver node of the cluster. Meanwhile, GeoPandas on Apache Sedona was able to parallelize operations across all of the nodes in the cluster.

Eager vs. lazy data loading

As mentioned above, GeoPandas will typically read the entire dataset into memory during a load operation. In contrast, Apache Sedona and Apache Spark implement lazy data loading and will defer reading the dataset until query time. As a result, Apache Sedona can minimize the amount of data read, only retrieving the data required to complete the query.

Here’s an example that demonstrates this tradeoff when working with large datasets. We load a large Parquet file consisting of 10s of millions of geometries and then perform an area calculation. The data for this example was synthetically generated using the Apache Sedona Spider Spatial Data Generator and stored in S3 in Parquet format.

Number of geometries GeoPandas GeoPandas on Sedona
Load time Area calculation time Total time Load time Area calculation time Total time
10,000,000 10.2 s 0.4 s 10.5 s N/A 0.4 s 0.4 s
20,000,000 20.7 s 0.7 s 21.4 s N/A 0.6 s 0.6 s
40,000,000 38.6 s 1.4 s 40.0 s N/A 0.9 s 0.9 s

Although GeoPandas and GeoPandas on Sedona have comparable performance calculating area in this example, we see the benefits of lazy loading. Instead of waiting tens of seconds to load the entire file, we can get instant sub second results! Admittedly, the data loading time is a one-time operation in GeoPandas that could be amortized across many queries. But, with lazy loading in GeoPandas on Sedona, your initial analysis of large datasets can be instantly accelerated and there’s little to no downside as you run subsequent queries.

Spatial joins

In this example, we perform spatial joins (with predicate dwithin) against two dataframes of varying sizes. The data for these queries was synthetically generated using the Apache Sedona Spider Spatial Data Generator and stored in S3 in Parquet format.

Here’s the code used to perform the spatial join:

left_df = read_parquet(left_data_path_in_s3)
right_df = read_parquet(right_data_path_in_s3)
result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
Left side geometries Right side geometries GeoPandas time GeoPandas on Sedona time
1000 1000 0.5 s 3.7 s
5,000 5,000 14.5 s 2.9 s
10,000 10,000 60.7 s 3.8 s
20,000 20,000 N/A, did not complete 8.1 s
100,000 100,000 N/A, did not complete 151.2 s

Note: GeoPandas performs slightly better on the join of 1000 x 1000 geometries. But, as the dataset size increases, GeoPandas on Apache Sedona demonstrates the advantages of distributed computation and performs significantly better. GeoPandas was not able to complete the 20,000 x 20,000 join or the 100,000 x 100,000 join on the single driver node of our Medium Wherobots cluster.

Real-World Example: Analyzing Overture Building Data

In this example, we look at the Overture Building dataset and analyze buildings by postal code using a spatial join.

Here’s the code used to perform the spatial join:

from sedona.spark.geopandas import read_parquet

DATA_DIR = f"s3://wherobots-examples/data/geopandas_blog/"
overture_size = "1M"
postal_codes_path = DATA_DIR + "postal-code/"
overture_path = DATA_DIR + overture_size + '/' + "overture-buildings/"
    
postal_codes_df = read_parquet(postal_codes_path)
overture_buildings_df = read_parquet(overture_path)
cnt = overture_buildings_df.sjoin(postal_codes_df, predicate="intersects").shape[0]

Note: To run the same code in GeoPandas, you may need to add the storage_options parameter to read from a public bucket, as in:

postal_codes_df = read_parquet(postal_codes_path, storage_options={'anon': True})

Now, let’s compare the performance of this query, including the data load, across increasing dataset sizes (1 million → 10 million → 50 million).

Dataset sizeGeoPandas time Sedona time
1 million 10.2 s 8.7 s
10 million 87.1 s 11.0 s
50 million N/A, did not complete12.5 s

Note: In this query example, Apache Sedona consistently performed better than GeoPandas. Furthermore, GeoPandas was not able to load the entire 50M records on the single driver node of our Medium Wherobots cluster and was, therefore, unable to complete the query.

Advantages of GeoPandas for Small-Scale Data

  • Working with smaller datasets: GeoPandas consistently provides slightly better performance on simple operations and on smaller datasets. Especially when the dataset is relatively small and can be loaded quickly into memory, there is less penalty for GeoPandas lack of lazy loading and query planning.
  • Complete functional coverage: GeoPandas is a mature framework with comprehensive geospatial functionality. By comparison, the initial release of GeoPandas API for Apache Sedona provides compatibility for the most popular GeoPandas functions, but some capabilities have not yet been implemented. For details on the current state of compatibility and opportunities to contribute to the open source project, see this GitHub issue.

Advantages GeoPandas API with Apache Sedona for Large Datasets

  • Working with larger datasets. With a distributed cluster, you will be able to load larger datasets that would exceed the memory of a single machine. And, with the power of lazy loading, you only need to load the subset of the data that’s actually required to complete each query.
  • Large scale computations. Complex geospatial analyses will benefit from parallelism and Apache Sedona’s geospatial optimizations, such as:
    • Computations that are heavily expensive
      • to_crs (ST_Transform), buffer , is_simple , is_valid
    • Joins with expensive predicates
      • touches , overlaps , dwithin

Recommendation

  • If there’s a chance your geospatial analysis could potentially grow in size or complexity, we recommend starting with the GeoPandas API for Apache Sedona today.
  • At smaller scales, performance is comparable to GeoPandas. But, as your needs grow, you will be able to reap the performance and scale benefits that are only possible with Apache Sedona.
  • If you’d like a deeper dive into how Wherobots, Sedona, and GeoPandas complement each other, check out this blog post.
  • When you’re ready, get started with Apache Sedona with Wherobots. You can use the Community Edition for free, which has initial scalability, or upgrade to Pro Edition today and no longer worry about limiting the scale of your geospatial analysis.
Create your Wherobots account