Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Learn about the new GeoPandas API for Apache Sedona. This new API allows GeoPandas developers to seamlessly scale their analysis beyond the limitations of a single compute instance, unlocking insights from large-scale datasets. By combining the intuitive Pythonic GeoPandas API with Apache Sedona’s distributed processing capabilities, users get the best of both worlds: familiar syntax with planetary-scale performance.
Python developers have long relied on GeoPandas as a great starting point for their spatial intelligence needs. As an extension of Pandas DataFrames, GeoPandas adds support for geometries and geospatial functions, making the analysis of location data straightforward, efficient and familiar to Python developers. With a few lines of Pythonic code, developers can easily load various geospatial data formats into GeoPandas DataFrames and then apply powerful geospatial transformations or analyses.
Here’s an example of how you can use GeoPandas to load data from a Parquet file and then filter the data using coordinate-based indexing:
import geopandas import s3fs # Define the S3 file path to the NYC buildings dataset and a filesystem object s3_path = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet' s3 = s3fs.S3FileSystem(anon=True) # Load the GeoParquet file directly from S3 nyc_buildings = geopandas.read_parquet(s3_path, filesystem=s3) # Define a bounding box for Columbus Circle (approximate coordinates) columbus_circle_bbox = [ -73.983, 40.767, # bottom-left (longitude, latitude) -73.981, 40.769 # top-right (longitude, latitude) ] # Filter for buildings within the bounding box using coordinate-based indexing columbus_circle_buildings = nyc_buildings.cx[ columbus_circle_bbox[0]:columbus_circle_bbox[2], columbus_circle_bbox[1]:columbus_circle_bbox[3] ] # Print the ID, address, and height for the first five rows of the filtered data print(columbus_circle_buildings[["BUILD_ID", "PROP_ADDR", "height_val"]].head(5)) BUILD_ID PROP_ADDR height_val 5160 7734681 2 COLUMBUS CIRCLE 38.750000 5161 7734653 1775 BROADWAY 49.049999 5164 7734648 233 WEST 58 STREET 41.220001 5165 7734719 240 CENTRAL PARK SOUTH 58.410000 5166 7734834 892 9 AVENUE 43.849998
However, Pandas and GeoPandas alone had fundamental limitations that limited their usage for large scale analytical processes.
GeoPandas is limited to a single node and does not natively support distributed computation. It’s easy to get started and great for smaller scale data analysis on your laptop or a single machine. But, as soon as your data volume or compute needs grow, you may be forced to adopt a different approach. Queries may slow down or even fail to complete due to out-of-memory errors as data sizes grow.
When GeoPandas hits this wall, Python developers currently turn to distributed frameworks like Dask, Dask DataFrames, and dask-geopandas. While effective, these solutions can introduce additional complexity and a different programming paradigm. And, as we will discuss in more detail later, they lack dedicated optimizations for distributed processing of geospatial data. This is the gap our new API is designed to fill.
Dask
Dask DataFrames
dask-geopandas
Today, Wherobots announces a new GeoPandas API for Apache Sedona, available in Wherobots and Apache Sedona 1.8.0. It combines the familiarity and convenience of GeoPandas DataFrames and APIs with the distributed query performance of Apache Sedona on Apache Spark. You can continue to use the GeoPandas syntax you know and love, but your computations will be executed in a distributed, scalable environment so you no longer need to worry about hitting the single node ceiling. Geospatial processing tasks that can be parallelized will instantly benefit from distributed computation, reducing the time and cost required for complex analysis.
Beyond parallelized processing, Apache Sedona also supports query planning and lazy execution to optimize complex analytical pipelines. By deferring execution as late as possible, Sedona maximizes parallelism and minimize costly data shuffling in distributed systems. For instance, with lazy execution, queries only need to read the relevant columns and sections of Parquet files once the specific query plan is understood. By comparison, GeoPandas typically reads the whole Parquet file at ingestion time, which can slow down the overall time to load and process larger datasets. For more details on the benefits of lazy loading, see “Eager vs. lazy data loading” below.
Now, when you evaluate frameworks for geospatial processing, you don’t have to choose between GeoPandas or Apache Sedona, or move data between the two frameworks for different types of analysis. With the new GeoPandas API for Apache Sedona, a single API provides the familiarity of GeoPandas with the seamlessly ability to scale.
In most cases, migrating existing code is as simple as changing your import statements to reference sedona.spark.geopandas:
sedona.spark.geopandas
import sedona.spark.geopandas as gpd
or from sedona.spark.geopandas import GeoDataFrame, read_parquet
from sedona.spark.geopandas import GeoDataFrame, read_parquet
# GeoPandas import geopandas as gpd df = gpd.read_parquet(path)
# GeoPandas on Sedona import sedona.spark.geopandas as gpd df = gpd.read_parquet(path) # That's it! Same API, distributed execution
Let’s look at some examples that show the relevant performance and scale differences between using GeoPandas directly or the GeoPandas API for Apache Sedona.
Note: The tests below were all conducted on a Medium Wherobots cluster with distributed compute across more than 20 nodes. Since GeoPandas is limited to a single node, it was only able to leverage the resources of the driver node of the cluster. Meanwhile, GeoPandas on Apache Sedona was able to parallelize operations across all of the nodes in the cluster.
As mentioned above, GeoPandas will typically read the entire dataset into memory during a load operation. In contrast, Apache Sedona and Apache Spark implement lazy data loading and will defer reading the dataset until query time. As a result, Apache Sedona can minimize the amount of data read, only retrieving the data required to complete the query.
Here’s an example that demonstrates this tradeoff when working with large datasets. We load a large Parquet file consisting of 10s of millions of geometries and then perform an area calculation. The data for this example was synthetically generated using the Apache Sedona Spider Spatial Data Generator and stored in S3 in Parquet format.
Although GeoPandas and GeoPandas on Sedona have comparable performance calculating area in this example, we see the benefits of lazy loading. Instead of waiting tens of seconds to load the entire file, we can get instant sub second results! Admittedly, the data loading time is a one-time operation in GeoPandas that could be amortized across many queries. But, with lazy loading in GeoPandas on Sedona, your initial analysis of large datasets can be instantly accelerated and there’s little to no downside as you run subsequent queries.
In this example, we perform spatial joins (with predicate dwithin) against two dataframes of varying sizes. The data for these queries was synthetically generated using the Apache Sedona Spider Spatial Data Generator and stored in S3 in Parquet format.
dwithin
Here’s the code used to perform the spatial join:
left_df = read_parquet(left_data_path_in_s3) right_df = read_parquet(right_data_path_in_s3) result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
Note: GeoPandas performs slightly better on the join of 1000 x 1000 geometries. But, as the dataset size increases, GeoPandas on Apache Sedona demonstrates the advantages of distributed computation and performs significantly better. GeoPandas was not able to complete the 20,000 x 20,000 join or the 100,000 x 100,000 join on the single driver node of our Medium Wherobots cluster.
In this example, we look at the Overture Building dataset and analyze buildings by postal code using a spatial join.
from sedona.spark.geopandas import read_parquet DATA_DIR = f"s3://wherobots-examples/data/geopandas_blog/" overture_size = "1M" postal_codes_path = DATA_DIR + "postal-code/" overture_path = DATA_DIR + overture_size + '/' + "overture-buildings/" postal_codes_df = read_parquet(postal_codes_path) overture_buildings_df = read_parquet(overture_path) cnt = overture_buildings_df.sjoin(postal_codes_df, predicate="intersects").shape[0]
Note: To run the same code in GeoPandas, you may need to add the storage_options parameter to read from a public bucket, as in:
storage_options
postal_codes_df = read_parquet(postal_codes_path, storage_options={'anon': True})
Now, let’s compare the performance of this query, including the data load, across increasing dataset sizes (1 million → 10 million → 50 million).
Note: In this query example, Apache Sedona consistently performed better than GeoPandas. Furthermore, GeoPandas was not able to load the entire 50M records on the single driver node of our Medium Wherobots cluster and was, therefore, unable to complete the query.
to_crs
buffer
is_simple
is_valid
touches
overlaps
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Spatial Data Pipeline Architecture: PostGIS and Wherobots Together
In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]
Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade
Introduction Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it. Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: