Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Last Updated: February 2026
Apache Sedona is an open-source cluster computing system built for processing large-scale spatial data across distributed environments. Originally developed at Arizona State University under the name GeoSpark, in the paper “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond” by Jia Yu and Mohamed Sarwat, it is now a top-level Apache Software Foundation project used by organizations across transportation, logistics, environmental monitoring, insurance, and urban planning. This page covers what Apache Sedona is, how it processes spatial queries, common use cases, and how to get started.
Apache Sedona treats spatial data as a first-class citizen by extending distributed compute frameworks including Apache Spark, Apache Flink, and Snowflake with specialized data types, spatial operations, and indexing techniques optimized for spatial workloads. Unlike general-purpose compute frameworks, Sedona is purpose-built for the unique challenges of spatial data, including complex geometries, coordinate systems, and spatial relationships that standard data types cannot handle efficiently. The following section outlines how Apache Sedona processes spatial queries from data ingestion through to distributed execution.
Apache Sedona supports multiple programming languages, including Python, Scala, Java, R, and SQL, making it accessible to a wide range of data engineering and analytics workflows. Developers can interact with Apache Sedona through whichever language fits their existing stack.
On the integrations side, Apache Sedona runs on Apache Spark, Apache Flink, and Snowflake. Each runtime serves a different need: Apache Spark for large-scale distributed batch processing, Apache Flink for real-time streaming spatial analytics, and Snowflake for teams running spatial workloads inside a cloud data warehouse environment.
The first step in spatial query processing is to ingest geospatial data into Apache Sedona. Data can be loaded from various sources such as files (Shapefiles, GeoJSON, Parquet, GeoTiff, CSV, etc) or databases into Apache Sedona’s in-memory distributed spatial data structures (typically the Spatial DataFrame).
Next, Sedona makes use of spatial indexing techniques to accelerate query processing, such as R-trees or Quad trees. The spatial index is used to partition the data into smaller, manageable units, enabling efficient data retrieval during query processing.
Once the data is loaded and indexed spatial queries can be executed using Sedona’s query execution engine. Sedona supports a wide range of spatial operations, such as spatial joins, distance calculations, and spatial aggregations.
Sedona optimizes spatial queries to improve performance. The query optimizer determines an efficient query plan by considering the spatial predicates, available indexes, and the distribution of data across the cluster.
Spatial queries are executed in a distributed manner using Sedona’s computational capabilities. The query execution engine distributes the query workload across the cluster, with each node processing a portion of the data. Intermediate results are combined to produce the final result set. Since spatial objects can be very complex with many coordinates and topology, Sedona implements a custom serializer for efficiently moving spatial data throughout the cluster.
Organizations use Apache Sedona for a range of large-scale geospatial data processing tasks, including
Many of these use cases can be described as geospatial ETL operations. ETL (extract, transform, load) is a data integration process that involves retrieving data from various sources, transforming and combining these datasets, then loading the transformed data into a target system or format for reporting or further analysis. Geospatial ETL shares many of the same challenges and requirements of traditional ETL processes with the additional complexities of managing the geospatial component of the data, working with geospatial data sources and formats, spatial data types and transformations, as well as the scalability and performance considerations required for spatial operations such as joins based on spatial relationships.
For a real-world example of Apache Sedona in production, watch how Comcast data engineer David Buchanan used Apache Sedona to optimize geospatial ETL pipelines at scale, reducing processing time from 5 hours to 30 minutes:
Comcast
Apache Sedona is one of the most widely adopted geospatial analytics libraries in the distributed computing ecosystem, with over 38 million downloads and active use across industries including transportation, logistics, environmental monitoring, and insurance. As a top-level Apache Software Foundation (ASF) project since February 2023, Sedona’s governance, licensing, and community participation align with ASF principles.
Sedona has an active and growing developer community, with contributors from a number of different types of organizations and over 100 individuals interested in advancing the state of geospatial analytics and distributed computing. As of October 2024, Apache Sedona had surpassed 38 million downloads, with approximately 2 million downloads per month and year-over-year usage growth of 200%.
Organizations in industries including transportation, urban planning, environmental monitoring, logistics, insurance and risk analysis and more have adopted Apache Sedona. These organizations leverage Sedona’s capabilities to perform large-scale geospatial analysis, extract insights from geospatial data and build geospatial analytical applications at scale.
Apache Sedona has been featured in conferences, workshops, and research publications related to geospatial analytics, distributed computing, and big data processing. For a deeper look at Apache Sedona’s adoption and real-world impact, watch Iceberg Geo Type: Transforming Geospatial Data Management at Scale presented by Jia Yu and Szehon Ho at at Data + AI Summit.
For quick access to documentation and community support:
For a comprehensive introduction, ‘Cloud Native Geospatial Analytics with Apache Sedona,’ published with O’Reilly, covers how to work with large-scale spatial data using Apache Sedona, Apache Spark, and modern cloud technologies. It is aimed at developers, data scientists, and data engineers
Apache Sedona is a cluster computing system for processing large-scale spatial data. It extends the functionality of distributed compute frameworks including Apache Spark, Apache Flink, and Snowflake, treating spatial data as a first-class citizen with specialized data types, operations, and indexing techniques optimized for spatial workloads.
Yes. Apache Sedona is an Apache Software Foundation (ASF) project. Its governance, licensing, and community participation align with ASF principles.
Apache Sedona supports Java, Python, R, Scala, and SQL.
Apache Sedona is used for large-scale geospatial ETL operations and spatial data analysis. Specific use cases mentioned on the page include: creating weather, climate, and environmental quality reports at national scale by combining vector parcel data with raster data; generating planetary-scale GeoParquet files for public distribution via cloud storage; converting billions of daily point telemetry observations into vehicle routes; and enriching parcel-level data with demographic and environmental information for real estate investment analysis.
Apache Sedona was initiated as GeoSpark by Jia Yu and Mohamed “Mo” Sarwat at Arizona State University in 2010. In 2020, the project was submitted to the Apache Software Foundation, and in February 2023 it graduated as a top-level ASF project.
Apache Sedona can ingest data from Shapefiles, GeoJSON, Parquet, GeoTiff, and CSV files.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down
A Wherobots Solution Accelerator for GPS Mobility Analytics — Part 1 of 2
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: