Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

What is Apache Sedona?

Authors

Apache Sedona - Simplified diagram

Apache Sedona Overview

Apache Sedona is a cluster computing system for processing large-scale spatial data. It treats spatial data as a first class citizen by extending the functionality of distributed compute frameworks like Apache Spark, Apache Flink, and Snowflake. Apache Sedona was created at Arizona State University under the name Geospark, in the paper “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond.”

Apache Sedona introduces data types, operations, and indexing techniques optimized for spatial workloads on top of Apache Spark and other distributed compute frameworks. Let’s take a look at the workflow for analyzing spatial data with Apache Sedona.

Apache Sedona Architecture

Spatial Query Processing

The first step in spatial query processing is to ingest geospatial data into Apache Sedona. Data can be loaded from various sources such as files (Shapefiles, GeoJSON, Parquet, GeoTiff, CSV, etc) or databases intro Apache Sedona’s in-memory distributed spatial data structures (typically the Spatial DataFrame).

Next, Sedona makes use of spatial indexing techniques to accelerate query processing, such as R-trees or Quad trees. The spatial index is used to partition the data into smaller, manageable units, enabling efficient data retrieval during query processing.

Once the data is loaded and indexed spatial queries can be executed using Sedona’s query execution engine. Sedona supports a wide range of spatial operations, such as spatial joins, distance calculations, and spatial aggregations.

Sedona optimizes spatial queries to improve performance. The query optimizer determines an efficient query plan by considering the spatial predicates, available indexes, and the distribution of data across the cluster.

Spatial queries are executed in a distributed manner using Sedona’s computational capabilities. The query execution engine distributes the query workload across the cluster, with each node processing a portion of the data. Intermediate results are combined to produce the final result set. Since spatial objects can be very complex with many coordinates and topology, Sedona implements a custom serializer for efficiently moving spatial data throughout the cluster.

Common Apache Sedona Use Cases

So what exactly are users doing with Apache Sedona? Here are some common examples of what users are doing with Apache Sedona:

  • Creating custom weather, climate, and environmental quality assessment reports at national scale by combining vector parcel data with environmental raster data products.
  • Generating planetary scale GeoParquet files for public dissemination via cloud storage by combining, cleaning, and indexing multiple datasets.
  • Converting billions of daily point telemetry observations into routes traveled by vehicles.
  • Enriching parcel level data with demographic and environmental data at the national level to feed into a real estate investment suitability analysis.

Many of these use case can be described as geospatial ETL operations. ETL (extract, transform, load) is a data integration process that involves retrieving data from various sources, transforming and combining these datasets, then loading the transformed data into a target system or format for reporting or further analysis. Geospatial ETL shares many of the same challenges and requirements of traditional ETL processes with the additional complexities of managing the geospatial component of the data, working with geospatial data sources and formats, spatial data types and transformations, as well as the scalability and performance considerations required for spatial operations such as joins based on spatial relationships. For a more complete overview of use cases with Apache Sedona, you can read our ebook on it here.

Community Adoption

Apache Sedona has gained significant community adoption and has become a popular geospatial analytics library within the distributed computing and big data ecosystem. As an Apache Software Foundation (ASF) incubator project, Sedona’s governance, licensing, and community participation align with ASF principles.

Sedona has an active and growing developer community, with contributors from a number of different types of organizations and over 100 individuals interested in advancing the state of geospatial analytics and distributed computing. Sedona has reach over 38 million downloads with a rate of 2 million downloads per month with usage growing at a rate of 200% per year (as of the date this was published).

Organizations in industries including transportation, urban planning, environmental monitoring, logistics, insurance and risk analysis and more have adopted Apache Sedona. These organizations leverage Sedona’s capabilities to perform large-scale geospatial analysis, extract insights from geospatial data and build geospatial analytical applications at scale. The industry adoption of Apache Sedona showcases its practical relevance and real-world utility.

Apache Sedona has been featured in conferences, workshops, and research publications related to geospatial analytics, distributed computing, and big data processing. These presentations and publications contribute to the awareness, visibility, and adoption both within the enterprise and within the research and academic communities.

Resources

As you get started with Apache Sedona the following resources will be useful throughout your journey in the world of large-scale geospatial data analytics.

The best place to start learning about Apache Sedona is the authoritative book on the topic, which was recently published in early release format “Cloud Native Geospatial Analytics with Apache Sedona”. The team behind the project will continue to release chapters until the book is complete over the coming months.

Cloud-Native Geospatial Analytics with Apache Sedona Book Cover
Get Access to the Free Guide