What is Apache Sedona?

Apache Sedona Overview

Apache Sedona is a cluster computing system for processing large-scale spatial data. It treats spatial data as a first class citizen by extending the functionality of distributed compute frameworks like Apache Spark, Apache Flink, and Snowflake. Apache Sedona was created at Arizona State University under the name Geospark, in the paper “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond.”

Apache Sedona introduces data types, operations, and indexing techniques optimized for spatial workloads on top of Apache Spark and other distributed compute frameworks. Let’s take a look at the workflow for analyzing spatial data with Apache Sedona.

Apache Sedona Architecture

Spatial Query Processing

The first step in spatial query processing is to ingest geospatial data into Apache Sedona. Data can be loaded from various sources such as files (Shapefiles, GeoJSON, Parquet, GeoTiff, CSV, etc) or databases intro Apache Sedona’s in-memory distributed spatial data structures (typically the Spatial DataFrame).

Next, Sedona makes use of spatial indexing techniques to accelerate query processing, such as R-trees or Quad trees. The spatial index is used to partition the data into smaller, manageable units, enabling efficient data retrieval during query processing.

Once the data is loaded and indexed spatial queries can be executed using Sedona’s query execution engine. Sedona supports a wide range of spatial operations, such as spatial joins, distance calculations, and spatial aggregations.

Sedona optimizes spatial queries to improve performance. The query optimizer determines an efficient query plan by considering the spatial predicates, available indexes, and the distribution of data across the cluster.

Spatial queries are executed in a distributed manner using Sedona’s computational capabilities. The query execution engine distributes the query workload across the cluster, with each node processing a portion of the data. Intermediate results are combined to produce the final result set. Since spatial objects can be very complex with many coordinates and topology, Sedona implements a custom serializer for efficiently moving spatial data throughout the cluster.

Common Apache Sedona Use Cases

So what exactly are users doing with Apache Sedona? Here are some common examples of what users are doing with Apache Sedona:

  • Creating custom weather, climate, and environmental quality assessment reports at national scale by combining vector parcel data with environmental raster data products.
  • Generating planetary scale GeoParquet files for public dissemination via cloud storage by combining, cleaning, and indexing multiple datasets.
  • Converting billions of daily point telemetry observations into routes traveled by vehicles.
  • Enriching parcel level data with demographic and environmental data at the national level to feed into a real estate investment suitability analysis.

Many of these use case can be described as geospatial ETL operations. ETL (extract, transform, load) is a data integration process that involves retrieving data from various sources, transforming and combining these datasets, then loading the transformed data into a target system or format for reporting or further analysis. Geospatial ETL shares many of the same challenges and requirements of traditional ETL processes with the additional complexities of managing the geospatial component of the data, working with geospatial data sources and formats, spatial data types and transformations, as well as the scalability and performance considerations required for spatial operations such as joins based on spatial relationships. For a more complete overview of use cases with Apache Sedona, you can read our ebook on it here.

Community Adoption

Apache Sedona has gained significant community adoption and has become a popular geospatial analytics library within the distributed computing and big data ecosystem. As an Apache Software Foundation (ASF) incubator project, Sedona’s governance, licensing, and community participation align with ASF principles.

Sedona has an active and growing developer community, with contributors from a number of different types of organizations and over 100 individuals interested in advancing the state of geospatial analytics and distributed computing. Sedona has reach over 38 million downloads with a rate of 2 million downloads per month with usage growing at a rate of 200% per year (as of the date this was published).

Organizations in industries including transportation, urban planning, environmental monitoring, logistics, insurance and risk analysis and more have adopted Apache Sedona. These organizations leverage Sedona’s capabilities to perform large-scale geospatial analysis, extract insights from geospatial data and build geospatial analytical applications at scale. The industry adoption of Apache Sedona showcases its practical relevance and real-world utility.

Apache Sedona has been featured in conferences, workshops, and research publications related to geospatial analytics, distributed computing, and big data processing. These presentations and publications contribute to the awareness, visibility, and adoption both within the enterprise and within the research and academic communities.

Resources

As you get started with Apache Sedona the following resources will be useful throughout your journey in the world of large-scale geospatial data analytics.

The best place to start learning about Apache Sedona is the authoritative book on the topic, which was recently published in early release format “Cloud Native Geospatial Analytics with Apache Sedona”. The team behind the project will continue to release chapters until the book is complete over the coming months. You can get the latest version of the book here.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Announcing SAML Single Sign-On (SSO) Support

We’re excited to announce that Wherobots Cloud now supports SAML SSO for Professional Edition customers. This enhancement underscores our commitment to providing robust security features tailored to the needs of businesses that handle sensitive data.

What is SAML Single Sign-On

SAML Single Sign on Graph

Many companies already use SAML SSO. It is the mechanism by which you log in to third party tools with your company’s centralized login (Google, Okta, etc.)

SAML (Security Assertion Markup Language) are essential components of modern security protocols. SAML is an open standard for exchanging authentication and authorization data between parties and SSO (Single Sign-On) allows users to log in once and gain access to multiple systems without re-entering credentials.

How to enable SAML SSO

Setting up SAML SSO is straightforward. Any Professional Edition Wherobots administrator can enable this feature by following these steps:

  1. Verify Your Domain:

    Go to your organization’s settings, copy the provided TXT DNS record, and add it to your domain provider’s DNS settings. Once done, click “Verify” in Wherobots.

  2. Configure Your Identity Provider (IdP):

    Log in to your IdP (e.g., Google Workspace, Okta, OneLogin, Azure AD) and configure it using the SAML details provided in the Wherobots settings.
    SAML Configuration

  3. Enter IdP Details into Wherobots:

    Input the Identity Provider Issuer URL, SSO URL, and certificate details from your IdP into the SAML section in Wherobots.
    SAML IdP Details view

  4. Enable SAML SSO:

    Make sure there’s an admin user with an email from the verified domain. Then, toggle the “Enable” switch in Wherobots to activate SSO.

  5. Test the Integration:

    Log in using your verified domain email to ensure everything is set up correctly.

For more detailed step-by-step instructions, check out our comprehensive getting started guide.

Why Your Organization Should Use SAML Single Sign-On

Benefits to Users

For users, SAML SSO simplifies the login process. With fewer passwords to remember and less frequent logins, users save time and reduce frustration. This streamlined access means employees can focus more on their tasks and less on managing multiple credentials.

Benefits to Organizations

Organizations benefit from SAML SSO by enhancing security and reducing administrative overhead. Centralized authentication means fewer password-related support tickets and a lower risk of phishing attacks, as users are less likely to fall for credential-stealing schemes. Moreover, it ensures compliance with security policies and regulatory requirements by enforcing strong, consistent access controls. It also allows organizations to grant access to certain third party services only for certain users/groups of users.

Location data is particularly sensitive, as it can reveal personal habits, routines, and even confidential business operations. For example, in healthcare, location data of patients visiting specialized clinics could inadvertently disclose medical information. By implementing SAML SSO, organizations can better control access, ensuring that only authorized personnel can view and interact with this information within the Wherobots platform.

Elevate Your Security with SAML SSO

Take this opportunity to simplify authentication, reduce security risks, and improve productivity. If you are not already a Professional Edition customer, upgrade to gain access to SAML SSO. By upgrading, you’ll not only bolster your security measures and simplify logins, but you’ll gain access to more powerful workloads, service principals, and more. Contact us to schedule a demo or get started!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Unlock Satellite Imagery Insights with WherobotsAI Raster Inference

Recently we introduced WherobotsAI Raster Inference to unlock analytics on satellite and aerial imagery using SQL or Python. Raster Inference simplifies extracting insights from satellite and aerial imagery using SQL or Python, and is powered by open-source machine learning models. This feature is currently in preview, and we are expanding it’s capabilities to support more models. Below we’ll dig into the popular computer vision tasks that Raster Inference supports, describe how it works, and how you can use it to run batch inference to find and map electricity infrastructure.

Watch the live demo of these capabilities here.

The Power of Machine Learning with Satellite Imagery

Petabytes of satellite imagery are generated each day all over the world in a dizzying number of sensor types and image resolutions. The applications for satellite imagery and other remote sensing data sources are broad and diverse. For example, satellites with consistent, continuous orbits are ideal for monitoring forest carbon stocks to validate carbon credits or estimating agricultural yields.

However, this data has been inaccessible for most analysts and even seasoned ML practitioners because insight extraction required specialized skills. We’ve done the work to make insight extraction simple and accessible to more people. Raster Inference abstracts the complexity and scales to support planetary-scale imagery datasets, so you don’t need ML expertise to derive insights. In this blog, we explore the key features that make Raster Inference effective for land cover classification, solar farm mapping, and marine infrastructure detection. And, in the near future, you will be able to use Raster Inference with your own models!

Introduction to Popular and Supported Machine Learning Tasks

Raster Inference supports the three most common kinds of computer vision models that are applied to imagery: classification, object detection, and semantic segmentation. Instance segmentation (combines object localization and semantic segmentation) is another common type of model which is not currently supported, but let us know if you need by contacting us and we can add it to the roadmap.

Computer Vision Detection Types
Computer Vision Detection Categories from Lin et al. Microsoft COCO: Common Objects in Context

The figure above illustrates these tasks. Image classification is when an image is assigned one or more text labels. In image (a), the scene is assigned the labels “person”, “sheep”, and “dog”. Image (b) is an example of object localization (or object detection). Object localization creates bounding boxes around objects of interest and assigns labels. In this image, five sheep are localized separately along with one human and one dog. Finally, semantic segmentation is when each pixel is given a category label, as shown in image (c). Here we can see all the pixels belonging to sheep are labeled blue, the dog is labeled red, and the person is labeled teal.

While these examples highlight detection tasks on regular imagery, these computer vision models can be applied to raster formatted imagery. Raster data formats are the most common data formats for satellite and aerial imagery. When objects of interest in raster imagery are localized, their bounding boxes can be georeferenced, which means that each pixel is localized to spatial coordinates, such as latitude and longitude. Therefore, georeferencing is object localization suited for spatial analytics.

https://wherobots.com/wp-content/uploads/2024/06/remotesensing-11-00339-g005.png

The example above shows various applications of object detection for localizing and classifying features in high resolution satellite and aerial imagery. This example comes from DOTA, a 15-class dataset of different objects in RGB and grayscale satellite imagery. Public datasets like DOTA are used to develop and benchmark machine learning models.

Not only are there many publicly available object detection models, but also there are many semantic segmentation models.

Semantic Segmentation
Sourced from “A Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning”.

Not every machine learning model should be treated equally, and each will have their own tradeoffs. You can see the difference between the ground truth image (human annotated buildings representing the real world) and segmentation results across two models (Scale-MAE and Vanilla MAE). These results are derived from the same image at two different resolutions (referred to as GSD, or Ground Sampling Distance).

  • Scale-MAE is a model developed to handle detection tasks at various resolutions with different sensor inputs. It uses a similar MAE model architecture as the Vanilla MAE, but is trained specifically for detection tasks on overhead imagery that span different resolutions.
  • The Vanilla MAE is not trained to handle varying resolutions in overhead imagery. It’s performance suffers in the top row and especially the bottom row, where resolution is coarser, as seen by the mismatch between Vanilla MAE and the ground truth image where many pixels are incorrectly classified.

Satellite Analytics Before Raster Inference

Without Raster Inference, typically a team who is looking to extract insights from overhead imagery using ML would need to:

  1. Deploy a distributed runtime to scale out workloads such as data loading, preprocessing, and inference.
  2. Develop functionality to operate on raster metadata to easily filter it by location to run inference workloads on specific areas of interest.
  3. Optimize models to run performantly on GPUs, which can involve complex rewrites of the underlying model prediction logic.
  4. Create and manage data preprocessing pipelines to normalize, resize, and collate raster imagery into the correct data type and size required by the model.
  5. Develop the logic to run data loading, preprocessing, and model inference efficiently at scale.

Raster Inference and its SQL and Python APIs abstract this complexity so you and your team can easily perform inference on massive raster datasets.

Raster Inference APIs for SQL and Python

Raster Inference offers APIs in both SQL and Python to run inference tasks. These APIs are designed to be easy to use, even if you’re not a machine learning expert. RS_CLASSIFY can be used for scene classification, RS_BBOXES_DETECT for object detection, and RS_SEGMENT for semantic segmentation. Each function produces tabular results which can be georeferenced either for the scene, object, or segmentation depending on the function. The records can be joined or visualized with other data (geospatial or traditional) to curate enriched datasets and insights. Here are SQL and Python examples for RS_Segment.

RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
df = df_raster_input.withColumn("segment_result", rs_segment(model_id, col("outdb_raster")))

Example: Mapping Electricity Infrastructure

Imagine you want to optimize the location of new EV charging stations, but you want to target locations based on the availability of green energy sources, such as local solar farms. You can use Raster Inference to detect and locate solar farms and cross-reference these locations with internal data or other vector geometries that captures demand for EV charging. This use case will be demonstrated in our upcoming release webinar on July 10th.

Let’s walk through how to use Raster Inference for this use case.

First, we run predictions on rasters to find solar farms. The following code block that calls RS_SEGMENT shows how easy this is.

CREATE OR REPLACE TEMP VIEW segment_fields AS (
    SELECT
        outdb_raster,
        RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
    FROM
    az_high_demand_with_scene
)

The confidence_array column produced from RS_SEGMENT can be assigned the same geospatial coordinates as the raster input and converted to a vector that can be spatially joined and processed with WherobotsDB using RS_SEGMENT_TO_GEOMS. We select a confidence threshold of .65 so that we only georeference high confidence detections.

WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions_df
    )
    SELECT result.* FROM t
+----------+--------------------+--------------------+
|     class|avg_confidence_score|            geometry|
+----------+--------------------+--------------------+
|Solar Farm|  0.7205783606825462|MULTIPOLYGON (((-...|
|Solar Farm|  0.7273308333550763|MULTIPOLYGON (((-...|
|Solar Farm|  0.7301468510823231|MULTIPOLYGON (((-...|
|Solar Farm|  0.7180177244988899|MULTIPOLYGON (((-...|
|Solar Farm|   0.728077805771141|MULTIPOLYGON (((-...|
|Solar Farm|     0.7264981572898|MULTIPOLYGON (((-...|
|Solar Farm|  0.7044100126912517|MULTIPOLYGON (((-...|
|Solar Farm|  0.7137283466756343|MULTIPOLYGON (((-...|
+----------+--------------------+--------------------+

This allows us to integrate the vectorized model predictions with other spatial datasets and easily visualize the results with SedonaKepler.

https://wherobots.com/wp-content/uploads/2024/06/solar_farm_detection-1-1024x398.png

Here Raster Inference runs on a 85 GiB dataset with 2,200 raster scenes for Arizona. Using a Sedona (tiny) runtime, Raster Inference completed in 430 seconds, predicting solar farms for all low cloud cover satellite images for the state of Arizona for the month of October. If we scale up our runtime to a San Francisco (small) runtime, the inference speed nearly doubles. In general, average bytes processed per second by Wherobots increases as datasets scale in size because startup costs are amortized over time. Processing speed also increases as runtimes scale in size.

Inference time (seconds) Runtime Size
430 Sedona
246 San Francisco

We use predictions from the output of Raster Inference to derive insights about which zip codes have the most solar farms, as shown below. This statement joins predicted solar farms with zip codes by location, then ranks zip codes by the pre-computed solar farm area within each zip code. We skipped this step for brevity but you can see it and others in the notebook example.

az_solar_zip_codes = sedona.sql("""
SELECT solar_area, any_value(az_zta5.geometry) AS geometry, ZCTA5CE10
FROM predictions_polys JOIN az_zta5
WHERE ST_Intersects(az_zta5.geometry, predictions_polys.geometry)
GROUP BY ZCTA5CE10
ORDER BY solar_area DESC
""")

https://wherobots.com/wp-content/uploads/2024/06/final_analysis.png

These predictions are made possible by SATLAS, a family of machine learning models released with Apache 2.0 licensing from Allen AI. The solar model demonstrated above was derived from the SATLAS foundational model. This foundational model can be used as a building block to create models to address specific detection challenges like solar farm detection. Additionally, there are many other open source machine learning models available for deriving insights from satellite imagery, many of which are provided by the TorchGeo project. We are just beginning to explore what these models can achieve for planetary-scale monitoring.

If you have a specific model you would like to see made available, please contact us to let us know.

For detailed instructions on using Raster Inference, please refer to our example Jupyter notebooks in the documentation.

https://wherobots.com/wp-content/uploads/2024/06/Screenshot_2024-06-08_at_2.11.07_PM-1024x683.png

Here are some links to get you started:
https://docs.wherobots.com/latest/tutorials/wherobotsai/wherobots-inference/segmentation/

https://docs.wherobots.com/latest/api/wherobots-inference/pythondoc/inference/sql_functions/

Getting Started

Getting started with WherobotsAI Raster Inference is easy. We’ve provided three models in Wherobots Cloud that can be used with our GPU optimized runtimes. Sign up for an account on Wherobots Cloud, send us a note to access the professional tier, start a GPU runtime, and you can run our example Jupyter notebooks to analyze satellite imagery in SQL or Python.

Stay tuned for updates on improvements to Raster Inference that will make it possible to run more models, including your own custom models. We’re excited to hear what models you’d like us to support, or the integrations you need to make running your own models even easier with Raster Inference. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Wherobots Cloud: The Cloud-native Spatial Analytics Data Platform

According to Gartner, 97% of data collected at the enterprise sits on the shelves without being put into use. That is a shockingly big number, especially given that the data industry got their hopes up a few years back when the Economist published their article “The most valuable resource is no longer oil, it’s data”. That is also quite surprising given the 100s of billions of dollars invested in database and analytics platforms over the past two decades.

One main reason is that data professionals most of the time struggle to connect data to use cases. A natural way for data professionals to achieve that is to link their data/insights to data about the physical world, aka.“Spatial Data”, and hence ask physical-world related questions on such data, aka. "Spatial Analytics". This spatial approach can be an indispensable asset for businesses worldwide. Use cases range from determining optimal delivery routes to making informed decisions about property investments, to climate and agricultural technology. For instance, commercial real estate data will make more sense when connected to spatial data about nearby objects (e.g., building, POIs), man-made events (e.g, crimes, traffic), as well as natural events such as wildfires and floods. The importance of understanding the ‘where’ cannot be overstated, especially when it influences operational efficiency, customer satisfaction, and strategic planning.

The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack. Incumbent data platforms, often not built to handle the intricacies and scale of spatial analytics, fall short in meeting these demands. Recognizing this gap, we introduce Wherobots Cloud, a novel spatial analytics database platform. Here is a summary of features supported:

Wherobots Key Features

Linking Enterprise Data to the Spatial world

Wherobots seamlessly incorporates spatial analytics in the enterprise data stack to bring data many steps closer to use cases. Using a scalable spatial join technology, Wherobots can link customer data stored anywhere to tens of terabytes of spatial data such as maps, roads, buildings, natural events, and man-made events in a few minutes. Users can then apply spatial data processing, analytics, and AI tasks using SQL and Python on their data with unparalleled efficiency and adaptability.

To get started with Wherobots, please visit the Wherobots website.

Scalablilty

With its scalable, distributed architecture, Wherobots is redefining the way businesses handle geometry and raster spatial data processing and analytics in the cloud. Wherobots achieves that in two main ways:

  1. Separating Compute/Storage: Wherobots uniquely separates the spatial processing and analytics layer from the data storage layer. This approach allows for optimal performance and scalability.
  2. Distributed System Architecture: By employing a distributed system architecture, Wherobots ensures scalable out-of-core spatial computation, catering to massive datasets without compromising speed or accuracy.

Openness

Wherobots builds upon and amplifies the capabilities seen in the open-source Apache Sedona (OSS Sedona). While OSS Sedona provides foundational spatial analytics functions using spatial SQL and Python, Wherobots takes it to the next level with its faster query processing, lakehouse architecture, and its self-service yet fully-managed provisioning on Wherobots Cloud. This makes Wherobots a more comprehensive and streamlined solution for businesses. Based on our benchmarks, Wherobots is up to 10x faster than OSS Sedona for geometry data processing, and up to 20x faster than OSS Sedona for raster data processing.

Spatial Lakehouse Solution

One of the standout features of Wherobots is its support for an Apache Iceberg-compatible spatial table format, dubbed "Havasu." This feature facilitates efficient querying and updating of geometry and raster columns on Parquet files in cloud object stores such as AWS S3. This enables spatial analytics on the sheer volume of data dumped on cloud object stores this, until today, remains seldom put to use. Details about the Havasu spatial data lake format is avaialble here

Self-service

Wherobots is provisioned as a fully-managed service within the Wherobots Cloud, ensuring that users don’t have to delve into the intricacies of managing cloud or compute resources.

By delegating resource management to Wherobots, businesses can concentrate on their core spatial analytics tasks, achieving their objectives faster, efficiently, and cost-effectively.

To use Wherobots, you first need to create an account on Wherobots Cloud. To get started, please visit the Wherobots website.

Connectivity

Wherobots comes equipped with connectors for major data storage platfoms and databases. This include cloud object stores, data warehouses like Snowflake and Redshift, lakehouses such as Databricks, and OLTP databases including Postgres / PostGIS.

Wherobots example usage

Using Wherobots, users can perform a plethora of spatial queries and analytics operations on their data. Here are some common operations users can invoke in Wherobots. For more details on these examples, please refer to the Wherobots documentation.

Insert geometry data

sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")

Insert external raster data

sedona
    .sql("SELECT RS_FromPath('s3a://XXX.tif') as rast")
    .writeTo("wherobots.test_db.test_table")
    .append()

Create a spatial index

sedona.sql("CREATE SPATIAL INDEX FOR wherobots.db.test_table USING hilbert(geom, 10)")

Read data from PostGIS

sedona.read
    .format("jdbc")
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))

Read data from CSV on AWS S3

sedona.read
    .format("csv")
    .load("s3a://data.csv")

Read data from a Havasu table

sedona.table("wherobots.test_db.test_table").filter("ST_Contains(city, location) = true")
sedona.sql("SELCT * FROM wherobots.test_db.test_table WHERE ST_Contains(city, location) = true")

Spatial join query to find zones prone to wild fires

fire_zone = sedona.sql(
    """
    SELECT
        z.geometry as zipgeom,
        z.ZCTA5CE10 as zipcode,
        f.FIRE_NAME
    FROM
        wherobots_open_data.us_census.zipcode z,
        wherobots_open_data.weather.wild_fires f
    WHERE
        ST_Intersects(z.geometry, f.geometry)
    """
)

Visualize geometry data in a notebook

SedonaKepler.create_map(geometryDf)



Visualize raster data in a notebook

SedonaUtils.display_image(rasterDF.selectExpr("RS_AsImage(raster)"))

Performance Benchmark

To better showcase Wherobots’s performance, we conducted a comprehensive performance benchmark on some commonly seen spatial data processing tasks. Please download the full report of our Wherobots Performance Benchmark.

To use Wherobots, you first need to create an account on Wherobots Cloud. To get started, please visit the Wherobots website.

Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet

Introduction

The Overture Maps Foundation (OMF) has recently released its first open-source Parquet dataset (https://overturemaps.org/download/), divided into four themes – places of interest, buildings, transportation networks, and administrative boundaries.

Apache Sedona is an open-source and distributed geospatial analytics engine. It enables large-scale spatial data processing and analysis through native handling of geospatial data types, Spatial SQL for querying, spatial indexing techniques, and support for standard formats like Shapefile, GeoParquet, GeoJSON, WKT, WKB, GeoTiff, and ArcGrid.

In this article, we demonstrate the capabilities of Apache Sedona for working with massive Overture Maps data. The article also shows that GeoParquet OMF data produced by Sedona can accelerate the spatial queries by 60X. This exploration underscores Sedona’s proficiency in handling scalable geospatial tasks using prevalent industry formats like Parquet and GeoParquet.

This article is structured as follows: First, we will explore and analyze the data in its original Parquet format. Then, we will convert it to the GeoParquet format with built-in spatial indexes. Finally, we will use Sedona in a Jupyter notebook to explore and analyze the dataset in GeoParquet form, leveraging capabilities like spatial SQL and spatial Python to derive insights.

Study 1: Analyze the OMF Parquet data using Sedona

Overture Maps uses the parquet format to store the datasets. You can find the schema of these datasets on OMF website (https://docs.overturemaps.org/reference). In this example, we will be using Buildings theme dataset which is the largest dataset in OMF with around 120GB size. The schema of this dataset is as follows:

id: string (nullable = true)
updatetime: string (nullable = true)
version: integer (nullable = true)
names: map (nullable = true)
 |-- key: string
 |-- value: array (valueContainsNull = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
level: integer (nullable = true)
height: double (nullable = true)
numfloors: integer (nullable = true)
class: string (nullable = true)
sources: array (nullable = true)
 |-- element: map (containsNull = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
bbox: struct (nullable = true)
 |-- minx: double (nullable = true)
 |-- maxx: double (nullable = true)
 |-- miny: double (nullable = true)
 |-- maxy: double (nullable = true)
geometry: binary (nullable = true)

To start using Sedona on OMF data, we will first have to create an SedonaContext:

import sedona.spark.SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext(config)

Apache Sedona facilitates easy loading and analysis of Parquet datasets through its APIs.

df = sedona.read.format("parquet").load("s3a://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Since Parquet does not natively support geospatial data types, the geometry columns in this dataset are stored in the WKB (Well-Known Binary) format. Sedona provides functionality to decode the WKB-encoded geometries into proper spatial types like points and polygons.

sedona.sql('SELECT ST_GeomFromWKB(geometry) as geometry FROM df').show(5)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((7.85925...|
|POLYGON ((2.69399...|
|POLYGON ((-95.775...|
|POLYGON ((103.141...|
|POLYGON ((111.900...|
+--------------------+

After processing, the dataset can be used for spatial range queries. We can perform a ST_Contains query between the dataset and Washington state’s boundary polygon to find out buildings within the Washington state.

df = df.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-123.3208 49.0023,-123.0338 49.0027,-122.0650 49.0018,-121.7491 48.9973,-121.5912 48.9991,-119.6082 49.0009,-118.0378 49.0005,-117.0319 48.9996,-117.0415 47.9614,-117.0394 46.5060,-117.0394 46.4274,-117.0621 46.3498,-117.0277 46.3384,-116.9879 46.2848,-116.9577 46.2388,-116.9659 46.2022,-116.9254 46.1722,-116.9357 46.1432,-116.9584 46.1009,-116.9762 46.0785,-116.9433 46.0537,-116.9165 45.9960,-118.0330 46.0008,-118.9867 45.9998,-119.1302 45.9320,-119.1708 45.9278,-119.2559 45.9402,-119.3047 45.9354,-119.3644 45.9220,-119.4386 45.9172,-119.4894 45.9067,-119.5724 45.9249,-119.6013 45.9196,-119.6700 45.8565,-119.8052 45.8479,-119.9096 45.8278,-119.9652 45.8245,-120.0710 45.7852,-120.1705 45.7623,-120.2110 45.7258,-120.3628 45.7057,-120.4829 45.6951,-120.5942 45.7469,-120.6340 45.7460,-120.6924 45.7143,-120.8558 45.6721,-120.9142 45.6409,-120.9471 45.6572,-120.9787 45.6419,-121.0645 45.6529,-121.1469 45.6078,-121.1847 45.6083,-121.2177 45.6721,-121.3392 45.7057,-121.4010 45.6932,-121.5328 45.7263,-121.6145 45.7091,-121.7361 45.6947,-121.8095 45.7067,-121.9338 45.6452,-122.0451 45.6088,-122.1089 45.5833,-122.1426 45.5838,-122.2009 45.5660,-122.2641 45.5439,-122.3321 45.5482,-122.3795 45.5756,-122.4392 45.5636,-122.5676 45.6006,-122.6891 45.6236,-122.7647 45.6582,-122.7750 45.6817,-122.7619 45.7613,-122.7962 45.8106,-122.7839 45.8642,-122.8114 45.9120,-122.8148 45.9612,-122.8587 46.0160,-122.8848 46.0604,-122.9034 46.0832,-122.9597 46.1028,-123.0579 46.1556,-123.1210 46.1865,-123.1664 46.1893,-123.2810 46.1446,-123.3703 46.1470,-123.4314 46.1822,-123.4287 46.2293,-123.4946 46.2691,-123.5557 46.2582,-123.6209 46.2573,-123.6875 46.2497,-123.7404 46.2691,-123.8729 46.2350,-123.9292 46.2383,-123.9711 46.2677,-124.0212 46.2924,-124.0329 46.2653,-124.2444 46.2596,-124.2691 46.4312,-124.3529 46.8386,-124.4380 47.1832,-124.5616 47.4689,-124.7566 47.8012,-124.8679 48.0423,-124.8679 48.2457,-124.8486 48.3727,-124.7539 48.4984,-124.4174 48.4096,-124.2389 48.3599,-124.0116 48.2964,-123.9141 48.2795,-123.5413 48.2247,-123.3998 48.2539,-123.2501 48.2841,-123.1169 48.4233,-123.1609 48.4533,-123.2220 48.5548,-123.2336 48.5902,-123.2721 48.6901,-123.0084 48.7675,-123.0084 48.8313,-123.3215 49.0023,-123.3208 49.0023))'), geometry) = true")
df.selectExpr("names", "height", "numfloors", "geometry").show(5)

+--------------------+------+---------+--------------------+
|               names|height|numfloors|            geometry|
+--------------------+------+---------+--------------------+
|{common -> [{valu...|   5.1|        1|POLYGON ((-122.32...|
|{common -> [{valu...|  10.9|        1|POLYGON ((-122.18...|
|{common -> [{valu...|   7.9|        1|POLYGON ((-122.31...|
|{common -> [{valu...|   9.2|        1|POLYGON ((-122.22...|
|{common -> [{valu...|   6.4|        1|POLYGON ((-122.21...|
+--------------------+------+---------+--------------------+

We can also leverage Apache Spark’s filter pushdown capabilities on non-spatial columns to reduce the data before geospatial analysis. Since the building dataset was large, we applied a highly selective filter pushdown:

df_building = df_building.filter(~(size(col("names")) <= 0))\
                         .filter(col("height") <= 200 )\
                         .filter(~(size(col("sources")) <= 0))\
                                                 .filter(col("numfloors") == 1)

The count of the dataset after intersecting with Washington State boundary:

df.count()

Count: 511
Time: 1h 31min 42s

Discussions

The spatial query and non-spatial filter pushdown on the dataset took an hour and half to complete on a single-node Sedona Docker environment. Therefore, users can barely do any interactive analytics on it. The reason is two-fold:

  1. The code was executed on a single-node Sedona docker environment with 4GB executor RAM, not a real cluster. Performance will be significantly improved if run in a distributed Sedona environment across multiple nodes.
  2. This lengthy processing time is primarily due to the limitations of using the Parquet format without geospatial optimizations. Parquet lacks native spatial data types and spatial indexing schemes suited for efficient geospatial analysis. The required data loading and preparation are time-consuming. Using a format optimized for geospatial workloads, like GeoParquet, could significantly reduce the pre-processing time for this analysis.

Study 2: Converting from Parquet to GeoParquet

Sedona enables spatial data format conversion between Parquet and GeoParquet. For more details, please refer to [Sedona GeoParquet blogpost]. After realizing the limitations of Parquet, we decide to leverage this functionality and see how much improvement the GeoParquet format brings in.

To achieve the best spatial filter pushdown performance, we need to partition the data based on their spatial proximity. In other words, nearby spatial objects should be put in the same GeoParquet partition. For this purpose, we first create a GeoHash ID for each geometry in OMF data using Sedona ST_GeoHash. This function generates geographic hashes for a given geometry at a specified precision. The precision refers to the size of the grid cells, where a precision of 2 indicates each cell has 1,250km X 625km. This precision level was chosen as an optimal balance, since too high a precision produces many small files that can slow down query processing and reduce read throughput.

df.withColumn("geohash", expr("ST_GeoHash(geometry, 2)")).repartition("geohash")

By repartitioning with GeoHashing, data points with the same GeoHash ID get assigned to contiguous partition slices based on precision. This clusters nearby points together in the same partitions.

Finally we will store the GeoParquet Overture maps into our public Wherobots’ S3 bucket. Such same operation is applied to all OMF datasets.

df.write.partitionBy("geohash").format("geoparquet").save("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Discussions

This conversion on all OMF datasets took around 15 minutes to finish on a 20-node AWS EMR cluster. Each node is a m3.2xlarge instance with 8 CPU and 30 GB RAM. In closing, Sedona provides a streamlined way to convert datasets to GeoParquet format and partition using GeoHashes for optimal performance. The entire GeoParquet dataset converted here is available in the Wherobots’ public S3 bucket for you to experiment with.

Study 3: Interactive geospatial spatial analytics on OMF GeoParquet data

We’ll ingest the public Wherobots dataset from S3 into a Spark DataFrame using Sedona, like before, along with that we will intersect Bellevue, Washington boundary with the dataset.

Fast geospatial queries via Sedona GeoParquet spatial filter pushdown

With GeoParquet, we observe improved query performance versus the Parquet format over the following simple spatial range query (i.e., spatial filter).

spatial_filter = "POLYGON ((-122.235128 47.650163, -122.233796 47.65162, -122.231581 47.653287, -122.228514 47.65482, -122.227526 47.655204, -122.226175 47.655729, -122.222039 47.656743999999996, -122.218428 47.657464, -122.217026 47.657506, -122.21437399999999 47.657588, -122.212091 47.657464, -122.212135 47.657320999999996, -122.21092999999999 47.653552, -122.209834 47.650121, -122.209559 47.648976, -122.209642 47.648886, -122.21042 47.648658999999995, -122.210897 47.64864, -122.211005 47.648373, -122.21103099999999 47.648320999999996, -122.211992 47.64644, -122.212457 47.646426, -122.212469 47.646392, -122.212469 47.646088999999996, -122.212471 47.645213, -122.213115 47.645212, -122.213123 47.644576, -122.21352999999999 47.644576, -122.213768 47.644560999999996, -122.21382 47.644560999999996, -122.21382 47.644456999999996, -122.21373299999999 47.644455, -122.213748 47.643102999999996, -122.213751 47.642790999999995, -122.213753 47.642716, -122.213702 47.642697999999996, -122.213679 47.642689999999995, -122.21364 47.642678, -122.213198 47.642541, -122.213065 47.642500000000005, -122.212918 47.642466, -122.21275 47.642441, -122.212656 47.642433, -122.21253899999999 47.642429, -122.212394 47.64243, -122.212182 47.642444999999995, -122.211957 47.642488, -122.211724 47.642551999999995, -122.21143599999999 47.642647, -122.210906 47.642834, -122.210216 47.643099, -122.209858 47.643215, -122.20973000000001 47.643248, -122.20973599999999 47.643105, -122.209267 47.643217, -122.208832 47.643302, -122.208391 47.643347999999996, -122.207797 47.643414, -122.207476 47.643418, -122.20701199999999 47.643397, -122.206795 47.643387999999995, -122.205742 47.643246, -122.20549 47.643201999999995, -122.20500200000001 47.643119, -122.204802 47.643085, -122.204641 47.643066, -122.204145 47.643012, -122.203547 47.643012, -122.203097 47.643107, -122.20275699999999 47.643283, -122.202507 47.643496999999996, -122.202399 47.643653, -122.202111 47.643771, -122.201668 47.643767, -122.201363 47.643665, -122.20133 47.643648999999996, -122.201096 47.643536, -122.200744 47.64328, -122.200568 47.64309, -122.200391 47.642849, -122.200162 47.642539, -122.199896 47.642500000000005, -122.19980799999999 47.642424, -122.199755 47.642376999999996, -122.199558 47.642227999999996, -122.199439 47.642157, -122.199293 47.642078999999995, -122.199131 47.642004, -122.198928 47.641925, -122.19883 47.641892, -122.19856300000001 47.641811999999994, -122.198203 47.641731, -122.197662 47.641619999999996, -122.196819 47.641436, -122.196294 47.641309, -122.196294 47.642314, -122.19628 47.642855, -122.196282 47.642897999999995, -122.196281 47.643111, -122.196283 47.643415, -122.196283 47.643508999999995, -122.19628399999999 47.643739, -122.196287 47.644203999999995, -122.196287 47.644262999999995, -122.19629 47.644937999999996, -122.19629 47.644954999999996, -122.196292 47.645271, -122.196291 47.645426, -122.19629499999999 47.646315, -122.19629499999999 47.646432, -122.195925 47.646432, -122.195251 47.646432, -122.190853 47.646429999999995, -122.187649 47.646428, -122.187164 47.646426, -122.18683 47.646426, -122.185547 47.646409, -122.185546 47.646316, -122.185537 47.645599, -122.185544 47.644197, -122.185537 47.643294999999995, -122.185544 47.642733, -122.185541 47.641757, -122.185555 47.640681, -122.185561 47.63972, -122.185557 47.638228999999995, -122.185591 47.635419, -122.185611 47.634750999999994, -122.18562299999999 47.634484, -122.18561700000001 47.634375999999996, -122.185592 47.634311, -122.185549 47.634232999999995, -122.185504 47.634181999999996, -122.185426 47.634119, -122.184371 47.633424999999995, -122.18400000000001 47.633198, -122.183896 47.633134, -122.1838 47.633067, -122.18375499999999 47.633019999999995, -122.183724 47.632959, -122.183695 47.632858, -122.183702 47.632675, -122.182757 47.632622999999995, -122.182365 47.63259, -122.18220600000001 47.632562, -122.181984 47.632504999999995, -122.18163799999999 47.632363, -122.18142 47.632262999999995, -122.181229 47.632165, -122.181612 47.632172999999995, -122.18271899999999 47.632151, -122.183138 47.632135, -122.18440000000001 47.632081, -122.184743 47.632065999999995, -122.185312 47.63205, -122.185624 47.632047, -122.185625 47.631873999999996, -122.184618 47.63187, -122.184291 47.631878, -122.184278 47.631817999999996, -122.183882 47.629942, -122.182689 47.623548, -122.182594 47.622789999999995, -122.182654 47.622155, -122.183135 47.622372999999996, -122.183471 47.622506, -122.18360200000001 47.622552, -122.183893 47.622637999999995, -122.184244 47.62272, -122.184618 47.622777, -122.184741 47.622727999999995, -122.184605 47.622679, -122.18424 47.622622, -122.183985 47.622569, -122.183717 47.622501, -122.183506 47.622439, -122.18327 47.622357, -122.18305699999999 47.622271999999995, -122.182669 47.622088999999995, -122.182796 47.621545, -122.18347 47.619628999999996, -122.18365 47.619098, -122.183859 47.6184, -122.183922 47.617793999999996, -122.183956 47.617292, -122.183792 47.616388, -122.183261 47.614391999999995, -122.183202 47.613802, -122.183209 47.613155, -122.183436 47.612384999999996, -122.18395100000001 47.610445999999996, -122.184338 47.60924, -122.184657 47.609116, -122.18481 47.609051, -122.18491900000001 47.608987, -122.184974 47.608942, -122.185047 47.608846, -122.185082 47.608743999999994, -122.185109 47.608526999999995, -122.185116 47.608359, -122.18513 47.608315999999995, -122.185157 47.608273999999994, -122.185183 47.608247, -122.185246 47.608214, -122.185354 47.608196, -122.185475 47.608191999999995, -122.185472 47.606697, -122.185472 47.606373999999995, -122.185521 47.606272, -122.185528 47.606210999999995, -122.185506 47.606037, -122.185451 47.605872999999995, -122.185411 47.605781, -122.185358 47.605681999999995, -122.185248 47.605509999999995, -122.185127 47.605365, -122.185058 47.605292, -122.184772 47.605038, -122.184428 47.604834, -122.184122 47.604693999999995, -122.183775 47.604574, -122.183644 47.604546, -122.183708 47.604400999999996, -122.183749 47.604223999999995, -122.18376 47.604037, -122.183707 47.603778, -122.183619 47.603556999999995, -122.183559 47.603406, -122.183488 47.603303, -122.183824 47.603167, -122.184108 47.603052, -122.184478 47.602902, -122.18543 47.602495, -122.186669 47.601957, -122.186433 47.601220999999995, -122.186341 47.601127999999996, -122.18874199999999 47.593742999999996, -122.188434 47.592338999999996, -122.188479 47.591786, -122.188217 47.591269999999994, -122.18795399999999 47.590871, -122.186822 47.589228, -122.187421 47.589228999999996, -122.18848299999999 47.589228999999996, -122.188433 47.587922999999996, -122.18990000000001 47.588547, -122.191368 47.589169999999996, -122.19158 47.589222, -122.191779 47.589254999999994, -122.192117 47.589289, -122.191569 47.587478999999995, -122.191323 47.586628999999995, -122.191295 47.586554, -122.191268 47.586479, -122.191192 47.586318, -122.191163 47.586268999999994, -122.1911 47.586164, -122.19099 47.586011, -122.19067 47.585668999999996, -122.1905 47.585515, -122.190301 47.58531, -122.190143 47.585152, -122.189573 47.584576999999996, -122.188702 47.583735999999995, -122.188646 47.583679, -122.188239 47.583258, -122.188037 47.583005, -122.187832 47.582657, -122.187726 47.582164999999996, -122.18769499999999 47.581964, -122.18768299999999 47.581781, -122.187678 47.581592, -122.18766099999999 47.581455, -122.187674 47.581311, -122.18768 47.581146, -122.187722 47.580877, -122.187817 47.580569999999994, -122.187932 47.580301999999996, -122.188047 47.580087, -122.188161 47.579933999999994, -122.188399 47.579660999999994, -122.18851699999999 47.579547, -122.188621 47.579454, -122.188042 47.579493, -122.18762 47.579527, -122.187806 47.579358, -122.188009 47.579175, -122.18814499999999 47.579051, -122.188177 47.579021, -122.18842000000001 47.5788, -122.188638 47.578461, -122.188895 47.57806, -122.189791 47.577281, -122.190008 47.577103, -122.190372 47.576805, -122.19119 47.576358, -122.191877 47.576087, -122.193025 47.57566, -122.194317 47.575185999999995, -122.196061 47.574664, -122.197239 47.574386999999994, -122.197873 47.574267, -122.198286 47.574189999999994, -122.199091 47.574044, -122.199067 47.574574999999996, -122.199007 47.575921, -122.200335 47.578222, -122.20057299999999 47.578345999999996, -122.2009 47.578517999999995, -122.201095 47.578621999999996, -122.20138399999999 47.578776999999995, -122.201465 47.57882, -122.201516 47.578846999999996, -122.205753 47.581112, -122.209515 47.583124, -122.210634 47.583721, -122.21473399999999 47.587021, -122.21538699999999 47.588254, -122.21580399999999 47.589042, -122.216534 47.590421, -122.220092 47.596261, -122.220434 47.596821, -122.22041899999999 47.597837999999996, -122.220289 47.606455, -122.220234 47.610121, -122.22048 47.615221999999996, -122.220359 47.615379, -122.220283 47.615477999999996, -122.21999 47.615854999999996, -122.219993 47.61597, -122.22023300000001 47.616634, -122.220356 47.616687999999996, -122.220409 47.616712, -122.221401 47.618538, -122.22142 47.618573, -122.221456 47.618635, -122.221791 47.619222, -122.222492 47.619682999999995, -122.222799 47.619886, -122.222083 47.620368, -122.222046 47.620407, -122.222028 47.620449, -122.222025 47.620483, -122.22203999999999 47.620523999999996, -122.222079 47.620557999999996, -122.222156 47.620594999999994, -122.222458 47.620629, -122.222454 47.620673, -122.222454 47.620711, -122.22244599999999 47.621041999999996, -122.223056 47.621041, -122.223129 47.62104, -122.223153 47.62104, -122.223574 47.621041, -122.22377900000001 47.621041, -122.223857 47.621041, -122.22467499999999 47.621041, -122.224712 47.62104, -122.224958 47.62104, -122.225167 47.621049, -122.226882 47.621037, -122.227565 47.621032, -122.228002 47.621029, -122.22797800000001 47.621300999999995, -122.227919 47.626574999999995, -122.227914 47.627085, -122.227901 47.6283, -122.227881 47.630069, -122.227869 47.631177, -122.227879 47.631952999999996, -122.22789 47.633879, -122.227886 47.63409, -122.227871 47.635534, -122.227918 47.635565, -122.228953 47.635624, -122.22895199999999 47.635571999999996, -122.231018 47.635574999999996, -122.233276 47.635588999999996, -122.233287 47.63617, -122.233273 47.63639, -122.233272 47.636469999999996, -122.23327 47.636578, -122.233266 47.636827, -122.233263 47.636851, -122.233262 47.637014, -122.23322999999999 47.638110999999995, -122.233239 47.638219, -122.233262 47.638279, -122.233313 47.638324999999995, -122.233255 47.638359, -122.233218 47.638380999999995, -122.233153 47.638450999999996, -122.233136 47.638552999999995, -122.233137 47.638692, -122.232715 47.639348999999996, -122.232659 47.640093, -122.232704 47.641375, -122.233821 47.645111, -122.234906 47.648874, -122.234924 47.648938, -122.235128 47.650163))"

df = sedona.read.format("geoparquet").load("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")
df = df.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")
df.count()

Count: 3423743
Time: 2min 47s

Switching to GeoParquet from regular Parquet reduced query time from over an hour and a half to around 3 minutes, a 60X speedup. This query was done on the same single-node Sedona Docker environment, without non-spatial filter pushdowns. Further gains could be achieved by leveraging a distributed Sedona cluster. Overall, GeoParquet enables more efficient geospatial queries through both spatial partitioning and enhanced filter pushdown.

Interactive geospatial visualization via Sedona Kepler

We also wanted to highlight SedonaKepler, our interactive visualization API built on KeplerGl. SedonaKepler provides a powerful, customizable visualization experience through a simple API. Creating visualizations with SedonaKepler is as straightforward as calling a single function. The map created by the code below indicates the buildings in Bellevue, Washington.

import sedona.spark.*

SedonaKepler.create_map(df, "Building")

Time: 2min 38s

file

Photo: Visualization of building dataset spatially filtered to only include data for the city of Bellevue, Washington.

Of course, you can also choose other region and see what the dataset looks like.

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Mount Rainier National Park, Washington in Satellite view.

SedonaKepler can also visualize multiple datasets together, revealing insights across interconnected geospatial data. To demonstrate, we are using the segment and connector datasets, as they highlight the transportation network from OMF.

The dots below represent connection points between road segments. The road segments themselves are shown as yellow lines, representing paths that can be traversed by people, cars, or other vehicles. By layering this networked transport data over a map, we can gain insights into how the transportation infrastructure links together across the geographic area.

map_connector = SedonaKepler.create_map(df_connector, "Connector")
SedonaKepler.add_df(map_connector, df_segment, name="Segment")
map_connector

Time: 3min 11s

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Bellevue, Washington.

The Take-Away

Apache Sedona enables both the reading and writing of GeoParquet files, a specialized Parquet variant tailored for spatial data interchange. When executing spatial range queries on GeoParquet files, Apache Sedona supports spatial filter pushdown, and optimizing query performance with over 60X speedup. SedonaKepler is a powerful tool for creating visualizations that are interactive and easy to maneuver, and it allows you to create visualizations from multiple datasets.

Try it yourself

Notebooks

All notebooks used in this article are available on Wherobots GitHub repository: https://github.com/wherobots/OvertureMaps

Interactive notebook using GeoParquet and Sedona

Use Wherobots to deploy Sedona to cloud vendors

The other notebooks used in Study 1 and 2 can be run on a AWS EMR or Databricks cluster. if you want to try them out, please sign up here: https://www.wherobots.ai/demo

Wherobots is a spatial data analytics and AI platform trusted in production, at scale, from the original creators of Apache Sedona.

Free and public Overture Maps GeoParquet data from Wherobots

The GeoParquet format data produced in Study 2 is provided by Wherobots for free. It has the same schema and license as the original Overture Maps Parquet data, except the geometry column is in geometry type and has additional geohash column in string type. You can access them as follows:

  • Buildings: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building
  • Places: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=place
  • AdminBoundary: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=administrativeBoundary
  • AdminLocality: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=locality
  • Transportation Connector: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=connector
  • Transportation Segment: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=segment