Introducing GeoStats for WherobotsAI and Apache Sedona

Introducing GeoStats for WherobotsAI and Apache Sedona

We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.

The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.

Use Cases for GeoStats

DBSCAN

DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.

Typical use cases for DBSCAN are found in:

  • Retail: Decision makers use DBSCAN with location data to understand areas of high and low pedestrian activity to decide where to setup retailing establishments.
  • City Planning: City planners use DBSCAN with GPS data to optimize transit support by identifying high usage routes, areas in need of additional transit options, or areas that have too much support.
  • Air Traffic Control: Traffic controllers use DBSCAN to identify areas with increasing weather activity to optimize flight routing.
  • Risk computation: Insurers and others can use DBSCAN to make policy decisions and calculate risk where risk is correlated to the proximity of two or more features of interest.
Local Outlier Factor (LOF)

LOF is an anomaly detection algorithm that identifies outliers present in a dataset.

Typical use cases for LOF include:

  • Data analysis and cleansing: Data teams can use LOF to identify and remove anomalies within a dataset, like removing erroneous GPS data points from a trace dataset
Getis-Ord (Gi*)

Getis-Ord is also a popular algorithm for identifying local hot and cold spots.

Typical use cases for Gi* include:

  • Public health: Officials can use disease data with Gi* to identify areas of abnormal disease outbreak
  • Telecommunications: Network administrators can use Gi* to identify areas of high demand and optimize network deployment
  • Insurance: Insurers can identify areas prone to specific claims to better manage risk

Traditional challenges with using these algorithms on geospatial data

Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:

  1. Struggle to get performance or scale from the underlying solutions that also don’t perform well when joining geospatial data.
  2. Determine how to host and scale open source versions of popular ML and statistical algorithms, like PostGIS or scikit-learn DBSCAN, PySal Gi*, or scikit-learn LOF, to work for geospatial data types and geospatial data formats.
  3. Replicate this overhead each time they want to deploy a new algorithm for geospatial data.

Benefits of WherobotsAI GeoStats

With GeoStats in WherobotsAI, you can now:

  1. Easily run native algorithms on a cloud-based engine, optimized for producing spatial data products and insights at scale.
  2. Use these algorithms without the operational overhead associated with setup and maintenance.
  3. Leverage optimized, hosted algorithms within a single platform to easily experiment and get critical insights faster.

We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.

Diving Deeper into the GeoStats Toolbox

DBSCAN Overview

DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.

How to use DBSCAN in Wherobots

The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.

WherobotsAI GeoStats DBSCAN Python API Overview
For a full walk through see the Python API reference: dbscan(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: max distance to neighbors (epsilon), min neighbors (min_points)
  • Output: dataframe with cluster id

DBSCAN Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))
  1. Choose values for your hyperparameters, max distance to neighbors (epsilon) and minimum neighbors (min_points). These values will determine how DBSCAN identifies clusters.
epsilon=0.3
min_points=10
  1. Run DBSCAN on your DataFrame with your chosen hyperparameter values.
clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)
  1. Analyze the results. For each datapoint, DBSCAN returns the cluster it’s associated with or if it’s an outlier.
+--------------------+------+-------+
|            geometry|isCore|cluster|
+--------------------+------+-------+
|POINT (1.22185277...| false|      1|
|POINT (0.77885034...| false|      1|
|POINT (-2.2744742...| false|      2|
+--------------------+------+-------+

only showing top 3 rows

There’s a complete example of how to use DBSCAN in the Wherobots user documentation.

DBSCAN Performance Overview

To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.

Local Outlier Factor (LOF)

LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.

How to use LOF in Wherobots

For the full example, please see this docs page.

WherobotsAI GeoStats LOF Python API Overview
For a full walk through see the Python API reference: local_outlier_factor(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: number of nearest neighbors to use
  • Output: score representing degree of inlier or outlier

LOF Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
  1. Choose your k value for how many nearest neighbors you want to use to measure density near a given datapoint.
k=20
  1. Run LOF on your DataFrame with your chosen k value.
outliers_df = local_outlier_factor(df, k=20)
  1. Analyze your results. LOF returns a score for each datapoint representing the degree of inlier or outlier.
+--------------------+------------------+
|            geometry|               lof|
+--------------------+------------------+
|POINT (-1.9169927...|0.9991534865548664|
|POINT (-1.7562422...|1.1370318880088373|
|POINT (-2.0107478...|1.1533763384772193|
+--------------------+------------------+
only showing top 3 rows

There’s a complete example of how to use LOF in the Wherobots user documentation.

LOF Performance Overview

We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.

Getis-Ord (Gi*) Overview

Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.

How to use GeoStats Gi*

WherobotsAI GeoStats Gi* Python API Overview
For the full example, please see this docs g_local(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: star, neighbor weighting
  • Output: Set of statistics that indicate the degree of local hot or cold spot for a given record

Gi* Walk Through

  1. Choose your dataset and create a Sedona Dataframe.
places_df = (
    sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
        .select(f.col("geometry"), f.col("categories"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)
  1. Choose how you’d like to weight datapoints (ex: datapoints in a specific geographic area need to be weighted higher or any datapoint close to a given datapoint need to be weighted higher) and star (boolean to indicate if a record is a neighbor of itself).

star = True
neighbor_search_radius_degrees = 1.0
variable_column = "myNumericColumnName"

weighted_dataframe = add_binary_distance_band_column(
        df,
        neighbor_search_radius_degrees,
        include_self=star
)
  1. Run Gi* on your DataFrame with your chosen hyperparameters.
gi_df = g_local(
        weighted_dataframe,
    variable_column,
    star=star
)
  1. Analyze your results. For each datapoint, Gi* returns a set of statistics that indicate the degree of local hot or cold spot.
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|num_places|                  G|                  EG|                  VG|                 Z|                   P|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|       871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465|                 0.0|
|       908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246|                 0.0|
|       218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908|                 0.0|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
only showing top 3 rows

There’s a complete example of how to use Gi* in the Wherobots user documentation.

Getis-Ord Performance Overview

To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.

Get started with WherobotsAI GeoStats

The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.

  • If you haven’t already, create a free Wherobots Organization subscribed to the Community Edition of Wherobots.
  • Start a Wherobots Notebook
  • In the Notebook environment, explore the notebook_example/python/wherobots-ai/ folder for examples that you can use to get started.
  • Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the release and get started!

What’s next

We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

The WherobotsAI Team

What is Apache Sedona?

Apache Sedona Overview

Apache Sedona is a cluster computing system for processing large-scale spatial data. It treats spatial data as a first class citizen by extending the functionality of distributed compute frameworks like Apache Spark, Apache Flink, and Snowflake. Apache Sedona was created at Arizona State University under the name Geospark, in the paper “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond.”

Apache Sedona introduces data types, operations, and indexing techniques optimized for spatial workloads on top of Apache Spark and other distributed compute frameworks. Let’s take a look at the workflow for analyzing spatial data with Apache Sedona.

Apache Sedona Architecture

Spatial Query Processing

The first step in spatial query processing is to ingest geospatial data into Apache Sedona. Data can be loaded from various sources such as files (Shapefiles, GeoJSON, Parquet, GeoTiff, CSV, etc) or databases intro Apache Sedona’s in-memory distributed spatial data structures (typically the Spatial DataFrame).

Next, Sedona makes use of spatial indexing techniques to accelerate query processing, such as R-trees or Quad trees. The spatial index is used to partition the data into smaller, manageable units, enabling efficient data retrieval during query processing.

Once the data is loaded and indexed spatial queries can be executed using Sedona’s query execution engine. Sedona supports a wide range of spatial operations, such as spatial joins, distance calculations, and spatial aggregations.

Sedona optimizes spatial queries to improve performance. The query optimizer determines an efficient query plan by considering the spatial predicates, available indexes, and the distribution of data across the cluster.

Spatial queries are executed in a distributed manner using Sedona’s computational capabilities. The query execution engine distributes the query workload across the cluster, with each node processing a portion of the data. Intermediate results are combined to produce the final result set. Since spatial objects can be very complex with many coordinates and topology, Sedona implements a custom serializer for efficiently moving spatial data throughout the cluster.

Common Apache Sedona Use Cases

So what exactly are users doing with Apache Sedona? Here are some common examples of what users are doing with Apache Sedona:

  • Creating custom weather, climate, and environmental quality assessment reports at national scale by combining vector parcel data with environmental raster data products.
  • Generating planetary scale GeoParquet files for public dissemination via cloud storage by combining, cleaning, and indexing multiple datasets.
  • Converting billions of daily point telemetry observations into routes traveled by vehicles.
  • Enriching parcel level data with demographic and environmental data at the national level to feed into a real estate investment suitability analysis.

Many of these use case can be described as geospatial ETL operations. ETL (extract, transform, load) is a data integration process that involves retrieving data from various sources, transforming and combining these datasets, then loading the transformed data into a target system or format for reporting or further analysis. Geospatial ETL shares many of the same challenges and requirements of traditional ETL processes with the additional complexities of managing the geospatial component of the data, working with geospatial data sources and formats, spatial data types and transformations, as well as the scalability and performance considerations required for spatial operations such as joins based on spatial relationships. For a more complete overview of use cases with Apache Sedona, you can read our ebook on it here.

Community Adoption

Apache Sedona has gained significant community adoption and has become a popular geospatial analytics library within the distributed computing and big data ecosystem. As an Apache Software Foundation (ASF) incubator project, Sedona’s governance, licensing, and community participation align with ASF principles.

Sedona has an active and growing developer community, with contributors from a number of different types of organizations and over 100 individuals interested in advancing the state of geospatial analytics and distributed computing. Sedona has reach over 38 million downloads with a rate of 2 million downloads per month with usage growing at a rate of 200% per year (as of the date this was published).

Organizations in industries including transportation, urban planning, environmental monitoring, logistics, insurance and risk analysis and more have adopted Apache Sedona. These organizations leverage Sedona’s capabilities to perform large-scale geospatial analysis, extract insights from geospatial data and build geospatial analytical applications at scale. The industry adoption of Apache Sedona showcases its practical relevance and real-world utility.

Apache Sedona has been featured in conferences, workshops, and research publications related to geospatial analytics, distributed computing, and big data processing. These presentations and publications contribute to the awareness, visibility, and adoption both within the enterprise and within the research and academic communities.

Resources

As you get started with Apache Sedona the following resources will be useful throughout your journey in the world of large-scale geospatial data analytics.

The best place to start learning about Apache Sedona is the authoritative book on the topic, which was recently published in early release format “Cloud Native Geospatial Analytics with Apache Sedona”. The team behind the project will continue to release chapters until the book is complete over the coming months. You can get the latest version of the book here.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Unlock Satellite Imagery Insights with WherobotsAI Raster Inference

Recently we introduced WherobotsAI Raster Inference to unlock analytics on satellite and aerial imagery using SQL or Python. Raster Inference simplifies extracting insights from satellite and aerial imagery using SQL or Python, and is powered by open-source machine learning models. This feature is currently in preview, and we are expanding it’s capabilities to support more models. Below we’ll dig into the popular computer vision tasks that Raster Inference supports, describe how it works, and how you can use it to run batch inference to find and map electricity infrastructure.

Watch the live demo of these capabilities here.

The Power of Machine Learning with Satellite Imagery

Petabytes of satellite imagery are generated each day all over the world in a dizzying number of sensor types and image resolutions. The applications for satellite imagery and other remote sensing data sources are broad and diverse. For example, satellites with consistent, continuous orbits are ideal for monitoring forest carbon stocks to validate carbon credits or estimating agricultural yields.

However, this data has been inaccessible for most analysts and even seasoned ML practitioners because insight extraction required specialized skills. We’ve done the work to make insight extraction simple and accessible to more people. Raster Inference abstracts the complexity and scales to support planetary-scale imagery datasets, so you don’t need ML expertise to derive insights. In this blog, we explore the key features that make Raster Inference effective for land cover classification, solar farm mapping, and marine infrastructure detection. And, in the near future, you will be able to use Raster Inference with your own models!

Introduction to Popular and Supported Machine Learning Tasks

Raster Inference supports the three most common kinds of computer vision models that are applied to imagery: classification, object detection, and semantic segmentation. Instance segmentation (combines object localization and semantic segmentation) is another common type of model which is not currently supported, but let us know if you need by contacting us and we can add it to the roadmap.

Computer Vision Detection Types
Computer Vision Detection Categories from Lin et al. Microsoft COCO: Common Objects in Context

The figure above illustrates these tasks. Image classification is when an image is assigned one or more text labels. In image (a), the scene is assigned the labels “person”, “sheep”, and “dog”. Image (b) is an example of object localization (or object detection). Object localization creates bounding boxes around objects of interest and assigns labels. In this image, five sheep are localized separately along with one human and one dog. Finally, semantic segmentation is when each pixel is given a category label, as shown in image (c). Here we can see all the pixels belonging to sheep are labeled blue, the dog is labeled red, and the person is labeled teal.

While these examples highlight detection tasks on regular imagery, these computer vision models can be applied to raster formatted imagery. Raster data formats are the most common data formats for satellite and aerial imagery. When objects of interest in raster imagery are localized, their bounding boxes can be georeferenced, which means that each pixel is localized to spatial coordinates, such as latitude and longitude. Therefore, georeferencing is object localization suited for spatial analytics.

https://wherobots.com/wp-content/uploads/2024/06/remotesensing-11-00339-g005.png

The example above shows various applications of object detection for localizing and classifying features in high resolution satellite and aerial imagery. This example comes from DOTA, a 15-class dataset of different objects in RGB and grayscale satellite imagery. Public datasets like DOTA are used to develop and benchmark machine learning models.

Not only are there many publicly available object detection models, but also there are many semantic segmentation models.

Semantic Segmentation
Sourced from “A Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning”.

Not every machine learning model should be treated equally, and each will have their own tradeoffs. You can see the difference between the ground truth image (human annotated buildings representing the real world) and segmentation results across two models (Scale-MAE and Vanilla MAE). These results are derived from the same image at two different resolutions (referred to as GSD, or Ground Sampling Distance).

  • Scale-MAE is a model developed to handle detection tasks at various resolutions with different sensor inputs. It uses a similar MAE model architecture as the Vanilla MAE, but is trained specifically for detection tasks on overhead imagery that span different resolutions.
  • The Vanilla MAE is not trained to handle varying resolutions in overhead imagery. It’s performance suffers in the top row and especially the bottom row, where resolution is coarser, as seen by the mismatch between Vanilla MAE and the ground truth image where many pixels are incorrectly classified.

Satellite Analytics Before Raster Inference

Without Raster Inference, typically a team who is looking to extract insights from overhead imagery using ML would need to:

  1. Deploy a distributed runtime to scale out workloads such as data loading, preprocessing, and inference.
  2. Develop functionality to operate on raster metadata to easily filter it by location to run inference workloads on specific areas of interest.
  3. Optimize models to run performantly on GPUs, which can involve complex rewrites of the underlying model prediction logic.
  4. Create and manage data preprocessing pipelines to normalize, resize, and collate raster imagery into the correct data type and size required by the model.
  5. Develop the logic to run data loading, preprocessing, and model inference efficiently at scale.

Raster Inference and its SQL and Python APIs abstract this complexity so you and your team can easily perform inference on massive raster datasets.

Raster Inference APIs for SQL and Python

Raster Inference offers APIs in both SQL and Python to run inference tasks. These APIs are designed to be easy to use, even if you’re not a machine learning expert. RS_CLASSIFY can be used for scene classification, RS_BBOXES_DETECT for object detection, and RS_SEGMENT for semantic segmentation. Each function produces tabular results which can be georeferenced either for the scene, object, or segmentation depending on the function. The records can be joined or visualized with other data (geospatial or traditional) to curate enriched datasets and insights. Here are SQL and Python examples for RS_Segment.

RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
df = df_raster_input.withColumn("segment_result", rs_segment(model_id, col("outdb_raster")))

Example: Mapping Electricity Infrastructure

Imagine you want to optimize the location of new EV charging stations, but you want to target locations based on the availability of green energy sources, such as local solar farms. You can use Raster Inference to detect and locate solar farms and cross-reference these locations with internal data or other vector geometries that captures demand for EV charging. This use case will be demonstrated in our upcoming release webinar on July 10th.

Let’s walk through how to use Raster Inference for this use case.

First, we run predictions on rasters to find solar farms. The following code block that calls RS_SEGMENT shows how easy this is.

CREATE OR REPLACE TEMP VIEW segment_fields AS (
    SELECT
        outdb_raster,
        RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
    FROM
    az_high_demand_with_scene
)

The confidence_array column produced from RS_SEGMENT can be assigned the same geospatial coordinates as the raster input and converted to a vector that can be spatially joined and processed with WherobotsDB using RS_SEGMENT_TO_GEOMS. We select a confidence threshold of .65 so that we only georeference high confidence detections.

WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions_df
    )
    SELECT result.* FROM t
+----------+--------------------+--------------------+
|     class|avg_confidence_score|            geometry|
+----------+--------------------+--------------------+
|Solar Farm|  0.7205783606825462|MULTIPOLYGON (((-...|
|Solar Farm|  0.7273308333550763|MULTIPOLYGON (((-...|
|Solar Farm|  0.7301468510823231|MULTIPOLYGON (((-...|
|Solar Farm|  0.7180177244988899|MULTIPOLYGON (((-...|
|Solar Farm|   0.728077805771141|MULTIPOLYGON (((-...|
|Solar Farm|     0.7264981572898|MULTIPOLYGON (((-...|
|Solar Farm|  0.7044100126912517|MULTIPOLYGON (((-...|
|Solar Farm|  0.7137283466756343|MULTIPOLYGON (((-...|
+----------+--------------------+--------------------+

This allows us to integrate the vectorized model predictions with other spatial datasets and easily visualize the results with SedonaKepler.

https://wherobots.com/wp-content/uploads/2024/06/solar_farm_detection-1-1024x398.png

Here Raster Inference runs on a 85 GiB dataset with 2,200 raster scenes for Arizona. Using a Sedona (tiny) runtime, Raster Inference completed in 430 seconds, predicting solar farms for all low cloud cover satellite images for the state of Arizona for the month of October. If we scale up our runtime to a San Francisco (small) runtime, the inference speed nearly doubles. In general, average bytes processed per second by Wherobots increases as datasets scale in size because startup costs are amortized over time. Processing speed also increases as runtimes scale in size.

Inference time (seconds) Runtime Size
430 Sedona
246 San Francisco

We use predictions from the output of Raster Inference to derive insights about which zip codes have the most solar farms, as shown below. This statement joins predicted solar farms with zip codes by location, then ranks zip codes by the pre-computed solar farm area within each zip code. We skipped this step for brevity but you can see it and others in the notebook example.

az_solar_zip_codes = sedona.sql("""
SELECT solar_area, any_value(az_zta5.geometry) AS geometry, ZCTA5CE10
FROM predictions_polys JOIN az_zta5
WHERE ST_Intersects(az_zta5.geometry, predictions_polys.geometry)
GROUP BY ZCTA5CE10
ORDER BY solar_area DESC
""")

https://wherobots.com/wp-content/uploads/2024/06/final_analysis.png

These predictions are made possible by SATLAS, a family of machine learning models released with Apache 2.0 licensing from Allen AI. The solar model demonstrated above was derived from the SATLAS foundational model. This foundational model can be used as a building block to create models to address specific detection challenges like solar farm detection. Additionally, there are many other open source machine learning models available for deriving insights from satellite imagery, many of which are provided by the TorchGeo project. We are just beginning to explore what these models can achieve for planetary-scale monitoring.

If you have a specific model you would like to see made available, please contact us to let us know.

For detailed instructions on using Raster Inference, please refer to our example Jupyter notebooks in the documentation.

https://wherobots.com/wp-content/uploads/2024/06/Screenshot_2024-06-08_at_2.11.07_PM-1024x683.png

Here are some links to get you started:
https://docs.wherobots.com/latest/tutorials/wherobotsai/wherobots-inference/segmentation/

https://docs.wherobots.com/latest/api/wherobots-inference/pythondoc/inference/sql_functions/

Getting Started

Getting started with WherobotsAI Raster Inference is easy. We’ve provided three models in Wherobots Cloud that can be used with our GPU optimized runtimes. Sign up for an account on Wherobots Cloud, send us a note to access the professional tier, start a GPU runtime, and you can run our example Jupyter notebooks to analyze satellite imagery in SQL or Python.

Stay tuned for updates on improvements to Raster Inference that will make it possible to run more models, including your own custom models. We’re excited to hear what models you’d like us to support, or the integrations you need to make running your own models even easier with Raster Inference. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

🌶 Comparing taco chains :: a consumer retail cannibalization study with isochrones

Authors: Sean Knight and Ilya Marchenko

Using Wherobots for a Retail Cannibalization Study Comparing Two Leading Taco Chains

In this post, we explore how to implement a workflow from the commercial real estate (CRE) space using Wherobots Cloud. This workflow is commonly known as a cannibalization study, and we will be using WherobotsDB, POI data from OvertureMaps, the open source Valhalla API, and visualization capabilities offered by SedonaKepler.

NOTE: This is a guest Wherobots post from our friends at YuzuData. Reach out to them to learn more about their spatial data product services. You can also join for a demo scheduled on this use case with them on July 30th here.

What is a retail cannibalization study?

In CRE (consumer real estate), stakeholders are often interested in questions like “If we build a new fast food restaurant here, how will its performance be affected by other similar fast food locations that already exist nearby?”. The idea of the new fast food restaurant “eating into” the sales of other fast food restaurants that already exist nearby is what is known as ‘cannibalization’.

The main objective of studying this phenomenon is to determine the extent to which a new store might divert sales from existing stores owned by the same company or brand and evaluate the overall impact on the company’s market share and profitability in the area.

Cannibalization Study in Wherobots

For this case study, we will look at two taco chains which are located primarily in Texas: Torchy’s Tacos and Velvet Taco. In general, information about the performance of individual locations and customer demographics are often proprietary information. We can, however, still learn a great deal about the potential for cannibalization both between these two chains as competitors, and between individual locations of each chain. We also know, based on our own experience, these chains compete with each other. Which taco shop to go to when we are visiting Texas is always a spicy debate.

??????

We begin by importing modules that will be useful to us as we go on.

import geopandas as gpd
import pandas as pd
import requests
from sedona.spark import *
from pyspark.sql.functions import explode, array
from pyspark.sql import functions as F

Next, we can initiate a Sedona context.

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Identifying Points of Interest

Now, we need to retrieve the locations of Torchy’s Tacos and Velvet Taco locations. In general, one can do this via a variety of both free and paid means. We will look at a simple, free approach that is made possible by the integration of Overture Maps data into the Wherobots environment:

sedona.table("wherobots_open_data.overture.places_place"). \\
createOrReplaceTempView("places")

We create a view of the Overture Maps places database, which contains information on points of interest (POI’s) worldwide.

Now, we can select the POI’s which are relevant to this exercise:

stores = sedona.sql("""
SELECT id, names.common[0].value as name, ST_X(geometry) as long,
 ST_Y(geometry) as lat, geometry, 
 CASE WHEN names.common[0].value LIKE "%Torchy's Tacos%" 
 THEN "Torchy's Tacos" 
 ELSE 'Velvet Taco' END AS chain
FROM places
WHERE addresses[0].region = 'TX'
AND (names.common[0].value LIKE "%Torchy's Tacos%" 
OR names.common[0].value LIKE '%Velvet Taco%')
""")

Calling stores.show() gives us a look at the spark DataFrame we created:

+--------------------+--------------+-----------+----------+
|                  id|          name|       long|       lat|
+--------------------+--------------+-----------+----------+
|tmp_8104A79216254...|Torchy's Tacos|  -98.59689|  29.60891|
|tmp_D17CA8BD72325...|Torchy's Tacos|  -97.74175|  30.29368|
|tmp_F497329382C10...|   Velvet Taco|  -95.48866|  30.18314|
|tmp_9B40A1BF3237E...|Torchy's Tacos| -96.805853| 32.909982|
|tmp_38210E5EC047B...|Torchy's Tacos|  -96.68755|  33.10118|
|tmp_DF0C5DF6CA549...|Torchy's Tacos|  -97.75159|  30.24542|
|tmp_BE38CAC8D46CF...|Torchy's Tacos|  -97.80877|  30.52676|
|tmp_44390C4117BEA...|Torchy's Tacos|  -97.82594|   30.4547|
|tmp_8032605AA5BDC...|   Velvet Taco| -96.469695| 32.898634|
|tmp_0A2AA67757F42...|Torchy's Tacos|  -96.44858|  32.90856|
|tmp_643821EB9C104...|Torchy's Tacos|  -97.11933|  32.94021|
|tmp_0042962D27E06...|   Velvet Taco|-95.3905374|29.7444214|
|tmp_8D0E2246C3F36...|Torchy's Tacos|  -97.15952|  33.22987|
|tmp_CB939610BC175...|Torchy's Tacos|  -95.62067|  29.60098|
|tmp_54C9A79320840...|Torchy's Tacos|  -97.75604|  30.37091|
|tmp_96D7B4FBCB327...|Torchy's Tacos|  -98.49816|  29.60937|
|tmp_1BB732F35314D...|   Velvet Taco|  -95.41044|    29.804|
|tmp_55787B14975DD...|   Velvet Taco|-96.7173913|32.9758554|
|tmp_7DC02C9CC1FAA...|Torchy's Tacos|  -95.29544|  32.30361|
|tmp_1987B31B9E24D...|   Velvet Taco|  -95.41006| 29.770256|
+--------------------+--------------+-----------+----------+
only showing top 20 rows

We’ve retrieved the latitude and longitude of our locations, as well as the name of the chain each location belongs to. We used the CASE WHEN statement in our query in order to simplify the location names. This way, we can easily select all the stores from the Torchy’s Tacos chain, for example, and not have to worry about individual locations being called things like “Torchy’s Tacos – Rice Village” or “Velvet Taco Midtown”, etc.

We can also visualize these locations using SedonaKepler. First, we can create the map using the following snippet:

location_map = SedonaKepler.create_map(stores, "Locations", 
    config = location_map_cfg)

Then, we can display the results by simply calling location_map in the notebook. For convenience, we included the location_map_cfg Python dict in our notebook, which stores the settings necessary for the map to be created with the locations color-coded by chain. If we wish to make modifications to the map and save the new configuration for later use, we can do so by calling location_map.config and saving the result either as a cell in our notebook or in a separate location_map_cfg.py file.

Generating Isochrones

Now, for each of these locations, we can generate a polygon known as an isochrone or drivetime. These polygons will represent the areas that are within a certain time’s drive from the given location. We will generate these drivetimes using the Valhalla isochrone api:

def get_isochrone(lat, lng, costing, time_steps, name, location_id):
    url = "<https://valhalla1.openstreetmap.de/isochrone>"
    params = {
      "locations": [{"lon": lng, "lat": lat}],
      "contours": [{"time": i} for i in time_steps],
      "costing": costing,
      "polygons": 1,
    }
    response = requests.post(url, json=params)
    if response:
        result = response.json()
        if 'error_code' not in result.keys():
            df = gpd.GeoDataFrame.from_features(result)
            df['name'] = name
            df['id'] = location_id
            return df[['name','id','geometry']]

The function takes as its input a latitude and longitude value, a costing paratemeter, a location name, and a location id. The output is a dataframe which contains a Shapely polygon representing the isochrone, along with the a name and id of the location the isochrone corresponds to.

We have separate columns for a location id and a location name so that we can use the id column to examine isochrones for individual restaurants and we can use the name column to look at isochrones for each of the chains.

The costing parameter can take on several different values (see the API reference here), and it can be used to create “drivetimes” assuming the user is either walking, driving, or taking public transport.

We create a geoDataFrame of all of the 5-minute drivetimes for our taco restaurant locations

drivetimes_5_min = pd.concat([get_isochrone(row.lat, row.long, 'auto', [5],
 row.chain, row.id) for row in stores.select('id','chain','lat','long').collect()])

and then save it to our S3 storage for later use:

drivetimes_5_min.to_csv('s3://path/drivetimes_5_min_torchys_velvet.csv',
 index = False)

Because we are using a free API and we have to create quite a few of these isochrones, we highly recommend saving the file for later analysis. For the purposes of this blog, we have provided a ready-made isochrone file here, which we can load into Wherobots with the following snippet:

sedona.read.option('header','true').format('csv') .\\
load('s3://path/drivetimes_5_min_torchys_velvet.csv') .\\
createOrReplaceTempView('drivetimes_5_min')

We can now visualize our drivetime polygons in SedonaKepler. As before, we first create the map with the snippet below.

map_isochrones = sedona.read.option('header','true').format('csv'). \\
load('s3://path/drivetimes_5_min_torchys_velvet.csv')

isochrone_map = SedonaKepler.create_map(map_isochrones, "Isochrones",
 config = isochrone_map_cfg)

Now, we can display the result by calling isochrone_map .

The Analysis

At this point, we have a collection of the Torchy’s and Velvet Taco locations in Texas, and we know the areas which are within a 5-minute drive of each location. What we want to do now is to estimate the number of potential customers that live near each of these locations, and the extent to which these populations overlap.

A First Look

Before we look at how these two chains might compete with each other, let’s also take a look at the extent to which restaurants within each chain might be cannibalizing each others’ sales. A quick way to do this is by using the filtering feature in Kepler to look at isochrones for a single chain:



We see that locations for each chain are fairly spread out and (at least at the 5-minute drivetime level), there is not a high degree of cannibalization within each chain. Looking at the isochrones for both chains, however, we notice that Velvet Taco locations often tend to be near Torchy’s Tacos locations (or vice-versa). At this point, all we have are qualitative statements based on these maps. Next, we will show how to use H3 and existing open-source datasets to make these statements more quantitative.

Estimating Cannibalization Potential

As we can see by looking at the map of isochrones above, they are highly irregular polygons which have a considerable amount of overlap. In general, these polygons are not described in a ‘nice’ way by any administrative boundaries such as census block groups, census tracts, etc. Therefore, we will have to be a little creative in order to estimate the population inside them.

One way of doing this using the tools provided by Apache Sedona and Wherobots is to convert these polygons to H3 hexes. We can do this with the following snippet:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
createOrReplaceTempView('h3_isochrones')

This turns our table of drivetime polygons into a table where each row represents a hexagon with sides roughly 400m long, which is a part of a drivetime polygon. We also record the chain that these hexagons are associated to (the chain that the polygon they came from belongs to). We store each hexagon in its own row because this will simplify the process of estimating population later on.

Although the question of estimating population inside individual H3 hexes is also a difficult one (we will release a notebook on this soon), open-source datasets with this information are available online, and we will use one such dataset, provided by Kontur:

kontur = sedona.read.option('header','true') .\\
load('s3://path/us_h3_8_pop.geojson', format="json") .\\
drop('_corrupt_record').dropna() .\\
selectExpr('CAST(CONV(properties.h3, 16, 10) AS BIGINT) AS h3',
 'properties.population as population')

kontur.createOrReplaceTempView('kontur')

We can now enhance our h3_isochrones table with population counts for each H3 hex:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').distinct().createOrReplaceTempView('h3_isochrones')

At this stage, we can also quickly compute the cannibalization potential within each chain. Using the following code, for example, we can estimate the number of people who live within a 5 minute drive of more than one Torcy’s Tacos:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').filter('name LIKE "%Torchy%"').select('h3','population') .\\
groupBy('h3').count().filter('count >= 2').join(kontur, 'h3', 'left').distinct() .\\
agg(F.sum('population')).collect()[0][0]
97903.0

We can easily change this code to compute the same information for Velvet Taco by changing filter('name LIKE "%Torchy%"') in line 4 of the above snippet to filter('name LIKE "%Velvet%"') . If we do this, we will see that 100298 people live within a 5 minute drive of more than one Velvet Taco. Thus, we see that the Torchy’s Tacos brand appears to be slightly better at avoiding canibalization among its own locations (especially given that Torchy’s Tacos has more locations than Velvet Taco).

Now, we can run the following query to show the number of people in Texas who live within a 5 minutes drive of a Torchy’s Tacos:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE "%Torchy's%"
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()

The reason we select distinct H3 hexes here is because a single hex can belong to more than one isochrone (as evidenced by the SedonaKepler visualizations above). We get the following output:

+---------------+
|sum(population)|
+---------------+
|      1546765.0|
+---------------+

So roughly 1.5 million people in Texas live within a 5-minute drive of a Torchy’s Tacos location. Looking at our previous calculations for how many people live near more than one restaurant of the same chain, we can see that Torchy’s Tacos locations near each other cannibalize about 6.3% of the potential customers who live within 5 minutes of a Torchy’s location.

Running a similar query for Velvet Taco tells us that roughly half as many people live within a 5-minute drive of a Velvet Taco:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE '%Velvet Taco%'
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()
+---------------+
|sum(population)|
+---------------+
|       750360.0|
+---------------+

As before, we can also see that Velvet Taco locations near each other cannibalize about 13.4% of the potential customers who live within 5 minutes of a Velvet Taco location.

Now, we can estimate the potential for cannibalization between these two chains:

sedona.sql("""
WITH overlap_h3 (h3, population) AS
(
    SELECT DISTINCT a.h3, ANY_VALUE(a.population)
    FROM h3_isochrones a LEFT JOIN h3_isochrones b ON a.h3 = b.h3
    WHERE a.name != b.name
    GROUP BY a.h3
)
SELECT sum(population)
FROM overlap_h3
""").show()

which gives:

+---------------+
|sum(population)|
+---------------+
|       415033.0|
+---------------+

We can see that more than half of the people who live near a Velvet Taco location also live near a Torchy’s Tacos location and we can visualize this population overlap:

isochrones_h3_map_data = sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\
join(kontur, 'h3', 'left').select('name','population',array('h3')).withColumnRenamed('array(h3)','h3').selectExpr('name','population','ST_H3ToGeom(h3)[0] AS geometry')

isochrones_h3_map = SedonaKepler.create_map(isochrones_h3_map_data, 'Isochrones in H3', config = isochrones_h3_map_cfg)

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet

Introduction

The Overture Maps Foundation (OMF) has recently released its first open-source Parquet dataset (https://overturemaps.org/download/), divided into four themes – places of interest, buildings, transportation networks, and administrative boundaries.

Apache Sedona is an open-source and distributed geospatial analytics engine. It enables large-scale spatial data processing and analysis through native handling of geospatial data types, Spatial SQL for querying, spatial indexing techniques, and support for standard formats like Shapefile, GeoParquet, GeoJSON, WKT, WKB, GeoTiff, and ArcGrid.

In this article, we demonstrate the capabilities of Apache Sedona for working with massive Overture Maps data. The article also shows that GeoParquet OMF data produced by Sedona can accelerate the spatial queries by 60X. This exploration underscores Sedona’s proficiency in handling scalable geospatial tasks using prevalent industry formats like Parquet and GeoParquet.

This article is structured as follows: First, we will explore and analyze the data in its original Parquet format. Then, we will convert it to the GeoParquet format with built-in spatial indexes. Finally, we will use Sedona in a Jupyter notebook to explore and analyze the dataset in GeoParquet form, leveraging capabilities like spatial SQL and spatial Python to derive insights.

Study 1: Analyze the OMF Parquet data using Sedona

Overture Maps uses the parquet format to store the datasets. You can find the schema of these datasets on OMF website (https://docs.overturemaps.org/reference). In this example, we will be using Buildings theme dataset which is the largest dataset in OMF with around 120GB size. The schema of this dataset is as follows:

id: string (nullable = true)
updatetime: string (nullable = true)
version: integer (nullable = true)
names: map (nullable = true)
 |-- key: string
 |-- value: array (valueContainsNull = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
level: integer (nullable = true)
height: double (nullable = true)
numfloors: integer (nullable = true)
class: string (nullable = true)
sources: array (nullable = true)
 |-- element: map (containsNull = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
bbox: struct (nullable = true)
 |-- minx: double (nullable = true)
 |-- maxx: double (nullable = true)
 |-- miny: double (nullable = true)
 |-- maxy: double (nullable = true)
geometry: binary (nullable = true)

To start using Sedona on OMF data, we will first have to create an SedonaContext:

import sedona.spark.SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext(config)

Apache Sedona facilitates easy loading and analysis of Parquet datasets through its APIs.

df = sedona.read.format("parquet").load("s3a://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Since Parquet does not natively support geospatial data types, the geometry columns in this dataset are stored in the WKB (Well-Known Binary) format. Sedona provides functionality to decode the WKB-encoded geometries into proper spatial types like points and polygons.

sedona.sql('SELECT ST_GeomFromWKB(geometry) as geometry FROM df').show(5)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((7.85925...|
|POLYGON ((2.69399...|
|POLYGON ((-95.775...|
|POLYGON ((103.141...|
|POLYGON ((111.900...|
+--------------------+

After processing, the dataset can be used for spatial range queries. We can perform a ST_Contains query between the dataset and Washington state’s boundary polygon to find out buildings within the Washington state.

df = df.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-123.3208 49.0023,-123.0338 49.0027,-122.0650 49.0018,-121.7491 48.9973,-121.5912 48.9991,-119.6082 49.0009,-118.0378 49.0005,-117.0319 48.9996,-117.0415 47.9614,-117.0394 46.5060,-117.0394 46.4274,-117.0621 46.3498,-117.0277 46.3384,-116.9879 46.2848,-116.9577 46.2388,-116.9659 46.2022,-116.9254 46.1722,-116.9357 46.1432,-116.9584 46.1009,-116.9762 46.0785,-116.9433 46.0537,-116.9165 45.9960,-118.0330 46.0008,-118.9867 45.9998,-119.1302 45.9320,-119.1708 45.9278,-119.2559 45.9402,-119.3047 45.9354,-119.3644 45.9220,-119.4386 45.9172,-119.4894 45.9067,-119.5724 45.9249,-119.6013 45.9196,-119.6700 45.8565,-119.8052 45.8479,-119.9096 45.8278,-119.9652 45.8245,-120.0710 45.7852,-120.1705 45.7623,-120.2110 45.7258,-120.3628 45.7057,-120.4829 45.6951,-120.5942 45.7469,-120.6340 45.7460,-120.6924 45.7143,-120.8558 45.6721,-120.9142 45.6409,-120.9471 45.6572,-120.9787 45.6419,-121.0645 45.6529,-121.1469 45.6078,-121.1847 45.6083,-121.2177 45.6721,-121.3392 45.7057,-121.4010 45.6932,-121.5328 45.7263,-121.6145 45.7091,-121.7361 45.6947,-121.8095 45.7067,-121.9338 45.6452,-122.0451 45.6088,-122.1089 45.5833,-122.1426 45.5838,-122.2009 45.5660,-122.2641 45.5439,-122.3321 45.5482,-122.3795 45.5756,-122.4392 45.5636,-122.5676 45.6006,-122.6891 45.6236,-122.7647 45.6582,-122.7750 45.6817,-122.7619 45.7613,-122.7962 45.8106,-122.7839 45.8642,-122.8114 45.9120,-122.8148 45.9612,-122.8587 46.0160,-122.8848 46.0604,-122.9034 46.0832,-122.9597 46.1028,-123.0579 46.1556,-123.1210 46.1865,-123.1664 46.1893,-123.2810 46.1446,-123.3703 46.1470,-123.4314 46.1822,-123.4287 46.2293,-123.4946 46.2691,-123.5557 46.2582,-123.6209 46.2573,-123.6875 46.2497,-123.7404 46.2691,-123.8729 46.2350,-123.9292 46.2383,-123.9711 46.2677,-124.0212 46.2924,-124.0329 46.2653,-124.2444 46.2596,-124.2691 46.4312,-124.3529 46.8386,-124.4380 47.1832,-124.5616 47.4689,-124.7566 47.8012,-124.8679 48.0423,-124.8679 48.2457,-124.8486 48.3727,-124.7539 48.4984,-124.4174 48.4096,-124.2389 48.3599,-124.0116 48.2964,-123.9141 48.2795,-123.5413 48.2247,-123.3998 48.2539,-123.2501 48.2841,-123.1169 48.4233,-123.1609 48.4533,-123.2220 48.5548,-123.2336 48.5902,-123.2721 48.6901,-123.0084 48.7675,-123.0084 48.8313,-123.3215 49.0023,-123.3208 49.0023))'), geometry) = true")
df.selectExpr("names", "height", "numfloors", "geometry").show(5)

+--------------------+------+---------+--------------------+
|               names|height|numfloors|            geometry|
+--------------------+------+---------+--------------------+
|{common -> [{valu...|   5.1|        1|POLYGON ((-122.32...|
|{common -> [{valu...|  10.9|        1|POLYGON ((-122.18...|
|{common -> [{valu...|   7.9|        1|POLYGON ((-122.31...|
|{common -> [{valu...|   9.2|        1|POLYGON ((-122.22...|
|{common -> [{valu...|   6.4|        1|POLYGON ((-122.21...|
+--------------------+------+---------+--------------------+

We can also leverage Apache Spark’s filter pushdown capabilities on non-spatial columns to reduce the data before geospatial analysis. Since the building dataset was large, we applied a highly selective filter pushdown:

df_building = df_building.filter(~(size(col("names")) <= 0))\
                         .filter(col("height") <= 200 )\
                         .filter(~(size(col("sources")) <= 0))\
                                                 .filter(col("numfloors") == 1)

The count of the dataset after intersecting with Washington State boundary:

df.count()

Count: 511
Time: 1h 31min 42s

Discussions

The spatial query and non-spatial filter pushdown on the dataset took an hour and half to complete on a single-node Sedona Docker environment. Therefore, users can barely do any interactive analytics on it. The reason is two-fold:

  1. The code was executed on a single-node Sedona docker environment with 4GB executor RAM, not a real cluster. Performance will be significantly improved if run in a distributed Sedona environment across multiple nodes.
  2. This lengthy processing time is primarily due to the limitations of using the Parquet format without geospatial optimizations. Parquet lacks native spatial data types and spatial indexing schemes suited for efficient geospatial analysis. The required data loading and preparation are time-consuming. Using a format optimized for geospatial workloads, like GeoParquet, could significantly reduce the pre-processing time for this analysis.

Study 2: Converting from Parquet to GeoParquet

Sedona enables spatial data format conversion between Parquet and GeoParquet. For more details, please refer to [Sedona GeoParquet blogpost]. After realizing the limitations of Parquet, we decide to leverage this functionality and see how much improvement the GeoParquet format brings in.

To achieve the best spatial filter pushdown performance, we need to partition the data based on their spatial proximity. In other words, nearby spatial objects should be put in the same GeoParquet partition. For this purpose, we first create a GeoHash ID for each geometry in OMF data using Sedona ST_GeoHash. This function generates geographic hashes for a given geometry at a specified precision. The precision refers to the size of the grid cells, where a precision of 2 indicates each cell has 1,250km X 625km. This precision level was chosen as an optimal balance, since too high a precision produces many small files that can slow down query processing and reduce read throughput.

df.withColumn("geohash", expr("ST_GeoHash(geometry, 2)")).repartition("geohash")

By repartitioning with GeoHashing, data points with the same GeoHash ID get assigned to contiguous partition slices based on precision. This clusters nearby points together in the same partitions.

Finally we will store the GeoParquet Overture maps into our public Wherobots’ S3 bucket. Such same operation is applied to all OMF datasets.

df.write.partitionBy("geohash").format("geoparquet").save("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Discussions

This conversion on all OMF datasets took around 15 minutes to finish on a 20-node AWS EMR cluster. Each node is a m3.2xlarge instance with 8 CPU and 30 GB RAM. In closing, Sedona provides a streamlined way to convert datasets to GeoParquet format and partition using GeoHashes for optimal performance. The entire GeoParquet dataset converted here is available in the Wherobots’ public S3 bucket for you to experiment with.

Study 3: Interactive geospatial spatial analytics on OMF GeoParquet data

We’ll ingest the public Wherobots dataset from S3 into a Spark DataFrame using Sedona, like before, along with that we will intersect Bellevue, Washington boundary with the dataset.

Fast geospatial queries via Sedona GeoParquet spatial filter pushdown

With GeoParquet, we observe improved query performance versus the Parquet format over the following simple spatial range query (i.e., spatial filter).

spatial_filter = "POLYGON ((-122.235128 47.650163, -122.233796 47.65162, -122.231581 47.653287, -122.228514 47.65482, -122.227526 47.655204, -122.226175 47.655729, -122.222039 47.656743999999996, -122.218428 47.657464, -122.217026 47.657506, -122.21437399999999 47.657588, -122.212091 47.657464, -122.212135 47.657320999999996, -122.21092999999999 47.653552, -122.209834 47.650121, -122.209559 47.648976, -122.209642 47.648886, -122.21042 47.648658999999995, -122.210897 47.64864, -122.211005 47.648373, -122.21103099999999 47.648320999999996, -122.211992 47.64644, -122.212457 47.646426, -122.212469 47.646392, -122.212469 47.646088999999996, -122.212471 47.645213, -122.213115 47.645212, -122.213123 47.644576, -122.21352999999999 47.644576, -122.213768 47.644560999999996, -122.21382 47.644560999999996, -122.21382 47.644456999999996, -122.21373299999999 47.644455, -122.213748 47.643102999999996, -122.213751 47.642790999999995, -122.213753 47.642716, -122.213702 47.642697999999996, -122.213679 47.642689999999995, -122.21364 47.642678, -122.213198 47.642541, -122.213065 47.642500000000005, -122.212918 47.642466, -122.21275 47.642441, -122.212656 47.642433, -122.21253899999999 47.642429, -122.212394 47.64243, -122.212182 47.642444999999995, -122.211957 47.642488, -122.211724 47.642551999999995, -122.21143599999999 47.642647, -122.210906 47.642834, -122.210216 47.643099, -122.209858 47.643215, -122.20973000000001 47.643248, -122.20973599999999 47.643105, -122.209267 47.643217, -122.208832 47.643302, -122.208391 47.643347999999996, -122.207797 47.643414, -122.207476 47.643418, -122.20701199999999 47.643397, -122.206795 47.643387999999995, -122.205742 47.643246, -122.20549 47.643201999999995, -122.20500200000001 47.643119, -122.204802 47.643085, -122.204641 47.643066, -122.204145 47.643012, -122.203547 47.643012, -122.203097 47.643107, -122.20275699999999 47.643283, -122.202507 47.643496999999996, -122.202399 47.643653, -122.202111 47.643771, -122.201668 47.643767, -122.201363 47.643665, -122.20133 47.643648999999996, -122.201096 47.643536, -122.200744 47.64328, -122.200568 47.64309, -122.200391 47.642849, -122.200162 47.642539, -122.199896 47.642500000000005, -122.19980799999999 47.642424, -122.199755 47.642376999999996, -122.199558 47.642227999999996, -122.199439 47.642157, -122.199293 47.642078999999995, -122.199131 47.642004, -122.198928 47.641925, -122.19883 47.641892, -122.19856300000001 47.641811999999994, -122.198203 47.641731, -122.197662 47.641619999999996, -122.196819 47.641436, -122.196294 47.641309, -122.196294 47.642314, -122.19628 47.642855, -122.196282 47.642897999999995, -122.196281 47.643111, -122.196283 47.643415, -122.196283 47.643508999999995, -122.19628399999999 47.643739, -122.196287 47.644203999999995, -122.196287 47.644262999999995, -122.19629 47.644937999999996, -122.19629 47.644954999999996, -122.196292 47.645271, -122.196291 47.645426, -122.19629499999999 47.646315, -122.19629499999999 47.646432, -122.195925 47.646432, -122.195251 47.646432, -122.190853 47.646429999999995, -122.187649 47.646428, -122.187164 47.646426, -122.18683 47.646426, -122.185547 47.646409, -122.185546 47.646316, -122.185537 47.645599, -122.185544 47.644197, -122.185537 47.643294999999995, -122.185544 47.642733, -122.185541 47.641757, -122.185555 47.640681, -122.185561 47.63972, -122.185557 47.638228999999995, -122.185591 47.635419, -122.185611 47.634750999999994, -122.18562299999999 47.634484, -122.18561700000001 47.634375999999996, -122.185592 47.634311, -122.185549 47.634232999999995, -122.185504 47.634181999999996, -122.185426 47.634119, -122.184371 47.633424999999995, -122.18400000000001 47.633198, -122.183896 47.633134, -122.1838 47.633067, -122.18375499999999 47.633019999999995, -122.183724 47.632959, -122.183695 47.632858, -122.183702 47.632675, -122.182757 47.632622999999995, -122.182365 47.63259, -122.18220600000001 47.632562, -122.181984 47.632504999999995, -122.18163799999999 47.632363, -122.18142 47.632262999999995, -122.181229 47.632165, -122.181612 47.632172999999995, -122.18271899999999 47.632151, -122.183138 47.632135, -122.18440000000001 47.632081, -122.184743 47.632065999999995, -122.185312 47.63205, -122.185624 47.632047, -122.185625 47.631873999999996, -122.184618 47.63187, -122.184291 47.631878, -122.184278 47.631817999999996, -122.183882 47.629942, -122.182689 47.623548, -122.182594 47.622789999999995, -122.182654 47.622155, -122.183135 47.622372999999996, -122.183471 47.622506, -122.18360200000001 47.622552, -122.183893 47.622637999999995, -122.184244 47.62272, -122.184618 47.622777, -122.184741 47.622727999999995, -122.184605 47.622679, -122.18424 47.622622, -122.183985 47.622569, -122.183717 47.622501, -122.183506 47.622439, -122.18327 47.622357, -122.18305699999999 47.622271999999995, -122.182669 47.622088999999995, -122.182796 47.621545, -122.18347 47.619628999999996, -122.18365 47.619098, -122.183859 47.6184, -122.183922 47.617793999999996, -122.183956 47.617292, -122.183792 47.616388, -122.183261 47.614391999999995, -122.183202 47.613802, -122.183209 47.613155, -122.183436 47.612384999999996, -122.18395100000001 47.610445999999996, -122.184338 47.60924, -122.184657 47.609116, -122.18481 47.609051, -122.18491900000001 47.608987, -122.184974 47.608942, -122.185047 47.608846, -122.185082 47.608743999999994, -122.185109 47.608526999999995, -122.185116 47.608359, -122.18513 47.608315999999995, -122.185157 47.608273999999994, -122.185183 47.608247, -122.185246 47.608214, -122.185354 47.608196, -122.185475 47.608191999999995, -122.185472 47.606697, -122.185472 47.606373999999995, -122.185521 47.606272, -122.185528 47.606210999999995, -122.185506 47.606037, -122.185451 47.605872999999995, -122.185411 47.605781, -122.185358 47.605681999999995, -122.185248 47.605509999999995, -122.185127 47.605365, -122.185058 47.605292, -122.184772 47.605038, -122.184428 47.604834, -122.184122 47.604693999999995, -122.183775 47.604574, -122.183644 47.604546, -122.183708 47.604400999999996, -122.183749 47.604223999999995, -122.18376 47.604037, -122.183707 47.603778, -122.183619 47.603556999999995, -122.183559 47.603406, -122.183488 47.603303, -122.183824 47.603167, -122.184108 47.603052, -122.184478 47.602902, -122.18543 47.602495, -122.186669 47.601957, -122.186433 47.601220999999995, -122.186341 47.601127999999996, -122.18874199999999 47.593742999999996, -122.188434 47.592338999999996, -122.188479 47.591786, -122.188217 47.591269999999994, -122.18795399999999 47.590871, -122.186822 47.589228, -122.187421 47.589228999999996, -122.18848299999999 47.589228999999996, -122.188433 47.587922999999996, -122.18990000000001 47.588547, -122.191368 47.589169999999996, -122.19158 47.589222, -122.191779 47.589254999999994, -122.192117 47.589289, -122.191569 47.587478999999995, -122.191323 47.586628999999995, -122.191295 47.586554, -122.191268 47.586479, -122.191192 47.586318, -122.191163 47.586268999999994, -122.1911 47.586164, -122.19099 47.586011, -122.19067 47.585668999999996, -122.1905 47.585515, -122.190301 47.58531, -122.190143 47.585152, -122.189573 47.584576999999996, -122.188702 47.583735999999995, -122.188646 47.583679, -122.188239 47.583258, -122.188037 47.583005, -122.187832 47.582657, -122.187726 47.582164999999996, -122.18769499999999 47.581964, -122.18768299999999 47.581781, -122.187678 47.581592, -122.18766099999999 47.581455, -122.187674 47.581311, -122.18768 47.581146, -122.187722 47.580877, -122.187817 47.580569999999994, -122.187932 47.580301999999996, -122.188047 47.580087, -122.188161 47.579933999999994, -122.188399 47.579660999999994, -122.18851699999999 47.579547, -122.188621 47.579454, -122.188042 47.579493, -122.18762 47.579527, -122.187806 47.579358, -122.188009 47.579175, -122.18814499999999 47.579051, -122.188177 47.579021, -122.18842000000001 47.5788, -122.188638 47.578461, -122.188895 47.57806, -122.189791 47.577281, -122.190008 47.577103, -122.190372 47.576805, -122.19119 47.576358, -122.191877 47.576087, -122.193025 47.57566, -122.194317 47.575185999999995, -122.196061 47.574664, -122.197239 47.574386999999994, -122.197873 47.574267, -122.198286 47.574189999999994, -122.199091 47.574044, -122.199067 47.574574999999996, -122.199007 47.575921, -122.200335 47.578222, -122.20057299999999 47.578345999999996, -122.2009 47.578517999999995, -122.201095 47.578621999999996, -122.20138399999999 47.578776999999995, -122.201465 47.57882, -122.201516 47.578846999999996, -122.205753 47.581112, -122.209515 47.583124, -122.210634 47.583721, -122.21473399999999 47.587021, -122.21538699999999 47.588254, -122.21580399999999 47.589042, -122.216534 47.590421, -122.220092 47.596261, -122.220434 47.596821, -122.22041899999999 47.597837999999996, -122.220289 47.606455, -122.220234 47.610121, -122.22048 47.615221999999996, -122.220359 47.615379, -122.220283 47.615477999999996, -122.21999 47.615854999999996, -122.219993 47.61597, -122.22023300000001 47.616634, -122.220356 47.616687999999996, -122.220409 47.616712, -122.221401 47.618538, -122.22142 47.618573, -122.221456 47.618635, -122.221791 47.619222, -122.222492 47.619682999999995, -122.222799 47.619886, -122.222083 47.620368, -122.222046 47.620407, -122.222028 47.620449, -122.222025 47.620483, -122.22203999999999 47.620523999999996, -122.222079 47.620557999999996, -122.222156 47.620594999999994, -122.222458 47.620629, -122.222454 47.620673, -122.222454 47.620711, -122.22244599999999 47.621041999999996, -122.223056 47.621041, -122.223129 47.62104, -122.223153 47.62104, -122.223574 47.621041, -122.22377900000001 47.621041, -122.223857 47.621041, -122.22467499999999 47.621041, -122.224712 47.62104, -122.224958 47.62104, -122.225167 47.621049, -122.226882 47.621037, -122.227565 47.621032, -122.228002 47.621029, -122.22797800000001 47.621300999999995, -122.227919 47.626574999999995, -122.227914 47.627085, -122.227901 47.6283, -122.227881 47.630069, -122.227869 47.631177, -122.227879 47.631952999999996, -122.22789 47.633879, -122.227886 47.63409, -122.227871 47.635534, -122.227918 47.635565, -122.228953 47.635624, -122.22895199999999 47.635571999999996, -122.231018 47.635574999999996, -122.233276 47.635588999999996, -122.233287 47.63617, -122.233273 47.63639, -122.233272 47.636469999999996, -122.23327 47.636578, -122.233266 47.636827, -122.233263 47.636851, -122.233262 47.637014, -122.23322999999999 47.638110999999995, -122.233239 47.638219, -122.233262 47.638279, -122.233313 47.638324999999995, -122.233255 47.638359, -122.233218 47.638380999999995, -122.233153 47.638450999999996, -122.233136 47.638552999999995, -122.233137 47.638692, -122.232715 47.639348999999996, -122.232659 47.640093, -122.232704 47.641375, -122.233821 47.645111, -122.234906 47.648874, -122.234924 47.648938, -122.235128 47.650163))"

df = sedona.read.format("geoparquet").load("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")
df = df.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")
df.count()

Count: 3423743
Time: 2min 47s

Switching to GeoParquet from regular Parquet reduced query time from over an hour and a half to around 3 minutes, a 60X speedup. This query was done on the same single-node Sedona Docker environment, without non-spatial filter pushdowns. Further gains could be achieved by leveraging a distributed Sedona cluster. Overall, GeoParquet enables more efficient geospatial queries through both spatial partitioning and enhanced filter pushdown.

Interactive geospatial visualization via Sedona Kepler

We also wanted to highlight SedonaKepler, our interactive visualization API built on KeplerGl. SedonaKepler provides a powerful, customizable visualization experience through a simple API. Creating visualizations with SedonaKepler is as straightforward as calling a single function. The map created by the code below indicates the buildings in Bellevue, Washington.

import sedona.spark.*

SedonaKepler.create_map(df, "Building")

Time: 2min 38s

file

Photo: Visualization of building dataset spatially filtered to only include data for the city of Bellevue, Washington.

Of course, you can also choose other region and see what the dataset looks like.

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Mount Rainier National Park, Washington in Satellite view.

SedonaKepler can also visualize multiple datasets together, revealing insights across interconnected geospatial data. To demonstrate, we are using the segment and connector datasets, as they highlight the transportation network from OMF.

The dots below represent connection points between road segments. The road segments themselves are shown as yellow lines, representing paths that can be traversed by people, cars, or other vehicles. By layering this networked transport data over a map, we can gain insights into how the transportation infrastructure links together across the geographic area.

map_connector = SedonaKepler.create_map(df_connector, "Connector")
SedonaKepler.add_df(map_connector, df_segment, name="Segment")
map_connector

Time: 3min 11s

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Bellevue, Washington.

The Take-Away

Apache Sedona enables both the reading and writing of GeoParquet files, a specialized Parquet variant tailored for spatial data interchange. When executing spatial range queries on GeoParquet files, Apache Sedona supports spatial filter pushdown, and optimizing query performance with over 60X speedup. SedonaKepler is a powerful tool for creating visualizations that are interactive and easy to maneuver, and it allows you to create visualizations from multiple datasets.

Try it yourself

Notebooks

All notebooks used in this article are available on Wherobots GitHub repository: https://github.com/wherobots/OvertureMaps

Interactive notebook using GeoParquet and Sedona

Use Wherobots to deploy Sedona to cloud vendors

The other notebooks used in Study 1 and 2 can be run on a AWS EMR or Databricks cluster. if you want to try them out, please sign up here: https://www.wherobots.ai/demo

Wherobots is a spatial data analytics and AI platform trusted in production, at scale, from the original creators of Apache Sedona.

Free and public Overture Maps GeoParquet data from Wherobots

The GeoParquet format data produced in Study 2 is provided by Wherobots for free. It has the same schema and license as the original Overture Maps Parquet data, except the geometry column is in geometry type and has additional geohash column in string type. You can access them as follows:

  • Buildings: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building
  • Places: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=place
  • AdminBoundary: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=administrativeBoundary
  • AdminLocality: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=locality
  • Transportation Connector: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=connector
  • Transportation Segment: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=segment

Announcing Apache Sedona 1.4 – A Giant Leap in Geospatial Processing

Hello Apache Sedona community,

We’re exhilarated to share the launch of Apache Sedona 1.4 (https://sedona.apache.org/1.4.1/), encompassing both 1.4.0 and 1.4.1 versions, signifying a transformative stage in geospatial analytics. These successive versions introduce an array of striking new features, major enhancements, and indispensable bug fixes.

Top-Notch Highlights

Here are the standout features spanning across these two versions:

  1. Use SedonaContext as a unified entry point (SEDONA-287): This allows you to use the same method to initialize your Sedona environment in different computation engines without tedious configuration.
  2. Geodesic/Geography Functions Support (SEDONA-281, SEDONA-286): This functionality adds a whole new dimension to your geospatial data processing, enabling more accurate computations over geographical data.
  3. Redesigned end-to-end raster ETL pipeline (SEDONA-251, SEDONA-269, SEDONA-292): Apache Sedona 1.4 includes new raster readers / writers, raster type, and many essential raster functions, paving the way for streamlined handling of vast volumes of geospatial data.
  4. Google S2 based spatial join (SEDONA-235): Sedona 1.4 introduces the support of Google S2 cells which enables a speedy approximate point-in-polygon join.
  5. Support for Spark 3.4 (SEDONA-276): Our support for Spark 3.4 allows you to merge the powerful capabilities of the latest Spark version with the enhanced geospatial features of Sedona.
  6. Serialization and Deserialization Enhancements (SEDONA-207, SEDONA-227, SEDONA-231): Sedona 1.4 offered faster geometry serialization and deserialization, resulting in a performance boost by a factor of 3 to 7.
  7. Native filter pushdown support in GeoParquet (SEDONA-156): this can push down spatial predicate on GeoParquet to reduce memory consumption by 10X.

Now, let’s dive into Features 1-4 and learn how to leverage them in your applications. We’ll discuss the remaining features in separate blog posts.

Use SedonaContext as a unified entry point (v1.4.1)

Initially designed to be a geospatial computation system on top of Apache Spark, Sedona has evolved over the years. It has extended its support to other engines and is now on a mission to become a cross-platform, scalable geospatial computation engine.

To use Sedona with different underlying engines, you must import several classes and register necessary Sedona data types, serializers, functions, and optimization rules. This process can be confusing for new users over time. Let’s examine the Sedona Spark Python interface as an example:

Old Sedona entry point

from sedona.register import SedonaRegistrator
from sedona.core.SpatialRDD import SpatialRDD
from sedona.utils.adapter import Adapter
from sedona.core.spatialOperator import JoinQueryRaw
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.sql.types import GeometryType
from sedona.core.enums import GridType
from sedona.core.enums import IndexType
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

spark = SparkSession.\
    builder.\
    master("local[*]").\
    appName("Sedona App").\
    config("spark.serializer", KryoSerializer.getName).\
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .\
    config("spark.jars.packages", "org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2") .\
    getOrCreate()

SedonaRegistrator.registerAll(spark)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

Users previously had to manually import many classes and add serializer configuration.

However, with the introduction of SedonaContext and other helper functions in SEDONA-287, starting applications has become much easier. Here’s how to do it:

SedonaContext with Spark Python

from sedona.spark import *

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,'
           'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
    getOrCreate()

sedona = SedonaContext.create(config)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

SedonaContext with Flink Java

import org.apache.sedona.flink.SedonaContext

StreamTableEnvironment sedona = SedonaContext.create(env, tableEnv);

Table spatialTbl = sedona.sqlQuery("SELECT ST_GeomFromWKT(XXX) FROM")

It’s worth noting that Sedona will maintain the old entry point for a long time, since many existing users are familiar with the old APIs.

Geodesic/Geography Functions Support (v1.4.1)

As we all know, the Earth is a spheroid, making it difficult to accurately calculate geographic distances between two locations. Given the complicated surface of the Earth, finding the exact distance is a challenging task. In the GIS industry, it is considered best practice to project the geospatial objects (usually in WGS84 / EPSG:4326) to an appropriate meter-based Coordinate Reference System (CRS) before applying geometric calculations such as distance calculation. Users can choose a CRS that is close to the locations of their data. For example, in Apache Sedona, one can calculate the distance between two locations as follows:

Sedona ST_Transfrom in SQL

CREATE OR REPLACE TEMP VIEW geomTable AS
SELECT ST_GeomFromWKT('POINT(-42.3 71.3)') AS geom1,
 ST_GeomFromWKT('POINT(-42.1 71.1)') AS geom2;

CREATE OR REPLACE TEMP VIEW geomTable_t AS
SELECT ST_Transform(geom1, 'epsg:4326', 'epsg:3857') AS geom1,
 ST_Transform(geom2, 'epsg:4326', 'epsg:3857') As geom2 
FROM geomTable

SELECT ST_Distance(geom1, geom2) AS distanceInMeter FROM geomTable_t

However, it is common for a single dataset to include geospatial objects across the globe, and there is no CRS suitable for everywhere. Therefore, it can be difficult to calculate distances accurately. In this case, the best way to measure geographic distance is either through great-circle distance (haversine distance) or spheroid distance (Vincenty’s formulae or C. Karney’s algorithm).

SEDONA-281 introduces support for both great-circle distance and spheroid distance, namely ST_DistanceSphere and ST_DistanceSpheroid. Let’s look at an example of calculating the distance between Hong Kong (22.308919, 113.914603) and Toronto (43.677223, -79.630556).

Sedona sphere and spheroid distance in SQL

All distance units in this example are in meters. The actual distance between Hong Kong and Toronto is about 12,544 to 12,568 kilometers. ST_DistanceSphere and ST_DistanceSpheroid both produce very accurate results. However, the ST_Transformbased approach cannot give a good answer if no proper CRS (coordinate reference system) is available.

# ST_DistanceSphere
SELECT ST_DistanceSphere(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2548548944238186E7|
+--------------------+

# ST_DistanceSpheroid
SELECT ST_DistanceSpheroid(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2568775317073349E7|
+--------------------+

# ST_Transform based distance calculation
SELECT ST_DISTANCE(ST_TRANSFORM(ST_POINT(22.308919, 113.914603), 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(ST_POINT(43.677223, -79.630556), 'epsg:4326', 'epsg:3857')) AS distance

+--------------------+
|            distance|
+--------------------+
|2.1735260976295043E7|
+--------------------+

Rule of thumb

Using ST_DistanceSphere and ST_DistanceSpheroid does not mean that the ST_Transform-based approach is no longer necessary. To select the most suitable approach, use the following rule of thumb: when calculating the distance between two geospatial objects that are far apart, such as cities in different countries or continents, use ST_DistanceSphere or ST_DistanceSpheroid. When calculating the distance between two objects in the same region such as a metropolitan area, use the ST_Transform-based approach.

Now that we are able to calculate the spherical distance between two geospatial objects, we can apply this to real-world applications that involve the distance calculation among many geospatial objects and not just two. For example, we can find the parks that are within 50KM distance from each city among the locations of cities and parks in the world (available from OpenStreetMap). This operation is called DistanceJoin in Sedona and is one of the most expensive and resource-consuming operations. Thanks to our highly efficient spatial join algorithm, Sedona can easily handle this operation on huge datasets. If you used our ST_Transform-based approach before, you may have enjoyed this algorithm for a while. SEDONA-286 extends our algorithm to support distance join over ST_DistanceSphere and ST_DistanceSpheroid. A distance join query can be written as follows:

Sedona distance join in SQL

50000 indicates 50KM

# ST_DistanceSphere join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSphere(c.geom, p.geom) < 50000

# ST_DistanceSpheroid join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSpheroid(c.geom, p.geom) < 50000

# ST_Transform + ST_Distance
SELECT *
FROM cities c, parks p
WHERE ST_DISTANCE(ST_TRANSFORM(c.geom, 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(p.geom, 'epsg:4326', 'epsg:3857')) < 50000

Redesigned end-to-end raster ETL pipeline (v1.4.1)

Sedona has been implementing geospatial vector data functions for several years, and the support has become mostly mature. In recent releases, the Sedona community has invested more in geospatial raster data support, including raster data reader, writer, and raster manipulation.

To enable comprehensive raster data support, Sedona has made the following efforts:

Raster data type

SEDONA-251 introduces a native raster data type. Similar to the geometry type, the raster type indicates that the column in a schema contains raster data. Each record stored in this type contains two parts: metadata and raster band information. With this in place, we can create a table that has both a geometry column and a raster column.

Raster reader

To begin processing raster data, the first step is to load the raster data. Additionally, we must be able to load raster data in parallel given the vast amount of available raster data. SEDONA-251 introduces a scalable raster reader and native raster data constructors. For example, we can read raster data files in Sedona Spark as follows:

sedona.read.format("binaryFile")
.load("raster/*.tiff").createOrReplaceTempView("binaryTable")

To construct a raster column, use the following steps. We have added additional constructors, such as RS_FromGeoTiff and RS_FromAscGrid, which allow you to convert various raster binary formats to a unified raster format.

CREATE OR REPLACE TEMP VIEW rasterTable AS
SELECT RS_FromGeoTiff(binaryTable.content) AS raster
FROM binaryTable

We can print the schema of this table. As we can see, the type of this column now is raster.

sedone.table("binaryTable").printSchema()

root
 |-- raster: raster (nullable = true)

Raster writer

After processing raster data, it is necessary to store the data in an external storage. With SEDONA-269, this can be achieved in Sedona Spark as follows:

First, convert the raster data to a specific binary format, such as GeoTiff. Note that the raster can be imported from one image format, but exported to another image format.

SELECT RS_AsGeoTiff(raster) AS image
FROM rasterTable

After processing the data, we can save it to an external location.

sedona.table("rasterTable").write.format("raster").mode(SaveMode.Overwrite).save("my_raster_file")

Raster functions

SEDONA-251, SEDONA-269, and SEDONA-292 together introduce several functions to transform and analyze raster data. Let’s examine a few examples.

First, let’s assume that we have a GeoTiff image as follows:

file

Extract bounding box, SRID and number of bands

We can extract the bounding box, srid, and number of bands of this raster.

SELECT raster, RS_Envelope(raster) as bbox, RS_Metadata(raster), RS_NumBands(raster)
FROM rasterTable

+--------------------+--------------------+-----+--------+
|              raster|                geom| srid|numBands|
+--------------------+--------------------+-----+--------+
|GridCoverage2D["g...|POLYGON ((590520 ...|32631|       3|
+--------------------+--------------------+-----+--------+

Now this resulting table has a schema that contains both raster and geometry type (see below)

root
 |-- raster: raster (nullable = true)
 |-- geom: geometry (nullable = true)
 |-- srid: integer (nullable = true)
 |-- numBands: integer (nullable = true)

This allows for more complex geometric operations on the table. For instance, we can save the table as a GeoParquet format and perform filter pushdown. When executing a Sedona spatial range query on this GeoParquet table, Sedona will retrieve only the data that intersects the spatial range query window.

sedona.table("rasterTable").write.format("geoparquet").save("rasterTable.parquet")

Extract individual band

We can extract any band in a raster.

SELECT RS_BandAsArray(raster, 1) as band1, RS_BandAsArray(raster, 2) as band2,
 RS_BandAsArray(raster, 3) as band3
FROM rasterTable

+--------------------+--------------------+--------------------+
|               band1|               band2|               band3|
+--------------------+--------------------+--------------------+
|[0.0, 799.0, 788....|[0.0, 555.0, 546....|[0.0, 330.0, 322....|
+--------------------+--------------------+--------------------+

Modify a band value

We can modify a band value and save it to the original raster

SELECT RS_AddBandFromArray(raster, RS_GreaterThan(band1, 0), 1) AS raster)
FROM rasterTable

The resulting GeoTiff looks like this

file

Google S2-based spatial join (v1.4.0)

If you’re experiencing suboptimal performance with Sedona’s optimized join, it may be due to complicated and overlapping geometries. Fortunately, Sedona offers a built-in alternative: the Google S2-based approximate equi-join. This equi-join utilizes Spark’s internal equi-join algorithm and may perform better than the optimized join, especially if you choose to skip the refinement step. However, be aware that skipping the refinement step will result in a loss of query accuracy.

Sedona’s approximate equi-join is a useful tool to have in your arsenal when working with complex geometries. And with the recent updates to Sedona, such as the introduction of the native raster data type and the redesigned end-to-end raster ETL pipeline, Sedona is better equipped than ever to meet the evolving needs of the geospatial data processing community.

1. Generate S2 ids for both tables

Use ST_S2CellIds to generate cell IDs. Each geometry may produce one or more IDs.

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM lefts
SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM rights

2. Perform equi-join

Join the two tables by their S2 cellId

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId

3. Optional: Refine the result

Due to the nature of S2 Cellid, the equi-join results might have a few false-positives depending on the S2 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives.

To ensure the correctness, you can use one of the Spatial Predicates to filter out them. Use this query instead of the query in Step 2.

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs, rcs
WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)

As you see, compared to the query in Step 2, we added one more filter, which is ST_Contains, to remove false positives. You can also use ST_Intersects and so on.

!!!tip
You can skip this step if you don’t need 100% accuracy and want faster query speed.

4. Optional: De-duplicate

Due to the explode function used when we generate S2 Cell Ids, the resulting DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can remove them by performing a GroupBy query.

SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), first(rcs_name)
FROM joinresult
GROUP BY (lcs_id, rcs_id)

The first function is to take the first value from a number of duplicate values.

If you don’t have a unique id for each geometry, you can also group by geometry itself. See below:

SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
FROM joinresult
GROUP BY (lcs_geom, rcs_geom)

Try it now

With these improvements and additions, Apache Sedona 1.4 s better equipped to meet the evolving needs of the geospatial data processing community.

We are immensely grateful to our fantastic community for their ongoing support and contributions to the development and growth of Apache Sedona. We are excited to continue our journey together, pushing the boundaries of geospatial data processing.

Stay tuned for future updates and enjoy your data processing with Apache Sedona 1.4!

Wherobots raises $5.5 Million in seed round to build the data platform for spatial analytics & AI

We are happy to announce that we raised $5.5 Million in seed round led by Clear Ventures and Wing VC. Wherobots Inc. was founded by Mo Sarwat and Jia Yu (the original creators of Open Source Software Apache Sedona) in June 2022 to enable every organization to drive value from data via space and time. The seed funding will accelerate the hiring of top-tier talent, advance R&D in spatial data+AI, and continue to gain adoption of the Apache Sedona-powered spatial data platform.

Message to the Apache Sedona community

Whether you are an Apache Sedona contributor, user, or champion, congratulations! Apache Sedona managed to create a dent in the world and this is just the start. As of June 13th 2022, Sedona has 104 contributors, 1500 stars on Github and has been downloaded more than 15 million times. One main reason we started Wherobots a year ago is to sustain the research and development of Apache Sedona in particular and spatial Data+AI platforms in general. We pledge to empower the Sedona community, respect the community standards / code of ethics, increase adoption among developers, and always be inclusive.

Message to the spatial community

We started Wherobots to democratize the process of building / maintaining scalable geospatial data pipelines for all developers. De facto data platforms have always treated the spatial aspect of data as an afterthought or a second class citizen at best. Wherobots aims at changing the status-quo by building a data platform that treats spatial as a first class citizen. Wherobots enables geospatial analytics and AI applications in the cloud. Users can build their geospatial data stack using familiar API such as Spatial SQL / Spatial Python, whereas Wherobots handles the hassle of cloud administration, computer resource management /provisioning, workload scalability, and geospatial data processing support/optimization. So, we ask everyone working in geospatial to join the movement.

Message to the data management community

The data management community have made great strides in making data more accessible on premises and in the cloud. Yet, as per Garter, 97% of enterprise data remains unused. Our approach to solving this problem is to enable organizations to look at enterprise data via the space / time lens. That will relate data to the real physical world since everything happens somewhere, and hence that makes it easier to extract real tangible value from data. Wherobots does not aim to replace de-facto data platforms. Instead, it empowers them by adding the spatial aspect to the data & AI pipeline and bringing spatial functionality to the data developer’s disposal. To achieve that, Wherobots builds a highly scalable spatial computing & AI engine that is fully managed and provisioned in the cloud. Such engine can seamlessly connect to / spatially empower existing database, data warehousing, & popular data storage platforms.

We are Hiring

We are hiring software engineers, cloud platform engineers, devrel engineers, and product managers. If you are interested, please apply here.