Wherobots and PostgreSQL + PostGIS: A Synergy in Spatial Analysis

Introduction

Spatial data has never been more abundant and valued in decision making (everything happens somewhere!). The tools we use to process location based data can significantly impact the outcomes of our projects. In the geospatial ecosystem, Apache Sedona-powered Wherobots and PostgreSQL with the PostGIS extension each offer robust capabilities. They share some functionalities, but they are more powerful when used together, rather than in isolation. This post explores how combining Wherobots and PostgreSQL + PostGIS can enhance spatial data processing, offering a synergy that leverages the unique strengths of each tool to enhance data analysis and decision-making processes.

Exploring PostgreSQL + PostGIS

PostgreSQL is a powerful, open-source object-relational database system known for its robustness and reliability. When combined with the PostGIS extension, PostgreSQL transforms into a spatial database capable of executing location-based queries and spatial functions. This combination is ideally suited for projects where data integrity and detailed transaction management (ACID transactions) are crucial.

Key Features:

  • Data Integrity: Ensures high levels of data accuracy and consistency with ACID transaction compliance.
  • Complex Queries: Supports a wide array of SQL and spatial queries for detailed data management.
  • Integration: Seamlessly integrates with existing IT infrastructures and data systems.

Architecture:

PostgreSQL’s architecture, a traditional RDBMS, uses a process-based database server architecture. It is designed for detailed, transactional data management, where data integrity and complex query capabilities are paramount.

Characteristics:

  • Transactional Integrity: PostgreSQL excels in maintaining data consistency and integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance, making it ideal for applications requiring reliable data storage.
  • Structured Data Management: Data must be written into PostgreSQL tables, allowing for sophisticated querying and indexing capabilities provided by PostGIS. This ensures precise spatial data management.
  • Integration with Legacy Systems: PostgreSQL + PostGIS can easily integrate with existing IT infrastructure, providing a stable environment for long-term data storage and complex spatial queries.

Understanding Wherobots & Apache Sedona

Wherobots, founded by the original creators of Apache Sedona, represents a significant advancement in modern cloud native spatial data processing. Apache Sedona is an open-source cluster computing system designed for the processing of large-scale spatial data, and Wherobots (continuously optimizing Sedona for performance) employs this technology with serverless deployment principles to analyze data distributed across clusters. This approach allows Wherobots to perform complex spatial queries and analytics efficiently, handling massive datasets that are beyond the scope of traditional geospatial data processing tools.

Key Features:

  • Scalability: Easily scales horizontally to handle petabytes of data across through a cluster computing architecture.
  • Speed: Utilizes memory-centric architecture to speed up data processing tasks.
  • Complex Algorithms: Supports numerous spatial operations and algorithms essential for advanced geographic data analysis.
  • Spatial SQL API: Over 300 (and growing) spatial type SQL functions covering both raster and vector analysis.
  • Pythonic Dataframe API: Extends the Spark Dataframe API for spatial data processing.
  • Python DB API: Python module conforming to the Python DB API 2.0 specification with familiar access patterns.

Architecture:

The backbone of Wherobots, Apache Sedona, is designed with a modern cloud native and distributed computing framework in mind. Wherobots orchestrates and manages the cluster distributed compute nodes to enable large-scale spatial data processing by bringing compute power directly to the data. This architecture is purpose built for handling big data analytics and planetary-scale spatial data processing.

Characteristics:

  • Distributed Processing: Wherobots and Apache Sedona utilize a cluster of nodes, enabling parallel processing of massive datasets. This makes it ideal for tasks requiring high scalability and speed.
  • In-Memory Computation: By leveraging in-memory processing, Sedona significantly reduces the time needed for complex spatial operations, allowing for rapid data exploration and transformation.
  • Cloud Integration: Wherobots architecture is naturally aligned with cloud-based solutions, enabling seamless integration with cloud services like Amazon Web Services (AWS). This facilitates scalable and flexible spatial data analytics.

Complementarity in Action

Sedona in Wherobots and PostgreSQL + PostGIS can both perform spatial queries, but their real power lies in how they complement each other. For example:

  • Data Handling and Performance: Apache Sedona in Wherobots Cloud is second to none at handling and analyzing large volumes of data, making them ideal for initial data processing, exploration, transformation, and analysis. In contrast, PostgreSQL + PostGIS excels in managing more detailed transactions and operations requiring high precision and integrity.
    • Use Wherobots for: Scalable Data Operations, Wherobots efficiently scales processing across multiple nodes, ideal for rapid analysis of spatial data at local to global levels.
    • Use PostgreSQL & PostGIS for: PERSISTENCE, PostgreSQL + PostGIS provides the tools necessary for precise data management crucial for applications like urban planning and asset management.
  • Use Case Flexibility: Wherobots can quickly process vast quantities of spatial data, making them suitable for environments and applications that require immediate insights on large volumes of data. PostgreSQL + PostGIS, being more transaction oriented, is perfect for applications where long-term data storage and management are needed.
    • Use Wherobots for: Processing Speed, when your scenario requires immediate response.
    • Use PostgreSQL & PostGIS for: Data Management, in scenarios requiring meticulous data integrity, such as environmental monitoring over years or detailed urban planning projects.
  • Analytical Depth: When initial processing (extractions, transformations, enrichment, etc.) on large scale data is done with Wherobots, ETL run times can be greatly reduced. Once processed, the data can be permanently stored and hosted in PostgreSQL + PostGIS. This allows users to perform deep, granular analyses at scale and then persist and serve those insights.
    • Use Wherobots for: Transforming and Enriching Data: Extract, transform, enrich, refine, and analyze data in Wherobots Cloud at scale and with unparalleled speed.
    • Use PostgreSQL & PostGIS for: Robust Data Ecosystem. Combining these tools ensures a robust data and analysis and management ecosystem capable of supporting complex, multi-dimensional spatial queries.

Integration Illustration

Integration Illustration

Let’s illustrate how these two systems can be integrated in a complimentary manner.

We’ll assume the persona of a fictitious company named “Synergistic GeoFusion Technologies (SGT) holdings”.

SGT holdings handles a vast array of spatial data from diverse sources such as sensors, satellites, and user inputs. The volume and velocity of the data collection requires a sophisticated approach to maximize efficiency. Wherobots steps in as the initial processing powerhouse, applying its Apache Sedona-based cluster computing to perform heavy-duty ETL (Extract, Transform, Load) tasks. This process involves cleansing, integrating, and transforming the raw data into a more structured format. Wherobots’ capability to handle massive datasets efficiently complements PostGIS’s robust data storage and querying capabilities by preparing the data for detailed spatial analysis and storage.

Once Wherobots processes the data, it can be seamlessly transferred to PostGIS, which cab serve as the system of record. This PostgreSQL extension is well-suited for ongoing data management, thanks to its sophisticated spatial functions and ability to handle complex queries. PostGIS’s strong querying capabilities complement Wherobots’ data processing strength, providing a stable platform for further data manipulation and refinement.

PostGIS is not only a storage layer but also a platform for ongoing data edits and updates, ensuring data integrity and relevance. When complex, resource-intensive spatial analyses are required, Wherobots is re-engaged. This dual engagement allows SGT holdings to handle routine data management in PostGIS while delegating computationally demanding tasks back to Wherobots, thus utilizing each system’s strengths to full effect.

For visualization, SGT holdings again leverages Wherobots to generate vector tiles from the analyzed data. These tiles are crucial for dynamic, scalable visual representations in the company’s internal dashboards and tools. Wherobots’ ability to generate these tiles efficiently complements PostGIS’s role by providing a means to visualize the data stored and managed within PostGIS. This not only enhances user interactions with the data but also provides a seamless bridge from data analysis to actionable insights through visual exploration.

Using Wherobots and PostGIS in this complimentary manner, SGT has established a highly efficient workflow that leverages the distinct capabilities of each system. They now have the capability to ingest all their data, manage it effectively, and run their post hoc analysis tasks to server internal and external clients efficiently and in a cost effective manner.

Performance Benchmark: Wherobots vs. PostGIS

When comparing WherobotsDB to PostgreSQL with PostGIS for spatial queries, WherobotsDB’s performance benefits become evident as data size increases. Initially, PostGIS’ precomputed GIST indexes give it an edge on smaller datasets due to faster query execution times. However, as datasets grow larger, the dynamic, on-the-fly indexing of WherobotsDB surpasses PostGIS. The overhead associated with WherobotsDB’s distributed system is outweighed by its ability to efficiently handle complex, large-scale queries, ultimately making it significantly faster and more scalable in high-demand scenarios.

In essence, WherobotsDB may start off slower with smaller datasets, but its performance dramatically improves with larger datasets, far exceeding PostGIS’ capabilities. This makes WherobotsDB the preferred choice when working with extensive spatial data that demands efficient, scalable processing.

See the tables below for a detailed comparison of performance metrics.

Datasets Size Number of Rows     Types of Geometries
OSM Nodes 256GB 7,456,990,919 Point
Overture Buildings     103GB 706,967,095 Polygon/
Multipolygon
OSM Postal Codes 0.86GB 154,452 Polygon
WITH t AS 
( 
    SELECT 
        * 
    FROM 
        overture_buildings_test AS buildings 
    JOIN 
        osm_nodes_test AS nodes 
    ON ST_Intersects(
        nodes.geometry, 
        buildings.geometry) 
) 
SELECT 
    COUNT(*) 
FROM 
    t;
Dataset sizes PostGIS WherobotsDB (Cairo)     Relative Speed
1M Buildings x 1M Nodes 0.70s 2.24s 0.31 x
1M Buildings x 10M Nodes 32.30s 33.44s 0.97 x
10M Buildings x 1M Nodes 28.30s 21.72s 1.30 x
10M Buildings x 10M Nodes 49.30s 3m 36.29s 0.23 x
All Buildings x 1M Nodes 8m 44s 36.59s 14.32 x
All Buildings x 10M Nodes 38m 30s 49.77s 46.42 x
10K Buildings x All Nodes 2h 29m 19s 46.42s 193.00 x
100K Buildings x All Nodes 4h 22m 36s 51.75s 304.57 x
1M Buildings x All Nodes 3h 29m 56s 1m 19.63s 158.17 x
10M Buildings x All Nodes 8h 48m 13s 2m 13.65s 237.17 x
All Buildings x All Nodes +24h (Aborted) 4m 31.84s + 317.83 x

Conclusion: Better Together for Spatial Brilliance

Individually, Wherobots and PostgreSQL + PostGIS are powerful tools. But when combined, they unlock new possibilities for spatial analysis, offering a balanced approach to handling both large-scale data processing and detailed, precise database management. By understanding and implementing their complementary capabilities, organizations can achieve more refined insights and greater operational efficiency in their spatial data projects.

By utilizing both tools strategically, companies can not only enhance their data analysis capabilities but also ensure that they are prepared for a variety of spatial data challenges, now and in the future.

To learn more about Wherobots, reach out or give it a try with a trial account

Wherobots Joins Overture, Winning The Taco Wars, Spatial SQL API, Geospatial Index Podcast – This Month In Wherobots

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! This month we have news about Wherobots and the Overture Maps Foundation, a deep dive on new Wherobots Cloud features like raster inference, generating vector tiles, and the Spatial SQL API, plus a look at retail cannibalization analysis for the commercial real estate industry.

Wherobots Joins Overture Maps Foundation

Wherobots joins Overture Maps Foundation

Wherobots has officially joined Overture Maps Foundation to support the next generation of planetary-scale open map data. Wherobots has supported the development of Overture datasets through Overture Maps Foundation’s use of the open-source Apache Sedona project to develop and distribute global data, enabling Overture to embrace modern cloud-native geospatial technologies like GeoParquet. By joining Overture as Contributing Members Wherobots will continue to support the ongoing development, distribution, and evolution of this critical open dataset that enables developers and data practitioners to make sense of the world around us.

Read the announcement blog post

Featured Community Members: Sean Knight & Ilya Marchenko

Apache Sedona featured community members - July 2024

This month’s featured community members are Sean Knight and Ilya Marchenko from YuzuData where they focus on AI and location intelligence for the commercial real estate industry. YuzuData is a Wherobots partner and leverages the power of Apache Sedona and Wherobots Cloud as part of their work analyzing large scale geospatial data. Sean and Ilya recently wrote a blog post showing how to use Wherobots for a retail cannibalization study. Thanks Sean and Ilya for being a part of the community and sharing how you’re building geospatial products using Wherobots!

Comparing Taco Chains: A Consumer Retail Cannibalization Study With Isochrones

Retail cannibalization analysis with Wherobots

Understanding the impact of opening a new retail location on existing locations is an important analysis in the commercial real estate industry. In this code-heavy blog post Sean and Ilya from YuzuData detail a retail cannibalization analysis using WherobotsDB, Overture Maps point of interest data, drive-time isochrones using the Valhalla API, and visualization with SedonaKepler. Sean also presented this analysis earlier this week in a live webinar.

Read the blog post or watch the video recording

Unlock Satellite Imagery Insights With WherobotsAI Raster Inference

Raster inference with WherobotsAI

One of the most exciting features in Wherobots’ latest release is WherobotsAI Raster Inference which enables running machine learning models on satellite imagery for object detection, segmentation, and classification. This post gives a detailed look at the types of models supported by WherobotsAI and an overview of the SQL and Python APIs for raster inference with an example of identifying solar farms for the purpose of mapping electricity infrastructure.

Read the blog post to learn more about WherobotsAI Raster Inference

Generating Global PMTiles In 26 Minutes With WherobotsDB VTiles

Generating PMTiles with Wherobots VTiles vector tiles generator

WherobotsDB VTiles is a highly scalable vector tile generator capable of generating vector tiles from small to planetary scale datasets quickly and cost-efficiently and supports the PMTiles format. In this post we see how to generate vector tiles of the entire planet using three Overture layers. Using Wherobots Cloud to generate PMTiles of the Overture buildings layer takes 26 minutes. The post includes all code necessary to recreate these tile generation operations and a discussion of performance considerations.

Read the blog post to learn more about WherobotsDB VTiles

Spatial SQL API Brings Performance Of WherobotsDB To Your Favorite Data Applications

Using Apache Airflow with WherobotsDB

The Wherobots Spatial SQL API enables integration with Wherobots Cloud via Python and Java client drivers. In addition to enabling integrations with your favorite data applications via the client drivers, Wherobots has released an Apache Airflow provider for orchestrating data pipelines and an integration with Harlequin, a popular SQL IDE.

Read the blog post to learn more about the Wherobots Spatial SQL API

Wherobots On The Geospatial Index Podcast

Wherobots On The Geospatial Index Podcast

William Lyon from Wherobots was recently a guest on The Geospatial Index podcast. In this episode he discusses the origins of Apache Sedona, the open-source technology behind Wherobots, how users are building spatial data products at massive scale with Wherobots, how Wherobots is improving the developer experience around geospatial analytics, and much more.

Watch the video recording

Upcoming Events

  • Apache Sedona Community Office Hours (Online – August 6th) – Join the Apache Sedona community for updates on the state of Apache Sedona, presentation and demo of recent features, and provide your input into the roadmap, future plans, and contribution opportunities.
  • GeoMeetup: Cloud Native Spatial Data Stack (San Francisco – September 5th) – Join us on September 5th for an exciting GeoMeetup featuring talks from industry leaders with Wherobots and Felt.com. In this meetup we will be exploring the elements of the cloud native spatial data stack.
  • FOSS4G NA 2024 (St Louis – September 9th-11th) – FOSS4G North America is the premier open geospatial technology and business conference. Join the Wherobots team for a pre-conference workshop or come by and chat with us at the Wherobots booth to learn about the latest developments in Apache Sedona.

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Developer Newsletter:


🌶 Comparing taco chains :: a consumer retail cannibalization study with isochrones

Authors: Sean Knight and Ilya Marchenko

Using Wherobots for a Retail Cannibalization Study Comparing Two Leading Taco Chains

In this post, we explore how to implement a workflow from the commercial real estate (CRE) space using Wherobots Cloud. This workflow is commonly known as a cannibalization study, and we will be using WherobotsDB, POI data from OvertureMaps, the open source Valhalla API, and visualization capabilities offered by SedonaKepler.

NOTE: This is a guest Wherobots post from our friends at YuzuData. Reach out to them to learn more about their spatial data product services. You can also join for a demo scheduled on this use case with them on July 30th here.

What is a retail cannibalization study?

In CRE (consumer real estate), stakeholders are often interested in questions like “If we build a new fast food restaurant here, how will its performance be affected by other similar fast food locations that already exist nearby?”. The idea of the new fast food restaurant “eating into” the sales of other fast food restaurants that already exist nearby is what is known as ‘cannibalization’.

The main objective of studying this phenomenon is to determine the extent to which a new store might divert sales from existing stores owned by the same company or brand and evaluate the overall impact on the company’s market share and profitability in the area.

Cannibalization Study in Wherobots

For this case study, we will look at two taco chains which are located primarily in Texas: Torchy’s Tacos and Velvet Taco. In general, information about the performance of individual locations and customer demographics are often proprietary information. We can, however, still learn a great deal about the potential for cannibalization both between these two chains as competitors, and between individual locations of each chain. We also know, based on our own experience, these chains compete with each other. Which taco shop to go to when we are visiting Texas is always a spicy debate.

??????

We begin by importing modules that will be useful to us as we go on.

import geopandas as gpd
import pandas as pd
import requests
from sedona.spark import *
from pyspark.sql.functions import explode, array
from pyspark.sql import functions as F

Next, we can initiate a Sedona context.

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Identifying Points of Interest

Now, we need to retrieve the locations of Torchy’s Tacos and Velvet Taco locations. In general, one can do this via a variety of both free and paid means. We will look at a simple, free approach that is made possible by the integration of Overture Maps data into the Wherobots environment:

sedona.table("wherobots_open_data.overture.places_place"). \\
createOrReplaceTempView("places")

We create a view of the Overture Maps places database, which contains information on points of interest (POI’s) worldwide.

Now, we can select the POI’s which are relevant to this exercise:

stores = sedona.sql("""
SELECT id, names.common[0].value as name, ST_X(geometry) as long,
 ST_Y(geometry) as lat, geometry, 
 CASE WHEN names.common[0].value LIKE "%Torchy's Tacos%" 
 THEN "Torchy's Tacos" 
 ELSE 'Velvet Taco' END AS chain
FROM places
WHERE addresses[0].region = 'TX'
AND (names.common[0].value LIKE "%Torchy's Tacos%" 
OR names.common[0].value LIKE '%Velvet Taco%')
""")

Calling stores.show() gives us a look at the spark DataFrame we created:

+--------------------+--------------+-----------+----------+
|                  id|          name|       long|       lat|
+--------------------+--------------+-----------+----------+
|tmp_8104A79216254...|Torchy's Tacos|  -98.59689|  29.60891|
|tmp_D17CA8BD72325...|Torchy's Tacos|  -97.74175|  30.29368|
|tmp_F497329382C10...|   Velvet Taco|  -95.48866|  30.18314|
|tmp_9B40A1BF3237E...|Torchy's Tacos| -96.805853| 32.909982|
|tmp_38210E5EC047B...|Torchy's Tacos|  -96.68755|  33.10118|
|tmp_DF0C5DF6CA549...|Torchy's Tacos|  -97.75159|  30.24542|
|tmp_BE38CAC8D46CF...|Torchy's Tacos|  -97.80877|  30.52676|
|tmp_44390C4117BEA...|Torchy's Tacos|  -97.82594|   30.4547|
|tmp_8032605AA5BDC...|   Velvet Taco| -96.469695| 32.898634|
|tmp_0A2AA67757F42...|Torchy's Tacos|  -96.44858|  32.90856|
|tmp_643821EB9C104...|Torchy's Tacos|  -97.11933|  32.94021|
|tmp_0042962D27E06...|   Velvet Taco|-95.3905374|29.7444214|
|tmp_8D0E2246C3F36...|Torchy's Tacos|  -97.15952|  33.22987|
|tmp_CB939610BC175...|Torchy's Tacos|  -95.62067|  29.60098|
|tmp_54C9A79320840...|Torchy's Tacos|  -97.75604|  30.37091|
|tmp_96D7B4FBCB327...|Torchy's Tacos|  -98.49816|  29.60937|
|tmp_1BB732F35314D...|   Velvet Taco|  -95.41044|    29.804|
|tmp_55787B14975DD...|   Velvet Taco|-96.7173913|32.9758554|
|tmp_7DC02C9CC1FAA...|Torchy's Tacos|  -95.29544|  32.30361|
|tmp_1987B31B9E24D...|   Velvet Taco|  -95.41006| 29.770256|
+--------------------+--------------+-----------+----------+
only showing top 20 rows

We’ve retrieved the latitude and longitude of our locations, as well as the name of the chain each location belongs to. We used the CASE WHEN statement in our query in order to simplify the location names. This way, we can easily select all the stores from the Torchy’s Tacos chain, for example, and not have to worry about individual locations being called things like “Torchy’s Tacos – Rice Village” or “Velvet Taco Midtown”, etc.

We can also visualize these locations using SedonaKepler. First, we can create the map using the following snippet:

location_map = SedonaKepler.create_map(stores, "Locations", 
    config = location_map_cfg)

Then, we can display the results by simply calling location_map in the notebook. For convenience, we included the location_map_cfg Python dict in our notebook, which stores the settings necessary for the map to be created with the locations color-coded by chain. If we wish to make modifications to the map and save the new configuration for later use, we can do so by calling location_map.config and saving the result either as a cell in our notebook or in a separate location_map_cfg.py file.

Generating Isochrones

Now, for each of these locations, we can generate a polygon known as an isochrone or drivetime. These polygons will represent the areas that are within a certain time’s drive from the given location. We will generate these drivetimes using the Valhalla isochrone api:

def get_isochrone(lat, lng, costing, time_steps, name, location_id):
    url = "<https://valhalla1.openstreetmap.de/isochrone>"
    params = {
      "locations": [{"lon": lng, "lat": lat}],
      "contours": [{"time": i} for i in time_steps],
      "costing": costing,
      "polygons": 1,
    }
    response = requests.post(url, json=params)
    if response:
        result = response.json()
        if 'error_code' not in result.keys():
            df = gpd.GeoDataFrame.from_features(result)
            df['name'] = name
            df['id'] = location_id
            return df[['name','id','geometry']]

The function takes as its input a latitude and longitude value, a costing paratemeter, a location name, and a location id. The output is a dataframe which contains a Shapely polygon representing the isochrone, along with the a name and id of the location the isochrone corresponds to.

We have separate columns for a location id and a location name so that we can use the id column to examine isochrones for individual restaurants and we can use the name column to look at isochrones for each of the chains.

The costing parameter can take on several different values (see the API reference here), and it can be used to create “drivetimes” assuming the user is either walking, driving, or taking public transport.

We create a geoDataFrame of all of the 5-minute drivetimes for our taco restaurant locations

drivetimes_5_min = pd.concat([get_isochrone(row.lat, row.long, 'auto', [5],
 row.chain, row.id) for row in stores.select('id','chain','lat','long').collect()])

and then save it to our S3 storage for later use:

drivetimes_5_min.to_csv('s3://path/drivetimes_5_min_torchys_velvet.csv',
 index = False)

Because we are using a free API and we have to create quite a few of these isochrones, we highly recommend saving the file for later analysis. For the purposes of this blog, we have provided a ready-made isochrone file here, which we can load into Wherobots with the following snippet:

sedona.read.option('header','true').format('csv') .\\
load('s3://path/drivetimes_5_min_torchys_velvet.csv') .\\
createOrReplaceTempView('drivetimes_5_min')

We can now visualize our drivetime polygons in SedonaKepler. As before, we first create the map with the snippet below.

map_isochrones = sedona.read.option('header','true').format('csv'). \\
load('s3://path/drivetimes_5_min_torchys_velvet.csv')

isochrone_map = SedonaKepler.create_map(map_isochrones, "Isochrones",
 config = isochrone_map_cfg)

Now, we can display the result by calling isochrone_map .

The Analysis

At this point, we have a collection of the Torchy’s and Velvet Taco locations in Texas, and we know the areas which are within a 5-minute drive of each location. What we want to do now is to estimate the number of potential customers that live near each of these locations, and the extent to which these populations overlap.

A First Look

Before we look at how these two chains might compete with each other, let’s also take a look at the extent to which restaurants within each chain might be cannibalizing each others’ sales. A quick way to do this is by using the filtering feature in Kepler to look at isochrones for a single chain:



We see that locations for each chain are fairly spread out and (at least at the 5-minute drivetime level), there is not a high degree of cannibalization within each chain. Looking at the isochrones for both chains, however, we notice that Velvet Taco locations often tend to be near Torchy’s Tacos locations (or vice-versa). At this point, all we have are qualitative statements based on these maps. Next, we will show how to use H3 and existing open-source datasets to make these statements more quantitative.

Estimating Cannibalization Potential

As we can see by looking at the map of isochrones above, they are highly irregular polygons which have a considerable amount of overlap. In general, these polygons are not described in a ‘nice’ way by any administrative boundaries such as census block groups, census tracts, etc. Therefore, we will have to be a little creative in order to estimate the population inside them.

One way of doing this using the tools provided by Apache Sedona and Wherobots is to convert these polygons to H3 hexes. We can do this with the following snippet:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
createOrReplaceTempView('h3_isochrones')

This turns our table of drivetime polygons into a table where each row represents a hexagon with sides roughly 400m long, which is a part of a drivetime polygon. We also record the chain that these hexagons are associated to (the chain that the polygon they came from belongs to). We store each hexagon in its own row because this will simplify the process of estimating population later on.

Although the question of estimating population inside individual H3 hexes is also a difficult one (we will release a notebook on this soon), open-source datasets with this information are available online, and we will use one such dataset, provided by Kontur:

kontur = sedona.read.option('header','true') .\\
load('s3://path/us_h3_8_pop.geojson', format="json") .\\
drop('_corrupt_record').dropna() .\\
selectExpr('CAST(CONV(properties.h3, 16, 10) AS BIGINT) AS h3',
 'properties.population as population')

kontur.createOrReplaceTempView('kontur')

We can now enhance our h3_isochrones table with population counts for each H3 hex:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').distinct().createOrReplaceTempView('h3_isochrones')

At this stage, we can also quickly compute the cannibalization potential within each chain. Using the following code, for example, we can estimate the number of people who live within a 5 minute drive of more than one Torcy’s Tacos:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').filter('name LIKE "%Torchy%"').select('h3','population') .\\
groupBy('h3').count().filter('count >= 2').join(kontur, 'h3', 'left').distinct() .\\
agg(F.sum('population')).collect()[0][0]
97903.0

We can easily change this code to compute the same information for Velvet Taco by changing filter('name LIKE "%Torchy%"') in line 4 of the above snippet to filter('name LIKE "%Velvet%"') . If we do this, we will see that 100298 people live within a 5 minute drive of more than one Velvet Taco. Thus, we see that the Torchy’s Tacos brand appears to be slightly better at avoiding canibalization among its own locations (especially given that Torchy’s Tacos has more locations than Velvet Taco).

Now, we can run the following query to show the number of people in Texas who live within a 5 minutes drive of a Torchy’s Tacos:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE "%Torchy's%"
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()

The reason we select distinct H3 hexes here is because a single hex can belong to more than one isochrone (as evidenced by the SedonaKepler visualizations above). We get the following output:

+---------------+
|sum(population)|
+---------------+
|      1546765.0|
+---------------+

So roughly 1.5 million people in Texas live within a 5-minute drive of a Torchy’s Tacos location. Looking at our previous calculations for how many people live near more than one restaurant of the same chain, we can see that Torchy’s Tacos locations near each other cannibalize about 6.3% of the potential customers who live within 5 minutes of a Torchy’s location.

Running a similar query for Velvet Taco tells us that roughly half as many people live within a 5-minute drive of a Velvet Taco:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE '%Velvet Taco%'
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()
+---------------+
|sum(population)|
+---------------+
|       750360.0|
+---------------+

As before, we can also see that Velvet Taco locations near each other cannibalize about 13.4% of the potential customers who live within 5 minutes of a Velvet Taco location.

Now, we can estimate the potential for cannibalization between these two chains:

sedona.sql("""
WITH overlap_h3 (h3, population) AS
(
    SELECT DISTINCT a.h3, ANY_VALUE(a.population)
    FROM h3_isochrones a LEFT JOIN h3_isochrones b ON a.h3 = b.h3
    WHERE a.name != b.name
    GROUP BY a.h3
)
SELECT sum(population)
FROM overlap_h3
""").show()

which gives:

+---------------+
|sum(population)|
+---------------+
|       415033.0|
+---------------+

We can see that more than half of the people who live near a Velvet Taco location also live near a Torchy’s Tacos location and we can visualize this population overlap:

isochrones_h3_map_data = sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\
join(kontur, 'h3', 'left').select('name','population',array('h3')).withColumnRenamed('array(h3)','h3').selectExpr('name','population','ST_H3ToGeom(h3)[0] AS geometry')

isochrones_h3_map = SedonaKepler.create_map(isochrones_h3_map_data, 'Isochrones in H3', config = isochrones_h3_map_cfg)

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


New Wherobots Cloud Features, How Overture Maps Uses Apache Sedona, Aircraft Data, & Spatial Lakehouses

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! In this edition we have a look at the latest Wherobots Cloud release, how the Overture Maps Foundation uses Apache Sedona to generate their data releases, processing a billion aircraft observations, building spatial data lakehouses with Iceberg Havasu, the new Apache Sedona 1.6.0 release, and more!

Introducing WherobotsAI For Planetary Inference And Capabilities That Modernize Spatial Intelligence At Scale

Wherobots announced significant new features in Wherobots Cloud to enable machine learning inference on satellite imagery via SQL, new Python and Java database drivers for interacting with WherobotsDB in your own analytics applications or data orchestration tooling, and a scalable vector tiles generator. These new enhancements are available now in Wherobots Cloud.

Read The Blog Post or Register For The Webinar

Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

Querying Overture Maps GeoParquet data using Apache Sedona

The Overture Maps Foundation publishes an open comprehensive global map dataset with layers for transportation, places, 3D buildings, and administrative boundaries. This data comes from multiple sources and is published in cloud-native GeoParquet format made publicly available for download in cloud object storage. In order to wrangle such a large planetary-scale dataset the Overture team uses Apache Sedona to prepare, process, and generate partitioned GeoParquet files. This blog post dives into the benefits of GeoParquet, how Overture uses Sedona to generate GeoParquet (including a dual Geohash partitioning and sorting method), and how to query and analyze the Overture Maps dataset using Wherobots Cloud.

Read the article: Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

Featured Community Member: Feng Jiang

June featured community member

Our featured Apache Sedona and Wherobots Community Member this month is Feng Jiang, a Senior Software Engineer at Microsoft where he works with map and geospatial data at scale. Through his involvement with the Overture Maps Foundation he also helps maintain and publish the public Overture Maps dataset. In the blog post "Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona" he shared some insights gained from working with Apache Sedona at Overture in the pipeline used to create and generate GeoParquet data of planetary-scale map data. Thanks for your contributions and being a part of the Apache Sedona community!

Processing A Billion Aircraft Observations With Apache Sedona In Wherobots Cloud

Impacted flight segments

An important factor to consider when analyzing aircraft data is the potential impact of weather and especially severe weather events on aircraft flights. This tutorial uses public ADS-B aircraft trace data combined with weather data to identify which flights have the highest potential to be impacted by severe weather events. We also see how to combine real-time Doppler radar raster data as well as explore the performance of working with a billion row dataset for spatial operations like point-in-polygon searches and spatial joins.

Read The Tutorial: Processing A Billion Aircraft Observations With Apache Sedona In Wherobots Cloud

Training Series: Large-Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Large-Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Choosing the right tool for the job is an important aspect of data science, and equally important is understanding how the tools fit together and can be used alongside each other. This hands-on workshop shows how to leverage the scale of Apache Sedona with Wherobots Cloud for geospatial data processing, alongside common Python tooling like Geopandas, and how to add graph analytics using Neo4j to our analysis toolkit. Using a dataset of species observations we build a species interaction graph to find which species share habitat overlap, a common workflow for conservation use cases.

Watch The Workshop Recording: Large Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Apache Sedona 1.6 Release

Apache Sedona Ecosystem

Version 1.6.0 of Apache Sedona is now available! This version includes support for Shapely 2.0 and GeoPandas 0.11.1+, enhanced support for geography data, new vector and raster functions, and tighter integration Python raster data workflows with support for Rasterio and NumPy User Defined Functions (UDFs). You can learn more about this release in the release notes.

Read The Apache Sedona 1.6 Release Notes

Building Spatial Data Lakehouses With Iceberg Havasu

Iceberg Havasu: A Spatial Data Lakehouse Format

This talk from Subsurface 2024 introduces the Havasu spatial table format, an extension of Apache Iceberg used to build spatial data lakehouses. We learn about the motivation for adding spatial functionality to Iceberg, how Havasu Iceberg enables efficient spatial queries for both vector and raster data, and how to use familiar SQL table interface when building large-scale geospatial analytics applications.

Watch The Recording: Building Spatial Data Lakehouses With Iceberg Havasu

Upcoming Events

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter:


Introducing WherobotsAI for planetary inference, and capabilities that modernize spatial intelligence at scale

We are excited to announce a preview of WherobotsAI, our new suite of AI and ML powered capabilities that unlock spatial intelligence in satellite imagery and GPS location data. Additionally, we are bringing the high-performance of WherobotsDB to your favorite data applications with a Spatial SQL API that integrates WherobotsDB with more interfaces including Apache Airflow for Spatial ETL. Finally, we’re introducing the most scalable vector tile generator on earth to make it easier for teams to produce engaging and interactive map applications. All of these new features are capable of operating on planetary-scale data.

Watch the walkthrough of this release here.

Wherobots Mission and Vision

Before we dive into this release, we think it’s important to understand how these capabilities fit into our mission, our product principles, and vision for the Spatial Intelligence Cloud so you can see where we are headed.

Our Mission
These new capabilities are core to Wherobots’ mission, which is to unlock spatial intelligence of earth, society, and business, at a planetary scale. We will do this by making it extremely easy to utilize data and AI technology purpose-built for creating spatial intelligence that’s cloud-native and compatible with modern open data architectures.

Our Product Principles

  • We’re building the spatial intelligence platform for modern organizations. Every organization with a mission directly linked to the performance of tangible assets, goods and services, or data products about what’s happening in the physical world, will need a spatial intelligence platform to be competitive, sustainable, and climate adaptive.
  • It delivers intelligence for the greater good. Teams and their organizations want to analyze their worlds to create a net positive impact for business, society, and the earth.
  • It’s purpose-built yet simple. Spatial intelligence won’t scale through in-house ‘spatial experts’, or through general purpose architectures that are not optimized for spatial workloads or development experiences.
  • It’s efficient at any scale. Maximal performance, scale, and cost efficiency can only be achieved through a cloud-native, serverless solution.
  • It creates intelligence with AI. Every organization will need AI alongside modern analytics to create spatial intelligence.
  • It’s open by default. Pace of innovation depends on choice. Organizations that adopt cloud-native, open source compatible, and modern open data architectures will innovate faster because they have more choices in the solutions they can use.

Our Vision
We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Introducing WherobotsAI

There are petabytes of satellite or aerial imagery produced every day. Yet for most analysts, scientists, and developers, these datasets are analytically inaccessible outside of the naked eye. As a result most organizations still rely on humans and their eyes, to analyze satellite or other forms of aerial imagery. Wherobots can already perform analytics of overhead imagery (also known as raster data) and geospatial objects (known as vector data) simultaneously at scale. But organizations also want to use modern AI and ML technologies to streamline and scale otherwise visual, single threaded tasks like object detection, classification, and segmentation from overhead imagery.

Like satellite imagery that is generally hard to analyze, businesses also find it hard to analyze GPS data in their applications because it’s too noisy; points don’t always correspond to the actual path taken. Teams need an easy solution for snapping noisy GPS data to road or other segment types, at any scale.

Today we are announcing WherobotsAI which offers fully managed AI and machine learning capabilities that accelerate the development of spatial insights, for anyone familiar with SQL or Python. WherobotsAI capabilities include:

[new] Raster Inference (preview): A first of its kind, Raster Inference unlocks the analytical potential of satellite or aerial imagery at a planetary scale, by integrating AI models with WherobotsDB to make it extremely easy to detect, classify, and segment features of interest in satellite and aerial images. You can see how easy it is to detect and georeference solar farms here, with just a few lines of SQL:

SELECT
  outdb_raster,
  RS_SEGMENT(‘solar-satlas-sentinel2’, outdb_raster) AS solar_farm_result
FROM df_raster_input

These georeferenced predictions can be queried with WherobotsDB and can be interactively explored in a Wherobots notebook. Below is an example of detection of solar panels in SedonaKepler.

AI Inference Solar Farm

The models and AI infrastructure powering Raster Inference are fully managed, which means there’s nothing to set up or configure. Today, you can use Raster Inference to detect, segment, and classify solar farms, land cover, and marine infrastructure from terabyte-scale Sentinel-2 true color and multispectral imagery datasets in under half an hour, on our GPU runtimes available in the Wherobots Professional Edition. Soon we will be making the inference metadata for the models public, so if your own models meet this standard, they are supported by Raster Inference.

These models and datasets are just the starting point for WherobotsAI. We are looking forward to hearing from you to help us define the roadmap for what we should build support for next.

Map Matching: If you need to analyze trips at scale, but struggle to wrangle noisy GPS data, Map Matching is capable of turning billions of noisy GPS pings into signal, by snapping shared points to road or other vector segments. Teams are using Map Matching to process hundreds of millions of vehicle trips per hour. This speed surpasses any current commercial solutions, all for a cost of just a few hundred dollars.

Here’s an example of what WherobotsAI Map Matching does to improve the quality of your trip segments.

  • Red and yellow line segments were created from raw, noisy GPS data.
  • Green represents Map Matched segments.

map matching algorithm

Visit the user documentation to learn more and get started with WherobotsAI.

A Spatial SQL API for WherobotsDB

WherobotsDB, our serverless, highly efficient compute engine compatible with Apache Sedona is up to 60x more performant for spatial joins than popular general purpose big data engines and warehouses, and up to 20x faster than Apache Sedona on its own. It will remain the most performant, earth-friendly solution for your spatial workloads at any scale.

Until today, teams had two options for harnessing WherobotsDB: they could write and run queries in Wherobots managed notebooks, or run spatial ETL pipelines using the Wherobots jobs interface.

Today, we’re enabling you to bring the utility of WherobotsDB to more interfaces with the new Spatial SQL API. Using this API, teams can remotely execute Spatial SQL queries using a remote SQL editor, build first-party applications using our client SDKs in Python (WherobotsDB API driver) and Java (Wherobots JDBC driver), or orchestrate spatial ETL pipelines using a Wherobots Apache Airflow provider.

Run spatial queries with popular SQL IDEs

The following is an example of how to integrate Harlequin, a popular SQL IDE with WherobotsDB. You’ll need a Wherobots API key to get started with Harlequin (or any remote client). API keys allow you to authenticate with Wherobots Cloud for programmatic access to Wherobots APIs and services. API keys can be created following a few steps in our user documentation.

We will query WherobotsDB using Harlequin in the Airflow example later in this blog.

$ pip install harlequin-wherobots
$ harlequin -a wherobots --api-key $(< api.key)

harlequin api key connection

You can find more information on how to use Harlequin in its documentation, and on the WherobotsDB adapter on its GitHub repository.

The Wherobots Python driver enables integration with many other tools as well. Here’s an example of using the Wherobots Python driver in the QGIS Python console to fetch points of interest from the Overture Maps dataset using Spatial SQL API.

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime
import geopandas 
from shapely import wkt

with connect(
        token=os.environ.get("WBC_TOKEN"),
        runtime=Runtime.SEDONA,
        region=Region.AWS_US_WEST_2,
        host="api.cloud.wherobots.com"
) as conn:
    curr = conn.cursor()
    curr.execute("""
    SELECT names.common[0].value AS name, categories.main AS category, geometry 
    FROM wherobots_open_data.overture.places_place 
    WHERE ST_DistanceSphere(ST_GeomFromWKT("POINT (-122.46552 37.77196)"), geometry) < 10000
    AND categories.main = "hiking_trail"
    """)
    results = curr.fetchall()
    print(results)

results["geometry"] = results.geometry.apply(wkt.loads)
gdf = geopandas.GeoDataFrame(results, crs="EPSG:4326",geometry="geometry")

def add_geodataframe_to_layer(geodataframe, layer_name):
    # Create a new memory layer
    layer = QgsVectorLayer(geodataframe.to_json(), layer_name, "ogr")

    # Add the layer to the QGIS project
    QgsProject.instance().addMapLayer(layer)

add_geodataframe_to_layer(gdf, "POI Layer")

Using the Wherobots Python driver with QGIS

Visit the Wherobots user documentation to get started with the Spatial SQL API, or see our latest blog post that goes deeper into how to use our database drivers with the Spatial SQL API.

Automating Spatial ETL workflows with the Apache Airflow provider for Wherobots

ETL (extract, transform, load) workflows are oftentimes required to prepare spatial data for interactive analytics, or to refresh datasets automatically as new data arrives. Apache Airflow is a powerful and popular open source orchestrator of data workflows. With the Wherobots Apache Airflow provider, you can now use Apache Airflow to convert your spatial SQL queries into automated workflows running on Wherobots Cloud.

Here’s an example of the Wherobots Airflow provider in use. In this example we identify the top 100 buildings in the state of New York with the most places (facilities, services, business, etc.) registered within them using the Overture Maps dataset, and we’ll eventually auto-refresh the result daily. The initial view can be generated with the following SQL query:

CREATE TABLE wherobots.test_db.top_100_hot_buildings_daily AS
SELECT
  buildings.id AS building,
  first(buildings.names),
  count(places.geometry) AS places_count,
  '2023-07-24' AS ts
FROM wherobots_open_data.overture.places_place places
JOIN wherobots_open_data.overture.buildings_building buildings
  ON ST_CONTAINS(buildings.geometry, places.geometry)
WHERE places.updatetime >= '2023-07-24'
  AND places.updatetime < '2023-07-25'
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
GROUP BY building
ORDER BY places_count DESC
LIMIT 100
  • A place in Overture is defined as real-world facilities, services, businesses or amenities.
  • We used an arbitrary date of 2023-07-24.
  • New York is defined by a simple bounding box polygon (79.762152, 40.496103, -71.856214, 45.01585) (we could alternatively join with its appropriate administrative boundary polygon)
  • We use two WHERE clauses on places.updatetime to filter one day’s worth of data.
  • The query creates a new table wherobots.test_db.top_100_hot_buildings_daily to store the query result. Note that it will not directly return any records because we are loading directly into a table.

Now, lets use Harlequin as described earlier to inspect the outcome of creating this table with the above query:

SELECT * FROM wherobots.test_db.top_100_hot_buildings_daily

Harlequin query test 2

Apache Airflow and the Airflow Provider for Wherobots allow you to schedule and execute this query each day, injecting the appropriate date filters into your templatized query.

  • In your Apache Airflow instance, install the airflow-providers-wherobots library. You can either execute pip install airflow-providers-wherobots, or add the library to the dependency list of your Apache Airflow runtime.
  • Create a new “generic” connection for Wherobots called wherobots_default, using api.cloud.wherobots.com as the “Host” and your Wherobots API key as the “Password”.

The next step is to create an Airflow DAG. The Wherobots Provider exposes the WherobotsSqlOperator for executing SQL queries. Update the hardcoded “2023-07-24” in your query into the Airflow template macros {ds} and {next_ds}, which will be rendered as the DAG schedule date on the fly:

import datetime

from airflow import DAG
from airflow_providers_wherobots.operators.sql import WherobotsSqlOperator

with DAG(
    dag_id="example_wherobots_sql_dag",
    start_date=datetime.datetime.strptime("2023-07-24", "%Y-%m-%d"),
    schedule="@daily",
    catchup=True,
    max_active_runs=1,
):
    operator = WherobotsSqlOperator(
        task_id="execute_query",
        wait_for_downstream=True,
        sql="""
        INSERT INTO wherobots.test_db.top_100_hot_buildings_daily
        SELECT
          buildings.id AS building,
          first(buildings.names),
          count(places.geometry) AS places_count,
          '{{ ds }}' AS ts
        FROM wherobots_open_data.overture.places_place places
        JOIN wherobots_open_data.overture.buildings_building buildings
          ON ST_CONTAINS(buildings.geometry, places.geometry)
        WHERE places.updatetime >= '{{ ds }}'
          AND places.updatetime < '{{ next_ds }}'
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
        GROUP BY building
        ORDER BY places_count DESC
        LIMIT 100
        """,
        return_last=False,
    )

You can visualize the status of the and log of the DAG’s execution in the Apache Airflow UI. As shown below, the operator prints out the exact query rendered and executed when you run your DAG.

apache airflow spatial sql api
Please visit the Wherobots user documentation for more details on how to set up your Apache Airflow instance with the Wherobots Provider.

Generate Vector Tiles — formatted as PMTiles — at Global Scale

Vector tiles are high resolution representations of features optimized for visualization, computed offline and displayed in map applications. This decouples dataset preparation from client side rendering driven by zooming and panning. By decoupling dataset preparation from the interactive experience, map developers use vector tiles to significantly improve the utility, clarity, and responsiveness of feature rich interactive map applications.

Traditional vector tiles generators like Tippecanoe are limited to the processing capability of a single VM and require the use of limited formats. These solutions are great for small-scale tile generation workloads when data is already in the right file format. But if you’re like the teams we’ve worked with, you may start small and need to scale past the limits of a single VM, or have a variety of file formats. You just want to generate vector tiles with the data you have, at any scale without having to worry about format conversion steps, configuring infrastructure, partitioning your workload around the capability of a VM, or waiting for workloads to complete.

Vector Tile Generation, or VTiles for WherobotsDB generates vector tiles in PMTiles format across common data lake formats, incredibly quickly and at a planetary scale, so you can start small and know you have the capability to scale without having to look for another solution. VTiles is incredibly fast because serverless computation is parallelized, and the WherobotsDB engine is optimized for vector tile generation. This means your development teams can spend less time building map applications that matter to your customers.

Using a Tokyo runtime, we generated vector tiles with VTiles for all buildings in the Overture dataset, from zoom levels 4-15 across the entire planet, in 23 minutes. That’s fast and efficient for a planetary scale operation. You can run the tile-generation-example notebook in the Wherobots Pro tier to experience the speed and simplicity of Vtiles yourself. Here’s what this looks like:

Visit our user documentation to start generating vector tiles at-scale.

Try Wherobots now

We look forward to hearing how you put these new capabilities to work, along with your feedback to increase the usefulness of the Wherobots Cloud platform. You can try these new features today by creating a Wherobots Cloud account. WherobotsAI is a professional tier feature.

Please reach out on LinkedIn or connect to us on email at info@wherobots.com

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


The Spatial SQL API brings the performance of WherobotsDB to your favorite data applications

Since its launch last fall, Wherobots has raised the bar for cloud-native geospatial data analytics, offering the first and only platform for working with vector and raster geospatial data together at a planetary scale. Wherobots delivers a significant breadth of geospatial analytics capabilities, built around a cloud-native data lakehouse architecture and query engine that delivers up to 60x better performance than incumbent solutions. Accessible through the powerful notebook experience data scientists and data engineers know and love, Wherobots Cloud is the most comprehensive, approachable, and fully-managed serverless offering for enabling spatial intelligence at scale.

Today, we’re announcing the Wherobots Spatial SQL API, powered by Apache Sedona, to bring the performance of WherobotsDB to your favorite data applications. This opens the door to a world of direct-SQL integrations with Wherobots Cloud, bringing a serverless cloud engine that’s optimized for spatial workloads at any scale into your spatial ETL pipelines and applications, and taking your users and engineers closer to your data and spatial insights.

Register for our release webinar on July 10th here: https://bit.ly/3yFlFYk

Developers love Wherobots because compute is abstracted and managed by Wherobots Cloud. Because it can run at a planetary scale, Wherobots streamlines development and reduces time to insight. It runs on a data lake architecture, so data doesn’t need to be copied into a proprietary storage system, and integrates into familiar development tools and interfaces for exploratory analytics and orchestrating production spatial ETL pipelines.

Utilize Apache Airflow or SQL IDEs with WherobotsDB via the Spatial SQL API

Wherobots Cloud and the Wherobots Spatial SQL API are powered by WherobotsDB, with Apache Sedona at its core: a distributed computation engine that can horizontally scale to handle computation and analytics on any dataset. Wherobots Cloud automatically manages the infrastructure and compute resources of WherobotsDB to serve your use case based on how much computation power you need.

Behind the scenes, your Wherobots Cloud “runtime” defines the amount of compute resources allocated and the configuration of the software environment that executes your workload (in particular for AI/ML use cases, or if your ETL or analytics workflow depends on 1st or 3rd party libraries).

Our always-free Community Edition gives access to a modest “Sedona” runtime for working with small-scale datasets. Our Professional Edition unlocks access to much larger runtimes, up to our “Tokyo” runtime capable of working on planetary-scale datasets, and GPU-accelerated options for your WherobotsAI workloads.

With the release of the Wherobots Spatial SQL API and its client SDKs, you can bring WherobotsDB, the ease-of-use, and the expressiveness of SQL to your Apache Airflow spatial ETL pipelines, your applications, and soon to tools like Tableau, Superset, and other 3rd party systems and applications that support JDBC.

Our customers love applying the performance and scalability of WherobotsDB to their data preparation workflows and their compute-intensive data processing applications.

Use cases include

  • Preparation of nationwide and planetary-scale datasets for their users and customers
  • Processing hundreds of millions of mobility data records every day
  • Creating and analyzing spatial datasets in support of their real estate strategy and decision-making.

Now customers have the option to integrate new tools with Wherobots for orchestration and development of spatial insights using the Spatial SQL API.

How to get started with the Spatial SQL API

By establishing a connection to the Wherobots Spatial SQL API, a SQL session is started backed by your selected WherobotsDB runtime (or a “Sedona” by default but you can specify larger runtimes if you need more horsepower). Queries submitted through this connection are securely executed against your runtime, with compute fully managed by Wherobots.

We provide client SDKs in Java and in Python to easily connect and interact with WherobotsDB through the Spatial SQL API, as well as an Airflow Provider to build your spatial ETL DAGs; all of which are open-source and available on package registries, as well as on Wherobots’ GitHub page.

Using the Wherobots SQL Driver in Python

Wherobots provides an open-source Python library that exposes a DB-API 2.0 compatible interface for connecting to WherobotsDB. To build a Python application around the Wherobots DB-API driver, add the wherobots-python-dbapi library to your project’s dependencies:

$ poetry add wherobots-python-dbapi

Or directly install the package on your system with pip:

$ pip install wherobots-python-dbapi

From your Python application, establish a connection with wherobots.db.connect() and use cursors to execute your SQL queries and use their results:

import logging

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime

# Optionally, setup logging to get information about the driver's
# activity.
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(name)20s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

# Get your API key, or securely read it from a local file.
api_key = '...'

with connect(
    host="api.cloud.wherobots.com",
    api_key=get_secret(),
  runtime=Runtime.SEDONA,
  region=Region.AWS_US_WEST_2) as conn:
        cur = conn.cursor()
        sql = """
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
        cur.execute(sql)
        results = cur.fetchall()
      results.show()

For more information and future releases, see https://github.com/wherobots/wherobots-python-dbapi-driver on GitHub.

Using the Apache Airflow provider

Wherobots provides an open-source provider for Apache Airflow, defining an Airflow operator for executing SQL queries directly on WherobotsDB. With this new capability, you can integrate your spatial analytics queries, data preparation or data processing steps into new or existing Airflow workflow DAGs.

To build or extend your Airflow DAG using the WherobotsSqlOperator , add the airflow-providers-wherobots dependency to your project:

$ poetry add airflow-providers-wherobots

Define your connection to Wherobots; by default the Wherobots operators use the wherobots_default connection ID:

$ airflow connections add "wherobots_default" \
    --conn-type "wherobots" \
    --conn-host "api.cloud.wherobots.com" \
    --conn-password "$(< api.key)"

Instantiate the WherobotsSqlOperator and with your choice of runtime and your SQL query, and integrate it into your Airflow DAG definition:

from wherobots.db.runtime import Runtime
import airflow_providers_wherobots.operators.sql.WherobotsSqlOperator

...

select = WherobotsSqlOperator(
  runtime=Runtime.SEDONA,
  sql="""
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
)
# select.execute() or integrate into your Airflow DAG definition

apache airflow spatial sql api
For more information and future releases, see https://github.com/wherobots/airflow-providers-wherobots on GitHub.

Using the Wherobots SQL Driver in Java

Wherobots provides an open-source Java library that implements a JDBC (Type 4) driver for connecting to WherobotsDB. To start building Java applications around the Wherobots JDBC driver, add the following line to your build.gradle file’s dependency section:

implementation "com.wherobots:wherobots-jdbc-driver"

In your application, you only need to work with Java’s JDBC APIs from the java.sql package:

import com.wherobots.db.Region;
import com.wherobots.db.Runtime;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

// Get your API key, or securely read it from a local file.
String apiKey = "...";

Properties props = new Properties();
props.setProperty("apiKey", apiKey);
props.setProperty("runtime", Runtime.SEDONA);
props.setProperty("region", Region.AWS_US_WEST_2);

try (Connection conn = DriverManager.getConnection("jdbc:wherobots://api.cloud.wherobots.com", props)) {
    String sql = """
        SELECT
            id,
            names['primary'] AS name,
            geometry,
            population
        FROM
            wherobots_open_data.overture_2024_02_15.admins_locality
        WHERE localityType = 'country'
        SORT BY population DESC
        LIMIT 10
    """;
  Statement stmt = conn.createStatement();
  try (ResultSet rs = stmt.executeQuery(sql)) {
    while (rs.next()) {
      System.out.printf("%s: %s %f %s\n",
        rs.getString("id"),
        rs.getString("name"),
        rs.getDouble("population"),
        rs.getString("geometry"));
    }
  }
}

For more information and future releases, see https://github.com/wherobots/wherobots-jdbc-driver on GitHub.

Conclusion

The Wherobots Spatial SQL API takes Wherobots’ vision of hassle-free, scalable geospatial data analytics & AI one step further by making it the easiest way to run your Spatial SQL queries in the cloud. Paired with Wherobots and Apache Sedona’s comprehensive support for working with all geospatial data at any scale and in any format, and with Wherobots AI’s inference features available directly from SQL, the Wherobots Spatial SQL API is also the most flexible and the most capable platform for getting the most out of your data.

Wherobots vision

We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Processing A Billion Aircraft Observations And Combining With Weather Data Using Apache Sedona On Wherobots Cloud

In this post we take a look at combining aircraft flight data with weather data to identify potentially impacted flights using public ADS-B data and weather data from NOAA. Along the way we introduce the types of spatial queries available with WherobotsDB including spatial range queries, spatial kNN, and spatial joins using Spatial SQL, geospatial visualization options, working with various file formats including GeoParquet, GeoTiff, Shapefiles, and more.

We’ll be running this example in Wherobots Cloud, where you can create a free account to follow along. See the Getting Started With Wherobots Cloud series for an overview of Wherobots Cloud.

Flight Trace Data

The ADS-B flight observation data we’ll be using comes from ADSB.lol which provides publicly accessible daily updates of crowd-sourced flight tracking data. I adapted some great examples for downloading, parsing and enriching this dataset from Mark Litwintschik’s blog post and extracted 1 month of data which resulted in just over 1.1 billion observations. I extended Mark’s approach a bit and saved this data as GeoParquet, partitioned by S2 id. This will enable more efficient spatial queries on the dataset.

First, we’ll import dependencies and configure WherobotsDB to access an S3 bucket with our example data.

from sedona.spark import *

config = SedonaContext.builder().config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

We load one month of world-wide aircraft trace data which is just over 1.1 billion point observations.

# Load flight traces via GeoParquet
traces_df = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet")
traces_df.createOrReplaceTempView("traces")

traces_df.count()
-------------------
1103371367

Exploratory Data Analysis

I followed Mark’s approach of extracting individual trace observations into their own row and we store detailed information about the trace such as altitude, squawk code, heading, etc in a struct. You can find his code in this post.

Let’s view the schema of our Spatial DataFrame that we loaded via GeoParquet.

traces_df.printSchema()
root
 |-- dbFlags: long (nullable = true)
 |-- desc: string (nullable = true)
 |-- icao: string (nullable = true)
 |-- ownOp: string (nullable = true)
 |-- r: string (nullable = true)
 |-- reg_details: struct (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- iso2: string (nullable = true)
 |    |-- iso3: string (nullable = true)
 |    |-- nation: string (nullable = true)
 |-- t: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- trace: struct (nullable = true)
 |    |-- aircraft: struct (nullable = true)
 |    |    |-- alert: long (nullable = true)
 |    |    |-- alt_geom: long (nullable = true)
 |    |    |-- baro_rate: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- emergency: string (nullable = true)
 |    |    |-- flight: string (nullable = true)
 |    |    |-- geom_rate: long (nullable = true)
 |    |    |-- gva: long (nullable = true)
 |    |    |-- ias: long (nullable = true)
 |    |    |-- mach: double (nullable = true)
 |    |    |-- mag_heading: double (nullable = true)
 |    |    |-- nac_p: long (nullable = true)
 |    |    |-- nac_v: long (nullable = true)
 |    |    |-- nav_altitude_fms: long (nullable = true)
 |    |    |-- nav_altitude_mcp: long (nullable = true)
 |    |    |-- nav_heading: double (nullable = true)
 |    |    |-- nav_modes: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- nav_qnh: double (nullable = true)
 |    |    |-- nic: long (nullable = true)
 |    |    |-- nic_baro: long (nullable = true)
 |    |    |-- oat: long (nullable = true)
 |    |    |-- rc: long (nullable = true)
 |    |    |-- roll: double (nullable = true)
 |    |    |-- sda: long (nullable = true)
 |    |    |-- sil: long (nullable = true)
 |    |    |-- sil_type: string (nullable = true)
 |    |    |-- spi: long (nullable = true)
 |    |    |-- squawk: string (nullable = true)
 |    |    |-- tas: long (nullable = true)
 |    |    |-- tat: long (nullable = true)
 |    |    |-- track: double (nullable = true)
 |    |    |-- track_rate: double (nullable = true)
 |    |    |-- true_heading: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- version: long (nullable = true)
 |    |    |-- wd: long (nullable = true)
 |    |    |-- ws: long (nullable = true)
 |    |-- altitude: long (nullable = true)
 |    |-- flags: long (nullable = true)
 |    |-- geometric_altitude: long (nullable = true)
 |    |-- geometric_vertical_rate: long (nullable = true)
 |    |-- ground_speed: double (nullable = true)
 |    |-- h3_5: string (nullable = true)
 |    |-- indicated_airspeed: long (nullable = true)
 |    |-- lat: double (nullable = true)
 |    |-- lon: double (nullable = true)
 |    |-- roll_angle: double (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |    |-- track_degrees: double (nullable = true)
 |    |-- vertical_rate: long (nullable = true)
 |-- year: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- date: date (nullable = true)
 |-- s2: long (nullable = true)

We can also inspect the first few rows of data.

traces_df.show(5)
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
|dbFlags|                desc|  icao|ownOp|     r|         reg_details|   t|          timestamp|               trace|year|            geometry|      date|                 s2|
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{null, null, nul...|null|POINT (174.722786...|2024-01-20|7854277750134145024|
|      0|        AIRBUS A-320|c81e2c| null|ZK-OJS|{general, NZ, NZL...|A320|2024-01-15 00:00:00|{{0, 36400, 0, A3...|null|POINT (174.549826...|2024-01-15|7854277750134145024|
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{null, null, nul...|null|POINT (174.722764...|2024-01-20|7854277750134145024|
|      0|        AIRBUS A-320|c81e2c| null|ZK-OJS|{general, NZ, NZL...|A320|2024-01-15 00:00:00|{{null, null, nul...|null|POINT (174.557251...|2024-01-15|7854277750134145024|
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{0, 1600, null, ...|null|POINT (174.722589...|2024-01-20|7854277750134145024|
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
only showing top 5 rows

We can query our aircraft traces Spatial DataFrame using Spatial SQL. To start off let’s just demonstrate selecting fields and viewing tabular results. Note that the geometry column here is a spatial type. The data we loaded was GeoParquet formatted so we don’t need to parse or cast the field to a geometry type.

sedona.sql("""
SELECT desc, ownOp, geometry, trace.ground_speed, trace.altitude, trace.aircraft.flight
FROM traces
WHERE trace.altitude IS NOT NULL AND OwnOp IS NOT NULL AND desc IS NOT NULL AND trace.aircraft.flight IS NOT NULL
LIMIT 10
""").show(truncate=False)
+----------------+-------------------+-----------------------------+------------+--------+--------+
|desc            |ownOp              |geometry                     |ground_speed|altitude|flight  |
+----------------+-------------------+-----------------------------+------------+--------+--------+
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.792839 -37.010239)|122.7       |400     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.782449 -37.013079)|161.6       |350     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.779046 -37.01401) |168.9       |475     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.771463 -37.016058)|165.7       |825     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.765248 -37.017746)|162.5       |1175    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.757827 -37.01976) |159.4       |1550    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.748356 -37.022296)|161.3       |1950    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.738951 -37.024796)|163.2       |2325    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.725607 -37.028395)|165.7       |2725    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.713367 -37.03184) |167.3       |3125    |DAL64   |
+----------------+-------------------+-----------------------------+------------+--------+--------+

Let’s see where the aircraft activity is in the ADS-B flight observation dataset. Where are we seeing the most aircraft activity? We can use H3 hexagons to aggregate the individual traces and visualize the results using the SedonaKepler visualization integration. Here we count the number of aircraft trace observations within each level 5 H3 hexagon and create a choropleth map to visualize the spatial distribution.

h3_traces_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(geometry, 5, false)) AS geometry,
    ST_H3CellIDs(geometry, 5, false) AS h3
FROM traces
GROUP BY h3
""")

SedonaKepler.create_map(h3_traces_df, name="Distribution of aircraft traces")

Aggregated aircraft traces

Since our data source relies on crowdsourced data retrieved from ground stations, the data coverage is not evenly distributed. We are notably missing observations over the oceans and other areas that lack crowd-sourced ground sensors. This isn’t an issue for this analysis because flight disruption is most likely to occur around airports.

We can filter for aircraft belonging to a specific airline using the ownOp column and also filter for traces that were observed while the aircraft was part of a named flight. This can help give us some insight into the routes flown by this airline.

jbu_h3_traces_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(geometry, 5, false)) AS geometry,
    ST_H3CellIDs(geometry, 5, false) AS h3
FROM traces
WHERE ownOp = "JETBLUE AIRWAYS CORP" AND trace.aircraft.flight LIKE "JBU%"
GROUP BY h3
""")

SedonaKepler.create_map(jbu_h3_traces_df, name="Distribution of aircraft traces")

Aggregated Flight Traces

Flight Traces – Spatial Queries And Performance

WherobotsDB supports optimizations for various types of spatial queries including:

  • Spatial range query
  • Spatial kNN query, and
  • Spatial joins

Let’s demonstrate these types of queries and compare the performance with this dataset.

Spatial Range Query

Let’s say we wanted to find all aircraft traces within the bounds of a specific area, perhaps the Gulf of Mexico region. We can do this using the ST_Within Spatial SQL predicate.

Spatial filter polygon

Here we apply this spatial predicate to filter our 1.1 billion observations to only those within the bounds of a polygon representing a specific area. This takes just over 4 seconds.

gulf_filter = 'ST_Within(geometry, ST_GeomFromWKT("POLYGON ((-84.294662 29.840644, -88.952866 30.297018, -89.831772 28.767659, -94.050522 29.61167, -97.038803 27.839076, -97.917709 21.453069, -94.489975 18.479609, -86.843491 21.616579, -80.779037 24.926295, -84.294662 29.840644))"))'
gulf_traces = traces_df.where(gulf_filter)

%%time
gulf_traces.count()
--------------------
7539347

Wall time: 4.37 s
Spatial kNN Query

Another type of spatial query that might be relevant for working with this dataset is a spatial kNN query, which will efficiently find the “k” nearest geometries where the value of k is specified by the user. For example, let’s find the 5 nearest aircraft traces to the Jackson Hole Airport.

jackson_hole_traces = sedona.sql("""
SELECT 
    desc, 
    ownOp,
    trace.aircraft.flight, 
    ST_DistanceSphere(ST_Point(-110.7376, 43.6088), geometry) AS distance
FROM traces
ORDER BY distance ASC
LIMIT 5
""")

%%time

jackson_hole_traces.show()
----------------------------
Wall time: 11.1 s
+--------------------+--------------------+--------+------------------+
|                desc|               ownOp|  flight|          distance|
+--------------------+--------------------+--------+------------------+
|BOMBARDIER BD-700...|         FLEXJET LLC|    null| 76.92424103100763|
|        AIRBUS A-320| UNITED AIRLINES INC|    null|161.98830220247405|
|CESSNA 700 CITATI...|AH CAPITAL MANAGE...|    null|180.05761405860687|
|CESSNA 750 CITATI...|           2FLIU LLC|    null|187.01797656145476|
|DASSAULT FALCON 2000|NEXTERA ENERGY EQ...|N58MW   |204.06240031528034|
+--------------------+--------------------+--------+------------------+

In this case our query takes 11 seconds against a 1.1 billion row dataset.

Historical Impact of Severe Weather Events On Flights

An important analysis related to flight routes is the impact of severe weather events. To enable this type of analysis Wherobots makes public a dataset of severe weather events as part of the Wherobots Open Data Catalog. The dataset has over 8 million weather events reported by NOAA in the US from 2016-2022.

weather = sedona.table("wherobots_pro_data.weather.weather_events")
weather.show()
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
| EventId|         Type|Severity|     StartTime(UTC)|       EndTime(UTC)|Precipitation(in)|  TimeZone|AirportCode|LocationLat|LocationLng|City|        County|State|ZipCode|            geometry|
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
|W-237263|         Cold|  Severe|2016-01-01 08:57:00|2016-01-04 15:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237264|         Rain|   Light|2016-01-04 22:49:00|2016-01-04 23:30:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237265|         Cold|  Severe|2016-01-05 00:35:00|2016-01-05 15:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237266|         Rain|   Light|2016-01-05 15:32:00|2016-01-05 17:57:00|              0.2|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237267|         Rain|Moderate|2016-01-05 17:57:00|2016-01-05 18:09:00|             0.14|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237268|         Rain|   Light|2016-01-05 18:09:00|2016-01-05 18:23:00|             0.07|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237269|         Rain|Moderate|2016-01-05 18:23:00|2016-01-05 18:57:00|             0.35|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237270|Precipitation|     UNK|2016-01-05 18:57:00|2016-01-05 19:41:00|             0.22|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237271|         Cold|  Severe|2016-01-06 00:57:00|2016-01-06 15:31:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237272|         Rain|   Light|2016-01-06 16:57:00|2016-01-06 18:33:00|             0.16|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237273|Precipitation|     UNK|2016-01-06 18:57:00|2016-01-06 19:10:00|             0.23|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237274|         Rain|Moderate|2016-01-06 19:10:00|2016-01-06 19:57:00|             0.16|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237275|         Rain|   Light|2016-01-06 20:57:00|2016-01-06 21:27:00|             0.03|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237276|         Rain|Moderate|2016-01-06 21:27:00|2016-01-06 22:57:00|             0.26|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237277|         Cold|  Severe|2016-01-07 00:57:00|2016-01-07 15:35:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237278|         Rain|   Light|2016-01-07 18:10:00|2016-01-07 18:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237279|         Rain|   Light|2016-01-07 23:57:00|2016-01-08 00:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237280|         Cold|  Severe|2016-01-08 00:57:00|2016-01-08 15:29:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237281|         Cold|  Severe|2016-01-09 00:57:00|2016-01-11 14:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237282|         Cold|  Severe|2016-01-12 00:57:00|2016-01-12 15:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
only showing top 20 rows
Severe Weather Events

Let’s start our weather event analysis by finding areas that are commonly susceptible to severe weather events. We’ll again use H3 hexagons to aggregate the data, this time counting the number of severe weather events that occur within each hexagon. We’ll visualize the results as a choropleth to see areas where weather events frequently occur.

weather_h3_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(ST_Buffer(geometry, 0.1), 5, false)) AS geometry,
    ST_H3CellIDs(ST_Buffer(geometry, 0.1), 5, false) AS h3
FROM wherobots_pro_data.weather.weather_events
WHERE Severity = "Severe"
GROUP BY h3
""")

weather_h3_df.createOrReplaceTempView("weather_h3")
SedonaKepler.create_map(weather_h3_df.cache(), name="Severe weather events")

Aggregated weather events

Historical Severe Weather Aircraft Trace Analysis

Now that we’ve determined which areas are susceptible to frequent severe weather events, we can analyze flight trace observations in these areas to determine flights that may be susceptible to frequent severe weather events using a spatial join. Note that we’re joining a table with 1.1 billion observations against our weather event data, which takes just under 13 seconds in Wherobots Cloud.

severe_weather_flight_traces_df= sedona.sql("""
SELECT traces.geometry AS geometry, desc, ownOp, trace.aircraft.flight AS flight, num AS eventCount
FROM traces
JOIN weather_h3
WHERE ST_Contains(weather_h3.geometry, traces.geometry) AND trace.aircraft.flight IS NOT null
""")

%%time
severe_weather_flight_traces_df.count()
----------------------------
Wall time: 12.7 s

53788395

Using the flight number we can calculate an “intensity” metric that attempts to measure the potential impact of a flight to be disrupted by a severe weather event using the historical weather data. Here we calculate this intensity metric across all traces to find the most potentially impacted flights.

most_impacted_flights = sedona.sql("""
SELECT sum(eventCount) AS intensity, ST_Collect(collect_list(geometry)) AS geometry, flight
FROM severe_weather_traces
WHERE NOT flight = "        " AND NOT flight = "00000000"
GROUP BY flight
ORDER BY intensity DESC
LIMIT 100
""")

Most impacted flights

Flight Routes

So far we’ve been working with individual aircraft traces (point observations), but it’s much more useful to analyze flights as routes. We can construct the individual flight routes from these aircraft traces, for a specific airline by filtering on the ownOp and trace.aircraft.flight fields.

First we collect all aircraft traces that match our filter by airline, grouping by flight number. Then we can convert this collection of point geometries into a LineString geometry that represents the path the aircraft took during this flight using the ST_MakeLine Spatial SQL function.

jbu_flights_df = sedona.sql("""
WITH jbu_traces AS (
    SELECT 
        any_value(trace.aircraft.flight) AS flight, 
        collect_list(ST_Point(trace.lon, trace.lat)) AS points, 
        any_value(CAST(date AS string)) AS date
    FROM traces
    WHERE ownOp = 'JETBLUE AIRWAYS CORP' AND trace.aircraft.flight IS NOT null
    GROUP BY trace.aircraft.flight, date
)
SELECT flight, ST_MakeLine(points) AS geometry, date 
FROM jbu_traces
WHERE size(points) > 100
""")

jbu_flights_df.show()
+--------+--------------------+----------+
|  flight|            geometry|      date|
+--------+--------------------+----------+
|JBU100  |LINESTRING (-112....|2024-01-24|
|JBU101  |LINESTRING (-80.1...|2024-01-21|
|JBU1016 |LINESTRING (-77.3...|2024-01-20|
|JBU1044 |LINESTRING (-80.1...|2024-01-10|
|JBU1053 |LINESTRING (-73.7...|2024-01-03|
|JBU1055 |LINESTRING (-71.0...|2024-01-03|
|JBU1060 |LINESTRING (-77.4...|2024-01-12|
|JBU1060 |LINESTRING (-77.4...|2024-01-26|
|JBU1067 |LINESTRING (-73.7...|2024-01-17|
|JBU1068 |LINESTRING (-80.0...|2024-01-24|
|JBU1114 |LINESTRING (-112....|2024-01-17|
|JBU1131 |LINESTRING (-80.2...|2024-01-22|
|JBU1136 |LINESTRING (-83.3...|2024-01-15|
|JBU1142 |LINESTRING (-80.0...|2024-01-27|
|JBU1196 |LINESTRING (-80.1...|2024-01-03|
|JBU1196 |LINESTRING (-80.1...|2024-01-27|
|JBU1217 |LINESTRING (-81.6...|2024-01-15|
|JBU1230 |LINESTRING (-86.8...|2024-01-22|
|JBU1238 |LINESTRING (-97.6...|2024-01-28|
|JBU1251 |LINESTRING (-73.7...|2024-01-10|
+--------+--------------------+----------+
only showing top 20 rows

Flight routes

Identify Potential For Weather Impacted Flights

Now that we’ve extracted flight routes from the individual aircraft trace point observations we can begin to determine the potential for severe weather events to impact these flights based on our historical weather event data.

Creating Uniform Flight Route Segments

To analyze which flights have the most potential to be impacted based on historical weather events we first create 100km segments of each flight which will allow us to identify not only entire flight routes that are potentially impacted by weather events, but specific flight segments with the most risk of disruption.

To do this we’ll make use of the ability to create a user defined function in WherobotsDB using the Shapely library. We’ll use Shapely’s line interpolate functionality to create equal length segments of each flight route.

from shapely.geometry import LineString, MultiLineString
from sedona.sql.types import GeometryType

# Define a Shapely UDF to create a series of consistent length line segments for each flight
def interpolate_lines(line: GeometryType()):
    try:
        distance_delta = 1.0 # ~100km
        points_count = int(line.length // distance_delta) + 1
        distances = (distance_delta * i for i in range(points_count))
        points = [line.interpolate(distance) for distance in distances] + [line.boundary[1]]
        line_strings = []
        for i in range(len(points) - 1):
            line_strings.append(LineString([points[i], points[i+1]]))
        return MultiLineString(line_strings)
    except:
        return None

sedona.udf.register("interpolate_lines", interpolate_lines, GeometryType())

We can now call our user defined function in a spatial SQL statement and leverage WherobotsDB’s parallel execution to run this UDF.

split_df = sedona.sql("""
SELECT flight, date, explode(ST_Dump(interpolate_lines(geometry))) AS geometry
FROM jbu_flights
""")

split_df.createOrReplaceTempView("split_flights")
split_df.show()
+--------+----------+--------------------+
|  flight|      date|            geometry|
+--------+----------+--------------------+
|JBU100  |2024-01-24|LINESTRING (-118....|
|JBU100  |2024-01-24|LINESTRING (-117....|
|JBU100  |2024-01-24|LINESTRING (-116....|
|JBU100  |2024-01-24|LINESTRING (-115....|
|JBU100  |2024-01-24|LINESTRING (-114....|
|JBU100  |2024-01-24|LINESTRING (-113....|
|JBU100  |2024-01-24|LINESTRING (-112....|
|JBU100  |2024-01-24|LINESTRING (-111....|
|JBU100  |2024-01-24|LINESTRING (-110....|
|JBU100  |2024-01-24|LINESTRING (-109....|
|JBU100  |2024-01-24|LINESTRING (-108....|
|JBU100  |2024-01-24|LINESTRING (-107....|
|JBU100  |2024-01-24|LINESTRING (-106....|
|JBU100  |2024-01-24|LINESTRING (-105....|
|JBU100  |2024-01-24|LINESTRING (-104....|
|JBU100  |2024-01-24|LINESTRING (-103....|
|JBU100  |2024-01-24|LINESTRING (-102....|
|JBU100  |2024-01-24|LINESTRING (-102....|
|JBU100  |2024-01-24|LINESTRING (-101....|
|JBU100  |2024-01-24|LINESTRING (-100....|
+--------+----------+--------------------+
only showing top 20 rows
Joining Flight Route Segments With Weather Events

Now we perform a spatial join to identify flight segments that pass through areas that historically have a high frequency of severe weather events.

severe_weather_split_df = sedona.sql("""
SELECT flight, date, split_flights.geometry AS geometry, num 
FROM split_flights
JOIN weather_h3
WHERE ST_Intersects(weather_h3.geometry, split_flights.geometry)
""")

severe_weather_split_df.show()

This gives us a measure of which individual flight segments have the highest potential for weather impact.

+--------+----------+--------------------+----+
|  flight|      date|            geometry| num|
+--------+----------+--------------------+----+
|JBU2288 |2024-01-02|LINESTRING (-69.3...|1221|
|JBU2288 |2024-01-02|LINESTRING (-70.3...|1774|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|  42|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|1192|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|1231|
|JBU2288 |2024-01-02|LINESTRING (-72.2...| 672|
|JBU2288 |2024-01-02|LINESTRING (-72.2...| 959|
|JBU2288 |2024-01-02|LINESTRING (-73.2...| 652|
|JBU2288 |2024-01-02|LINESTRING (-73.2...| 672|
|JBU1053 |2024-01-03|LINESTRING (-73.7...| 416|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 217|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 467|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 487|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 611|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 885|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 722|
|JBU2424 |2024-01-07|LINESTRING (-81.3...|2237|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 722|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 618|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 607|
+--------+----------+--------------------+----+
only showing top 20 rows

Impacted flight segments

Flights With The Highest Potential For Weather Impact

Now we can join the flight segments and their corresponding weather intensity impact scores with the full flight routes to visualize the flights with the highest potential for weather impact based on the historical weather events.

full_flight_intensity_df = sedona.sql("""
SELECT jbu_flights.flight, jbu_flights.geometry, jbu_flights.date, intensity
FROM jbu_flights
JOIN jbu_intensity ON jbu_flights.flight = jbu_intensity.flight AND jbu_flights.date = jbu_intensity.date
""")

map_view = SedonaKepler.create_map(full_flight_intensity_df.orderBy("intensity", ascending=False).limit(10), name="Flights")
SedonaKepler.add_df(map_view, df=weather_h3_df, name="Weather")
map_view

Impacted flight routes

Doppler Weather Raster Data

So far we’ve been working with historical weather event data, lets now look at a simple example using more real-time weather radar data. Specifically, we’ll use NOAA’s NEXRAD Doppler weather radar surface reflectivity product to assess the impact of high precipitation at airports.

We’ll load a GeoTiff raster image of the continental US to identify weather patterns, then join this raster data with the location of US airports.

First, we load the raster image from S3 using the RS_FromPath function which will create an out-db raster in WherobotsDB.

nexrad_raster_url = "s3://wherobots-examples/data/noaa/nexrad/n0q_202309231625.tif"
nexrad_df = sedona.sql(f"SELECT RS_FromPath('{nexrad_raster_url}') AS raster")
nexrad_df.createOrReplaceTempView("nexrad")
nexrad_df.printSchema()
root
 |-- raster: raster (nullable = true)

Inspecting the meta data for this raster image we can see the spatial extent, spatial resolution, the CRS, and number of bands.

sedona.sql("SELECT RS_MetaData(raster) from nexrad").show(truncate=False)
+---------------------------------------------------------------------------+
|rs_metadata(raster)                                                        |
+---------------------------------------------------------------------------+
|[-126.0025, 50.0025, 12200.0, 5400.0, 0.005, -0.005, 0.0, 0.0, 4326.0, 1.0]|
+---------------------------------------------------------------------------+

We can visualize the raster using RS_AsImage.

htmlDf = sedona.sql("SELECT RS_AsImage(raster, 1200) AS raster_image FROM nexrad")
SedonaUtils.display_image(htmlDf)

Doppler weather radar

Next, we’ll load the point geometries of airports from a Shapefile.

spatialRdd = ShapefileReader.readToGeometryRDD(sedona.sparkContext, "s3a://wherobots-examples/data/ne_50m_airports")
poi_df = Adapter.toDf(spatialRdd, sedona)

poi_df = poi_df
poi_df.show(5)
poi_df.createOrReplaceTempView("airports")
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000|
|POINT (103.986413...|        2|   Airport|major|Singapore Changi|   SIN|terminal|    WSSS|      SIN|http://en.wikiped...|  150.000|
|POINT (-0.4531566...|        2|   Airport|major| London Heathrow|   LHR| parking|    EGLL|      LHR|http://en.wikiped...|  150.000|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
only showing top 5 rows

And now we’re ready to join the raster radar data to our vector airport locations. This operation retrieves the pixel value (in this case a precipitation index value) from the raster at the location of each airport. We’ll filter for any airports where this index value is above 100, indicating significant precipitation.


airport_precip_df = sedona.sql("""
SELECT name, iata_code, type, RS_Value(raster, geometry, 1) AS precip
FROM airports
JOIN nexrad 
""")

airport_precip_df.where("precip > 100").show(truncate=False)
+-------------------+---------+-----+------+
|name               |iata_code|type |precip|
+-------------------+---------+-----+------+
|Gen E L Logan Int'l|BOS      |major|108.0 |
|Durham Int'l       |RDU      |major|138.0 |
+-------------------+---------+-----+------+

Based on this analysis we can see that Logan International Airport and Raleigh-Durham Airport are experiencing high precipitation events, which may potentially impact ground operations.

Resources

This blog post looked at combining aircraft trace observations from public ADS-B sensors with weather event and radar data to analyze the potential impact of severe weather events on flights.

You can follow along with these examples by creating a free account on Wherobots Cloud and find the code for this example on GitHub.


Learn more about this use case and Wherobots Cloud in a live demo Flight Analysis With Wherobots Cloud on June 20th at 8am Pacific time.

Register now.

Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.

Understanding GeoParquet

Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.

In a nutshell, GeoParquet adds the following information to the file metadata:

  • Version: the GeoParquet specification version of this file
  • Primary column: the main Geometry Type column should be considered when there are multiple Geometry Type columns stored in this file.
  • Column metadata per Geometry Type column:

    • Encoding of the geometry data. Currently only WKB format is supported.
    • Geometry types: all possible geometry types occurred in this column, such as POINT, POLYGON, LINESTRING
    • Coordinate reference system information, bounding box, geometry orientation, and so on

In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.

The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.

Benefits of GeoParquet

GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.

Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.

A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.

As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.

The Role Of Apache Sedona

The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.

With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/

Generating GeoParquet Files With Apache Sedona

Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:

myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")

In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash function to generate dual GeoHash IDs for each geometry in the dataset.

  1. Coarse-Level GeoHashing (Precision Level 3): The first GeoHash, with a precision level of 3, creates cells approximately 156.5km x 156km in size. This level is ideal for grouping data based on geographic proximity. Using this GeoHash key, a large Sedona DataFrame is partitioned into numerous smaller GeoParquet files. Each file correlates to a unique 3-character GeoHash key, facilitating efficient data pruning at the Parquet file level. Importantly, this approach enables users to manually download individual GeoParquet files by referencing the GeoHash key in the file path, bypassing the need for a Parquet file reader.
  2. Fine-Level GeoHashing (Precision Level 8): Concurrently, OMF utilizes a finer GeoHash at level 8, resulting in smaller cells around 38.2m x 19m and then sorts data using this GeoHash within each GeoParquet file. This precision is used to sort data within each GeoParquet file. By doing so, OMF ensures that the data within each Parquet row-group or data page is organized based on spatial proximity. This organization is crucial for enabling efficient data pruning, particularly on the bbox column.

The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.

This dual GeoHash approach serves three core purposes:

  • It allows users to download individual GeoParquet files from OMF without requiring a GeoParquet or Parquet reader, simply by checking the GeoHash key.
  • GeoParquet readers can efficiently filter files using the BBox information in each file’s metadata, enhancing query speed.
  • The strategy also enhances the functionality of standard Parquet readers which can effectively filter files using the statistical information of the bbox column, found in each file’s column and row-group metadata, without needing to recognize the file as a GeoParquet file.

Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud

Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.

The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:

docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
apache/sedona:latest

This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888

Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:

from sedona.spark import *

config = SedonaContext. \
builder(). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate()

sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet"). \
load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")

We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:

  • ST_PolygonFromEnvelope – create a polygon geometry that will serve as the area roughly representing the boundaries of New York City from a specified bounding box
  • ST_Within – filter rows for only points of interest that fall within the boundary of that polygon geometry
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
df = df.where(spatial_filter)
df.createOrReplaceTempView("places")

When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.

To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:

  • ST_Buffer – create a new geometry that represents a buffer around each stadium that represents a short walking distance from the stadium
  • ST_Intersects – filter for points of interest that lie within the buffer geometry, identifying points of interest within walking distance of each stadium
stadium_places = sedona.sql("""
WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena")
SELECT * FROM places, stadiums
WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))
""")
Querying Overture Maps GeoParquet data using Apache Sedona
Visualizing points of interest within walking distance from each stadium.

You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.

Conclusion

The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Welcome to This Month In Wherobots, the monthly newsletter for data practitioners in the Apache Sedona and Wherobots community. This month we’re exploring raster data analysis with Spatial SQL, processing petabytes of agronomic data with Apache Sedona, a deep dive on new features added in the 1.5 release series, and an overview of working with files in Wherobots Cloud.

Raster Data Analysis With Spatial SQL & Apache Sedona

One of the strengths of Apache Sedona and Wherobots Cloud is the ability to work with large scale vector and raster geospatial data together using Spatial SQL. This post (and video) takes a look at how to get started working with raster data in Sedona using Spatial SQL and some of the use cases for raster data analysis including vector / raster join operations, zonal statistics, and using raster map algebra.

Read The Article: Raster Data Analysis With Spatial SQL & Apache Sedona

Featured Community Member: Luiz Santana

This month’s Wherobots & Apache Sedona featured community member is Luiz Santana. Luiz is Co-Founder and CTO of Leaf Agriculture. He has extensive experience as a former data architect and developer. Luiz has a PhD in Computer Science from Universidade Federal de Santa Catarina and spent time researching data processing and integration in highly scalable environments. Leaf Agriculture is building the unified API for food and agriculture by leveraging large-scale agronomic data. Luiz has given several presentations at conferences such as Apache Sedona: How To Process Petabytes of agrnomic data with Spark, and Perspectives on the use of data in Agriculture, which covers how Leaf uses Apache Sedona to analyze large-scale agricultural data and how Sedona fits into their stack alongside other technologies. Thank you Luiz for your work with the Apache Sedona and Wherobots community and sharing your knowledge and experience!

Apache Sedona: How To Process Petabytes of Agronomic Data With Spark

In this presentation from The Developer’s Conference Luiz Santana shares the experience of using Apache Sedona at Leaf Agriculture to process petabytes of agronomic data from satellites, agricultural machines, drones and other sensors. He discusses how Leaf uses Sedona for tasks such as geometry intersections, geographic searches, and polygon transformations with high performance and speed. Luiz also presented Perspectives On The Use of Data in Agriculture which covers some of the data challenges that Leaf handles and an overview of the technologies used to address these challenges, including Apache Sedona.

See The Slides From The Presentation

Working With Files – Getting Started With Wherobots Cloud

https://wherobots.com/wp-content/uploads/2024/02/Screenshot_2024-02-16_at_9.44.20_AM-1024x842.png

This post takes a look at loading and working with our own data in Wherobots Cloud as well as creating and saving data as the result of our analysis, such as the end result of a data pipeline. It covers importing files in various formats including CSV, GeoJSON, Shapefile, and GeoTIFF in Wherobots Cloud, working with AWS S3 cloud object storage, and creating GeoParquet files using Apache Sedona.

Read The Post: Working With Files – Getting Started With Wherobots Cloud

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both raster and vector data

https://wherobots.com/wp-content/uploads/2024/02/Screenshot-2024-02-08-at-2.13.52-PM.png

The 1.5 series of Apache Sedona represents a leap forward in geospatial processing that adds essential features and enhancements to make Sedona a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data analysis. This post covers XYZM coordinates and SRID, vector and raster joins, raster data manipulation, visualization with SedonaKepler and SedonaPyDeck, GeoParquet reading and writing, H3 hexagons, and new cluster compute engine support.

Read The Post: Introducing Sedona 1.5 – Making Sedona The Most Comprehensive & Scalable Spatial Data Processing and ETL Engine For Both Raster and Vector Data

Unlocking the Spatial Frontier: The Evolution and Potential of spatial technology in Apple Vision Pro and Augmented Reality Apps

https://wherobots.com/wp-content/uploads/2024/02/VisionPro.001.png

Apple adopted the term “spatial computing” when announcing the Apple Vision Pro to describe its new augmented reality platform. This post from Wherobots CEO Mo Sarwat examines spatial computing in the context of augmented reality experiences to explore spatial object localization and presentation and the role of spatial query processing and spatial data analytics in Apple Vision Pro.

Read The Post: Unlocking The Spatial Frontier

Upcoming Events

Raster Data Analysis With Spatial SQL And Apache Sedona

One of the strengths of Apache Sedona and Wherobots Cloud is the ability to work with large scale vector and raster geospatial data together using Spatial SQL. In this post we’ll take a look at how to get started working with raster data in Sedona using Spatial SQL and some of the use cases for raster data analysis including vector / raster join operations, zonal statistics, and using raster map algebra.

Raster Data Overview

Raster data refers to gridded data where each pixel has one or more values associated with it (called bands) and each pixel references a geographic location. Common raster data include aerial imagery, elevation models, precipitation maps, and population distribution datasets.

This image from the National Ecological Observatory Network (NEON) does a great job of introducing raster data concepts using an example of aerial imagery. We can see the value of each pixel, the coordinates of each pixel in pixel space, and how pixel values map to the color rendered when the raster is visualized. We also see the spatial resolution of the data as measured by the height of each pixel, in this case 1 meter. This means that each pixel represents 1 square meter of the Earth’s surface.

https://wherobots.com/wp-content/uploads/2024/02/raster_concept.png

When working with raster data in Sedona we store rasters and their associated metadata as rows in a table and query them using Spatial SQL. Additionally, large rasters can be tiled so that each row in the table represents a single tile of a larger raster image.

https://wherobots.com/wp-content/uploads/2024/02/earth_observation_table.png

One of the tools available to us when analyzing raster data is map algebra: performing calculations using the individual pixel values. We can perform map algebra operations across multiple bands in a single raster (useful when computing indexes like NDVI) and also map algebra operations across different rasters (useful for time series analysis).

https://wherobots.com/wp-content/uploads/2024/02/map_algebra.png

There are several map algebra operations available using Spatial SQL, but in the image above we can see the RS_MapAlgebra function being used to compute the difference in average surface temperature between two years.

Let’s see some examples of how to get started working with raster data in Sedona. If you’d like to follow along you can create a free Wherobots Cloud account and get started today.

Load A Single Raster

Let’s start with a simple example of loading a single GeoTiff from the NEON high resolution imagery dataset. We’ll then perform some basic map algebra operations on the image and introduce some new Spatial SQL functions along the way.

Sedona supports a number of raster data formats, including Arc/Info ASCII grid, NetCDF, and GeoTiff. See the raster loaders documentation page for more. Rasters can be loaded as "out-of-database" (out-db) rasters from a remote file path or "in-database" (in-db). Out-db allows for managing large raster datasets stored on cloud storage and enables efficient handling of remote data, while the in-db format is useful when the raster data needs to managed within the database because of data integrity or access efficiency requirements. In most cases we typically want to be using the out-db raster format.

We can use the RS_FromPath Spatial SQL function to create an out-db raster from a remote file path. I’ve loaded the example GeoTiff we’ll be using into a public S3 bucket, so when instantiating our SedonaContext we’ll need to configure anonymous access to this bucket.

config = SedonaContext.builder(). \
    config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

sedona = SedonaContext.create(config)

Next, we load the GeoTiff and create an out-db raster using RS_FromPath, specifying the S3 URI for the single GeoTiff.

file_url = "s3://wherobots-examples/data/examples/NEON_ortho.tif"
neon_df = sedona.sql(f"SELECT RS_FromPath('{file_url}') AS raster")
neon_df.createOrReplaceTempView("neon")
neon_df.printSchema()

We can see that our DataFrame has a single column raster which is of type raster .

root
 |-- raster: raster (nullable = true)

Once we’ve loaded a raster we can inspect the metadata using the RS_MetaData Spatial SQL function.

sedona.sql("SELECT RS_MetaData(raster) from neon").show(truncate=False)

This function will return the upper left coordinates of the raster (in the raster’s coordinate system’s units), the width and height of the raster (in number of pixels), the spatial resolution of each pixel (in units of the rasters’s CRS), skew or rotation of the raster (if any), the SRID of the raster’s coordinate system, and the number of bands in the raster.

+------------------------------------------------------------------------+
|rs_metadata(raster)                                                     |
+------------------------------------------------------------------------+
|[470000.0, 7228000.0, 1000.0, 1000.0, 1.0, -1.0, 0.0, 0.0, 32606.0, 3.0]|
+------------------------------------------------------------------------+

Based on this output we can observe a few important details about the raster:

  • the upper left coordinate is 470000, 7228000 in the raster’s CRS
  • the raster is 1000 pixels by 1000 pixels
  • each pixel represents 1 square meter which means the raster image represents 1 square kilometer (1000 pixels x 1 meter)
  • the SRID is 32606, which is UTM zone 6N
  • the raster has 3 bands, representing the red, green, and blue color spectrums

There are a number of other Spatial SQL functions that can be used to inspect similar metadata values, such as RS_NumBands

sedona.sql("SELECT RS_NumBands(raster) FROM neon").show()
+-------------------+
|rs_numbands(raster)|
+-------------------+
|                  3|
+-------------------+

We can visualize the raster using the RS_AsImage function.

htmlDf = sedona.sql("SELECT RS_AsImage(raster, 500) FROM neon")
SedonaUtils.display_image(htmlDf)

Visualizing a single GeoTiff aerial image

To obtain pixel values at specific coordinates we can use the RS_Value or RS_Values function. Note that for rasters with multiple bands we must specify which band we are interested in. We’ll also need to specify the SRID of our geometry (in this case a Point) to match that of the raster.

Here we retrieve the value of the upper left pixel for band 1, which represents the red color spectrum.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470000, 7227880), 32606), 1) 
AS pixel_value
FROM neon
""").show(truncate=False)
+-----------+
|pixel_value|
+-----------+
|127.0      |
+-----------+

Map Algebra Operations

Performing raster math is one of the most common and powerful workflows when analyzing raster data. Sedona supports a number of band-specific raster math operations such as:

  • RS_Add, RS_Divide, RS_Mean, RS_NormalizedDifference, etc
  • RS_MapAlgebra

RS_MapAlgebra is used to apply a map algebra script on a raster or multiple rasters, for example to calculate NDVI (to identify vegetation) or AWEI (to identify bodies of water). Let’s calculate the Normalized Difference Greenness Index (NDGI) for our raster. NDGI is a metric similar to NDVI, however it can be computed using only visible spectrum (RGB) whereas NDVI uses near infrared as well.

For each pixel NDGI is calculated by dividing the difference of the green and red band values by the sum of the green and red band values, resulting in a value between -1 and 1. To interpret NDGI, values greater than zero are indicative of vegetation, while values less than zero are indicative of no-vegetation.

NDGI formula

To calculate NDGI we will use the RS_MapAlgebra function which uses a map algebra scripting language called Jiffle to express the computation to be performed. Note that the result of this operation is a new raster with a single band that represents the NDGI value for each pixel.

ndgi_df = sedona.sql("""
SELECT 
  RS_MapAlgebra(raster, 'D', 'out = ((rast[1] - rast[0]) / (rast[1] + rast[0]));') 
AS raster
FROM neon
""")
ndgi_df.createOrReplaceTempView("ndgi")

We can confirm the NDGI values are in the expected range by using the RS_SummaryStats function which will return the count, sum, mean, standard deviation, minimum, and maximum values of a single band in the raster.

sedona.sql("SELECT RS_SummaryStats(raster) AS stats FROM ndgi").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------+
|stats                                                                                                                 |
+----------------------------------------------------------------------------------------------------------------------+
|[1000000.0, -10863.97160595989, -0.010863971605958938, 0.045695423898689726, -0.1493212669683258, 0.18421052631578946]|
+----------------------------------------------------------------------------------------------------------------------+

We can evaluate single pixel values using the RS_Value function. First, selecting a pixel from the upper left area of the image that visually appears to be dirt or gravel we can confirm the value is less than zero.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470000, 7227880), 32606), 1) 
AS pixel_value
FROM ndgi
""").show(truncate=False)
+--------------------+
|pixel_value         |
+--------------------+
|-0.09956709956709957|
+--------------------+

Next, choosing a pixel from the lower left area of the image that appears to be tree cover we can confirm the value is greater than zero.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470700, 7228000), 32606), 1)
AS pixel_value
FROM ndgi
""").show(truncate=False)
+-------------------+
|pixel_value        |
+-------------------+
|0.03636363636363636|
+-------------------+

We can also export our new NDGI raster as a GeoTiff for further analysis, either directly in the Wherobots Notebook environment or to S3 using the RS_AsGeoTiff function. Here we save the raster as a GeoTiff in the notebook environment.

binary_raster_df = sedona.sql("SELECT RS_AsGeoTiff(raster) AS raster FROM ndgi")
binary_raster_df.write.format("raster").mode("overwrite").save("ndgi.tif")

Load Multiple Rasters

Previously we saw loading a single raster into a single row. In a typical workflow we work with many rasters across many rows. Let’s load a raster dataset from the WorldClim historical climate dataset of precipitation from 1970-2000. Each raster file represents the average precipitation across all observed years for each month of the year.

Instead of the RS_FromPath function we used previously to load the raster this time we’ll use the binary file reader to load each raster file, then use RS_FromGeoTiff to convert each row to a raster. We do this so we also have access to the file path which we’ll use to determine the month for each raster which is included in the filename.

PREC_URL = "s3://wherobots-examples/data/examples/world_clim/wc2.1_10m_prec" #/wc2.1_10m_prec_01.tif
rawDf = sedona.read.format("binaryFile").load(PREC_URL + "/*.tif")
rawDf.createOrReplaceTempView("rawdf")

We’ll use a regular expression to match on the file name and extract the month for each raster, which is the last two digits of the filename.

rasterDf = sedona.sql("""
SELECT 
  RS_FromGeoTiff(content) AS raster, 
  Int(regexp_extract(path, '(.*)([0-9]{2}).tif', 2)) AS month
FROM rawdf
""")

rasterDf.createOrReplaceTempView("prec")
rasterDf.printSchema()
root
 |-- raster: raster (nullable = true)
 |-- month: integer (nullable = true)

To determine the average monthly precipitation for a certain location we can use the RS_Value function.

sedona.sql("""
SELECT 
  month, 
  RS_Value(raster, ST_POINT(-113.9940, 46.8721)) AS avg_prec 
  FROM prec ORDER BY month ASC
""").show(truncate=False)
+-----+--------+
|month|avg_prec|
+-----+--------+
|1    |43.0    |
|2    |27.0    |
|3    |31.0    |
|4    |31.0    |
|5    |50.0    |
|6    |49.0    |
|7    |27.0    |
|8    |31.0    |
|9    |31.0    |
|10   |29.0    |
|11   |35.0    |
|12   |40.0    |
+-----+--------+

Zonal Statistics

Zonal statistics involves joining vector geometries to the raster and calculating statistical or aggregating values based on the pixel values that fall within each vector geometry. Let’s load a dataset of country boundaries and calculate the average yearly precipitation for each country. To do this we’ll make use of the RS_ZonalStats function to compute the average of all precipitation values within each country’s boundaries.

Note that since the RS_ZonalStats function will return a value for each county for each month we will sum these monthly averages together to end up with the yearly average for each country.

world_prec_df = sedona.sql("""
SELECT 
  sum(RS_ZonalStats(prec.raster, countries.geometry, 1, 'avg', true)) AS yearly_avg_prec,  
  any_value(countries.geometry), 
  countries.name
FROM prec, countries
GROUP BY name
ORDER BY yearly_avg_prec DESC
""")
world_prec_df.dropna().show()
+------------------+--------------------+--------------------+
|   yearly_avg_prec| any_value(geometry)|                name|
+------------------+--------------------+--------------------+
|            4937.5|MULTIPOLYGON (((1...|          Micronesia|
|            3526.0|MULTIPOLYGON (((1...|               Palau|
|           3378.75|MULTIPOLYGON (((-...|               Samoa|
|         3345.6875|MULTIPOLYGON (((1...|              Brunei|
| 3234.474358974358|MULTIPOLYGON (((1...|         Solomon Is.|
|            3111.0|MULTIPOLYGON (((-...|        Saint Helena|
|            3008.0|MULTIPOLYGON (((-...|Wallis and Futuna...|
|2988.5508720930234|MULTIPOLYGON (((1...|    Papua New Guinea|
| 2934.628571428572|MULTIPOLYGON (((1...|             Vanuatu|
| 2881.220250521921|MULTIPOLYGON (((1...|            Malaysia|
|2831.0000000000005|MULTIPOLYGON (((-...|          Costa Rica|
|2807.0000000000005|MULTIPOLYGON (((-...|          Faeroe Is.|
| 2733.146385760997|MULTIPOLYGON (((1...|           Indonesia|
|2717.5348837209303|MULTIPOLYGON (((-...|        Sierra Leone|
| 2675.306306306307|MULTIPOLYGON (((-...|              Panama|
|2656.3900709219856|POLYGON ((-11.476...|             Liberia|
|2631.9744668068433|MULTIPOLYGON (((-...|            Colombia|
|2560.3461538461534|MULTIPOLYGON (((-...|                Fiji|
|2555.6666666666665|MULTIPOLYGON (((6...|São Tomé and Pr...|
|2525.6388261851025|MULTIPOLYGON (((1...|         Philippines|
+------------------+--------------------+--------------------+
only showing top 20 rows

Visualizing the results of this zonal statistics analysis we can see that the countries with the highest average annual precipitation lie mostly along the equator.

yearly average precipitation

Tiling Rasters

If we are working with large rasters we typically want to break them up into multiple tiles to improve the efficiency of raster operations. Let’s take a look at an example using a dataset from NOAA of night time visible lights. A single raster image represents the intensity of light visible at night across the entire surface of the Earth. There are several derived data products from this dataset that infer measures such as population or economic output, but we’ll use the average annual visible light for a single year.

Night time visible lights

We can use the RS_FromPath function to load this raster into Sedona.

f10_1993_uri = "s3://wherobots-examples/data/examples/DMSP_OLS/F101993.v4b.global.stable_lights.avg_vis.tif"
f10_1993_df = sedona.sql(f"SELECT RS_FromPath('{f10_1993_uri}') AS raster")
f10_1993_df.createOrReplaceTempView("f10_1993")

This gives us a single row with a single raster. Let’s use the RS_TileExplode function to break our single large raster into equally sized tiles, with one tile per row.

tile_df = sedona.sql("SELECT RS_TileExplode(raster, 256, 256) AS (x, y, tile) FROM f10_1993")
tile_df.show(5)
+---+---+--------------------+
|  x|  y|                tile|
+---+---+--------------------+
|  0|  0|OutDbGridCoverage...|
|  1|  0|OutDbGridCoverage...|
|  2|  0|OutDbGridCoverage...|
|  3|  0|OutDbGridCoverage...|
|  4|  0|OutDbGridCoverage...|
+---+---+--------------------+

So far we’ve been querying our data as Sedona Spatial DataFrames, without persisting these DataFrames. However, to persist these tables we can save our DataFrames as Havasu Tables. Havasu is a spatial table format based on Apache Iceberg used by Sedona for storing geospatial data and enables efficient geospatial operations. You can read more about Havasu in this post.

By default in Wherobots Cloud the wherobots catalog is configured as a namespace for creating and storing databases that will be persisted in S3 cloud storage private to our Wherobots user.

Here we save our tiled raster table to a database called test_db in the wherobots catalog, thus the full name for this table is wherobots.test_db.f10_1993

sedona.sql("DROP TABLE IF EXISTS wherobots.test_db.f10_1993")
tile_df.writeTo("wherobots.test_db.f10_1993").create()

We can then refer to this table during queries. Here we count the number of rows in the table, which in this case is the number of tiles.

sedona.table("wherobots.test_db.f10_1993").count()
-----------------------
11154

To visualize the boundaries of each tile we can use SedonaKepler, along with the RS_Envelope function which will return the bounding box of each tile. Note that each tile also has x,y coordinates which represents its position relative to other tiles.

tiledMap = SedonaKepler.create_map()
SedonaKepler.add_df(tiledMap, sedona.table("wherobots.test_db.f10_1993").withColumn("tile", expr("RS_Envelope(tile)")), name="tiles")

Tiled raster boundaries

We can also visualize each individual tile using the RS_AsImage function.

display(HTML(sedona.table("wherobots.test_db.f10_1993").selectExpr("RS_AsImage(tile, 200)", "RS_Envelope(tile)").limit(5).toPandas().to_html(escape=False)))

Tiled raster visualization

To inspect the observed light intensity at a certain point we first need to find the tile that contains the point we wish to inspect. We can do this using the RS_Intersects predicate to find intersecting vector and raster geometries.

sedona.sql("""
WITH matched_tile AS (
  SELECT * FROM wherobots.test_db.f10_1993 
  WHERE RS_Intersects(tile, ST_POINT(-113.9940, 46.8721))
)

SELECT RS_Value(tile, ST_POINT(-113.9940, 46.8721))
AS pixel_value
FROM matched_tile

""").show(truncate=False)
+-----------+
|pixel_value|
+-----------+
|63.0       |
+-----------+

Tiled Zonal Statistics

We can also perform zonal statistics operations on a tiled raster dataset. This time let’s determine which counties in the US have the most night time visible light.

First, we’ll load the boundaries of all US counties using a Shapefile from Natural Earth.

counties_shapefile = "s3://wherobots-examples/data/examples/natural_earth/ne_10m_admin_2_counties"
spatialRDD = ShapefileReader.readToGeometryRDD(sedona, counties_shapefile)
counties_df = Adapter.toDf(spatialRDD, sedona)
counties_df.createOrReplaceTempView("counties")

Similar to the zonal statistics example above we’ll use the RS_ZonalStats function to calculate statistics of pixel values from our raster for all pixels that are contained within a geometry, however since our raster is tiled we’ll first need to match the tile(s) that intersect with the boundary each of county.

county_light_tiled_df = sedona.sql("""
WITH matched_tile AS (
  SELECT tile, geometry, FIPS
  FROM wherobots.test_db.f10_1993, counties
  WHERE RS_Intersects(tile, counties.geometry) 
)

SELECT 
    sum(RS_ZonalStats(matched_tile.tile, matched_tile.geometry, 'mean')) AS mean_light, 
    any_value(matched_tile.geometry) AS geometry, 
    FIPS
FROM matched_tile
GROUP BY FIPS
""")

county_light_tiled_df.createOrReplaceTempView("county_light_1993")
county_light_tiled_df.show(100)
+--------------------+--------------------+-------+
|          mean_light|            geometry|   FIPS|
+--------------------+--------------------+-------+
|   6.563388510224063|MULTIPOLYGON (((-...|US01003|
|   5.303058346553825|POLYGON ((-85.422...|US01019|
|   7.112594570538514|POLYGON ((-86.413...|US01021|
|  2.5223492723492806|POLYGON ((-88.091...|US01025|
|   9.564617731305812|POLYGON ((-85.789...|US01031|
|  13.433770014555993|POLYGON ((-88.130...|US01033|
|   8.240051020408188|POLYGON ((-86.370...|US01037|
|   2.301078582434537|POLYGON ((-86.191...|US01039|
|  0.9495387954422182|POLYGON ((-86.499...|US01041|
|     8.3112128146453|POLYGON ((-85.770...|US01045|
|   2.008212672420753|POLYGON ((-86.916...|US01047|
|   4.339487179487201|POLYGON ((-86.699...|US01053|

Visualizing the results of this analysis we can see that night time visible light intensity is largely consistent with population centers. Further analysis of this data using time series analysis could reveal patterns of growth or resource extraction activities.

SedonaKepler.create_map(county_light_tiled_df, name="Night Time Visible Lights by County")

Average night light intensity zonal statistics

Resources

In this post we explored working with raster data in Apache Sedona and Wherobots Cloud. Here are some relevant resources that you might find helpful:

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: