Wherobots Joins Overture, Winning The Taco Wars, Spatial SQL API, Geospatial Index Podcast – This Month In Wherobots

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! This month we have news about Wherobots and the Overture Maps Foundation, a deep dive on new Wherobots Cloud features like raster inference, generating vector tiles, and the Spatial SQL API, plus a look at retail cannibalization analysis for the commercial real estate industry.

Wherobots Joins Overture Maps Foundation

Wherobots joins Overture Maps Foundation

Wherobots has officially joined Overture Maps Foundation to support the next generation of planetary-scale open map data. Wherobots has supported the development of Overture datasets through Overture Maps Foundation’s use of the open-source Apache Sedona project to develop and distribute global data, enabling Overture to embrace modern cloud-native geospatial technologies like GeoParquet. By joining Overture as Contributing Members Wherobots will continue to support the ongoing development, distribution, and evolution of this critical open dataset that enables developers and data practitioners to make sense of the world around us.

Read the announcement blog post

Featured Community Members: Sean Knight & Ilya Marchenko

Apache Sedona featured community members - July 2024

This month’s featured community members are Sean Knight and Ilya Marchenko from YuzuData where they focus on AI and location intelligence for the commercial real estate industry. YuzuData is a Wherobots partner and leverages the power of Apache Sedona and Wherobots Cloud as part of their work analyzing large scale geospatial data. Sean and Ilya recently wrote a blog post showing how to use Wherobots for a retail cannibalization study. Thanks Sean and Ilya for being a part of the community and sharing how you’re building geospatial products using Wherobots!

Comparing Taco Chains: A Consumer Retail Cannibalization Study With Isochrones

Retail cannibalization analysis with Wherobots

Understanding the impact of opening a new retail location on existing locations is an important analysis in the commercial real estate industry. In this code-heavy blog post Sean and Ilya from YuzuData detail a retail cannibalization analysis using WherobotsDB, Overture Maps point of interest data, drive-time isochrones using the Valhalla API, and visualization with SedonaKepler. Sean also presented this analysis earlier this week in a live webinar.

Read the blog post or watch the video recording

Unlock Satellite Imagery Insights With WherobotsAI Raster Inference

Raster inference with WherobotsAI

One of the most exciting features in Wherobots’ latest release is WherobotsAI Raster Inference which enables running machine learning models on satellite imagery for object detection, segmentation, and classification. This post gives a detailed look at the types of models supported by WherobotsAI and an overview of the SQL and Python APIs for raster inference with an example of identifying solar farms for the purpose of mapping electricity infrastructure.

Read the blog post to learn more about WherobotsAI Raster Inference

Generating Global PMTiles In 26 Minutes With WherobotsDB VTiles

Generating PMTiles with Wherobots VTiles vector tiles generator

WherobotsDB VTiles is a highly scalable vector tile generator capable of generating vector tiles from small to planetary scale datasets quickly and cost-efficiently and supports the PMTiles format. In this post we see how to generate vector tiles of the entire planet using three Overture layers. Using Wherobots Cloud to generate PMTiles of the Overture buildings layer takes 26 minutes. The post includes all code necessary to recreate these tile generation operations and a discussion of performance considerations.

Read the blog post to learn more about WherobotsDB VTiles

Spatial SQL API Brings Performance Of WherobotsDB To Your Favorite Data Applications

Using Apache Airflow with WherobotsDB

The Wherobots Spatial SQL API enables integration with Wherobots Cloud via Python and Java client drivers. In addition to enabling integrations with your favorite data applications via the client drivers, Wherobots has released an Apache Airflow provider for orchestrating data pipelines and an integration with Harlequin, a popular SQL IDE.

Read the blog post to learn more about the Wherobots Spatial SQL API

Wherobots On The Geospatial Index Podcast

Wherobots On The Geospatial Index Podcast

William Lyon from Wherobots was recently a guest on The Geospatial Index podcast. In this episode he discusses the origins of Apache Sedona, the open-source technology behind Wherobots, how users are building spatial data products at massive scale with Wherobots, how Wherobots is improving the developer experience around geospatial analytics, and much more.

Watch the video recording

Upcoming Events

  • Apache Sedona Community Office Hours (Online – August 6th) – Join the Apache Sedona community for updates on the state of Apache Sedona, presentation and demo of recent features, and provide your input into the roadmap, future plans, and contribution opportunities.
  • GeoMeetup: Cloud Native Spatial Data Stack (San Francisco – September 5th) – Join us on September 5th for an exciting GeoMeetup featuring talks from industry leaders with Wherobots and Felt.com. In this meetup we will be exploring the elements of the cloud native spatial data stack.
  • FOSS4G NA 2024 (St Louis – September 9th-11th) – FOSS4G North America is the premier open geospatial technology and business conference. Join the Wherobots team for a pre-conference workshop or come by and chat with us at the Wherobots booth to learn about the latest developments in Apache Sedona.

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Developer Newsletter:


🌶 Comparing taco chains :: a consumer retail cannibalization study with isochrones

Authors: Sean Knight and Ilya Marchenko

Using Wherobots for a Retail Cannibalization Study Comparing Two Leading Taco Chains

In this post, we explore how to implement a workflow from the commercial real estate (CRE) space using Wherobots Cloud. This workflow is commonly known as a cannibalization study, and we will be using WherobotsDB, POI data from OvertureMaps, the open source Valhalla API, and visualization capabilities offered by SedonaKepler.

NOTE: This is a guest Wherobots post from our friends at YuzuData. Reach out to them to learn more about their spatial data product services. You can also join for a demo scheduled on this use case with them on July 30th here.

What is a retail cannibalization study?

In CRE (consumer real estate), stakeholders are often interested in questions like “If we build a new fast food restaurant here, how will its performance be affected by other similar fast food locations that already exist nearby?”. The idea of the new fast food restaurant “eating into” the sales of other fast food restaurants that already exist nearby is what is known as ‘cannibalization’.

The main objective of studying this phenomenon is to determine the extent to which a new store might divert sales from existing stores owned by the same company or brand and evaluate the overall impact on the company’s market share and profitability in the area.

Cannibalization Study in Wherobots

For this case study, we will look at two taco chains which are located primarily in Texas: Torchy’s Tacos and Velvet Taco. In general, information about the performance of individual locations and customer demographics are often proprietary information. We can, however, still learn a great deal about the potential for cannibalization both between these two chains as competitors, and between individual locations of each chain. We also know, based on our own experience, these chains compete with each other. Which taco shop to go to when we are visiting Texas is always a spicy debate.

??????

We begin by importing modules that will be useful to us as we go on.

import geopandas as gpd
import pandas as pd
import requests
from sedona.spark import *
from pyspark.sql.functions import explode, array
from pyspark.sql import functions as F

Next, we can initiate a Sedona context.

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Identifying Points of Interest

Now, we need to retrieve the locations of Torchy’s Tacos and Velvet Taco locations. In general, one can do this via a variety of both free and paid means. We will look at a simple, free approach that is made possible by the integration of Overture Maps data into the Wherobots environment:

sedona.table("wherobots_open_data.overture.places_place"). \\
createOrReplaceTempView("places")

We create a view of the Overture Maps places database, which contains information on points of interest (POI’s) worldwide.

Now, we can select the POI’s which are relevant to this exercise:

stores = sedona.sql("""
SELECT id, names.common[0].value as name, ST_X(geometry) as long,
 ST_Y(geometry) as lat, geometry, 
 CASE WHEN names.common[0].value LIKE "%Torchy's Tacos%" 
 THEN "Torchy's Tacos" 
 ELSE 'Velvet Taco' END AS chain
FROM places
WHERE addresses[0].region = 'TX'
AND (names.common[0].value LIKE "%Torchy's Tacos%" 
OR names.common[0].value LIKE '%Velvet Taco%')
""")

Calling stores.show() gives us a look at the spark DataFrame we created:

+--------------------+--------------+-----------+----------+
|                  id|          name|       long|       lat|
+--------------------+--------------+-----------+----------+
|tmp_8104A79216254...|Torchy's Tacos|  -98.59689|  29.60891|
|tmp_D17CA8BD72325...|Torchy's Tacos|  -97.74175|  30.29368|
|tmp_F497329382C10...|   Velvet Taco|  -95.48866|  30.18314|
|tmp_9B40A1BF3237E...|Torchy's Tacos| -96.805853| 32.909982|
|tmp_38210E5EC047B...|Torchy's Tacos|  -96.68755|  33.10118|
|tmp_DF0C5DF6CA549...|Torchy's Tacos|  -97.75159|  30.24542|
|tmp_BE38CAC8D46CF...|Torchy's Tacos|  -97.80877|  30.52676|
|tmp_44390C4117BEA...|Torchy's Tacos|  -97.82594|   30.4547|
|tmp_8032605AA5BDC...|   Velvet Taco| -96.469695| 32.898634|
|tmp_0A2AA67757F42...|Torchy's Tacos|  -96.44858|  32.90856|
|tmp_643821EB9C104...|Torchy's Tacos|  -97.11933|  32.94021|
|tmp_0042962D27E06...|   Velvet Taco|-95.3905374|29.7444214|
|tmp_8D0E2246C3F36...|Torchy's Tacos|  -97.15952|  33.22987|
|tmp_CB939610BC175...|Torchy's Tacos|  -95.62067|  29.60098|
|tmp_54C9A79320840...|Torchy's Tacos|  -97.75604|  30.37091|
|tmp_96D7B4FBCB327...|Torchy's Tacos|  -98.49816|  29.60937|
|tmp_1BB732F35314D...|   Velvet Taco|  -95.41044|    29.804|
|tmp_55787B14975DD...|   Velvet Taco|-96.7173913|32.9758554|
|tmp_7DC02C9CC1FAA...|Torchy's Tacos|  -95.29544|  32.30361|
|tmp_1987B31B9E24D...|   Velvet Taco|  -95.41006| 29.770256|
+--------------------+--------------+-----------+----------+
only showing top 20 rows

We’ve retrieved the latitude and longitude of our locations, as well as the name of the chain each location belongs to. We used the CASE WHEN statement in our query in order to simplify the location names. This way, we can easily select all the stores from the Torchy’s Tacos chain, for example, and not have to worry about individual locations being called things like “Torchy’s Tacos – Rice Village” or “Velvet Taco Midtown”, etc.

We can also visualize these locations using SedonaKepler. First, we can create the map using the following snippet:

location_map = SedonaKepler.create_map(stores, "Locations", 
    config = location_map_cfg)

Then, we can display the results by simply calling location_map in the notebook. For convenience, we included the location_map_cfg Python dict in our notebook, which stores the settings necessary for the map to be created with the locations color-coded by chain. If we wish to make modifications to the map and save the new configuration for later use, we can do so by calling location_map.config and saving the result either as a cell in our notebook or in a separate location_map_cfg.py file.

Generating Isochrones

Now, for each of these locations, we can generate a polygon known as an isochrone or drivetime. These polygons will represent the areas that are within a certain time’s drive from the given location. We will generate these drivetimes using the Valhalla isochrone api:

def get_isochrone(lat, lng, costing, time_steps, name, location_id):
    url = "<https://valhalla1.openstreetmap.de/isochrone>"
    params = {
      "locations": [{"lon": lng, "lat": lat}],
      "contours": [{"time": i} for i in time_steps],
      "costing": costing,
      "polygons": 1,
    }
    response = requests.post(url, json=params)
    if response:
        result = response.json()
        if 'error_code' not in result.keys():
            df = gpd.GeoDataFrame.from_features(result)
            df['name'] = name
            df['id'] = location_id
            return df[['name','id','geometry']]

The function takes as its input a latitude and longitude value, a costing paratemeter, a location name, and a location id. The output is a dataframe which contains a Shapely polygon representing the isochrone, along with the a name and id of the location the isochrone corresponds to.

We have separate columns for a location id and a location name so that we can use the id column to examine isochrones for individual restaurants and we can use the name column to look at isochrones for each of the chains.

The costing parameter can take on several different values (see the API reference here), and it can be used to create “drivetimes” assuming the user is either walking, driving, or taking public transport.

We create a geoDataFrame of all of the 5-minute drivetimes for our taco restaurant locations

drivetimes_5_min = pd.concat([get_isochrone(row.lat, row.long, 'auto', [5],
 row.chain, row.id) for row in stores.select('id','chain','lat','long').collect()])

and then save it to our S3 storage for later use:

drivetimes_5_min.to_csv('s3://path/drivetimes_5_min_torchys_velvet.csv',
 index = False)

Because we are using a free API and we have to create quite a few of these isochrones, we highly recommend saving the file for later analysis. For the purposes of this blog, we have provided a ready-made isochrone file here, which we can load into Wherobots with the following snippet:

sedona.read.option('header','true').format('csv') .\\
load('s3://path/drivetimes_5_min_torchys_velvet.csv') .\\
createOrReplaceTempView('drivetimes_5_min')

We can now visualize our drivetime polygons in SedonaKepler. As before, we first create the map with the snippet below.

map_isochrones = sedona.read.option('header','true').format('csv'). \\
load('s3://path/drivetimes_5_min_torchys_velvet.csv')

isochrone_map = SedonaKepler.create_map(map_isochrones, "Isochrones",
 config = isochrone_map_cfg)

Now, we can display the result by calling isochrone_map .

The Analysis

At this point, we have a collection of the Torchy’s and Velvet Taco locations in Texas, and we know the areas which are within a 5-minute drive of each location. What we want to do now is to estimate the number of potential customers that live near each of these locations, and the extent to which these populations overlap.

A First Look

Before we look at how these two chains might compete with each other, let’s also take a look at the extent to which restaurants within each chain might be cannibalizing each others’ sales. A quick way to do this is by using the filtering feature in Kepler to look at isochrones for a single chain:



We see that locations for each chain are fairly spread out and (at least at the 5-minute drivetime level), there is not a high degree of cannibalization within each chain. Looking at the isochrones for both chains, however, we notice that Velvet Taco locations often tend to be near Torchy’s Tacos locations (or vice-versa). At this point, all we have are qualitative statements based on these maps. Next, we will show how to use H3 and existing open-source datasets to make these statements more quantitative.

Estimating Cannibalization Potential

As we can see by looking at the map of isochrones above, they are highly irregular polygons which have a considerable amount of overlap. In general, these polygons are not described in a ‘nice’ way by any administrative boundaries such as census block groups, census tracts, etc. Therefore, we will have to be a little creative in order to estimate the population inside them.

One way of doing this using the tools provided by Apache Sedona and Wherobots is to convert these polygons to H3 hexes. We can do this with the following snippet:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
createOrReplaceTempView('h3_isochrones')

This turns our table of drivetime polygons into a table where each row represents a hexagon with sides roughly 400m long, which is a part of a drivetime polygon. We also record the chain that these hexagons are associated to (the chain that the polygon they came from belongs to). We store each hexagon in its own row because this will simplify the process of estimating population later on.

Although the question of estimating population inside individual H3 hexes is also a difficult one (we will release a notebook on this soon), open-source datasets with this information are available online, and we will use one such dataset, provided by Kontur:

kontur = sedona.read.option('header','true') .\\
load('s3://path/us_h3_8_pop.geojson', format="json") .\\
drop('_corrupt_record').dropna() .\\
selectExpr('CAST(CONV(properties.h3, 16, 10) AS BIGINT) AS h3',
 'properties.population as population')

kontur.createOrReplaceTempView('kontur')

We can now enhance our h3_isochrones table with population counts for each H3 hex:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').distinct().createOrReplaceTempView('h3_isochrones')

At this stage, we can also quickly compute the cannibalization potential within each chain. Using the following code, for example, we can estimate the number of people who live within a 5 minute drive of more than one Torcy’s Tacos:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').filter('name LIKE "%Torchy%"').select('h3','population') .\\
groupBy('h3').count().filter('count >= 2').join(kontur, 'h3', 'left').distinct() .\\
agg(F.sum('population')).collect()[0][0]
97903.0

We can easily change this code to compute the same information for Velvet Taco by changing filter('name LIKE "%Torchy%"') in line 4 of the above snippet to filter('name LIKE "%Velvet%"') . If we do this, we will see that 100298 people live within a 5 minute drive of more than one Velvet Taco. Thus, we see that the Torchy’s Tacos brand appears to be slightly better at avoiding canibalization among its own locations (especially given that Torchy’s Tacos has more locations than Velvet Taco).

Now, we can run the following query to show the number of people in Texas who live within a 5 minutes drive of a Torchy’s Tacos:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE "%Torchy's%"
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()

The reason we select distinct H3 hexes here is because a single hex can belong to more than one isochrone (as evidenced by the SedonaKepler visualizations above). We get the following output:

+---------------+
|sum(population)|
+---------------+
|      1546765.0|
+---------------+

So roughly 1.5 million people in Texas live within a 5-minute drive of a Torchy’s Tacos location. Looking at our previous calculations for how many people live near more than one restaurant of the same chain, we can see that Torchy’s Tacos locations near each other cannibalize about 6.3% of the potential customers who live within 5 minutes of a Torchy’s location.

Running a similar query for Velvet Taco tells us that roughly half as many people live within a 5-minute drive of a Velvet Taco:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE '%Velvet Taco%'
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()
+---------------+
|sum(population)|
+---------------+
|       750360.0|
+---------------+

As before, we can also see that Velvet Taco locations near each other cannibalize about 13.4% of the potential customers who live within 5 minutes of a Velvet Taco location.

Now, we can estimate the potential for cannibalization between these two chains:

sedona.sql("""
WITH overlap_h3 (h3, population) AS
(
    SELECT DISTINCT a.h3, ANY_VALUE(a.population)
    FROM h3_isochrones a LEFT JOIN h3_isochrones b ON a.h3 = b.h3
    WHERE a.name != b.name
    GROUP BY a.h3
)
SELECT sum(population)
FROM overlap_h3
""").show()

which gives:

+---------------+
|sum(population)|
+---------------+
|       415033.0|
+---------------+

We can see that more than half of the people who live near a Velvet Taco location also live near a Torchy’s Tacos location and we can visualize this population overlap:

isochrones_h3_map_data = sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\
join(kontur, 'h3', 'left').select('name','population',array('h3')).withColumnRenamed('array(h3)','h3').selectExpr('name','population','ST_H3ToGeom(h3)[0] AS geometry')

isochrones_h3_map = SedonaKepler.create_map(isochrones_h3_map_data, 'Isochrones in H3', config = isochrones_h3_map_cfg)

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


New Wherobots Cloud Features, How Overture Maps Uses Apache Sedona, Aircraft Data, & Spatial Lakehouses

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! In this edition we have a look at the latest Wherobots Cloud release, how the Overture Maps Foundation uses Apache Sedona to generate their data releases, processing a billion aircraft observations, building spatial data lakehouses with Iceberg Havasu, the new Apache Sedona 1.6.0 release, and more!

Introducing WherobotsAI For Planetary Inference And Capabilities That Modernize Spatial Intelligence At Scale

Wherobots announced significant new features in Wherobots Cloud to enable machine learning inference on satellite imagery via SQL, new Python and Java database drivers for interacting with WherobotsDB in your own analytics applications or data orchestration tooling, and a scalable vector tiles generator. These new enhancements are available now in Wherobots Cloud.

Read The Blog Post or Register For The Webinar

Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

Querying Overture Maps GeoParquet data using Apache Sedona

The Overture Maps Foundation publishes an open comprehensive global map dataset with layers for transportation, places, 3D buildings, and administrative boundaries. This data comes from multiple sources and is published in cloud-native GeoParquet format made publicly available for download in cloud object storage. In order to wrangle such a large planetary-scale dataset the Overture team uses Apache Sedona to prepare, process, and generate partitioned GeoParquet files. This blog post dives into the benefits of GeoParquet, how Overture uses Sedona to generate GeoParquet (including a dual Geohash partitioning and sorting method), and how to query and analyze the Overture Maps dataset using Wherobots Cloud.

Read the article: Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

Featured Community Member: Feng Jiang

June featured community member

Our featured Apache Sedona and Wherobots Community Member this month is Feng Jiang, a Senior Software Engineer at Microsoft where he works with map and geospatial data at scale. Through his involvement with the Overture Maps Foundation he also helps maintain and publish the public Overture Maps dataset. In the blog post "Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona" he shared some insights gained from working with Apache Sedona at Overture in the pipeline used to create and generate GeoParquet data of planetary-scale map data. Thanks for your contributions and being a part of the Apache Sedona community!

Processing A Billion Aircraft Observations With Apache Sedona In Wherobots Cloud

Impacted flight segments

An important factor to consider when analyzing aircraft data is the potential impact of weather and especially severe weather events on aircraft flights. This tutorial uses public ADS-B aircraft trace data combined with weather data to identify which flights have the highest potential to be impacted by severe weather events. We also see how to combine real-time Doppler radar raster data as well as explore the performance of working with a billion row dataset for spatial operations like point-in-polygon searches and spatial joins.

Read The Tutorial: Processing A Billion Aircraft Observations With Apache Sedona In Wherobots Cloud

Training Series: Large-Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Large-Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Choosing the right tool for the job is an important aspect of data science, and equally important is understanding how the tools fit together and can be used alongside each other. This hands-on workshop shows how to leverage the scale of Apache Sedona with Wherobots Cloud for geospatial data processing, alongside common Python tooling like Geopandas, and how to add graph analytics using Neo4j to our analysis toolkit. Using a dataset of species observations we build a species interaction graph to find which species share habitat overlap, a common workflow for conservation use cases.

Watch The Workshop Recording: Large Scale Geospatial Analytics With Graphs And The PyData Ecosystem

Apache Sedona 1.6 Release

Apache Sedona Ecosystem

Version 1.6.0 of Apache Sedona is now available! This version includes support for Shapely 2.0 and GeoPandas 0.11.1+, enhanced support for geography data, new vector and raster functions, and tighter integration Python raster data workflows with support for Rasterio and NumPy User Defined Functions (UDFs). You can learn more about this release in the release notes.

Read The Apache Sedona 1.6 Release Notes

Building Spatial Data Lakehouses With Iceberg Havasu

Iceberg Havasu: A Spatial Data Lakehouse Format

This talk from Subsurface 2024 introduces the Havasu spatial table format, an extension of Apache Iceberg used to build spatial data lakehouses. We learn about the motivation for adding spatial functionality to Iceberg, how Havasu Iceberg enables efficient spatial queries for both vector and raster data, and how to use familiar SQL table interface when building large-scale geospatial analytics applications.

Watch The Recording: Building Spatial Data Lakehouses With Iceberg Havasu

Upcoming Events

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter:


Introducing WherobotsAI for planetary inference, and capabilities that modernize spatial intelligence at scale

We are excited to announce a preview of WherobotsAI, our new suite of AI and ML powered capabilities that unlock spatial intelligence in satellite imagery and GPS location data. Additionally, we are bringing the high-performance of WherobotsDB to your favorite data applications with a Spatial SQL API that integrates WherobotsDB with more interfaces including Apache Airflow for Spatial ETL. Finally, we’re introducing the most scalable vector tile generator on earth to make it easier for teams to produce engaging and interactive map applications. All of these new features are capable of operating on planetary-scale data.

Watch the walkthrough of this release here.

Wherobots Mission and Vision

Before we dive into this release, we think it’s important to understand how these capabilities fit into our mission, our product principles, and vision for the Spatial Intelligence Cloud so you can see where we are headed.

Our Mission
These new capabilities are core to Wherobots’ mission, which is to unlock spatial intelligence of earth, society, and business, at a planetary scale. We will do this by making it extremely easy to utilize data and AI technology purpose-built for creating spatial intelligence that’s cloud-native and compatible with modern open data architectures.

Our Product Principles

  • We’re building the spatial intelligence platform for modern organizations. Every organization with a mission directly linked to the performance of tangible assets, goods and services, or data products about what’s happening in the physical world, will need a spatial intelligence platform to be competitive, sustainable, and climate adaptive.
  • It delivers intelligence for the greater good. Teams and their organizations want to analyze their worlds to create a net positive impact for business, society, and the earth.
  • It’s purpose-built yet simple. Spatial intelligence won’t scale through in-house ‘spatial experts’, or through general purpose architectures that are not optimized for spatial workloads or development experiences.
  • It’s efficient at any scale. Maximal performance, scale, and cost efficiency can only be achieved through a cloud-native, serverless solution.
  • It creates intelligence with AI. Every organization will need AI alongside modern analytics to create spatial intelligence.
  • It’s open by default. Pace of innovation depends on choice. Organizations that adopt cloud-native, open source compatible, and modern open data architectures will innovate faster because they have more choices in the solutions they can use.

Our Vision
We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Introducing WherobotsAI

There are petabytes of satellite or aerial imagery produced every day. Yet for most analysts, scientists, and developers, these datasets are analytically inaccessible outside of the naked eye. As a result most organizations still rely on humans and their eyes, to analyze satellite or other forms of aerial imagery. Wherobots can already perform analytics of overhead imagery (also known as raster data) and geospatial objects (known as vector data) simultaneously at scale. But organizations also want to use modern AI and ML technologies to streamline and scale otherwise visual, single threaded tasks like object detection, classification, and segmentation from overhead imagery.

Like satellite imagery that is generally hard to analyze, businesses also find it hard to analyze GPS data in their applications because it’s too noisy; points don’t always correspond to the actual path taken. Teams need an easy solution for snapping noisy GPS data to road or other segment types, at any scale.

Today we are announcing WherobotsAI which offers fully managed AI and machine learning capabilities that accelerate the development of spatial insights, for anyone familiar with SQL or Python. WherobotsAI capabilities include:

[new] Raster Inference (preview): A first of its kind, Raster Inference unlocks the analytical potential of satellite or aerial imagery at a planetary scale, by integrating AI models with WherobotsDB to make it extremely easy to detect, classify, and segment features of interest in satellite and aerial images. You can see how easy it is to detect and georeference solar farms here, with just a few lines of SQL:

SELECT
  outdb_raster,
  RS_SEGMENT(‘solar-satlas-sentinel2’, outdb_raster) AS solar_farm_result
FROM df_raster_input

These georeferenced predictions can be queried with WherobotsDB and can be interactively explored in a Wherobots notebook. Below is an example of detection of solar panels in SedonaKepler.

AI Inference Solar Farm

The models and AI infrastructure powering Raster Inference are fully managed, which means there’s nothing to set up or configure. Today, you can use Raster Inference to detect, segment, and classify solar farms, land cover, and marine infrastructure from terabyte-scale Sentinel-2 true color and multispectral imagery datasets in under half an hour, on our GPU runtimes available in the Wherobots Professional Edition. Soon we will be making the inference metadata for the models public, so if your own models meet this standard, they are supported by Raster Inference.

These models and datasets are just the starting point for WherobotsAI. We are looking forward to hearing from you to help us define the roadmap for what we should build support for next.

Map Matching: If you need to analyze trips at scale, but struggle to wrangle noisy GPS data, Map Matching is capable of turning billions of noisy GPS pings into signal, by snapping shared points to road or other vector segments. Teams are using Map Matching to process hundreds of millions of vehicle trips per hour. This speed surpasses any current commercial solutions, all for a cost of just a few hundred dollars.

Here’s an example of what WherobotsAI Map Matching does to improve the quality of your trip segments.

  • Red and yellow line segments were created from raw, noisy GPS data.
  • Green represents Map Matched segments.

map matching algorithm

Visit the user documentation to learn more and get started with WherobotsAI.

A Spatial SQL API for WherobotsDB

WherobotsDB, our serverless, highly efficient compute engine compatible with Apache Sedona is up to 60x more performant for spatial joins than popular general purpose big data engines and warehouses, and up to 20x faster than Apache Sedona on its own. It will remain the most performant, earth-friendly solution for your spatial workloads at any scale.

Until today, teams had two options for harnessing WherobotsDB: they could write and run queries in Wherobots managed notebooks, or run spatial ETL pipelines using the Wherobots jobs interface.

Today, we’re enabling you to bring the utility of WherobotsDB to more interfaces with the new Spatial SQL API. Using this API, teams can remotely execute Spatial SQL queries using a remote SQL editor, build first-party applications using our client SDKs in Python (WherobotsDB API driver) and Java (Wherobots JDBC driver), or orchestrate spatial ETL pipelines using a Wherobots Apache Airflow provider.

Run spatial queries with popular SQL IDEs

The following is an example of how to integrate Harlequin, a popular SQL IDE with WherobotsDB. You’ll need a Wherobots API key to get started with Harlequin (or any remote client). API keys allow you to authenticate with Wherobots Cloud for programmatic access to Wherobots APIs and services. API keys can be created following a few steps in our user documentation.

We will query WherobotsDB using Harlequin in the Airflow example later in this blog.

$ pip install harlequin-wherobots
$ harlequin -a wherobots --api-key $(< api.key)

harlequin api key connection

You can find more information on how to use Harlequin in its documentation, and on the WherobotsDB adapter on its GitHub repository.

The Wherobots Python driver enables integration with many other tools as well. Here’s an example of using the Wherobots Python driver in the QGIS Python console to fetch points of interest from the Overture Maps dataset using Spatial SQL API.

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime
import geopandas 
from shapely import wkt

with connect(
        token=os.environ.get("WBC_TOKEN"),
        runtime=Runtime.SEDONA,
        region=Region.AWS_US_WEST_2,
        host="api.cloud.wherobots.com"
) as conn:
    curr = conn.cursor()
    curr.execute("""
    SELECT names.common[0].value AS name, categories.main AS category, geometry 
    FROM wherobots_open_data.overture.places_place 
    WHERE ST_DistanceSphere(ST_GeomFromWKT("POINT (-122.46552 37.77196)"), geometry) < 10000
    AND categories.main = "hiking_trail"
    """)
    results = curr.fetchall()
    print(results)

results["geometry"] = results.geometry.apply(wkt.loads)
gdf = geopandas.GeoDataFrame(results, crs="EPSG:4326",geometry="geometry")

def add_geodataframe_to_layer(geodataframe, layer_name):
    # Create a new memory layer
    layer = QgsVectorLayer(geodataframe.to_json(), layer_name, "ogr")

    # Add the layer to the QGIS project
    QgsProject.instance().addMapLayer(layer)

add_geodataframe_to_layer(gdf, "POI Layer")

Using the Wherobots Python driver with QGIS

Visit the Wherobots user documentation to get started with the Spatial SQL API, or see our latest blog post that goes deeper into how to use our database drivers with the Spatial SQL API.

Automating Spatial ETL workflows with the Apache Airflow provider for Wherobots

ETL (extract, transform, load) workflows are oftentimes required to prepare spatial data for interactive analytics, or to refresh datasets automatically as new data arrives. Apache Airflow is a powerful and popular open source orchestrator of data workflows. With the Wherobots Apache Airflow provider, you can now use Apache Airflow to convert your spatial SQL queries into automated workflows running on Wherobots Cloud.

Here’s an example of the Wherobots Airflow provider in use. In this example we identify the top 100 buildings in the state of New York with the most places (facilities, services, business, etc.) registered within them using the Overture Maps dataset, and we’ll eventually auto-refresh the result daily. The initial view can be generated with the following SQL query:

CREATE TABLE wherobots.test_db.top_100_hot_buildings_daily AS
SELECT
  buildings.id AS building,
  first(buildings.names),
  count(places.geometry) AS places_count,
  '2023-07-24' AS ts
FROM wherobots_open_data.overture.places_place places
JOIN wherobots_open_data.overture.buildings_building buildings
  ON ST_CONTAINS(buildings.geometry, places.geometry)
WHERE places.updatetime >= '2023-07-24'
  AND places.updatetime < '2023-07-25'
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
GROUP BY building
ORDER BY places_count DESC
LIMIT 100
  • A place in Overture is defined as real-world facilities, services, businesses or amenities.
  • We used an arbitrary date of 2023-07-24.
  • New York is defined by a simple bounding box polygon (79.762152, 40.496103, -71.856214, 45.01585) (we could alternatively join with its appropriate administrative boundary polygon)
  • We use two WHERE clauses on places.updatetime to filter one day’s worth of data.
  • The query creates a new table wherobots.test_db.top_100_hot_buildings_daily to store the query result. Note that it will not directly return any records because we are loading directly into a table.

Now, lets use Harlequin as described earlier to inspect the outcome of creating this table with the above query:

SELECT * FROM wherobots.test_db.top_100_hot_buildings_daily

Harlequin query test 2

Apache Airflow and the Airflow Provider for Wherobots allow you to schedule and execute this query each day, injecting the appropriate date filters into your templatized query.

  • In your Apache Airflow instance, install the airflow-providers-wherobots library. You can either execute pip install airflow-providers-wherobots, or add the library to the dependency list of your Apache Airflow runtime.
  • Create a new “generic” connection for Wherobots called wherobots_default, using api.cloud.wherobots.com as the “Host” and your Wherobots API key as the “Password”.

The next step is to create an Airflow DAG. The Wherobots Provider exposes the WherobotsSqlOperator for executing SQL queries. Update the hardcoded “2023-07-24” in your query into the Airflow template macros {ds} and {next_ds}, which will be rendered as the DAG schedule date on the fly:

import datetime

from airflow import DAG
from airflow_providers_wherobots.operators.sql import WherobotsSqlOperator

with DAG(
    dag_id="example_wherobots_sql_dag",
    start_date=datetime.datetime.strptime("2023-07-24", "%Y-%m-%d"),
    schedule="@daily",
    catchup=True,
    max_active_runs=1,
):
    operator = WherobotsSqlOperator(
        task_id="execute_query",
        wait_for_downstream=True,
        sql="""
        INSERT INTO wherobots.test_db.top_100_hot_buildings_daily
        SELECT
          buildings.id AS building,
          first(buildings.names),
          count(places.geometry) AS places_count,
          '{{ ds }}' AS ts
        FROM wherobots_open_data.overture.places_place places
        JOIN wherobots_open_data.overture.buildings_building buildings
          ON ST_CONTAINS(buildings.geometry, places.geometry)
        WHERE places.updatetime >= '{{ ds }}'
          AND places.updatetime < '{{ next_ds }}'
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
        GROUP BY building
        ORDER BY places_count DESC
        LIMIT 100
        """,
        return_last=False,
    )

You can visualize the status of the and log of the DAG’s execution in the Apache Airflow UI. As shown below, the operator prints out the exact query rendered and executed when you run your DAG.

apache airflow spatial sql api
Please visit the Wherobots user documentation for more details on how to set up your Apache Airflow instance with the Wherobots Provider.

Generate Vector Tiles — formatted as PMTiles — at Global Scale

Vector tiles are high resolution representations of features optimized for visualization, computed offline and displayed in map applications. This decouples dataset preparation from client side rendering driven by zooming and panning. By decoupling dataset preparation from the interactive experience, map developers use vector tiles to significantly improve the utility, clarity, and responsiveness of feature rich interactive map applications.

Traditional vector tiles generators like Tippecanoe are limited to the processing capability of a single VM and require the use of limited formats. These solutions are great for small-scale tile generation workloads when data is already in the right file format. But if you’re like the teams we’ve worked with, you may start small and need to scale past the limits of a single VM, or have a variety of file formats. You just want to generate vector tiles with the data you have, at any scale without having to worry about format conversion steps, configuring infrastructure, partitioning your workload around the capability of a VM, or waiting for workloads to complete.

Vector Tile Generation, or VTiles for WherobotsDB generates vector tiles in PMTiles format across common data lake formats, incredibly quickly and at a planetary scale, so you can start small and know you have the capability to scale without having to look for another solution. VTiles is incredibly fast because serverless computation is parallelized, and the WherobotsDB engine is optimized for vector tile generation. This means your development teams can spend less time building map applications that matter to your customers.

Using a Tokyo runtime, we generated vector tiles with VTiles for all buildings in the Overture dataset, from zoom levels 4-15 across the entire planet, in 23 minutes. That’s fast and efficient for a planetary scale operation. You can run the tile-generation-example notebook in the Wherobots Pro tier to experience the speed and simplicity of Vtiles yourself. Here’s what this looks like:

Visit our user documentation to start generating vector tiles at-scale.

Try Wherobots now

We look forward to hearing how you put these new capabilities to work, along with your feedback to increase the usefulness of the Wherobots Cloud platform. You can try these new features today by creating a Wherobots Cloud account. WherobotsAI is a professional tier feature.

Please reach out on LinkedIn or connect to us on email at info@wherobots.com

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


The Spatial SQL API brings the performance of WherobotsDB to your favorite data applications

Since its launch last fall, Wherobots has raised the bar for cloud-native geospatial data analytics, offering the first and only platform for working with vector and raster geospatial data together at a planetary scale. Wherobots delivers a significant breadth of geospatial analytics capabilities, built around a cloud-native data lakehouse architecture and query engine that delivers up to 60x better performance than incumbent solutions. Accessible through the powerful notebook experience data scientists and data engineers know and love, Wherobots Cloud is the most comprehensive, approachable, and fully-managed serverless offering for enabling spatial intelligence at scale.

Today, we’re announcing the Wherobots Spatial SQL API, powered by Apache Sedona, to bring the performance of WherobotsDB to your favorite data applications. This opens the door to a world of direct-SQL integrations with Wherobots Cloud, bringing a serverless cloud engine that’s optimized for spatial workloads at any scale into your spatial ETL pipelines and applications, and taking your users and engineers closer to your data and spatial insights.

Register for our release webinar on July 10th here: https://bit.ly/3yFlFYk

Developers love Wherobots because compute is abstracted and managed by Wherobots Cloud. Because it can run at a planetary scale, Wherobots streamlines development and reduces time to insight. It runs on a data lake architecture, so data doesn’t need to be copied into a proprietary storage system, and integrates into familiar development tools and interfaces for exploratory analytics and orchestrating production spatial ETL pipelines.

Utilize Apache Airflow or SQL IDEs with WherobotsDB via the Spatial SQL API

Wherobots Cloud and the Wherobots Spatial SQL API are powered by WherobotsDB, with Apache Sedona at its core: a distributed computation engine that can horizontally scale to handle computation and analytics on any dataset. Wherobots Cloud automatically manages the infrastructure and compute resources of WherobotsDB to serve your use case based on how much computation power you need.

Behind the scenes, your Wherobots Cloud “runtime” defines the amount of compute resources allocated and the configuration of the software environment that executes your workload (in particular for AI/ML use cases, or if your ETL or analytics workflow depends on 1st or 3rd party libraries).

Our always-free Community Edition gives access to a modest “Sedona” runtime for working with small-scale datasets. Our Professional Edition unlocks access to much larger runtimes, up to our “Tokyo” runtime capable of working on planetary-scale datasets, and GPU-accelerated options for your WherobotsAI workloads.

With the release of the Wherobots Spatial SQL API and its client SDKs, you can bring WherobotsDB, the ease-of-use, and the expressiveness of SQL to your Apache Airflow spatial ETL pipelines, your applications, and soon to tools like Tableau, Superset, and other 3rd party systems and applications that support JDBC.

Our customers love applying the performance and scalability of WherobotsDB to their data preparation workflows and their compute-intensive data processing applications.

Use cases include

  • Preparation of nationwide and planetary-scale datasets for their users and customers
  • Processing hundreds of millions of mobility data records every day
  • Creating and analyzing spatial datasets in support of their real estate strategy and decision-making.

Now customers have the option to integrate new tools with Wherobots for orchestration and development of spatial insights using the Spatial SQL API.

How to get started with the Spatial SQL API

By establishing a connection to the Wherobots Spatial SQL API, a SQL session is started backed by your selected WherobotsDB runtime (or a “Sedona” by default but you can specify larger runtimes if you need more horsepower). Queries submitted through this connection are securely executed against your runtime, with compute fully managed by Wherobots.

We provide client SDKs in Java and in Python to easily connect and interact with WherobotsDB through the Spatial SQL API, as well as an Airflow Provider to build your spatial ETL DAGs; all of which are open-source and available on package registries, as well as on Wherobots’ GitHub page.

Using the Wherobots SQL Driver in Python

Wherobots provides an open-source Python library that exposes a DB-API 2.0 compatible interface for connecting to WherobotsDB. To build a Python application around the Wherobots DB-API driver, add the wherobots-python-dbapi library to your project’s dependencies:

$ poetry add wherobots-python-dbapi

Or directly install the package on your system with pip:

$ pip install wherobots-python-dbapi

From your Python application, establish a connection with wherobots.db.connect() and use cursors to execute your SQL queries and use their results:

import logging

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime

# Optionally, setup logging to get information about the driver's
# activity.
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(name)20s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

# Get your API key, or securely read it from a local file.
api_key = '...'

with connect(
    host="api.cloud.wherobots.com",
    api_key=get_secret(),
  runtime=Runtime.SEDONA,
  region=Region.AWS_US_WEST_2) as conn:
        cur = conn.cursor()
        sql = """
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
        cur.execute(sql)
        results = cur.fetchall()
      results.show()

For more information and future releases, see https://github.com/wherobots/wherobots-python-dbapi-driver on GitHub.

Using the Apache Airflow provider

Wherobots provides an open-source provider for Apache Airflow, defining an Airflow operator for executing SQL queries directly on WherobotsDB. With this new capability, you can integrate your spatial analytics queries, data preparation or data processing steps into new or existing Airflow workflow DAGs.

To build or extend your Airflow DAG using the WherobotsSqlOperator , add the airflow-providers-wherobots dependency to your project:

$ poetry add airflow-providers-wherobots

Define your connection to Wherobots; by default the Wherobots operators use the wherobots_default connection ID:

$ airflow connections add "wherobots_default" \
    --conn-type "wherobots" \
    --conn-host "api.cloud.wherobots.com" \
    --conn-password "$(< api.key)"

Instantiate the WherobotsSqlOperator and with your choice of runtime and your SQL query, and integrate it into your Airflow DAG definition:

from wherobots.db.runtime import Runtime
import airflow_providers_wherobots.operators.sql.WherobotsSqlOperator

...

select = WherobotsSqlOperator(
  runtime=Runtime.SEDONA,
  sql="""
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
)
# select.execute() or integrate into your Airflow DAG definition

apache airflow spatial sql api
For more information and future releases, see https://github.com/wherobots/airflow-providers-wherobots on GitHub.

Using the Wherobots SQL Driver in Java

Wherobots provides an open-source Java library that implements a JDBC (Type 4) driver for connecting to WherobotsDB. To start building Java applications around the Wherobots JDBC driver, add the following line to your build.gradle file’s dependency section:

implementation "com.wherobots:wherobots-jdbc-driver"

In your application, you only need to work with Java’s JDBC APIs from the java.sql package:

import com.wherobots.db.Region;
import com.wherobots.db.Runtime;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

// Get your API key, or securely read it from a local file.
String apiKey = "...";

Properties props = new Properties();
props.setProperty("apiKey", apiKey);
props.setProperty("runtime", Runtime.SEDONA);
props.setProperty("region", Region.AWS_US_WEST_2);

try (Connection conn = DriverManager.getConnection("jdbc:wherobots://api.cloud.wherobots.com", props)) {
    String sql = """
        SELECT
            id,
            names['primary'] AS name,
            geometry,
            population
        FROM
            wherobots_open_data.overture_2024_02_15.admins_locality
        WHERE localityType = 'country'
        SORT BY population DESC
        LIMIT 10
    """;
  Statement stmt = conn.createStatement();
  try (ResultSet rs = stmt.executeQuery(sql)) {
    while (rs.next()) {
      System.out.printf("%s: %s %f %s\n",
        rs.getString("id"),
        rs.getString("name"),
        rs.getDouble("population"),
        rs.getString("geometry"));
    }
  }
}

For more information and future releases, see https://github.com/wherobots/wherobots-jdbc-driver on GitHub.

Conclusion

The Wherobots Spatial SQL API takes Wherobots’ vision of hassle-free, scalable geospatial data analytics & AI one step further by making it the easiest way to run your Spatial SQL queries in the cloud. Paired with Wherobots and Apache Sedona’s comprehensive support for working with all geospatial data at any scale and in any format, and with Wherobots AI’s inference features available directly from SQL, the Wherobots Spatial SQL API is also the most flexible and the most capable platform for getting the most out of your data.

Wherobots vision

We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Processing A Billion Aircraft Observations And Combining With Weather Data Using Apache Sedona On Wherobots Cloud

In this post we take a look at combining aircraft flight data with weather data to identify potentially impacted flights using public ADS-B data and weather data from NOAA. Along the way we introduce the types of spatial queries available with WherobotsDB including spatial range queries, spatial kNN, and spatial joins using Spatial SQL, geospatial visualization options, working with various file formats including GeoParquet, GeoTiff, Shapefiles, and more.

We’ll be running this example in Wherobots Cloud, where you can create a free account to follow along. See the Getting Started With Wherobots Cloud series for an overview of Wherobots Cloud.

Flight Trace Data

The ADS-B flight observation data we’ll be using comes from ADSB.lol which provides publicly accessible daily updates of crowd-sourced flight tracking data. I adapted some great examples for downloading, parsing and enriching this dataset from Mark Litwintschik’s blog post and extracted 1 month of data which resulted in just over 1.1 billion observations. I extended Mark’s approach a bit and saved this data as GeoParquet, partitioned by S2 id. This will enable more efficient spatial queries on the dataset.

First, we’ll import dependencies and configure WherobotsDB to access an S3 bucket with our example data.

from sedona.spark import *

config = SedonaContext.builder().config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

We load one month of world-wide aircraft trace data which is just over 1.1 billion point observations.

# Load flight traces via GeoParquet
traces_df = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet")
traces_df.createOrReplaceTempView("traces")

traces_df.count()
-------------------
1103371367

Exploratory Data Analysis

I followed Mark’s approach of extracting individual trace observations into their own row and we store detailed information about the trace such as altitude, squawk code, heading, etc in a struct. You can find his code in this post.

Let’s view the schema of our Spatial DataFrame that we loaded via GeoParquet.

traces_df.printSchema()
root
 |-- dbFlags: long (nullable = true)
 |-- desc: string (nullable = true)
 |-- icao: string (nullable = true)
 |-- ownOp: string (nullable = true)
 |-- r: string (nullable = true)
 |-- reg_details: struct (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- iso2: string (nullable = true)
 |    |-- iso3: string (nullable = true)
 |    |-- nation: string (nullable = true)
 |-- t: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- trace: struct (nullable = true)
 |    |-- aircraft: struct (nullable = true)
 |    |    |-- alert: long (nullable = true)
 |    |    |-- alt_geom: long (nullable = true)
 |    |    |-- baro_rate: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- emergency: string (nullable = true)
 |    |    |-- flight: string (nullable = true)
 |    |    |-- geom_rate: long (nullable = true)
 |    |    |-- gva: long (nullable = true)
 |    |    |-- ias: long (nullable = true)
 |    |    |-- mach: double (nullable = true)
 |    |    |-- mag_heading: double (nullable = true)
 |    |    |-- nac_p: long (nullable = true)
 |    |    |-- nac_v: long (nullable = true)
 |    |    |-- nav_altitude_fms: long (nullable = true)
 |    |    |-- nav_altitude_mcp: long (nullable = true)
 |    |    |-- nav_heading: double (nullable = true)
 |    |    |-- nav_modes: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- nav_qnh: double (nullable = true)
 |    |    |-- nic: long (nullable = true)
 |    |    |-- nic_baro: long (nullable = true)
 |    |    |-- oat: long (nullable = true)
 |    |    |-- rc: long (nullable = true)
 |    |    |-- roll: double (nullable = true)
 |    |    |-- sda: long (nullable = true)
 |    |    |-- sil: long (nullable = true)
 |    |    |-- sil_type: string (nullable = true)
 |    |    |-- spi: long (nullable = true)
 |    |    |-- squawk: string (nullable = true)
 |    |    |-- tas: long (nullable = true)
 |    |    |-- tat: long (nullable = true)
 |    |    |-- track: double (nullable = true)
 |    |    |-- track_rate: double (nullable = true)
 |    |    |-- true_heading: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- version: long (nullable = true)
 |    |    |-- wd: long (nullable = true)
 |    |    |-- ws: long (nullable = true)
 |    |-- altitude: long (nullable = true)
 |    |-- flags: long (nullable = true)
 |    |-- geometric_altitude: long (nullable = true)
 |    |-- geometric_vertical_rate: long (nullable = true)
 |    |-- ground_speed: double (nullable = true)
 |    |-- h3_5: string (nullable = true)
 |    |-- indicated_airspeed: long (nullable = true)
 |    |-- lat: double (nullable = true)
 |    |-- lon: double (nullable = true)
 |    |-- roll_angle: double (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |    |-- track_degrees: double (nullable = true)
 |    |-- vertical_rate: long (nullable = true)
 |-- year: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- date: date (nullable = true)
 |-- s2: long (nullable = true)

We can also inspect the first few rows of data.

traces_df.show(5)
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
|dbFlags|                desc|  icao|ownOp|     r|         reg_details|   t|          timestamp|               trace|year|            geometry|      date|                 s2|
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{null, null, nul...|null|POINT (174.722786...|2024-01-20|7854277750134145024|
|      0|        AIRBUS A-320|c81e2c| null|ZK-OJS|{general, NZ, NZL...|A320|2024-01-15 00:00:00|{{0, 36400, 0, A3...|null|POINT (174.549826...|2024-01-15|7854277750134145024|
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{null, null, nul...|null|POINT (174.722764...|2024-01-20|7854277750134145024|
|      0|        AIRBUS A-320|c81e2c| null|ZK-OJS|{general, NZ, NZL...|A320|2024-01-15 00:00:00|{{null, null, nul...|null|POINT (174.557251...|2024-01-15|7854277750134145024|
|      0|PIPER PA-28R-180/...|c81c51| null|ZK-RTE|{general, NZ, NZL...|P28R|2024-01-20 00:00:00|{{0, 1600, null, ...|null|POINT (174.722589...|2024-01-20|7854277750134145024|
+-------+--------------------+------+-----+------+--------------------+----+-------------------+--------------------+----+--------------------+----------+-------------------+
only showing top 5 rows

We can query our aircraft traces Spatial DataFrame using Spatial SQL. To start off let’s just demonstrate selecting fields and viewing tabular results. Note that the geometry column here is a spatial type. The data we loaded was GeoParquet formatted so we don’t need to parse or cast the field to a geometry type.

sedona.sql("""
SELECT desc, ownOp, geometry, trace.ground_speed, trace.altitude, trace.aircraft.flight
FROM traces
WHERE trace.altitude IS NOT NULL AND OwnOp IS NOT NULL AND desc IS NOT NULL AND trace.aircraft.flight IS NOT NULL
LIMIT 10
""").show(truncate=False)
+----------------+-------------------+-----------------------------+------------+--------+--------+
|desc            |ownOp              |geometry                     |ground_speed|altitude|flight  |
+----------------+-------------------+-----------------------------+------------+--------+--------+
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.792839 -37.010239)|122.7       |400     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.782449 -37.013079)|161.6       |350     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.779046 -37.01401) |168.9       |475     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.771463 -37.016058)|165.7       |825     |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.765248 -37.017746)|162.5       |1175    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.757827 -37.01976) |159.4       |1550    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.748356 -37.022296)|161.3       |1950    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.738951 -37.024796)|163.2       |2325    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.725607 -37.028395)|165.7       |2725    |DAL64   |
|AIRBUS A-350-900|UMB BANK NA TRUSTEE|POINT (174.713367 -37.03184) |167.3       |3125    |DAL64   |
+----------------+-------------------+-----------------------------+------------+--------+--------+

Let’s see where the aircraft activity is in the ADS-B flight observation dataset. Where are we seeing the most aircraft activity? We can use H3 hexagons to aggregate the individual traces and visualize the results using the SedonaKepler visualization integration. Here we count the number of aircraft trace observations within each level 5 H3 hexagon and create a choropleth map to visualize the spatial distribution.

h3_traces_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(geometry, 5, false)) AS geometry,
    ST_H3CellIDs(geometry, 5, false) AS h3
FROM traces
GROUP BY h3
""")

SedonaKepler.create_map(h3_traces_df, name="Distribution of aircraft traces")

Aggregated aircraft traces

Since our data source relies on crowdsourced data retrieved from ground stations, the data coverage is not evenly distributed. We are notably missing observations over the oceans and other areas that lack crowd-sourced ground sensors. This isn’t an issue for this analysis because flight disruption is most likely to occur around airports.

We can filter for aircraft belonging to a specific airline using the ownOp column and also filter for traces that were observed while the aircraft was part of a named flight. This can help give us some insight into the routes flown by this airline.

jbu_h3_traces_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(geometry, 5, false)) AS geometry,
    ST_H3CellIDs(geometry, 5, false) AS h3
FROM traces
WHERE ownOp = "JETBLUE AIRWAYS CORP" AND trace.aircraft.flight LIKE "JBU%"
GROUP BY h3
""")

SedonaKepler.create_map(jbu_h3_traces_df, name="Distribution of aircraft traces")

Aggregated Flight Traces

Flight Traces – Spatial Queries And Performance

WherobotsDB supports optimizations for various types of spatial queries including:

  • Spatial range query
  • Spatial kNN query, and
  • Spatial joins

Let’s demonstrate these types of queries and compare the performance with this dataset.

Spatial Range Query

Let’s say we wanted to find all aircraft traces within the bounds of a specific area, perhaps the Gulf of Mexico region. We can do this using the ST_Within Spatial SQL predicate.

Spatial filter polygon

Here we apply this spatial predicate to filter our 1.1 billion observations to only those within the bounds of a polygon representing a specific area. This takes just over 4 seconds.

gulf_filter = 'ST_Within(geometry, ST_GeomFromWKT("POLYGON ((-84.294662 29.840644, -88.952866 30.297018, -89.831772 28.767659, -94.050522 29.61167, -97.038803 27.839076, -97.917709 21.453069, -94.489975 18.479609, -86.843491 21.616579, -80.779037 24.926295, -84.294662 29.840644))"))'
gulf_traces = traces_df.where(gulf_filter)

%%time
gulf_traces.count()
--------------------
7539347

Wall time: 4.37 s
Spatial kNN Query

Another type of spatial query that might be relevant for working with this dataset is a spatial kNN query, which will efficiently find the “k” nearest geometries where the value of k is specified by the user. For example, let’s find the 5 nearest aircraft traces to the Jackson Hole Airport.

jackson_hole_traces = sedona.sql("""
SELECT 
    desc, 
    ownOp,
    trace.aircraft.flight, 
    ST_DistanceSphere(ST_Point(-110.7376, 43.6088), geometry) AS distance
FROM traces
ORDER BY distance ASC
LIMIT 5
""")

%%time

jackson_hole_traces.show()
----------------------------
Wall time: 11.1 s
+--------------------+--------------------+--------+------------------+
|                desc|               ownOp|  flight|          distance|
+--------------------+--------------------+--------+------------------+
|BOMBARDIER BD-700...|         FLEXJET LLC|    null| 76.92424103100763|
|        AIRBUS A-320| UNITED AIRLINES INC|    null|161.98830220247405|
|CESSNA 700 CITATI...|AH CAPITAL MANAGE...|    null|180.05761405860687|
|CESSNA 750 CITATI...|           2FLIU LLC|    null|187.01797656145476|
|DASSAULT FALCON 2000|NEXTERA ENERGY EQ...|N58MW   |204.06240031528034|
+--------------------+--------------------+--------+------------------+

In this case our query takes 11 seconds against a 1.1 billion row dataset.

Historical Impact of Severe Weather Events On Flights

An important analysis related to flight routes is the impact of severe weather events. To enable this type of analysis Wherobots makes public a dataset of severe weather events as part of the Wherobots Open Data Catalog. The dataset has over 8 million weather events reported by NOAA in the US from 2016-2022.

weather = sedona.table("wherobots_pro_data.weather.weather_events")
weather.show()
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
| EventId|         Type|Severity|     StartTime(UTC)|       EndTime(UTC)|Precipitation(in)|  TimeZone|AirportCode|LocationLat|LocationLng|City|        County|State|ZipCode|            geometry|
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
|W-237263|         Cold|  Severe|2016-01-01 08:57:00|2016-01-04 15:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237264|         Rain|   Light|2016-01-04 22:49:00|2016-01-04 23:30:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237265|         Cold|  Severe|2016-01-05 00:35:00|2016-01-05 15:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237266|         Rain|   Light|2016-01-05 15:32:00|2016-01-05 17:57:00|              0.2|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237267|         Rain|Moderate|2016-01-05 17:57:00|2016-01-05 18:09:00|             0.14|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237268|         Rain|   Light|2016-01-05 18:09:00|2016-01-05 18:23:00|             0.07|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237269|         Rain|Moderate|2016-01-05 18:23:00|2016-01-05 18:57:00|             0.35|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237270|Precipitation|     UNK|2016-01-05 18:57:00|2016-01-05 19:41:00|             0.22|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237271|         Cold|  Severe|2016-01-06 00:57:00|2016-01-06 15:31:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237272|         Rain|   Light|2016-01-06 16:57:00|2016-01-06 18:33:00|             0.16|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237273|Precipitation|     UNK|2016-01-06 18:57:00|2016-01-06 19:10:00|             0.23|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237274|         Rain|Moderate|2016-01-06 19:10:00|2016-01-06 19:57:00|             0.16|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237275|         Rain|   Light|2016-01-06 20:57:00|2016-01-06 21:27:00|             0.03|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237276|         Rain|Moderate|2016-01-06 21:27:00|2016-01-06 22:57:00|             0.26|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237277|         Cold|  Severe|2016-01-07 00:57:00|2016-01-07 15:35:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237278|         Rain|   Light|2016-01-07 18:10:00|2016-01-07 18:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237279|         Rain|   Light|2016-01-07 23:57:00|2016-01-08 00:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237280|         Cold|  Severe|2016-01-08 00:57:00|2016-01-08 15:29:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237281|         Cold|  Severe|2016-01-09 00:57:00|2016-01-11 14:32:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
|W-237282|         Cold|  Severe|2016-01-12 00:57:00|2016-01-12 15:57:00|              0.0|US/Pacific|       KNSI|    33.2338|  -119.4559|null|Ventura County|   CA|   null|POINT (-119.4559 ...|
+--------+-------------+--------+-------------------+-------------------+-----------------+----------+-----------+-----------+-----------+----+--------------+-----+-------+--------------------+
only showing top 20 rows
Severe Weather Events

Let’s start our weather event analysis by finding areas that are commonly susceptible to severe weather events. We’ll again use H3 hexagons to aggregate the data, this time counting the number of severe weather events that occur within each hexagon. We’ll visualize the results as a choropleth to see areas where weather events frequently occur.

weather_h3_df = sedona.sql("""
SELECT
    COUNT(*) AS num,
    ST_H3ToGeom(ST_H3CellIDs(ST_Buffer(geometry, 0.1), 5, false)) AS geometry,
    ST_H3CellIDs(ST_Buffer(geometry, 0.1), 5, false) AS h3
FROM wherobots_pro_data.weather.weather_events
WHERE Severity = "Severe"
GROUP BY h3
""")

weather_h3_df.createOrReplaceTempView("weather_h3")
SedonaKepler.create_map(weather_h3_df.cache(), name="Severe weather events")

Aggregated weather events

Historical Severe Weather Aircraft Trace Analysis

Now that we’ve determined which areas are susceptible to frequent severe weather events, we can analyze flight trace observations in these areas to determine flights that may be susceptible to frequent severe weather events using a spatial join. Note that we’re joining a table with 1.1 billion observations against our weather event data, which takes just under 13 seconds in Wherobots Cloud.

severe_weather_flight_traces_df= sedona.sql("""
SELECT traces.geometry AS geometry, desc, ownOp, trace.aircraft.flight AS flight, num AS eventCount
FROM traces
JOIN weather_h3
WHERE ST_Contains(weather_h3.geometry, traces.geometry) AND trace.aircraft.flight IS NOT null
""")

%%time
severe_weather_flight_traces_df.count()
----------------------------
Wall time: 12.7 s

53788395

Using the flight number we can calculate an “intensity” metric that attempts to measure the potential impact of a flight to be disrupted by a severe weather event using the historical weather data. Here we calculate this intensity metric across all traces to find the most potentially impacted flights.

most_impacted_flights = sedona.sql("""
SELECT sum(eventCount) AS intensity, ST_Collect(collect_list(geometry)) AS geometry, flight
FROM severe_weather_traces
WHERE NOT flight = "        " AND NOT flight = "00000000"
GROUP BY flight
ORDER BY intensity DESC
LIMIT 100
""")

Most impacted flights

Flight Routes

So far we’ve been working with individual aircraft traces (point observations), but it’s much more useful to analyze flights as routes. We can construct the individual flight routes from these aircraft traces, for a specific airline by filtering on the ownOp and trace.aircraft.flight fields.

First we collect all aircraft traces that match our filter by airline, grouping by flight number. Then we can convert this collection of point geometries into a LineString geometry that represents the path the aircraft took during this flight using the ST_MakeLine Spatial SQL function.

jbu_flights_df = sedona.sql("""
WITH jbu_traces AS (
    SELECT 
        any_value(trace.aircraft.flight) AS flight, 
        collect_list(ST_Point(trace.lon, trace.lat)) AS points, 
        any_value(CAST(date AS string)) AS date
    FROM traces
    WHERE ownOp = 'JETBLUE AIRWAYS CORP' AND trace.aircraft.flight IS NOT null
    GROUP BY trace.aircraft.flight, date
)
SELECT flight, ST_MakeLine(points) AS geometry, date 
FROM jbu_traces
WHERE size(points) > 100
""")

jbu_flights_df.show()
+--------+--------------------+----------+
|  flight|            geometry|      date|
+--------+--------------------+----------+
|JBU100  |LINESTRING (-112....|2024-01-24|
|JBU101  |LINESTRING (-80.1...|2024-01-21|
|JBU1016 |LINESTRING (-77.3...|2024-01-20|
|JBU1044 |LINESTRING (-80.1...|2024-01-10|
|JBU1053 |LINESTRING (-73.7...|2024-01-03|
|JBU1055 |LINESTRING (-71.0...|2024-01-03|
|JBU1060 |LINESTRING (-77.4...|2024-01-12|
|JBU1060 |LINESTRING (-77.4...|2024-01-26|
|JBU1067 |LINESTRING (-73.7...|2024-01-17|
|JBU1068 |LINESTRING (-80.0...|2024-01-24|
|JBU1114 |LINESTRING (-112....|2024-01-17|
|JBU1131 |LINESTRING (-80.2...|2024-01-22|
|JBU1136 |LINESTRING (-83.3...|2024-01-15|
|JBU1142 |LINESTRING (-80.0...|2024-01-27|
|JBU1196 |LINESTRING (-80.1...|2024-01-03|
|JBU1196 |LINESTRING (-80.1...|2024-01-27|
|JBU1217 |LINESTRING (-81.6...|2024-01-15|
|JBU1230 |LINESTRING (-86.8...|2024-01-22|
|JBU1238 |LINESTRING (-97.6...|2024-01-28|
|JBU1251 |LINESTRING (-73.7...|2024-01-10|
+--------+--------------------+----------+
only showing top 20 rows

Flight routes

Identify Potential For Weather Impacted Flights

Now that we’ve extracted flight routes from the individual aircraft trace point observations we can begin to determine the potential for severe weather events to impact these flights based on our historical weather event data.

Creating Uniform Flight Route Segments

To analyze which flights have the most potential to be impacted based on historical weather events we first create 100km segments of each flight which will allow us to identify not only entire flight routes that are potentially impacted by weather events, but specific flight segments with the most risk of disruption.

To do this we’ll make use of the ability to create a user defined function in WherobotsDB using the Shapely library. We’ll use Shapely’s line interpolate functionality to create equal length segments of each flight route.

from shapely.geometry import LineString, MultiLineString
from sedona.sql.types import GeometryType

# Define a Shapely UDF to create a series of consistent length line segments for each flight
def interpolate_lines(line: GeometryType()):
    try:
        distance_delta = 1.0 # ~100km
        points_count = int(line.length // distance_delta) + 1
        distances = (distance_delta * i for i in range(points_count))
        points = [line.interpolate(distance) for distance in distances] + [line.boundary[1]]
        line_strings = []
        for i in range(len(points) - 1):
            line_strings.append(LineString([points[i], points[i+1]]))
        return MultiLineString(line_strings)
    except:
        return None

sedona.udf.register("interpolate_lines", interpolate_lines, GeometryType())

We can now call our user defined function in a spatial SQL statement and leverage WherobotsDB’s parallel execution to run this UDF.

split_df = sedona.sql("""
SELECT flight, date, explode(ST_Dump(interpolate_lines(geometry))) AS geometry
FROM jbu_flights
""")

split_df.createOrReplaceTempView("split_flights")
split_df.show()
+--------+----------+--------------------+
|  flight|      date|            geometry|
+--------+----------+--------------------+
|JBU100  |2024-01-24|LINESTRING (-118....|
|JBU100  |2024-01-24|LINESTRING (-117....|
|JBU100  |2024-01-24|LINESTRING (-116....|
|JBU100  |2024-01-24|LINESTRING (-115....|
|JBU100  |2024-01-24|LINESTRING (-114....|
|JBU100  |2024-01-24|LINESTRING (-113....|
|JBU100  |2024-01-24|LINESTRING (-112....|
|JBU100  |2024-01-24|LINESTRING (-111....|
|JBU100  |2024-01-24|LINESTRING (-110....|
|JBU100  |2024-01-24|LINESTRING (-109....|
|JBU100  |2024-01-24|LINESTRING (-108....|
|JBU100  |2024-01-24|LINESTRING (-107....|
|JBU100  |2024-01-24|LINESTRING (-106....|
|JBU100  |2024-01-24|LINESTRING (-105....|
|JBU100  |2024-01-24|LINESTRING (-104....|
|JBU100  |2024-01-24|LINESTRING (-103....|
|JBU100  |2024-01-24|LINESTRING (-102....|
|JBU100  |2024-01-24|LINESTRING (-102....|
|JBU100  |2024-01-24|LINESTRING (-101....|
|JBU100  |2024-01-24|LINESTRING (-100....|
+--------+----------+--------------------+
only showing top 20 rows
Joining Flight Route Segments With Weather Events

Now we perform a spatial join to identify flight segments that pass through areas that historically have a high frequency of severe weather events.

severe_weather_split_df = sedona.sql("""
SELECT flight, date, split_flights.geometry AS geometry, num 
FROM split_flights
JOIN weather_h3
WHERE ST_Intersects(weather_h3.geometry, split_flights.geometry)
""")

severe_weather_split_df.show()

This gives us a measure of which individual flight segments have the highest potential for weather impact.

+--------+----------+--------------------+----+
|  flight|      date|            geometry| num|
+--------+----------+--------------------+----+
|JBU2288 |2024-01-02|LINESTRING (-69.3...|1221|
|JBU2288 |2024-01-02|LINESTRING (-70.3...|1774|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|  42|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|1192|
|JBU2288 |2024-01-02|LINESTRING (-71.2...|1231|
|JBU2288 |2024-01-02|LINESTRING (-72.2...| 672|
|JBU2288 |2024-01-02|LINESTRING (-72.2...| 959|
|JBU2288 |2024-01-02|LINESTRING (-73.2...| 652|
|JBU2288 |2024-01-02|LINESTRING (-73.2...| 672|
|JBU1053 |2024-01-03|LINESTRING (-73.7...| 416|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 217|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 467|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 487|
|JBU2424 |2024-01-07|LINESTRING (-81.2...| 611|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 885|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 722|
|JBU2424 |2024-01-07|LINESTRING (-81.3...|2237|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 722|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 618|
|JBU2424 |2024-01-07|LINESTRING (-81.3...| 607|
+--------+----------+--------------------+----+
only showing top 20 rows

Impacted flight segments

Flights With The Highest Potential For Weather Impact

Now we can join the flight segments and their corresponding weather intensity impact scores with the full flight routes to visualize the flights with the highest potential for weather impact based on the historical weather events.

full_flight_intensity_df = sedona.sql("""
SELECT jbu_flights.flight, jbu_flights.geometry, jbu_flights.date, intensity
FROM jbu_flights
JOIN jbu_intensity ON jbu_flights.flight = jbu_intensity.flight AND jbu_flights.date = jbu_intensity.date
""")

map_view = SedonaKepler.create_map(full_flight_intensity_df.orderBy("intensity", ascending=False).limit(10), name="Flights")
SedonaKepler.add_df(map_view, df=weather_h3_df, name="Weather")
map_view

Impacted flight routes

Doppler Weather Raster Data

So far we’ve been working with historical weather event data, lets now look at a simple example using more real-time weather radar data. Specifically, we’ll use NOAA’s NEXRAD Doppler weather radar surface reflectivity product to assess the impact of high precipitation at airports.

We’ll load a GeoTiff raster image of the continental US to identify weather patterns, then join this raster data with the location of US airports.

First, we load the raster image from S3 using the RS_FromPath function which will create an out-db raster in WherobotsDB.

nexrad_raster_url = "s3://wherobots-examples/data/noaa/nexrad/n0q_202309231625.tif"
nexrad_df = sedona.sql(f"SELECT RS_FromPath('{nexrad_raster_url}') AS raster")
nexrad_df.createOrReplaceTempView("nexrad")
nexrad_df.printSchema()
root
 |-- raster: raster (nullable = true)

Inspecting the meta data for this raster image we can see the spatial extent, spatial resolution, the CRS, and number of bands.

sedona.sql("SELECT RS_MetaData(raster) from nexrad").show(truncate=False)
+---------------------------------------------------------------------------+
|rs_metadata(raster)                                                        |
+---------------------------------------------------------------------------+
|[-126.0025, 50.0025, 12200.0, 5400.0, 0.005, -0.005, 0.0, 0.0, 4326.0, 1.0]|
+---------------------------------------------------------------------------+

We can visualize the raster using RS_AsImage.

htmlDf = sedona.sql("SELECT RS_AsImage(raster, 1200) AS raster_image FROM nexrad")
SedonaUtils.display_image(htmlDf)

Doppler weather radar

Next, we’ll load the point geometries of airports from a Shapefile.

spatialRdd = ShapefileReader.readToGeometryRDD(sedona.sparkContext, "s3a://wherobots-examples/data/ne_50m_airports")
poi_df = Adapter.toDf(spatialRdd, sedona)

poi_df = poi_df
poi_df.show(5)
poi_df.createOrReplaceTempView("airports")
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000|
|POINT (103.986413...|        2|   Airport|major|Singapore Changi|   SIN|terminal|    WSSS|      SIN|http://en.wikiped...|  150.000|
|POINT (-0.4531566...|        2|   Airport|major| London Heathrow|   LHR| parking|    EGLL|      LHR|http://en.wikiped...|  150.000|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
only showing top 5 rows

And now we’re ready to join the raster radar data to our vector airport locations. This operation retrieves the pixel value (in this case a precipitation index value) from the raster at the location of each airport. We’ll filter for any airports where this index value is above 100, indicating significant precipitation.


airport_precip_df = sedona.sql("""
SELECT name, iata_code, type, RS_Value(raster, geometry, 1) AS precip
FROM airports
JOIN nexrad 
""")

airport_precip_df.where("precip > 100").show(truncate=False)
+-------------------+---------+-----+------+
|name               |iata_code|type |precip|
+-------------------+---------+-----+------+
|Gen E L Logan Int'l|BOS      |major|108.0 |
|Durham Int'l       |RDU      |major|138.0 |
+-------------------+---------+-----+------+

Based on this analysis we can see that Logan International Airport and Raleigh-Durham Airport are experiencing high precipitation events, which may potentially impact ground operations.

Resources

This blog post looked at combining aircraft trace observations from public ADS-B sensors with weather event and radar data to analyze the potential impact of severe weather events on flights.

You can follow along with these examples by creating a free account on Wherobots Cloud and find the code for this example on GitHub.


Learn more about this use case and Wherobots Cloud in a live demo Flight Analysis With Wherobots Cloud on June 20th at 8am Pacific time.

Register now.

Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Welcome to This Month In Wherobots, the monthly newsletter for data practitioners in the Apache Sedona and Wherobots community. This month we’re exploring raster data analysis with Spatial SQL, processing petabytes of agronomic data with Apache Sedona, a deep dive on new features added in the 1.5 release series, and an overview of working with files in Wherobots Cloud.

Raster Data Analysis With Spatial SQL & Apache Sedona

One of the strengths of Apache Sedona and Wherobots Cloud is the ability to work with large scale vector and raster geospatial data together using Spatial SQL. This post (and video) takes a look at how to get started working with raster data in Sedona using Spatial SQL and some of the use cases for raster data analysis including vector / raster join operations, zonal statistics, and using raster map algebra.

Read The Article: Raster Data Analysis With Spatial SQL & Apache Sedona

Featured Community Member: Luiz Santana

This month’s Wherobots & Apache Sedona featured community member is Luiz Santana. Luiz is Co-Founder and CTO of Leaf Agriculture. He has extensive experience as a former data architect and developer. Luiz has a PhD in Computer Science from Universidade Federal de Santa Catarina and spent time researching data processing and integration in highly scalable environments. Leaf Agriculture is building the unified API for food and agriculture by leveraging large-scale agronomic data. Luiz has given several presentations at conferences such as Apache Sedona: How To Process Petabytes of agrnomic data with Spark, and Perspectives on the use of data in Agriculture, which covers how Leaf uses Apache Sedona to analyze large-scale agricultural data and how Sedona fits into their stack alongside other technologies. Thank you Luiz for your work with the Apache Sedona and Wherobots community and sharing your knowledge and experience!

Apache Sedona: How To Process Petabytes of Agronomic Data With Spark

In this presentation from The Developer’s Conference Luiz Santana shares the experience of using Apache Sedona at Leaf Agriculture to process petabytes of agronomic data from satellites, agricultural machines, drones and other sensors. He discusses how Leaf uses Sedona for tasks such as geometry intersections, geographic searches, and polygon transformations with high performance and speed. Luiz also presented Perspectives On The Use of Data in Agriculture which covers some of the data challenges that Leaf handles and an overview of the technologies used to address these challenges, including Apache Sedona.

See The Slides From The Presentation

Working With Files – Getting Started With Wherobots Cloud

https://wherobots.com/wp-content/uploads/2024/02/Screenshot_2024-02-16_at_9.44.20_AM-1024x842.png

This post takes a look at loading and working with our own data in Wherobots Cloud as well as creating and saving data as the result of our analysis, such as the end result of a data pipeline. It covers importing files in various formats including CSV, GeoJSON, Shapefile, and GeoTIFF in Wherobots Cloud, working with AWS S3 cloud object storage, and creating GeoParquet files using Apache Sedona.

Read The Post: Working With Files – Getting Started With Wherobots Cloud

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both raster and vector data

https://wherobots.com/wp-content/uploads/2024/02/Screenshot-2024-02-08-at-2.13.52-PM.png

The 1.5 series of Apache Sedona represents a leap forward in geospatial processing that adds essential features and enhancements to make Sedona a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data analysis. This post covers XYZM coordinates and SRID, vector and raster joins, raster data manipulation, visualization with SedonaKepler and SedonaPyDeck, GeoParquet reading and writing, H3 hexagons, and new cluster compute engine support.

Read The Post: Introducing Sedona 1.5 – Making Sedona The Most Comprehensive & Scalable Spatial Data Processing and ETL Engine For Both Raster and Vector Data

Unlocking the Spatial Frontier: The Evolution and Potential of spatial technology in Apple Vision Pro and Augmented Reality Apps

https://wherobots.com/wp-content/uploads/2024/02/VisionPro.001.png

Apple adopted the term “spatial computing” when announcing the Apple Vision Pro to describe its new augmented reality platform. This post from Wherobots CEO Mo Sarwat examines spatial computing in the context of augmented reality experiences to explore spatial object localization and presentation and the role of spatial query processing and spatial data analytics in Apple Vision Pro.

Read The Post: Unlocking The Spatial Frontier

Upcoming Events

Raster Data Analysis With Spatial SQL And Apache Sedona

One of the strengths of Apache Sedona and Wherobots Cloud is the ability to work with large scale vector and raster geospatial data together using Spatial SQL. In this post we’ll take a look at how to get started working with raster data in Sedona using Spatial SQL and some of the use cases for raster data analysis including vector / raster join operations, zonal statistics, and using raster map algebra.

Raster Data Overview

Raster data refers to gridded data where each pixel has one or more values associated with it (called bands) and each pixel references a geographic location. Common raster data include aerial imagery, elevation models, precipitation maps, and population distribution datasets.

This image from the National Ecological Observatory Network (NEON) does a great job of introducing raster data concepts using an example of aerial imagery. We can see the value of each pixel, the coordinates of each pixel in pixel space, and how pixel values map to the color rendered when the raster is visualized. We also see the spatial resolution of the data as measured by the height of each pixel, in this case 1 meter. This means that each pixel represents 1 square meter of the Earth’s surface.

https://wherobots.com/wp-content/uploads/2024/02/raster_concept.png

When working with raster data in Sedona we store rasters and their associated metadata as rows in a table and query them using Spatial SQL. Additionally, large rasters can be tiled so that each row in the table represents a single tile of a larger raster image.

https://wherobots.com/wp-content/uploads/2024/02/earth_observation_table.png

One of the tools available to us when analyzing raster data is map algebra: performing calculations using the individual pixel values. We can perform map algebra operations across multiple bands in a single raster (useful when computing indexes like NDVI) and also map algebra operations across different rasters (useful for time series analysis).

https://wherobots.com/wp-content/uploads/2024/02/map_algebra.png

There are several map algebra operations available using Spatial SQL, but in the image above we can see the RS_MapAlgebra function being used to compute the difference in average surface temperature between two years.

Let’s see some examples of how to get started working with raster data in Sedona. If you’d like to follow along you can create a free Wherobots Cloud account and get started today.

Load A Single Raster

Let’s start with a simple example of loading a single GeoTiff from the NEON high resolution imagery dataset. We’ll then perform some basic map algebra operations on the image and introduce some new Spatial SQL functions along the way.

Sedona supports a number of raster data formats, including Arc/Info ASCII grid, NetCDF, and GeoTiff. See the raster loaders documentation page for more. Rasters can be loaded as "out-of-database" (out-db) rasters from a remote file path or "in-database" (in-db). Out-db allows for managing large raster datasets stored on cloud storage and enables efficient handling of remote data, while the in-db format is useful when the raster data needs to managed within the database because of data integrity or access efficiency requirements. In most cases we typically want to be using the out-db raster format.

We can use the RS_FromPath Spatial SQL function to create an out-db raster from a remote file path. I’ve loaded the example GeoTiff we’ll be using into a public S3 bucket, so when instantiating our SedonaContext we’ll need to configure anonymous access to this bucket.

config = SedonaContext.builder(). \
    config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

sedona = SedonaContext.create(config)

Next, we load the GeoTiff and create an out-db raster using RS_FromPath, specifying the S3 URI for the single GeoTiff.

file_url = "s3://wherobots-examples/data/examples/NEON_ortho.tif"
neon_df = sedona.sql(f"SELECT RS_FromPath('{file_url}') AS raster")
neon_df.createOrReplaceTempView("neon")
neon_df.printSchema()

We can see that our DataFrame has a single column raster which is of type raster .

root
 |-- raster: raster (nullable = true)

Once we’ve loaded a raster we can inspect the metadata using the RS_MetaData Spatial SQL function.

sedona.sql("SELECT RS_MetaData(raster) from neon").show(truncate=False)

This function will return the upper left coordinates of the raster (in the raster’s coordinate system’s units), the width and height of the raster (in number of pixels), the spatial resolution of each pixel (in units of the rasters’s CRS), skew or rotation of the raster (if any), the SRID of the raster’s coordinate system, and the number of bands in the raster.

+------------------------------------------------------------------------+
|rs_metadata(raster)                                                     |
+------------------------------------------------------------------------+
|[470000.0, 7228000.0, 1000.0, 1000.0, 1.0, -1.0, 0.0, 0.0, 32606.0, 3.0]|
+------------------------------------------------------------------------+

Based on this output we can observe a few important details about the raster:

  • the upper left coordinate is 470000, 7228000 in the raster’s CRS
  • the raster is 1000 pixels by 1000 pixels
  • each pixel represents 1 square meter which means the raster image represents 1 square kilometer (1000 pixels x 1 meter)
  • the SRID is 32606, which is UTM zone 6N
  • the raster has 3 bands, representing the red, green, and blue color spectrums

There are a number of other Spatial SQL functions that can be used to inspect similar metadata values, such as RS_NumBands

sedona.sql("SELECT RS_NumBands(raster) FROM neon").show()
+-------------------+
|rs_numbands(raster)|
+-------------------+
|                  3|
+-------------------+

We can visualize the raster using the RS_AsImage function.

htmlDf = sedona.sql("SELECT RS_AsImage(raster, 500) FROM neon")
SedonaUtils.display_image(htmlDf)

Visualizing a single GeoTiff aerial image

To obtain pixel values at specific coordinates we can use the RS_Value or RS_Values function. Note that for rasters with multiple bands we must specify which band we are interested in. We’ll also need to specify the SRID of our geometry (in this case a Point) to match that of the raster.

Here we retrieve the value of the upper left pixel for band 1, which represents the red color spectrum.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470000, 7227880), 32606), 1) 
AS pixel_value
FROM neon
""").show(truncate=False)
+-----------+
|pixel_value|
+-----------+
|127.0      |
+-----------+

Map Algebra Operations

Performing raster math is one of the most common and powerful workflows when analyzing raster data. Sedona supports a number of band-specific raster math operations such as:

  • RS_Add, RS_Divide, RS_Mean, RS_NormalizedDifference, etc
  • RS_MapAlgebra

RS_MapAlgebra is used to apply a map algebra script on a raster or multiple rasters, for example to calculate NDVI (to identify vegetation) or AWEI (to identify bodies of water). Let’s calculate the Normalized Difference Greenness Index (NDGI) for our raster. NDGI is a metric similar to NDVI, however it can be computed using only visible spectrum (RGB) whereas NDVI uses near infrared as well.

For each pixel NDGI is calculated by dividing the difference of the green and red band values by the sum of the green and red band values, resulting in a value between -1 and 1. To interpret NDGI, values greater than zero are indicative of vegetation, while values less than zero are indicative of no-vegetation.

NDGI formula

To calculate NDGI we will use the RS_MapAlgebra function which uses a map algebra scripting language called Jiffle to express the computation to be performed. Note that the result of this operation is a new raster with a single band that represents the NDGI value for each pixel.

ndgi_df = sedona.sql("""
SELECT 
  RS_MapAlgebra(raster, 'D', 'out = ((rast[1] - rast[0]) / (rast[1] + rast[0]));') 
AS raster
FROM neon
""")
ndgi_df.createOrReplaceTempView("ndgi")

We can confirm the NDGI values are in the expected range by using the RS_SummaryStats function which will return the count, sum, mean, standard deviation, minimum, and maximum values of a single band in the raster.

sedona.sql("SELECT RS_SummaryStats(raster) AS stats FROM ndgi").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------+
|stats                                                                                                                 |
+----------------------------------------------------------------------------------------------------------------------+
|[1000000.0, -10863.97160595989, -0.010863971605958938, 0.045695423898689726, -0.1493212669683258, 0.18421052631578946]|
+----------------------------------------------------------------------------------------------------------------------+

We can evaluate single pixel values using the RS_Value function. First, selecting a pixel from the upper left area of the image that visually appears to be dirt or gravel we can confirm the value is less than zero.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470000, 7227880), 32606), 1) 
AS pixel_value
FROM ndgi
""").show(truncate=False)
+--------------------+
|pixel_value         |
+--------------------+
|-0.09956709956709957|
+--------------------+

Next, choosing a pixel from the lower left area of the image that appears to be tree cover we can confirm the value is greater than zero.

sedona.sql("""
SELECT RS_Value(raster, ST_SetSRID(ST_Point(470700, 7228000), 32606), 1)
AS pixel_value
FROM ndgi
""").show(truncate=False)
+-------------------+
|pixel_value        |
+-------------------+
|0.03636363636363636|
+-------------------+

We can also export our new NDGI raster as a GeoTiff for further analysis, either directly in the Wherobots Notebook environment or to S3 using the RS_AsGeoTiff function. Here we save the raster as a GeoTiff in the notebook environment.

binary_raster_df = sedona.sql("SELECT RS_AsGeoTiff(raster) AS raster FROM ndgi")
binary_raster_df.write.format("raster").mode("overwrite").save("ndgi.tif")

Load Multiple Rasters

Previously we saw loading a single raster into a single row. In a typical workflow we work with many rasters across many rows. Let’s load a raster dataset from the WorldClim historical climate dataset of precipitation from 1970-2000. Each raster file represents the average precipitation across all observed years for each month of the year.

Instead of the RS_FromPath function we used previously to load the raster this time we’ll use the binary file reader to load each raster file, then use RS_FromGeoTiff to convert each row to a raster. We do this so we also have access to the file path which we’ll use to determine the month for each raster which is included in the filename.

PREC_URL = "s3://wherobots-examples/data/examples/world_clim/wc2.1_10m_prec" #/wc2.1_10m_prec_01.tif
rawDf = sedona.read.format("binaryFile").load(PREC_URL + "/*.tif")
rawDf.createOrReplaceTempView("rawdf")

We’ll use a regular expression to match on the file name and extract the month for each raster, which is the last two digits of the filename.

rasterDf = sedona.sql("""
SELECT 
  RS_FromGeoTiff(content) AS raster, 
  Int(regexp_extract(path, '(.*)([0-9]{2}).tif', 2)) AS month
FROM rawdf
""")

rasterDf.createOrReplaceTempView("prec")
rasterDf.printSchema()
root
 |-- raster: raster (nullable = true)
 |-- month: integer (nullable = true)

To determine the average monthly precipitation for a certain location we can use the RS_Value function.

sedona.sql("""
SELECT 
  month, 
  RS_Value(raster, ST_POINT(-113.9940, 46.8721)) AS avg_prec 
  FROM prec ORDER BY month ASC
""").show(truncate=False)
+-----+--------+
|month|avg_prec|
+-----+--------+
|1    |43.0    |
|2    |27.0    |
|3    |31.0    |
|4    |31.0    |
|5    |50.0    |
|6    |49.0    |
|7    |27.0    |
|8    |31.0    |
|9    |31.0    |
|10   |29.0    |
|11   |35.0    |
|12   |40.0    |
+-----+--------+

Zonal Statistics

Zonal statistics involves joining vector geometries to the raster and calculating statistical or aggregating values based on the pixel values that fall within each vector geometry. Let’s load a dataset of country boundaries and calculate the average yearly precipitation for each country. To do this we’ll make use of the RS_ZonalStats function to compute the average of all precipitation values within each country’s boundaries.

Note that since the RS_ZonalStats function will return a value for each county for each month we will sum these monthly averages together to end up with the yearly average for each country.

world_prec_df = sedona.sql("""
SELECT 
  sum(RS_ZonalStats(prec.raster, countries.geometry, 1, 'avg', true)) AS yearly_avg_prec,  
  any_value(countries.geometry), 
  countries.name
FROM prec, countries
GROUP BY name
ORDER BY yearly_avg_prec DESC
""")
world_prec_df.dropna().show()
+------------------+--------------------+--------------------+
|   yearly_avg_prec| any_value(geometry)|                name|
+------------------+--------------------+--------------------+
|            4937.5|MULTIPOLYGON (((1...|          Micronesia|
|            3526.0|MULTIPOLYGON (((1...|               Palau|
|           3378.75|MULTIPOLYGON (((-...|               Samoa|
|         3345.6875|MULTIPOLYGON (((1...|              Brunei|
| 3234.474358974358|MULTIPOLYGON (((1...|         Solomon Is.|
|            3111.0|MULTIPOLYGON (((-...|        Saint Helena|
|            3008.0|MULTIPOLYGON (((-...|Wallis and Futuna...|
|2988.5508720930234|MULTIPOLYGON (((1...|    Papua New Guinea|
| 2934.628571428572|MULTIPOLYGON (((1...|             Vanuatu|
| 2881.220250521921|MULTIPOLYGON (((1...|            Malaysia|
|2831.0000000000005|MULTIPOLYGON (((-...|          Costa Rica|
|2807.0000000000005|MULTIPOLYGON (((-...|          Faeroe Is.|
| 2733.146385760997|MULTIPOLYGON (((1...|           Indonesia|
|2717.5348837209303|MULTIPOLYGON (((-...|        Sierra Leone|
| 2675.306306306307|MULTIPOLYGON (((-...|              Panama|
|2656.3900709219856|POLYGON ((-11.476...|             Liberia|
|2631.9744668068433|MULTIPOLYGON (((-...|            Colombia|
|2560.3461538461534|MULTIPOLYGON (((-...|                Fiji|
|2555.6666666666665|MULTIPOLYGON (((6...|São Tomé and Pr...|
|2525.6388261851025|MULTIPOLYGON (((1...|         Philippines|
+------------------+--------------------+--------------------+
only showing top 20 rows

Visualizing the results of this zonal statistics analysis we can see that the countries with the highest average annual precipitation lie mostly along the equator.

yearly average precipitation

Tiling Rasters

If we are working with large rasters we typically want to break them up into multiple tiles to improve the efficiency of raster operations. Let’s take a look at an example using a dataset from NOAA of night time visible lights. A single raster image represents the intensity of light visible at night across the entire surface of the Earth. There are several derived data products from this dataset that infer measures such as population or economic output, but we’ll use the average annual visible light for a single year.

Night time visible lights

We can use the RS_FromPath function to load this raster into Sedona.

f10_1993_uri = "s3://wherobots-examples/data/examples/DMSP_OLS/F101993.v4b.global.stable_lights.avg_vis.tif"
f10_1993_df = sedona.sql(f"SELECT RS_FromPath('{f10_1993_uri}') AS raster")
f10_1993_df.createOrReplaceTempView("f10_1993")

This gives us a single row with a single raster. Let’s use the RS_TileExplode function to break our single large raster into equally sized tiles, with one tile per row.

tile_df = sedona.sql("SELECT RS_TileExplode(raster, 256, 256) AS (x, y, tile) FROM f10_1993")
tile_df.show(5)
+---+---+--------------------+
|  x|  y|                tile|
+---+---+--------------------+
|  0|  0|OutDbGridCoverage...|
|  1|  0|OutDbGridCoverage...|
|  2|  0|OutDbGridCoverage...|
|  3|  0|OutDbGridCoverage...|
|  4|  0|OutDbGridCoverage...|
+---+---+--------------------+

So far we’ve been querying our data as Sedona Spatial DataFrames, without persisting these DataFrames. However, to persist these tables we can save our DataFrames as Havasu Tables. Havasu is a spatial table format based on Apache Iceberg used by Sedona for storing geospatial data and enables efficient geospatial operations. You can read more about Havasu in this post.

By default in Wherobots Cloud the wherobots catalog is configured as a namespace for creating and storing databases that will be persisted in S3 cloud storage private to our Wherobots user.

Here we save our tiled raster table to a database called test_db in the wherobots catalog, thus the full name for this table is wherobots.test_db.f10_1993

sedona.sql("DROP TABLE IF EXISTS wherobots.test_db.f10_1993")
tile_df.writeTo("wherobots.test_db.f10_1993").create()

We can then refer to this table during queries. Here we count the number of rows in the table, which in this case is the number of tiles.

sedona.table("wherobots.test_db.f10_1993").count()
-----------------------
11154

To visualize the boundaries of each tile we can use SedonaKepler, along with the RS_Envelope function which will return the bounding box of each tile. Note that each tile also has x,y coordinates which represents its position relative to other tiles.

tiledMap = SedonaKepler.create_map()
SedonaKepler.add_df(tiledMap, sedona.table("wherobots.test_db.f10_1993").withColumn("tile", expr("RS_Envelope(tile)")), name="tiles")

Tiled raster boundaries

We can also visualize each individual tile using the RS_AsImage function.

display(HTML(sedona.table("wherobots.test_db.f10_1993").selectExpr("RS_AsImage(tile, 200)", "RS_Envelope(tile)").limit(5).toPandas().to_html(escape=False)))

Tiled raster visualization

To inspect the observed light intensity at a certain point we first need to find the tile that contains the point we wish to inspect. We can do this using the RS_Intersects predicate to find intersecting vector and raster geometries.

sedona.sql("""
WITH matched_tile AS (
  SELECT * FROM wherobots.test_db.f10_1993 
  WHERE RS_Intersects(tile, ST_POINT(-113.9940, 46.8721))
)

SELECT RS_Value(tile, ST_POINT(-113.9940, 46.8721))
AS pixel_value
FROM matched_tile

""").show(truncate=False)
+-----------+
|pixel_value|
+-----------+
|63.0       |
+-----------+

Tiled Zonal Statistics

We can also perform zonal statistics operations on a tiled raster dataset. This time let’s determine which counties in the US have the most night time visible light.

First, we’ll load the boundaries of all US counties using a Shapefile from Natural Earth.

counties_shapefile = "s3://wherobots-examples/data/examples/natural_earth/ne_10m_admin_2_counties"
spatialRDD = ShapefileReader.readToGeometryRDD(sedona, counties_shapefile)
counties_df = Adapter.toDf(spatialRDD, sedona)
counties_df.createOrReplaceTempView("counties")

Similar to the zonal statistics example above we’ll use the RS_ZonalStats function to calculate statistics of pixel values from our raster for all pixels that are contained within a geometry, however since our raster is tiled we’ll first need to match the tile(s) that intersect with the boundary each of county.

county_light_tiled_df = sedona.sql("""
WITH matched_tile AS (
  SELECT tile, geometry, FIPS
  FROM wherobots.test_db.f10_1993, counties
  WHERE RS_Intersects(tile, counties.geometry) 
)

SELECT 
    sum(RS_ZonalStats(matched_tile.tile, matched_tile.geometry, 'mean')) AS mean_light, 
    any_value(matched_tile.geometry) AS geometry, 
    FIPS
FROM matched_tile
GROUP BY FIPS
""")

county_light_tiled_df.createOrReplaceTempView("county_light_1993")
county_light_tiled_df.show(100)
+--------------------+--------------------+-------+
|          mean_light|            geometry|   FIPS|
+--------------------+--------------------+-------+
|   6.563388510224063|MULTIPOLYGON (((-...|US01003|
|   5.303058346553825|POLYGON ((-85.422...|US01019|
|   7.112594570538514|POLYGON ((-86.413...|US01021|
|  2.5223492723492806|POLYGON ((-88.091...|US01025|
|   9.564617731305812|POLYGON ((-85.789...|US01031|
|  13.433770014555993|POLYGON ((-88.130...|US01033|
|   8.240051020408188|POLYGON ((-86.370...|US01037|
|   2.301078582434537|POLYGON ((-86.191...|US01039|
|  0.9495387954422182|POLYGON ((-86.499...|US01041|
|     8.3112128146453|POLYGON ((-85.770...|US01045|
|   2.008212672420753|POLYGON ((-86.916...|US01047|
|   4.339487179487201|POLYGON ((-86.699...|US01053|

Visualizing the results of this analysis we can see that night time visible light intensity is largely consistent with population centers. Further analysis of this data using time series analysis could reveal patterns of growth or resource extraction activities.

SedonaKepler.create_map(county_light_tiled_df, name="Night Time Visible Lights by County")

Average night light intensity zonal statistics

Resources

In this post we explored working with raster data in Apache Sedona and Wherobots Cloud. Here are some relevant resources that you might find helpful:

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Working With Files – Getting Started With Wherobots Cloud Part 3

This is the third post in a series that will introduce Wherobots Cloud and WherobotsDB, covering how to get started with cloud-native geospatial analytics at scale.

In the previous post in this series we saw how to access and query data using Spatial SQL in Wherobots Cloud via the Wherobots Open Data Catalog. In this post we’re going to take a look at loading and working with our own data in Wherobots Cloud as well as creating and saving data as the result of our analysis, such as the end result of a data pipeline. We will cover importing files in various formats including CSV, GeoJSON, Shapefile, and GeoTIFF in WherobotsDB, working with AWS S3 cloud object storage, and creating GeoParquet files using Apache Sedona.

If you’d like to follow along you can create a free Wherobots Cloud account at cloud.wherobots.com.

Loading A CSV File From A Public S3 Bucket

First, let’s explore loading a CSV file from a public AWS S3 bucket. In our SedonaContext object we’ll configure the anonymous S3 authentication provider for the bucket to ensure we can access this specific S3 bucket’s contents. Most access configuration will happen in the SedonaContext configuration object in the notebook, however we can also apply these settings when creating the notebook runtime by specifying additional spark configuration in the runtime configuration UI. See this page in the documentation for more examples of configuring cloud object storage access using the SedonaContext configuration object.

from sedona.spark import *

config = SedonaContext.builder(). \
    config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
           "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

sedona = SedonaContext.create(config)

Access to private S3 buckets can be configured by either specifying access keys or for a more secure option using an IAM role trust policy.

Now that we’ve configured the anonymous S3 credentials provider we can use the S3 URI to access objects within the bucket. In this case we’ll load a CSV file of bird species observations. We’ll use the ST_Point function to convert the individual longitude and latitude columns into a singe Point geometry column.

S3_CSV_URL = "s3://wherobots-examples/data/examples/birdbuddy_oct23.csv"
bb_df = sedona.read.format('csv'). \
    option('header', 'true'). \
    option('delimiter', ','). \
    option('inferSchema', 'true'). \
    load(S3_CSV_URL)

bb_df = bb_df.selectExpr(
    'ST_Point(anonymized_longitude, anonymized_latitude) AS location', 
    'timestamp', 
    'common_name', 
    'scientific_name')

bb_df.createOrReplaceTempView('bb')
bb_df.show(truncate=False)
+----------------------------+-----------------------+-------------------------+----------------------+
|location                    |timestamp              |common_name              |scientific_name       |
+----------------------------+-----------------------+-------------------------+----------------------+
|POINT (-118.59075 34.393112)|2023-10-01 00:00:02.415|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:02.415|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:04.544|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:04.544|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.474|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.474|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.487|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.487|california scrub jay     |aphelocoma californica|
|POINT (-120.5542 43.804134) |2023-10-01 00:00:05.931|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:05.931|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:06.522|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:06.522|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:09.113|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:09.113|lesser goldfinch         |spinus psaltria       |
|POINT (-118.59075 34.393112)|2023-10-01 00:00:09.434|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:09.434|california scrub jay     |aphelocoma californica|
|POINT (-122.8521 46.864)    |2023-10-01 00:00:17.488|red winged blackbird     |agelaius phoeniceus   |
|POINT (-122.8521 46.864)    |2023-10-01 00:00:17.488|red winged blackbird     |agelaius phoeniceus   |
|POINT (-122.2438 47.8534)   |2023-10-01 00:00:18.046|chestnut backed chickadee|poecile rufescens     |
|POINT (-122.2438 47.8534)   |2023-10-01 00:00:18.046|chestnut backed chickadee|poecile rufescens     |
+----------------------------+-----------------------+-------------------------+----------------------+

We can visualize a sample of this data using SedonaKepler, the Kepler GL integration for Apache Sedona.

SedonaKepler.create_map(df=bb_df.sample(0.001), name="Bird Species")

Visualizing point data with Sedona Kepler

Uploading A GeoJSON File Using The Wherobots Cloud File Browser

Next, let’s see how we can upload our own data in Wherobots Cloud using the Wherobots Cloud File Browser UI. Wherobots Cloud accounts include secure file storage in private S3 buckets specific to our user or shared with other users of our organization.

There are two options for uploading our own data into Wherobots Cloud:

  1. Via the file browser UI in the Wherobots Cloud web application
  2. Using the AWS CLI by generating temporary ingest credentials

We’ll explore both options, first using the file browser UI. We will upload a GeoJSON file of the US Watershed Boundary Dataset. In the Wherobots Cloud web application, navigate to the “Files” tab then select the “data” directory. Here you’ll see two folders, one with the name customer-XXXXXXXXX and the other shared. The “customer” folder is private to each Wherobots Cloud user while the “shared” folder is private to the users within your Wherobots Cloud organization.

We can upload files using the “Upload” button. Once the file is uploaded we can click on the clipboard icon to the right of the filename to copy the S3 URL for this file to access it in the notebook environment.

The Wherobots Cloud file browser

We’ll save the S3 URL for this GeoJSON file as a variable to refer to later, note that this is private to my Wherobot’s user and not publicly accessible so if you’re following along you’ll have a different URL

S3_URL_JSON = "s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/watershed_boundaries.geojson"

WherobotsDB supports native readers for many file types, including GeoJSON so we’ll specify the “geojson” format to import our watersheds data into a Spatial DataFrame. This is a multiline GeoJSON file, where the features are contained in one large single object. WherobotsDB can also handle GeoJSON files with each feature in a single line. Refer to the documentation here for more information about working with GeoJSON files in WherobotsDB.

watershed_df = sedona.read.format("geojson"). \
    option("multiLine", "true"). \
    load(S3_URL_JSON). \
    selectExpr("explode(features) as features"). \
    select("features.*")
watershed_df.createOrReplaceTempView("watersheds")
watershed_df.printSchema()
root
 |-- geometry: geometry (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- areaacres: double (nullable = true)
 |    |-- areasqkm: double (nullable = true)
 |    |-- globalid: string (nullable = true)
 |    |-- huc6: string (nullable = true)
 |    |-- loaddate: string (nullable = true)
 |    |-- metasourceid: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- objectid: long (nullable = true)
 |    |-- referencegnis_ids: string (nullable = true)
 |    |-- shape_Area: double (nullable = true)
 |    |-- shape_Length: double (nullable = true)
 |    |-- sourcedatadesc: string (nullable = true)
 |    |-- sourcefeatureid: string (nullable = true)
 |    |-- sourceoriginator: string (nullable = true)
 |    |-- states: string (nullable = true)
 |    |-- tnmid: string (nullable = true)
 |-- type: string (nullable = true)

The WherobotsDB GeoJSON loader will parse GeoJSON exactly as it is stored – as a single object so we’ll want to explode the features column which will give us rows in our DataFrame containing each feature’s geometry and a struct containing the feature’s associated properties.

The geometries stored in GeoJSON are loaded as geometry types so we can operate on the DataFrame without explicitly creating geometries (as we did when loading the CSV file above).

Here we filter for all watersheds that intersect with California and visualize them using SedonaKepler.

california_df = sedona.sql("""
SELECT geometry, properties.name, properties.huc6
FROM watersheds
WHERE properties.states LIKE "%CA%"
""")

SedonaKepler.create_map(df=california_df, name="California Watersheds")

California watersheds

Uploading Files To Wherobots Cloud Using The AWS CLI

If we have large files or many files to upload to Wherobots Cloud, instead of uploading files through the file browser web application we can use the AWS CLI directly by generating temporary ingest credentials. After clicking “Upload” select “Create Ingest Credentials” to create temporary credentials that can be used with the AWS CLI to upload data into your private Wherobots Cloud file storage.

Uploading files to Wherobots Cloud via the AWS CLI

Once we generate our credentials we’ll need to configure the AWS CLI to use these credentials by adding them to a profile in the ~/.aws/credentials file (on Mac/Linux systems) or by running the aws configure command. See this page for more information on working with the AWS CLI.

[default]
aws_access_key_id = ASIAUZT33PSSUFF73PQA
aws_secret_access_key = O4zTVJTLNURqcFJof6F21Rh7gIQOGuqAUJNREUBa
aws_session_token = IQoJb3JpZ2luX2VjEOj//////////wEaCXVzLXdlc3QtMiJIMEYCIQCcn17jQory/9dbWjoq47cnxU4lENEE6S1akq1dEQx+4AIhALuFkR/XZtCOiw/AwEKtbCpj0IjDTR24MzSPxbSVoFOkKrICCMH//////////wEQABoMMzI5ODk4NDkxMDQ1IgwK3vI/VJyUkHJMltAqhgL6dhz0ikL2kpB7fCIKE52sw5NHlmG1LfuQkmxlhWHEJHnvFd1PrYEneDBTyXbt3Mxx8HQ86/k23zePAbm3mdOyVrrd7r9nA+cPYu5Jv93aGf+3brgGd/3fRMJy6y2Lwydsfuj/3u2/c8Ox7pTJKtcJYN14C8f3BNzTqtpR3bDjpTWG2+JMGFjgOx4lf9GuuXhW39tH8qOONA/y2lRiM00/j8cVOu1AZ/R5gRqL2/fCTFdxp9oBKJHXO8RZJ2u7/H67dmgdDFcw4T/ZuIvhEOtZ9TG2Vo9Vqb4jk+tP5E0ZhAnPyjWfhAdXD8at9/4i6S2WGhsywl5fnwBLYjRFRas5nnX7yzEEMLuQvq4GOpwBpyB0/qrzwPeRhHwb/K/ipspU2DMGSL0BFg6DoEpAOct/flMMmRYTaEhWV/Igexr746Hwox7ZOdN1gCLED1+iy8R/xcASNJJ9Yt194ItTvVtT4I6NTF13Oi50t0KqLURP43t1A67YwuZiWm+V7npUyiyHezBYLzTwf/rRi17lgpYO6/NSUSjgFeOqLDob11HGywzW6ifik+y269rq

Let’s upload a Shapefile of US Forest Service roads to our Wherobots Cloud file storage and then import it using WherobotsDB. The command to upload our Shapefile data will look like this (your S3 URL will be different). Note that we use the --recursive flag since we we want to upload the multiple files that make up a Shapefile.

aws s3 cp --recursive roads_shapefile s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/roads_shapefile

Back in our notebook, we can use WherobotsDB’s Shapefile Reader which will first generate a Spatial RDD from the Shapefile which we can then convert to a Spatial DataFrame. See this page in the documentation for more information about working with Shapefile data in Wherobots.

S3_URL_SHAPEFILE = "s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/roads_shapefile"

spatialRDD = ShapefileReader.readToGeometryRDD(sedona, S3_URL_SHAPEFILE)
roads_df = Adapter.toDf(spatialRDD, sedona)
roads_df.printSchema()

We can visualize a sample of our forest service roads using SedonaKepler.

SedonaKepler.create_map(df=roads_df.sample(0.1), name="Forest Service Roads")

Visualizing forest service roads

Working With Raster Data – Loading GeoTiff Raster Files

So far we’ve been working with vector data: geometries and their associated properties. We can also work with raster data in Wherobots Cloud. Let’s load a GeoTiff as an out-db raster using the RS_FromPath Spatial SQL function. Once we’ve loaded the raster image we can use Spatial SQL to manipulate and work with the band data of our raster.

ortho_url = "s3://wherobots-examples/data/examples/NEON_ortho.tif"
ortho_df = sedona.sql(f"SELECT RS_FromPath('{ortho_url}') AS raster")
ortho_df.createOrReplaceTempView("ortho")
ortho_df.show(truncate=False)

For example we can use the RS_AsImage function to visualize the GeoTiff, in this case an aerial image of a forest and road scene.

htmlDf = sedona.sql("SELECT RS_AsImage(raster) FROM ortho")
SedonaUtils.display_image(htmlDf)

Visualizing raster with RS_AsImage function

This image has three bands of data: red, green, and blue pixel values.

sedona.sql("SELECT RS_NumBands(raster) FROM ortho").show()

+-------------------+
|rs_numbands(raster)|
+-------------------+
|                  3|
+-------------------+

We can use the RS_MapAlgebra function to calculate the normalized different greenness index (NDGI), a metric similar to the normalized difference vegetation index (NDVI) used for quantifying the health and density of vegetation and landcover. The RS_MapAlgebra function allows us to execute complex computations using values from one or more bands and also across multiple rasters if for example we had a sequence of images across time and were interested in change detection.

ndgi_df = sedona.sql("""
SELECT RS_MapAlgebra(raster, 'D', 'out = (rast[1] - rast[0]) / (rast[1] + rast[0]);')
AS ndgi 
FROM ortho
""")

Writing Files With Wherobots Cloud

A common workflow with Wherobots Cloud is to load several files, perform some geospatial analysis and save the results as part of a larger data pipeline, often using GeoParquet. GeoParquet is a cloud-native file format that enables efficient data storage and retrieval of geospatial data.

Let’s perform some geospatial analysis using the data we loaded above and save the results as GeoParquet to our Wherobots Cloud S3 bucket. We’ll perform a spatial join of our bird observations, joining with the boundaries of our watersheds and then with a GROUP BY calculating the number of bird observations in each watershed.

birdshed_df = sedona.sql("""
SELECT 
    COUNT(*) AS num, 
    any_value(watersheds.geometry) AS geometry, 
    any_value(watersheds.properties.name) AS name, 
    any_value(watersheds.properties.huc6) AS huc6
FROM bb, watersheds
WHERE ST_Contains(watersheds.geometry, bb.location) 
    AND watersheds.properties.states LIKE "%CA%"
GROUP BY watersheds.properties.huc6
ORDER BY num DESC
""")

birdshed_df.show()

Here we can see the watersheds with the highest number of bird observations in our dataset.

+------+--------------------+--------------------+------+
|   num|            geometry|                name|  huc6|
+------+--------------------+--------------------+------+
|138414|MULTIPOLYGON (((-...|   San Francisco Bay|180500|
| 75006|MULTIPOLYGON (((-...|Ventura-San Gabri...|180701|
| 74254|MULTIPOLYGON (((-...|Laguna-San Diego ...|180703|
| 48452|MULTIPOLYGON (((-...|Central Californi...|180600|
| 33842|MULTIPOLYGON (((-...|    Lower Sacramento|180201|
| 20476|MULTIPOLYGON (((-...|           Santa Ana|180702|
| 17014|MULTIPOLYGON (((-...|         San Joaquin|180400|
| 15288|MULTIPOLYGON (((-...|Northern Californ...|180101|
| 13636|MULTIPOLYGON (((-...|             Truckee|160501|
|  9964|MULTIPOLYGON (((-...|Southern Oregon C...|171003|
|  6864|MULTIPOLYGON (((-...|Tulare-Buena Vist...|180300|
|  5120|MULTIPOLYGON (((-...|     Northern Mojave|180902|
|  3660|MULTIPOLYGON (((-...|          Salton Sea|181002|
|  2362|MULTIPOLYGON (((-...|              Carson|160502|
|  1040|MULTIPOLYGON (((-...|      Lower Colorado|150301|
|   814|MULTIPOLYGON (((-...|    Mono-Owens Lakes|180901|
|   584|MULTIPOLYGON (((-...|             Klamath|180102|
|   516|MULTIPOLYGON (((-...|    Upper Sacramento|180200|
|   456|MULTIPOLYGON (((-...|      North Lahontan|180800|
|   436|MULTIPOLYGON (((-...|Central Nevada De...|160600|
+------+--------------------+--------------------+------+
only showing top 20 rows

We can also visualize this data as a choropleth using SedonaKepler.

Bird observations per watershed

The GeoParquet data we’ll save will include the geometry of each watershed boundary, the count of bird observations, and the name and id of each watershed. We’ll save this GeoParquet file to our Wherobots private S3 bucket. Previously we accessed our S3 URL via the Wherobots File Browser UI, but we can also access this URI as an environment variable in the notebook environment.

USER_S3_PATH = os.environ.get("USER_S3_PATH")

Since this file isn’t very large we’ll save as a single partition, but typically we would want to save partitioned GeoParquet files partitioned on a geospatial index or administrative boundary. The WherobotsDB GeoParquet writer will add the geospatial metadata when saving as GeoParquet. See this page in the documentation for more information about creating GeoParquet files with Sedona.

birdshed_df.repartition(1).write.mode("overwrite"). \
    format("geoparquet"). \
    save(USER_S3_PATH + "geoparquet/birdshed.parquet")

That was a look at working with files in Wherobots Cloud. You can get started with large-scale geospatial analytics by creating a free account at cloud.wherobots.com. Please join the Wherobots & Apache Sedona Community site and let us know what you’re working on with Wherobots, what type of examples you’d like to see next, or if you have any feedback!

Sign in to Wherobots Cloud to get started today.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


SedonaSnow: Apache Sedona On Snowflake, Accelerating Your GIS Pipeline, Exploring Global Fishing Watch Data With GeoParquet, and Apache Sedona 1.5.1 Release

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! In this edition we cover SedonaSnow: Apache Sedona on Snowflake, accelerating your GIS pipeline with Apache Sedona, exploring Global Fishing Watch public data with SedonaDB and GeoParquet, and a look at new features and updates in the 1.5.1 release of Apache Sedona.

Apache Sedona Now Available In Snowflake Marketplace: SedonaSnow

SedonaSnow

Apache Sedona, the scalable open-source geospatial compute engine is now available on Snowflake via the Snowflake Marketplace or via manual installation. The SedonaSnow plugin brings Apache Sedona’s Spatial SQL functionality to Snowflake via 130+ Sedona SQL "ST" SQL functions that can be used alongside Snowflake SQL.

Read More About Using SedonaSnow In Snowflake In This Tutorial

Featured Community Members: Alihan Zihna & Fernando Ayuso Palacios

This month’s Wherobots & Apache Sedona featured community members are Alihan Zihna, Lead Data Scientist at CKDelta and Fernando Palacios, Director of Data Science & Data Engineering also at CKDelta. Alihan and Fernando presented "GIS Pipeline Acceleration With Apache Sedona" at the Data + AI Summit where they share how they were able to improve the performance and innovation of their geospatial analysis pipelines, going from a pipeline that took 48 hours to complete down to 10 minutes using Apache Sedona. Thank you Fernando and Alihan for being a part of the community and sharing your work!

GIS Pipeline Acceleration With Apache Sedona

GIS pipeline accelerations with Apache Sedona

In this presentation from Data + AI Summit, Fernando and Alihan discuss some of the various usecases for working with large-scale geospatial data at conglomerate CKDelta, part of the Hutchinson Group which operates ports, utility networks, retail stores and mobile telecom networks with hundreds of millions of users across dozens of countries. They discuss how geospatial analytics at scale is important for identifying water leakage in their utility network, understanding customer satisfaction, identifying sites for electric vehicle charging station installation, and forecasting the supply and demand of energy. They provide a technical overview of Apache Sedona and share the results of improving and extending their geospatial analytics pipelines including one process that reduced running time from 48 hours to 10 minutes using Apache Sedona.

Watch the recording of "GIS Pipeline Acceleration With Apache Sedona"

The Wherobots Notebook Environment – Getting Started With Wherobots Cloud & SedonaDB Part 2

Wherobots Initial Notebook

In Part 2 of our Getting Started With Wherobots Cloud & SedonaDB series we dive into the Wherobots Notebook Environment including how to configure and start notebook runtimes, an overview of the sample notebooks included in Wherobots Cloud, and how to use version control like git with notebooks. If you missed it check out Part 1: An Overview of Wherobots Cloud or sign up for a free Wherobots Cloud account to get started directly.

Read More About The Wherobots Notebooks Environment

Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

Matched vs unmatched vessels

This post is a hands-on look at offshore ocean infrastructure and industrial vessel activity with SedonaDB using data from Global Fishing Watch. We also see how GeoParquet can be used with this data to improve the efficiency of data retrieval and enable large-scale geospatial visualization using GeoArrow and the Lonboard Python visualization library.

Read "Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet"

Apache Sedona 1.5.1 Release

Apache Sedona 1.5.1 Release Notes

The most recent release of Apache Sedona introduces some exciting new updates including support for Spark 3.5, 20+ new raster functions, 7 new vector functions, support for running Sedona in Snowflake with SedonaSnow, updates to Sedona’s GeoParquet reader and writer, and more! The updated raster functions include RS_ZonalStats for computing zonal statistics, RS_Tile and RS_TileExplode to enable tiling large rasters, and updates to RS_MapAlgebra to enable user defined raster functions that can work across multiple rasters. Updated vector functions include ST_IsValidReason which exposes the reason geometries might not be valid, and ST_LineLocatePoint which can be useful for map matching and snapping data to road networks.

Read More About Apache Sedona 1.5.1 In The Release Notes

Hands-On With Havasu and GeoParquet

GeoParquet and Iceberg Havasu

Each month you can find a new livestream tutorial on the Wherobots YouTube channel. January’s livestream was all about working with GeoParquet and Havasu tables in SedonaDB. We dig in to understanding some of the optimizations built into the Apache Parquet format to learn how Parquet delivers efficient data storage and data retrieval before exploring the GeoParquet specification for storing geospatial data in Parquet. We cover loading, analyzing, and creating GeoParquet files using SedonaDB with a focus on comparing performance of various GeoParquet partitioning strategies. Finally, we see how the Havasu extension to the Apache Iceberg table format enables working with both vector and raster geospatial data backed by GeoParquet but with the familiar developer experience of SQL tables.

Watch The Recording: Hands-On With Havasu And GeoParquet

Upcoming Events

  • Apache Sedona Community Office Hour (Online Zoom Call – February 13, 2024) – Join the Apache Sedona community for updates on the state of Apache Sedona, presentation and demo of recent features, and provide your input into the roadmap, future plans, and contribution opportunities.
  • Raster Data Analysis With Spatial SQL & SedonaDB (Online Livestream – February 29th, 2024) – This month our livestream is focused on raster data analysis. We’ll see how to load raster data in SedonaDB and perform raster operations like map algebra and zonal statistics using Spatial SQL. Be sure to subscribe to the Wherobots YouTube channel to keep up to date with more Wherobots livestreams and videos!

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter: