Introducing GeoStats for WherobotsAI and Apache Sedona

Introducing GeoStats for WherobotsAI and Apache Sedona

We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.

The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.

Use Cases for GeoStats

DBSCAN

DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.

Typical use cases for DBSCAN are found in:

  • Retail: Decision makers use DBSCAN with location data to understand areas of high and low pedestrian activity to decide where to setup retailing establishments.
  • City Planning: City planners use DBSCAN with GPS data to optimize transit support by identifying high usage routes, areas in need of additional transit options, or areas that have too much support.
  • Air Traffic Control: Traffic controllers use DBSCAN to identify areas with increasing weather activity to optimize flight routing.
  • Risk computation: Insurers and others can use DBSCAN to make policy decisions and calculate risk where risk is correlated to the proximity of two or more features of interest.
Local Outlier Factor (LOF)

LOF is an anomaly detection algorithm that identifies outliers present in a dataset.

Typical use cases for LOF include:

  • Data analysis and cleansing: Data teams can use LOF to identify and remove anomalies within a dataset, like removing erroneous GPS data points from a trace dataset
Getis-Ord (Gi*)

Getis-Ord is also a popular algorithm for identifying local hot and cold spots.

Typical use cases for Gi* include:

  • Public health: Officials can use disease data with Gi* to identify areas of abnormal disease outbreak
  • Telecommunications: Network administrators can use Gi* to identify areas of high demand and optimize network deployment
  • Insurance: Insurers can identify areas prone to specific claims to better manage risk

Traditional challenges with using these algorithms on geospatial data

Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:

  1. Struggle to get performance or scale from the underlying solutions that also don’t perform well when joining geospatial data.
  2. Determine how to host and scale open source versions of popular ML and statistical algorithms, like PostGIS or scikit-learn DBSCAN, PySal Gi*, or scikit-learn LOF, to work for geospatial data types and geospatial data formats.
  3. Replicate this overhead each time they want to deploy a new algorithm for geospatial data.

Benefits of WherobotsAI GeoStats

With GeoStats in WherobotsAI, you can now:

  1. Easily run native algorithms on a cloud-based engine, optimized for producing spatial data products and insights at scale.
  2. Use these algorithms without the operational overhead associated with setup and maintenance.
  3. Leverage optimized, hosted algorithms within a single platform to easily experiment and get critical insights faster.

We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.

Diving Deeper into the GeoStats Toolbox

DBSCAN Overview

DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.

How to use DBSCAN in Wherobots

The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.

WherobotsAI GeoStats DBSCAN Python API Overview
For a full walk through see the Python API reference: dbscan(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: max distance to neighbors (epsilon), min neighbors (min_points)
  • Output: dataframe with cluster id

DBSCAN Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))
  1. Choose values for your hyperparameters, max distance to neighbors (epsilon) and minimum neighbors (min_points). These values will determine how DBSCAN identifies clusters.
epsilon=0.3
min_points=10
  1. Run DBSCAN on your DataFrame with your chosen hyperparameter values.
clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)
  1. Analyze the results. For each datapoint, DBSCAN returns the cluster it’s associated with or if it’s an outlier.
+--------------------+------+-------+
|            geometry|isCore|cluster|
+--------------------+------+-------+
|POINT (1.22185277...| false|      1|
|POINT (0.77885034...| false|      1|
|POINT (-2.2744742...| false|      2|
+--------------------+------+-------+

only showing top 3 rows

There’s a complete example of how to use DBSCAN in the Wherobots user documentation.

DBSCAN Performance Overview

To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.

Local Outlier Factor (LOF)

LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.

How to use LOF in Wherobots

For the full example, please see this docs page.

WherobotsAI GeoStats LOF Python API Overview
For a full walk through see the Python API reference: local_outlier_factor(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: number of nearest neighbors to use
  • Output: score representing degree of inlier or outlier

LOF Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
  1. Choose your k value for how many nearest neighbors you want to use to measure density near a given datapoint.
k=20
  1. Run LOF on your DataFrame with your chosen k value.
outliers_df = local_outlier_factor(df, k=20)
  1. Analyze your results. LOF returns a score for each datapoint representing the degree of inlier or outlier.
+--------------------+------------------+
|            geometry|               lof|
+--------------------+------------------+
|POINT (-1.9169927...|0.9991534865548664|
|POINT (-1.7562422...|1.1370318880088373|
|POINT (-2.0107478...|1.1533763384772193|
+--------------------+------------------+
only showing top 3 rows

There’s a complete example of how to use LOF in the Wherobots user documentation.

LOF Performance Overview

We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.

Getis-Ord (Gi*) Overview

Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.

How to use GeoStats Gi*

WherobotsAI GeoStats Gi* Python API Overview
For the full example, please see this docs g_local(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: star, neighbor weighting
  • Output: Set of statistics that indicate the degree of local hot or cold spot for a given record

Gi* Walk Through

  1. Choose your dataset and create a Sedona Dataframe.
places_df = (
    sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
        .select(f.col("geometry"), f.col("categories"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)
  1. Choose how you’d like to weight datapoints (ex: datapoints in a specific geographic area need to be weighted higher or any datapoint close to a given datapoint need to be weighted higher) and star (boolean to indicate if a record is a neighbor of itself).

star = True
neighbor_search_radius_degrees = 1.0
variable_column = "myNumericColumnName"

weighted_dataframe = add_binary_distance_band_column(
        df,
        neighbor_search_radius_degrees,
        include_self=star
)
  1. Run Gi* on your DataFrame with your chosen hyperparameters.
gi_df = g_local(
        weighted_dataframe,
    variable_column,
    star=star
)
  1. Analyze your results. For each datapoint, Gi* returns a set of statistics that indicate the degree of local hot or cold spot.
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|num_places|                  G|                  EG|                  VG|                 Z|                   P|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|       871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465|                 0.0|
|       908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246|                 0.0|
|       218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908|                 0.0|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
only showing top 3 rows

There’s a complete example of how to use Gi* in the Wherobots user documentation.

Getis-Ord Performance Overview

To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.

Get started with WherobotsAI GeoStats

The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.

  • If you haven’t already, create a free Wherobots Organization subscribed to the Community Edition of Wherobots.
  • Start a Wherobots Notebook
  • In the Notebook environment, explore the notebook_example/python/wherobots-ai/ folder for examples that you can use to get started.
  • Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the release and get started!

What’s next

We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

The WherobotsAI Team

Easily create trip insights at scale by snapping millions of GPS points to road segments using WherobotsAI Map Matching

What is Map Matching?

GPS data is inherently noisy and often lacks precision, which can make it challenging to extract accurate insights. This imprecision means that the GPS points logged may not accurately represent the actual locations where a device was. For example, GPS data from a drive around a lake may incorrectly include points that are over the water!

To address these inaccuracies, teams commonly use two approaches:

  1. Identifying and Dropping Erroneous Points: This method involves manually or algorithmically filtering out GPS points that are clearly incorrect. However, this approach can reduce analytical accuracy, be costly, and is time-intensive.
  2. Map Matching Techniques: A smarter and more effective approach involves using map matching techniques. These techniques take the noisy GPS data points and compute the most likely path taken based on known transportation segments such as roadways or trails.

WherobotsAI Map Matching offers an advanced solution for this problem. It performs map matching with high scale on millions or even billions of trips with ease and performance, ensuring that the GPS data aligns accurately with the actual paths most likely taken.

map matching telematics

An illustration of map matching. Blue dots: GPS samples, Green line: matched trajectory.

Map matching is a common solution for preparing GPS data for use in a wide range of applications including:

  • Sattelite & GPS based navigation
  • GPS tracking of freight
  • Assessing risk of driving behavior for improved insurance pricing
  • Post hoc analysis of self driving car trips for telematics teams
  • Transportation engineering and urban planning

The objective of map matching is to accurately determine which road or path in the digital map corresponds to the observed geographic coordinates, considering factors such as the accuracy of the location data, the density and layout of the road network, and the speed and direction of travel.

Existing Solutions for Map Matching

Most map matching implementations are variants of the Hidden Markov Model (HMM)-based algorithm described by Newson and Krumm in their seminal paper, "Hidden Markov Map Matching through Noise and Sparseness." This foundational research has influenced a variety of map matching solutions available today.

However, traditional HMM-based approaches have notable downsides when working with large-scale GPS datasets:

  1. Significant Costs: Many commercially available map matching APIs charge substantial fees for large-scale usage.
  2. Performance Issues: Traditional map matching algorithms, while accurate, are often not optimized for large-scale computation. They can be prohibitively slow, especially when dealing with extensive GPS data, as the underlying computation struggles to handle the data scale efficiently.

These challenges highlight the need for more efficient and cost-effective solutions capable of handling large-scale GPS datasets without compromising on performance.

RESTful API Map Matching Options

The Mapbox Map Matching API, HERE Maps Route Matching API, and Google Roads API are powerful RESTful APIs for performing map matching. These solutions are particularly effective for small-scale applications. However, for large-scale applications, such as population-level analysis involving millions of trajectories, the costs can become prohibitively high.

For example, as of July 2024, the approximate costs for matching 1 million trips are:

  • Mapbox: $1,600
  • HERE Maps: $4,400
  • Google Maps Platform: $8,000

These prices are based on public pricing pages and do not consider any potential volume-based discounts that may be available.

While these APIs provide robust and accurate map matching capabilities, organizations seeking to perform extensive analyses often must explore more cost-effective alternatives.

Open-Source Map Matching Solutions

Open-source software such as such as Valhalla or GraphHopper can also be used for map matching. However, these solutions are designed for use on a single-machine. If your map matching workload exceeds the capacity that machine, your workload will suffer from extended processing times. Furthermore, you will end up running out of headroom if you are vertically scaling up the ladder of VM sizes.

Meet WherobotsAI Map Matching

WherobotsAI Map Matching is a high performance, low cost, and planetary scale map matching capability for your telematics pipelines.

WherobotsAI provides a scalable map matching feature designed for small to very large scale trajectory datasets. It works seamlessly with other Wherobots capabilities, which means you can implement data cleaning, data transformations, and map matching in one single (serverless) data processing pipeline. We’ll see how it works in the following sections.

How it works

WherobotsAI Map Matching takes a DataFrame containing trajectories and another DataFrame containing road segments, and produces a DataFrame containing map matched results. Here is a walk-through of using WherobotsAI Map Matching to match trajectories in the VED dataset to the OpenStreetMap (OSM) road network.

1. Preparing the Trajectory Data

First, we load the trajectory data. We’ll use the preprocessed VED dataset stored as GeoParquet files for demonstration.

dfPath = sedona.read.format("geoparquet").load("s3://wherobots-benchmark-prod/data/mm/ved/VED_traj/")

The trajectory dataset should contain the following attributes:

  • A unique ID for trips. In this example the ids attribute is the unique ID of each trip.
  • A geometry attribute containing LineStrings, in this case the geometry attribute is for trip data.

The rows in the trajectory DataFrame look like this:

+---+-----+----+--------------------+--------------------+
|ids|VehId|Trip|              coords|            geometry|
+---+-----+----+--------------------+--------------------+
|  0|    8| 706|[{0, 42.277558333...|LINESTRING (-83.6...|
|  1|    8| 707|[{0, 42.277681388...|LINESTRING (-83.6...|
|  2|    8| 708|[{0, 42.261997222...|LINESTRING (-83.7...|
|  3|   10|1558|[{0, 42.277065833...|LINESTRING (-83.7...|
|  4|   10|1561|[{0, 42.286599722...|LINESTRING (-83.7...|
+---+-----+----+--------------------+--------------------+
only showing top 5 rows
2. Preparing the Road Network Data

We’ll use the OpenStreetMap (OSM) data specific to the Ann Arbor, Michigan region to map match our trip data with. Wherobots provides an API for loading road network data from OSM XML files.

from wherobots import matcher
dfEdge = matcher.load_osm("s3://wherobots-examples/data/osm_AnnArbor_large.xml", "[car]")
dfEdge.show(5)

The loaded road network DataFrame looks like this:

+--------------------+----------+--------+----------+-----------+----------+-----------+
|            geometry|       src|     dst|   src_lat|    src_lon|   dst_lat|    dst_lon|
+--------------------+----------+--------+----------+-----------+----------+-----------+
|LINESTRING (-83.7...|  68133325|27254523| 42.238819|-83.7390142|42.2386159|-83.7390153|
|LINESTRING (-83.7...|9405840276|27254523|42.2386058|-83.7388915|42.2386159|-83.7390153|
|LINESTRING (-83.7...|  68133353|27254523|42.2385675|-83.7390856|42.2386159|-83.7390153|
|LINESTRING (-83.7...|2262917109|27254523|42.2384552|-83.7390313|42.2386159|-83.7390153|
|LINESTRING (-83.7...|9979197063|27489080|42.3200426|-83.7272283|42.3200887|-83.7273003|
+--------------------+----------+--------+----------+-----------+----------+-----------+
only showing top 5 rows

Users can also prepare the road network data from any data source using any data processing procedures, as long as the schema of the road network DataFrame conforms to the requirement of the Map Matching API.

3. Run Map Matching

Once the trajectories and road network data is ready, we can run matcher.match to match trajectories to the road network.

dfMmResult = matcher.match(dfEdge, dfPath, "geometry", "geometry")

The dfMmResult contains the trajectories snapped to the roads in matched_points attribute:

+---+--------------------+--------------------+--------------------+
|ids|     observed_points|      matched_points|       matched_nodes|
+---+--------------------+--------------------+--------------------+
|275|LINESTRING (-83.6...|LINESTRING (-83.6...|[62574078, 773611...|
|253|LINESTRING (-83.6...|LINESTRING (-83.6...|[5930199197, 6252...|
| 88|LINESTRING (-83.7...|LINESTRING (-83.7...|[4931645364, 6249...|
|561|LINESTRING (-83.6...|LINESTRING (-83.6...|[29314519, 773612...|
|154|LINESTRING (-83.7...|LINESTRING (-83.7...|[5284529433, 6252...|
+---+--------------------+--------------------+--------------------+
only showing top 5 rows

We can visualize the map matching result using SedonaKepler to see what the matched trajectories look like:

mapAll = SedonaKepler.create_map()
SedonaKepler.add_df(mapAll, dfEdge, name="Road Network")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("observed_points AS geometry"), name="Observed Points")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("matched_points AS geometry"), name="Matched Points")
mapAll

The following figure shows the map matching results. The red lines are original trajectories, and the green lines are matched trajectories. We can see that the noisy original trajectories are all snapped to the road network.

map matching results example 2

Performance

We used WherobotsAI Map Matching to match 90 million trips across the entire US in just 1.5 hours on the Wherobots Tokyo runtime, which equates to approximately 1 million trips per minute. The average cost of matching 1 million trips is an order of magnitude less costly and more efficient than the options outlined above.

The “optimization magic” behind WherobotsAI Map Matching lies in how Wherobots intelligently and automatically co-partitions trajectory and road network datasets based on the spatial proximity of their elements, ensuring a balanced distribution of work. As a result, the computational load is balanced evenly through this partitioning strategy, and makes map matching with Wherobots highly efficient, scalable, and affordable compared to alternatives.

Try It Out!

You can try out WherobotsAI Map Matching by starting a notebook environment in Wherobots Cloud and running our example notebook within Wherobots Cloud.

notebook_example/python/wherobots-ai/mapmatching_example.ipynb

You can also check out the WherobotsAI Map Matching tutorial and reference documentation for more information!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Unlock Satellite Imagery Insights with WherobotsAI Raster Inference

Recently we introduced WherobotsAI Raster Inference to unlock analytics on satellite and aerial imagery using SQL or Python. Raster Inference simplifies extracting insights from satellite and aerial imagery using SQL or Python, and is powered by open-source machine learning models. This feature is currently in preview, and we are expanding it’s capabilities to support more models. Below we’ll dig into the popular computer vision tasks that Raster Inference supports, describe how it works, and how you can use it to run batch inference to find and map electricity infrastructure.

Watch the live demo of these capabilities here.

The Power of Machine Learning with Satellite Imagery

Petabytes of satellite imagery are generated each day all over the world in a dizzying number of sensor types and image resolutions. The applications for satellite imagery and other remote sensing data sources are broad and diverse. For example, satellites with consistent, continuous orbits are ideal for monitoring forest carbon stocks to validate carbon credits or estimating agricultural yields.

However, this data has been inaccessible for most analysts and even seasoned ML practitioners because insight extraction required specialized skills. We’ve done the work to make insight extraction simple and accessible to more people. Raster Inference abstracts the complexity and scales to support planetary-scale imagery datasets, so you don’t need ML expertise to derive insights. In this blog, we explore the key features that make Raster Inference effective for land cover classification, solar farm mapping, and marine infrastructure detection. And, in the near future, you will be able to use Raster Inference with your own models!

Introduction to Popular and Supported Machine Learning Tasks

Raster Inference supports the three most common kinds of computer vision models that are applied to imagery: classification, object detection, and semantic segmentation. Instance segmentation (combines object localization and semantic segmentation) is another common type of model which is not currently supported, but let us know if you need by contacting us and we can add it to the roadmap.

Computer Vision Detection Types
Computer Vision Detection Categories from Lin et al. Microsoft COCO: Common Objects in Context

The figure above illustrates these tasks. Image classification is when an image is assigned one or more text labels. In image (a), the scene is assigned the labels “person”, “sheep”, and “dog”. Image (b) is an example of object localization (or object detection). Object localization creates bounding boxes around objects of interest and assigns labels. In this image, five sheep are localized separately along with one human and one dog. Finally, semantic segmentation is when each pixel is given a category label, as shown in image (c). Here we can see all the pixels belonging to sheep are labeled blue, the dog is labeled red, and the person is labeled teal.

While these examples highlight detection tasks on regular imagery, these computer vision models can be applied to raster formatted imagery. Raster data formats are the most common data formats for satellite and aerial imagery. When objects of interest in raster imagery are localized, their bounding boxes can be georeferenced, which means that each pixel is localized to spatial coordinates, such as latitude and longitude. Therefore, georeferencing is object localization suited for spatial analytics.

https://wherobots.com/wp-content/uploads/2024/06/remotesensing-11-00339-g005.png

The example above shows various applications of object detection for localizing and classifying features in high resolution satellite and aerial imagery. This example comes from DOTA, a 15-class dataset of different objects in RGB and grayscale satellite imagery. Public datasets like DOTA are used to develop and benchmark machine learning models.

Not only are there many publicly available object detection models, but also there are many semantic segmentation models.

Semantic Segmentation
Sourced from “A Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning”.

Not every machine learning model should be treated equally, and each will have their own tradeoffs. You can see the difference between the ground truth image (human annotated buildings representing the real world) and segmentation results across two models (Scale-MAE and Vanilla MAE). These results are derived from the same image at two different resolutions (referred to as GSD, or Ground Sampling Distance).

  • Scale-MAE is a model developed to handle detection tasks at various resolutions with different sensor inputs. It uses a similar MAE model architecture as the Vanilla MAE, but is trained specifically for detection tasks on overhead imagery that span different resolutions.
  • The Vanilla MAE is not trained to handle varying resolutions in overhead imagery. It’s performance suffers in the top row and especially the bottom row, where resolution is coarser, as seen by the mismatch between Vanilla MAE and the ground truth image where many pixels are incorrectly classified.

Satellite Analytics Before Raster Inference

Without Raster Inference, typically a team who is looking to extract insights from overhead imagery using ML would need to:

  1. Deploy a distributed runtime to scale out workloads such as data loading, preprocessing, and inference.
  2. Develop functionality to operate on raster metadata to easily filter it by location to run inference workloads on specific areas of interest.
  3. Optimize models to run performantly on GPUs, which can involve complex rewrites of the underlying model prediction logic.
  4. Create and manage data preprocessing pipelines to normalize, resize, and collate raster imagery into the correct data type and size required by the model.
  5. Develop the logic to run data loading, preprocessing, and model inference efficiently at scale.

Raster Inference and its SQL and Python APIs abstract this complexity so you and your team can easily perform inference on massive raster datasets.

Raster Inference APIs for SQL and Python

Raster Inference offers APIs in both SQL and Python to run inference tasks. These APIs are designed to be easy to use, even if you’re not a machine learning expert. RS_CLASSIFY can be used for scene classification, RS_BBOXES_DETECT for object detection, and RS_SEGMENT for semantic segmentation. Each function produces tabular results which can be georeferenced either for the scene, object, or segmentation depending on the function. The records can be joined or visualized with other data (geospatial or traditional) to curate enriched datasets and insights. Here are SQL and Python examples for RS_Segment.

RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
df = df_raster_input.withColumn("segment_result", rs_segment(model_id, col("outdb_raster")))

Example: Mapping Electricity Infrastructure

Imagine you want to optimize the location of new EV charging stations, but you want to target locations based on the availability of green energy sources, such as local solar farms. You can use Raster Inference to detect and locate solar farms and cross-reference these locations with internal data or other vector geometries that captures demand for EV charging. This use case will be demonstrated in our upcoming release webinar on July 10th.

Let’s walk through how to use Raster Inference for this use case.

First, we run predictions on rasters to find solar farms. The following code block that calls RS_SEGMENT shows how easy this is.

CREATE OR REPLACE TEMP VIEW segment_fields AS (
    SELECT
        outdb_raster,
        RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
    FROM
    az_high_demand_with_scene
)

The confidence_array column produced from RS_SEGMENT can be assigned the same geospatial coordinates as the raster input and converted to a vector that can be spatially joined and processed with WherobotsDB using RS_SEGMENT_TO_GEOMS. We select a confidence threshold of .65 so that we only georeference high confidence detections.

WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions_df
    )
    SELECT result.* FROM t
+----------+--------------------+--------------------+
|     class|avg_confidence_score|            geometry|
+----------+--------------------+--------------------+
|Solar Farm|  0.7205783606825462|MULTIPOLYGON (((-...|
|Solar Farm|  0.7273308333550763|MULTIPOLYGON (((-...|
|Solar Farm|  0.7301468510823231|MULTIPOLYGON (((-...|
|Solar Farm|  0.7180177244988899|MULTIPOLYGON (((-...|
|Solar Farm|   0.728077805771141|MULTIPOLYGON (((-...|
|Solar Farm|     0.7264981572898|MULTIPOLYGON (((-...|
|Solar Farm|  0.7044100126912517|MULTIPOLYGON (((-...|
|Solar Farm|  0.7137283466756343|MULTIPOLYGON (((-...|
+----------+--------------------+--------------------+

This allows us to integrate the vectorized model predictions with other spatial datasets and easily visualize the results with SedonaKepler.

https://wherobots.com/wp-content/uploads/2024/06/solar_farm_detection-1-1024x398.png

Here Raster Inference runs on a 85 GiB dataset with 2,200 raster scenes for Arizona. Using a Sedona (tiny) runtime, Raster Inference completed in 430 seconds, predicting solar farms for all low cloud cover satellite images for the state of Arizona for the month of October. If we scale up our runtime to a San Francisco (small) runtime, the inference speed nearly doubles. In general, average bytes processed per second by Wherobots increases as datasets scale in size because startup costs are amortized over time. Processing speed also increases as runtimes scale in size.

Inference time (seconds) Runtime Size
430 Sedona
246 San Francisco

We use predictions from the output of Raster Inference to derive insights about which zip codes have the most solar farms, as shown below. This statement joins predicted solar farms with zip codes by location, then ranks zip codes by the pre-computed solar farm area within each zip code. We skipped this step for brevity but you can see it and others in the notebook example.

az_solar_zip_codes = sedona.sql("""
SELECT solar_area, any_value(az_zta5.geometry) AS geometry, ZCTA5CE10
FROM predictions_polys JOIN az_zta5
WHERE ST_Intersects(az_zta5.geometry, predictions_polys.geometry)
GROUP BY ZCTA5CE10
ORDER BY solar_area DESC
""")

https://wherobots.com/wp-content/uploads/2024/06/final_analysis.png

These predictions are made possible by SATLAS, a family of machine learning models released with Apache 2.0 licensing from Allen AI. The solar model demonstrated above was derived from the SATLAS foundational model. This foundational model can be used as a building block to create models to address specific detection challenges like solar farm detection. Additionally, there are many other open source machine learning models available for deriving insights from satellite imagery, many of which are provided by the TorchGeo project. We are just beginning to explore what these models can achieve for planetary-scale monitoring.

If you have a specific model you would like to see made available, please contact us to let us know.

For detailed instructions on using Raster Inference, please refer to our example Jupyter notebooks in the documentation.

https://wherobots.com/wp-content/uploads/2024/06/Screenshot_2024-06-08_at_2.11.07_PM-1024x683.png

Here are some links to get you started:
https://docs.wherobots.com/latest/tutorials/wherobotsai/wherobots-inference/segmentation/

https://docs.wherobots.com/latest/api/wherobots-inference/pythondoc/inference/sql_functions/

Getting Started

Getting started with WherobotsAI Raster Inference is easy. We’ve provided three models in Wherobots Cloud that can be used with our GPU optimized runtimes. Sign up for an account on Wherobots Cloud, send us a note to access the professional tier, start a GPU runtime, and you can run our example Jupyter notebooks to analyze satellite imagery in SQL or Python.

Stay tuned for updates on improvements to Raster Inference that will make it possible to run more models, including your own custom models. We’re excited to hear what models you’d like us to support, or the integrations you need to make running your own models even easier with Raster Inference. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: