7 Mins Read

8 Oct 2024

Introducing GeoStats for WherobotsAI and Apache Sedona

Authors

Introducing GeoStats for WherobotsAI and Apache Sedona

We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.

The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.

Use Cases for GeoStats

DBSCAN

DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.

Typical use cases for DBSCAN are found in:

Retail: Decision makers use DBSCAN with location data to understand areas of high and low pedestrian activity to decide where to setup retailing establishments.
City Planning: City planners use DBSCAN with GPS data to optimize transit support by identifying high usage routes, areas in need of additional transit options, or areas that have too much support.
Air Traffic Control: Traffic controllers use DBSCAN to identify areas with increasing weather activity to optimize flight routing.
Risk computation: Insurers and others can use DBSCAN to make policy decisions and calculate risk where risk is correlated to the proximity of two or more features of interest.

Local Outlier Factor (LOF)

LOF is an anomaly detection algorithm that identifies outliers present in a dataset.

Typical use cases for LOF include:

Data analysis and cleansing: Data teams can use LOF to identify and remove anomalies within a dataset, like removing erroneous GPS data points from a trace dataset

Getis-Ord (Gi*)

Getis-Ord is also a popular algorithm for identifying local hot and cold spots.

Typical use cases for Gi* include:

Public health: Officials can use disease data with Gi* to identify areas of abnormal disease outbreak
Telecommunications: Network administrators can use Gi* to identify areas of high demand and optimize network deployment
Insurance: Insurers can identify areas prone to specific claims to better manage risk

Click here to launch this interactive notebook

Launch Notebook

Traditional challenges with using these algorithms on geospatial data

Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:

Struggle to get performance or scale from the underlying solutions that also don’t perform well when joining geospatial data.
Determine how to host and scale open source versions of popular ML and statistical algorithms, like PostGIS or scikit-learn DBSCAN, PySal Gi*, or scikit-learn LOF, to work for geospatial data types and geospatial data formats.
Replicate this overhead each time they want to deploy a new algorithm for geospatial data.

Benefits of WherobotsAI GeoStats

With GeoStats in WherobotsAI, you can now:

Easily run native algorithms on a cloud-based engine, optimized for producing spatial data products and insights at scale.
Use these algorithms without the operational overhead associated with setup and maintenance.
Leverage optimized, hosted algorithms within a single platform to easily experiment and get critical insights faster.

We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.

Diving Deeper into the GeoStats Toolbox

DBSCAN Overview

DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.

How to use DBSCAN in Wherobots

The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.

WherobotsAI GeoStats DBSCAN Python API Overview
For a full walk through see the Python API reference: dbscan(...).

Supported Geometries: points, linestrings, polygons
Hyperparameters: max distance to neighbors (epsilon), min neighbors (min_points)
Output: dataframe with cluster id

DBSCAN Walk Through

Choose your dataset and create a Sedona DataFrame.

dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))

Choose values for your hyperparameters, max distance to neighbors (epsilon) and minimum neighbors (min_points). These values will determine how DBSCAN identifies clusters.

epsilon=0.3
min_points=10

Run DBSCAN on your DataFrame with your chosen hyperparameter values.

clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)

Analyze the results. For each datapoint, DBSCAN returns the cluster it’s associated with or if it’s an outlier.

+--------------------+------+-------+
|            geometry|isCore|cluster|
+--------------------+------+-------+
|POINT (1.22185277...| false|      1|
|POINT (0.77885034...| false|      1|
|POINT (-2.2744742...| false|      2|
+--------------------+------+-------+

only showing top 3 rows

There’s a complete example of how to use DBSCAN in the Wherobots user documentation.

DBSCAN Performance Overview

To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.

Local Outlier Factor (LOF)

LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.

How to use LOF in Wherobots

For the full example, please see this docs page.

WherobotsAI GeoStats LOF Python API Overview
For a full walk through see the Python API reference: local_outlier_factor(...).

Supported Geometries: points, linestrings, polygons
Hyperparameters: number of nearest neighbors to use
Output: score representing degree of inlier or outlier

LOF Walk Through

Choose your dataset and create a Sedona DataFrame.

df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))

Choose your k value for how many nearest neighbors you want to use to measure density near a given datapoint.

k=20

Run LOF on your DataFrame with your chosen k value.

outliers_df = local_outlier_factor(df, k=20)

Analyze your results. LOF returns a score for each datapoint representing the degree of inlier or outlier.

+--------------------+------------------+
|            geometry|               lof|
+--------------------+------------------+
|POINT (-1.9169927...|0.9991534865548664|
|POINT (-1.7562422...|1.1370318880088373|
|POINT (-2.0107478...|1.1533763384772193|
+--------------------+------------------+
only showing top 3 rows

There’s a complete example of how to use LOF in the Wherobots user documentation.

LOF Performance Overview

We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.

Getis-Ord (Gi*) Overview

Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.

How to use GeoStats Gi*

WherobotsAI GeoStats Gi* Python API Overview
For the full example, please see this docs g_local(...).

Supported Geometries: points, linestrings, polygons
Hyperparameters: star, neighbor weighting
Output: Set of statistics that indicate the degree of local hot or cold spot for a given record

Gi* Walk Through

Choose your dataset and create a Sedona Dataframe.

places_df = (
    sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
        .select(f.col("geometry"), f.col("categories"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)

Choose how you’d like to weight datapoints (ex: datapoints in a specific geographic area need to be weighted higher or any datapoint close to a given datapoint need to be weighted higher) and star (boolean to indicate if a record is a neighbor of itself).

star = True
neighbor_search_radius_degrees = 1.0
variable_column = "myNumericColumnName"

weighted_dataframe = add_binary_distance_band_column(
        df,
        neighbor_search_radius_degrees,
        include_self=star
)

Run Gi* on your DataFrame with your chosen hyperparameters.

gi_df = g_local(
        weighted_dataframe,
    variable_column,
    star=star
)

Analyze your results. For each datapoint, Gi* returns a set of statistics that indicate the degree of local hot or cold spot.

+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|num_places|                  G|                  EG|                  VG|                 Z|                   P|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|       871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465|                 0.0|
|       908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246|                 0.0|
|       218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908|                 0.0|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
only showing top 3 rows

There’s a complete example of how to use Gi* in the Wherobots user documentation.

Getis-Ord Performance Overview

To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.

Get started with WherobotsAI GeoStats

The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.

If you haven’t already, create a free Wherobots Organization subscribed to the Community Edition of Wherobots.
Start a Wherobots Notebook
In the Notebook environment, explore the notebook_example/python/wherobots-ai/ folder for examples that you can use to get started.
Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Spatial Intelligence Newsletter and join the Sedona community to get notified of the release and get started!

What’s next

We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Get started with Wherobots Pro

Access Here

7 Mins Read 10 Dec 2025

Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence

RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.

Computer Vision + 4

4 Mins Read 27 Feb 2026

PostGIS, Wherobots, and the Spatial Data Lakehouse: A Strategic Guide for Leaders

Explore PostGIS, Wherobots, and the Spatial Data Lakehouse. Learn when to use each for scalable geospatial analytics, AI, and cost-efficient data strategy.

Spatial Lakehouse

4 Mins Read 19 Feb 2026

It takes 15 minutes for the Caltrain to get from Sunnyvale to SAP Center

That’s how long it took our MCP server to go from “how many bus stops are in Maryland” to an answer

General + 2

4 Mins Read 10 Feb 2026

Wherobots and Felt Partner to Modernize Spatial Intelligence

We’re excited to announce Wherobots and Felt are partnering to enable data teams to innovate with physical world data and move beyond legacy GIS, using the modern spatial intelligence stack. The stack with Wherobots and Felt provides a cloud-native, spatial processing and collaborative mapping solution that accelerates innovation and time-to-insight across an organization. What is […]

General + 1

Introducing GeoStats for WherobotsAI and Apache Sedona

Introducing GeoStats for WherobotsAI and Apache Sedona

Use Cases for GeoStats

DBSCAN

Local Outlier Factor (LOF)

Getis-Ord (Gi*)

Traditional challenges with using these algorithms on geospatial data

Benefits of WherobotsAI GeoStats

Diving Deeper into the GeoStats Toolbox

DBSCAN Overview

How to use DBSCAN in Wherobots

DBSCAN Performance Overview

Local Outlier Factor (LOF)

How to use LOF in Wherobots

LOF Performance Overview

Getis-Ord (Gi*) Overview

How to use GeoStats Gi*

Getis-Ord Performance Overview

Get started with WherobotsAI GeoStats

Apache Sedona Users

What’s next

RELATED POSTS