Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.
The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.
DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.
Typical use cases for DBSCAN are found in:
LOF is an anomaly detection algorithm that identifies outliers present in a dataset.
Typical use cases for LOF include:
Getis-Ord is also a popular algorithm for identifying local hot and cold spots.
Typical use cases for Gi* include:
Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:
With GeoStats in WherobotsAI, you can now:
We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.
DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.
The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.
WherobotsAI GeoStats DBSCAN Python API OverviewFor a full walk through see the Python API reference: dbscan(...).
dbscan(...)
DBSCAN Walk Through
dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))
epsilon=0.3 min_points=10
clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)
+--------------------+------+-------+ | geometry|isCore|cluster| +--------------------+------+-------+ |POINT (1.22185277...| false| 1| |POINT (0.77885034...| false| 1| |POINT (-2.2744742...| false| 2| +--------------------+------+-------+ only showing top 3 rows
There’s a complete example of how to use DBSCAN in the Wherobots user documentation.
To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.
LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.
For the full example, please see this docs page.
WherobotsAI GeoStats LOF Python API OverviewFor a full walk through see the Python API reference: local_outlier_factor(...).
local_outlier_factor(...)
LOF Walk Through
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
k=20
outliers_df = local_outlier_factor(df, k=20)
+--------------------+------------------+ | geometry| lof| +--------------------+------------------+ |POINT (-1.9169927...|0.9991534865548664| |POINT (-1.7562422...|1.1370318880088373| |POINT (-2.0107478...|1.1533763384772193| +--------------------+------------------+ only showing top 3 rows
There’s a complete example of how to use LOF in the Wherobots user documentation.
We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.
Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.
WherobotsAI GeoStats Gi* Python API OverviewFor the full example, please see this docs g_local(...).
g_local(...)
Gi* Walk Through
places_df = ( sedona.table("wherobots_open_data.overture_2024_07_22.places_place") .select(f.col("geometry"), f.col("categories")) .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0]) )
star = True neighbor_search_radius_degrees = 1.0 variable_column = "myNumericColumnName" weighted_dataframe = add_binary_distance_band_column( df, neighbor_search_radius_degrees, include_self=star )
gi_df = g_local( weighted_dataframe, variable_column, star=star )
+----------+-------------------+--------------------+--------------------+------------------+--------------------+ |num_places| G| EG| VG| Z| P| +----------+-------------------+--------------------+--------------------+------------------+--------------------+ | 871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465| 0.0| | 908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246| 0.0| | 218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908| 0.0| +----------+-------------------+--------------------+--------------------+------------------+--------------------+ only showing top 3 rows
There’s a complete example of how to use Gi* in the Wherobots user documentation.
To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.
The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.
notebook_example/python/wherobots-ai/
Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Spatial Intelligence Newsletter and join the Sedona community to get notified of the release and get started!
We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
PostGIS, Wherobots, and the Spatial Data Lakehouse: A Strategic Guide for Leaders
Explore PostGIS, Wherobots, and the Spatial Data Lakehouse. Learn when to use each for scalable geospatial analytics, AI, and cost-efficient data strategy.
It takes 15 minutes for the Caltrain to get from Sunnyvale to SAP Center
That’s how long it took our MCP server to go from “how many bus stops are in Maryland” to an answer
Wherobots and Felt Partner to Modernize Spatial Intelligence
We’re excited to announce Wherobots and Felt are partnering to enable data teams to innovate with physical world data and move beyond legacy GIS, using the modern spatial intelligence stack. The stack with Wherobots and Felt provides a cloud-native, spatial processing and collaborative mapping solution that accelerates innovation and time-to-insight across an organization. What is […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: