Introducing GeoStats for WherobotsAI and Apache Sedona
We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.
The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.
Use Cases for GeoStats
DBSCAN
DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.
Typical use cases for DBSCAN are found in:
- Retail: Decision makers use DBSCAN with location data to understand areas of high and low pedestrian activity to decide where to setup retailing establishments.
- City Planning: City planners use DBSCAN with GPS data to optimize transit support by identifying high usage routes, areas in need of additional transit options, or areas that have too much support.
- Air Traffic Control: Traffic controllers use DBSCAN to identify areas with increasing weather activity to optimize flight routing.
- Risk computation: Insurers and others can use DBSCAN to make policy decisions and calculate risk where risk is correlated to the proximity of two or more features of interest.
Local Outlier Factor (LOF)
LOF is an anomaly detection algorithm that identifies outliers present in a dataset.
Typical use cases for LOF include:
- Data analysis and cleansing: Data teams can use LOF to identify and remove anomalies within a dataset, like removing erroneous GPS data points from a trace dataset
Getis-Ord (Gi*)
Getis-Ord is also a popular algorithm for identifying local hot and cold spots.
Typical use cases for Gi* include:
- Public health: Officials can use disease data with Gi* to identify areas of abnormal disease outbreak
- Telecommunications: Network administrators can use Gi* to identify areas of high demand and optimize network deployment
- Insurance: Insurers can identify areas prone to specific claims to better manage risk
Traditional challenges with using these algorithms on geospatial data
Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:
- Struggle to get performance or scale from the underlying solutions that also don’t perform well when joining geospatial data.
- Determine how to host and scale open source versions of popular ML and statistical algorithms, like PostGIS or scikit-learn DBSCAN, PySal Gi*, or scikit-learn LOF, to work for geospatial data types and geospatial data formats.
- Replicate this overhead each time they want to deploy a new algorithm for geospatial data.
Benefits of WherobotsAI GeoStats
With GeoStats in WherobotsAI, you can now:
- Easily run native algorithms on a cloud-based engine, optimized for producing spatial data products and insights at scale.
- Use these algorithms without the operational overhead associated with setup and maintenance.
- Leverage optimized, hosted algorithms within a single platform to easily experiment and get critical insights faster.
We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.
Diving Deeper into the GeoStats Toolbox
DBSCAN Overview
DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.
How to use DBSCAN in Wherobots
The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.
WherobotsAI GeoStats DBSCAN Python API Overview
For a full walk through see the Python API reference: dbscan(...)
.
- Supported Geometries: points, linestrings, polygons
- Hyperparameters: max distance to neighbors (epsilon), min neighbors (min_points)
- Output: dataframe with cluster id
DBSCAN Walk Through
- Choose your dataset and create a Sedona DataFrame.
dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))
- Choose values for your hyperparameters, max distance to neighbors (epsilon) and minimum neighbors (min_points). These values will determine how DBSCAN identifies clusters.
epsilon=0.3
min_points=10
- Run DBSCAN on your DataFrame with your chosen hyperparameter values.
clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)
- Analyze the results. For each datapoint, DBSCAN returns the cluster it’s associated with or if it’s an outlier.
+--------------------+------+-------+
| geometry|isCore|cluster|
+--------------------+------+-------+
|POINT (1.22185277...| false| 1|
|POINT (0.77885034...| false| 1|
|POINT (-2.2744742...| false| 2|
+--------------------+------+-------+
only showing top 3 rows
There’s a complete example of how to use DBSCAN in the Wherobots user documentation.
DBSCAN Performance Overview
To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.
Local Outlier Factor (LOF)
LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.
How to use LOF in Wherobots
For the full example, please see this docs page.
WherobotsAI GeoStats LOF Python API Overview
For a full walk through see the Python API reference: local_outlier_factor(...)
.
- Supported Geometries: points, linestrings, polygons
- Hyperparameters: number of nearest neighbors to use
- Output: score representing degree of inlier or outlier
LOF Walk Through
- Choose your dataset and create a Sedona DataFrame.
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
- Choose your k value for how many nearest neighbors you want to use to measure density near a given datapoint.
k=20
- Run LOF on your DataFrame with your chosen k value.
outliers_df = local_outlier_factor(df, k=20)
- Analyze your results. LOF returns a score for each datapoint representing the degree of inlier or outlier.
+--------------------+------------------+
| geometry| lof|
+--------------------+------------------+
|POINT (-1.9169927...|0.9991534865548664|
|POINT (-1.7562422...|1.1370318880088373|
|POINT (-2.0107478...|1.1533763384772193|
+--------------------+------------------+
only showing top 3 rows
There’s a complete example of how to use LOF in the Wherobots user documentation.
LOF Performance Overview
We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.
Getis-Ord (Gi*) Overview
Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.
How to use GeoStats Gi*
WherobotsAI GeoStats Gi* Python API Overview
For the full example, please see this docs g_local(...)
.
- Supported Geometries: points, linestrings, polygons
- Hyperparameters: star, neighbor weighting
- Output: Set of statistics that indicate the degree of local hot or cold spot for a given record
Gi* Walk Through
- Choose your dataset and create a Sedona Dataframe.
places_df = (
sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
.select(f.col("geometry"), f.col("categories"))
.withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)
- Choose how you’d like to weight datapoints (ex: datapoints in a specific geographic area need to be weighted higher or any datapoint close to a given datapoint need to be weighted higher) and star (boolean to indicate if a record is a neighbor of itself).
star = True
neighbor_search_radius_degrees = 1.0
variable_column = "myNumericColumnName"
weighted_dataframe = add_binary_distance_band_column(
df,
neighbor_search_radius_degrees,
include_self=star
)
- Run Gi* on your DataFrame with your chosen hyperparameters.
gi_df = g_local(
weighted_dataframe,
variable_column,
star=star
)
- Analyze your results. For each datapoint, Gi* returns a set of statistics that indicate the degree of local hot or cold spot.
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|num_places| G| EG| VG| Z| P|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
| 871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465| 0.0|
| 908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246| 0.0|
| 218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908| 0.0|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
only showing top 3 rows
There’s a complete example of how to use Gi* in the Wherobots user documentation.
Getis-Ord Performance Overview
To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.
Get started with WherobotsAI GeoStats
The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.
- If you haven’t already, create a free Wherobots Organization subscribed to the Community Edition of Wherobots.
- Start a Wherobots Notebook
- In the Notebook environment, explore the
notebook_example/python/wherobots-ai/
folder for examples that you can use to get started. - Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.
Apache Sedona Users
Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the release and get started!
What’s next
We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!
Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:
The WherobotsAI Team