Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
We are excited to introduce the k-Nearest Neighbors Join (kNN Join) for WherobotsDB and Apache Sedona 1.7.0. With the kNN Join, you can efficiently find the closest entities to your points of interest from datasets at scale. We’ve released two types of kNN Joins: the Exact kNN Join and the Approximate kNN Join.
With the Exact kNN Join you get precise results, but the additional precision adds cost and time to the query. The Exact kNN Join is useful when precise object proximity is critical. It is now available in WherobotsDB and will be available in Apache Sedona 1.7.0. The Approximate kNN Join delivers results faster and at a lower cost than the Exact kNN Join, and is tested to be ~3-7% less accurate than the Exact kNN Join for varying data sizes (more info below). The Approximate kNN Join, available now in WherobotsDB, is useful when performance matters more than accuracy, such as testing workloads or when query speed matters more than precision.
The kNN Join allows you to quickly find a chosen set (k) of nearest entities (objects) to your points of interest (queries).
The Exact kNN Join will efficiently identify the exact k closest objects to queries of interest. This makes it a highly useful algorithm when precise object proximity is crucial. Common applications of the Exact kNN Join at scale include:
The Approximate kNN Join trades off accuracy for higher query speed. It will find a set of k close neighbors (but not necessarily the closest neighbors) to your query set. Some common applications of the Approximate kNN Join include:
More importantly, the Approximate kNN Join unlocks kNN to be used for any compute or time constrained workloads, including:
Before, teams leveraging the kNN algorithm for geospatial workloads would have to:
Now with the kNN Join in Wherobots or Apache Sedona, you can:
We’ll walk through a brief overview of the kNN Join, share how each perform at various scale, and how to use it.
When used with geospatial data, the kNN Join identifies the k-nearest neighbors for a given spatial point or region based on geographic proximity. It involves two datasets: a query dataset containing points of interest, and an objects dataset, containing the entities you’re trying to locate near those points of interest. For each record in the queries dataset, the kNN Join finds the resulting set (k) of nearest neighbors from the objects dataset based on a user-defined distance metric and k value.
For example, let’s say you’re interested in locating the three closest coffee shops to specific public transit stops in your neighborhood. The query dataset would be gps coordinates of the public transit stops. The objects dataset would be gps coordinates of all coffee shops in your neighborhood. As we’re looking for the three closest coffee shops, the k value is 3. We’ll say that our distance metric is three blocks. Using this as input, the kNN Join will conduct a search of the objects and return three coffee shops that are within three blocks of each of the public transit stops in your neighborhood.An illustration of kNN join. Using the example above, the red dots are the query dataset containing gps coordinates of the public transit stops. The green dots are the objects dataset containing gps coordinates of all coffee shops in your area. We want the 3-nearest coffee shops (k=3) to these public transit stops. Using kNN join, we find the closest coffee shops to the public transit stops. The query dataset and only their closest found neighbors are shown in the df_joined.
We’ve made it easy to integrate the kNN Join into your workflows. Here is a quick overview of the kNN Join Spatial SQL APIs. For an in-depth look, see our docs here.
ST_KNN(...)
Exact_kNN_Join_Results = ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
ST_AKNN(...)
Approximate_kNN_Join_Results = ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
We’ll compare how the Exact and Approximate kNN Joins perform vs PostGIS, and how the Approximate kNN Join performs against the Exact kNN Join.
Let’s look at some performance benchmarks for the Exact and Approximate kNN Joins compared to PostGIS’s exact kNN join algorithm. The following benchmarks are based on measured wall clock time for each join on varying data sizes. Each data size is defined by the query data size and the object data size, so a query set of 1M records and a object set of 12M records is described as 1M x 12M in the graph below. To compare scale and performance, we avoid compute comparisons across options by normalizing on cost per hour for each join.
As seen below, the Approximate and Exact kNN Joins scale better than PostGIS with increasing dataset sizes. A key limitation of PostGIS’s exact kNN join is its inability to scale horizontally across machines, which is why performance with large scale exact kNN joins is better with Wherobots. Meanwhile, the Exact and Approximate kNN Joins distribute the join horizontally across workers and process larger datasets more efficiently. Specifically, the Approximate kNN Join is more performant and efficient at cost per workload than PostGIS and Exact kNN Join. Meanwhile, the Exact kNN Join is more effective for larger workloads than PostGIS, with PostGIS unable to complete the join for the 10M x 1B data size.
Let’s look at some performance benchmarks to help you choose which kNN Join is right for a given workload. We’ve chosen a variety of metrics to evaluate the accuracy of the Approximate kNN Join.
Let’s take a look at wall clock time for the Approximate kNN Join versus Exact kNN Join for increasing data sizes.
Let’s take a look at accuracy and Jaccard scores for the Approximate kNN Join for increasing data sizes.
As seen above, the Approximate kNN Join will run faster than the Exact kNN Join for datasets of varying size, is highly accurate, and identifies highly similar sets of nearest neighbors as the Exact KNN Join, even for increasing dataset sizes. We recommend leveraging the Approximate kNN Join whenever you’re workloads allow for less precision, are cost constrained, or require high performance.
The following example details how to apply the Exact and Approximate kNN Joins on weather events data to find which flights are closest to each weather event, which can be crucial for making real-time decisions in air traffic management and ensuring flight safety.
We will load this weather data from the Wherobots Spatial Catalog. The Spatial Catalog includes a collection of Wherobots maintained open datasets from various data sources like Overture Maps, LandSAT, Wild Fires, New York Taxi Data, and more. These datasets are optimized for fast and efficient analytics with WherobotsDB.
The queries table contains the locations of weather events, such as storms or turbulence, while the objects table contains flight locations. Visit the user documentation for a full walk through of both kNN joins.
1. Preparing the Weather Events Data as Queries DataFrame
First, we load the weather events data from the Spatial Data Catalog. Then, we assign a monotonically increasing ID to each row in the weather events, and load it to a temporary view so we can use in the SQL.
df_queries = sedona.table("wherobots_pro_data.weather.weather_events") df_queries = df_queries.withColumn("id", monotonically_increasing_id()) df_queries.createOrReplaceTempView("weather") df_queries.select("id", "geometry", "type").show(20, False)
The weather queries events DataFrame looks like this:
+-------+-------------------------+----+ |id |geometry |type| +-------+-------------------------+----+ |2716677|POINT (-110.974 41.8191) |Snow| |2656934|POINT (-109.4604 41.5323)|Snow| |2653971|POINT (-109.0528 41.5945)|Rain| |2519252|POINT (-105.0333 42.45) |Rain| |2480600|POINT (-105.5419 44.3394)|Snow| +-------+-------------------------+----+ only showing top 5 rows
2. Preparing the Flights Data as Objects DataFrame
We are using the flights tracking data from ABS-B flight observation data. This data originated from ADSB.lol, which provides publicly available daily updates of crowdsourced flight tracking data.
df_objects = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet") df_objects.createOrReplaceTempView("flights")
The flights objects DataFrame looks like this:
+---------------------------+-----------------------------+-------------------+ |desc |geometry |timestamp | +---------------------------+-----------------------------+-------------------+ |BOEING 737-800 |POINT (-110.232108 43.929066)|2024-01-11 00:00:00| |CESSNA 240 CORVALIS TTX |POINT (-110.852247 43.765574)|2024-01-01 00:00:00| |CESSNA 172 SKYHAWK |POINT (-108.790218 44.801926)|2024-01-23 00:00:00| |CESSNA 510 CITATION MUSTANG|POINT (-110.795842 43.518448)|2024-01-26 00:00:00| |BOEING 737-800 |POINT (-110.873321 43.079086)|2024-01-18 00:00:00| +---------------------------+-----------------------------+-------------------+ only showing top 5 rows
3. Run the kNN Join
Once the two datasets are loaded, we can use the following SQL to run the kNN Joins.
The query below runs the Exact kNN Join with the number of neighbors set to 4.
df_knn_join = sedona.sql(""" SELECT WEATHER.GEOMETRY AS WEATHER_GEOM, FLIGHTS.GEOMETRY AS FLIGHTS_GEOM FROM WEATHER JOIN FLIGHTS ON ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE) """)
You can also run the Approximate kNN Join with number of neighbors set to 4, and add UDFs to the result DataFrame as following.
df_knn_join = sedona.sql(""" SELECT WEATHER.GEOMETRY AS WEATHER_GEOM, [WEATHER.ID](http://weather.id/) AS QID, FLIGHTS.GEOMETRY AS FLIGHTS_GEOM, ST_DISTANCESPHERE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS DISTANCE, ST_MAKELINE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS LINE FROM WEATHER JOIN FLIGHTS ON ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE) """)
The joined DataFrame looks like the following for a single query point.
+-------+-------------------------+-----------------------------+-----------------+ |QID |QUERIES_GEOM |OBJECTS_GEOM |DISTANCE | +-------+-------------------------+-----------------------------+-----------------+ |2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718| |2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839| |2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582| |2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948| +-------+-------------------------+-----------------------------+-----------------+ only showing top 4 rows+-------+-------------------------+-----------------------------+-----------------+ |QID |QUERIES_GEOM |OBJECTS_GEOM |DISTANCE | +-------+-------------------------+-----------------------------+-----------------+ |2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718| |2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839| |2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582| |2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948| +-------+-------------------------+-----------------------------+-----------------+ only showing top 4 rows
4. Visualize the kNN Join Results
We can also visualize the Exact kNN Join results using SedonaKepler by adding the joined DataFrame to a map.
# create map for the results map_view = SedonaKepler.create_map(df_unique_qid.select('QUERIES_GEOM'), name="WEATHER EVENTS") SedonaKepler.add_df(map_view, df=df_related_rows.select('OBJECTS_GEOM', 'DISTANCE').withColumnRenamed("OBJECTS_GEOM", "geometry"), name="FLIGHTS") SedonaKepler.add_df(map_view, df=df_related_rows.select('LINE', 'DISTANCE').withColumnRenamed("LINE", "geometry"), name="KNN LINES") # show the map map_view
Below we see a portion of the map we generated above. In this portion of the map, we see the results of the Exact kNN Join. Given our distance requirement between a weather event and flight (the lines), for each of the weather events from our query database (blue dots), we see the 4 closest flights to each weather event (yellow dots) .
All Wherobots users have access to the Exact and Approximate kNN Joins.
python/wherobots-db/wherobots-db-knn-joins.ipynb
We’re releasing the Exact kNN Join in Apache Sedona 1.7.0. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the releases and get started!
We’re excited to hear what algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: