9 Mins Read

5 Nov 2024

Introducing kNN Join for Wherobots and Apache Sedona

Authors

Feng Zhang

Introducing kNN Join for Wherobots and Apache Sedona

We are excited to introduce the k-Nearest Neighbors Join (kNN Join) for WherobotsDB and Apache Sedona 1.7.0. With the kNN Join, you can efficiently find the closest entities to your points of interest from datasets at scale. We’ve released two types of kNN Joins: the Exact kNN Join and the Approximate kNN Join.

With the Exact kNN Join you get precise results, but the additional precision adds cost and time to the query. The Exact kNN Join is useful when precise object proximity is critical. It is now available in WherobotsDB and will be available in Apache Sedona 1.7.0. The Approximate kNN Join delivers results faster and at a lower cost than the Exact kNN Join, and is tested to be ~3-7% less accurate than the Exact kNN Join for varying data sizes (more info below). The Approximate kNN Join, available now in WherobotsDB, is useful when performance matters more than accuracy, such as testing workloads or when query speed matters more than precision.

Click here to launch this interactive notebook

Launch Notebook

When is the kNN Join useful?

The kNN Join allows you to quickly find a chosen set (k) of nearest entities (objects) to your points of interest (queries).

Typical Use Cases for the Exact kNN Join

The Exact kNN Join will efficiently identify the exact k closest objects to queries of interest. This makes it a highly useful algorithm when precise object proximity is crucial. Common applications of the Exact kNN Join at scale include:

Reverse geocoding: Ride share services can use the Exact kNN Join to map high quantities of user tracking coordinates to street address for smooth rider pickups.
Identifying nearest transit options: Map services can use the Exact kNN Join to accurately locate (or simulate) the closest public transit options for many people simultaneously.
Analyzing the distribution of public services: City planners can use the Exact kNN Join to identify geographic areas without public services nearby to inform future development plans.
Efficient freight routing: Shipping and cargo trucks can use the Exact kNN Join to locate exact locations of nearest refueling stations, ramps, marine life, etc. to create efficient routes and reduce time and cost.

Typical Use Cases for the Approximate kNN Join

The Approximate kNN Join trades off accuracy for higher query speed. It will find a set of k close neighbors (but not necessarily the closest neighbors) to your query set. Some common applications of the Approximate kNN Join include:

Local search: Map services use Approximate kNN Join to help users find places of interest within a certain distance of their geolocation.
Image search: Geospatial modeling teams leverage Approximate kNN to find images for geographic points of interest to augment available data for model training.

More importantly, the Approximate kNN Join unlocks kNN to be used for any compute or time constrained workloads, including:

Testing: Teams developing geospatial workloads leverage the speed of the Approximate kNN Join to quickly test and iterate their pipelines.
Meeting latency requirements: Teams use Approximate kNN Join to meet latency requirements and serve identified nearest neighbors at scale in production.

Hurdles of self-hosting kNN join algorithms

Before, teams leveraging the kNN algorithm for geospatial workloads would have to:

Determine how to host and scale general-purpose open source kNN solutions, including managing the curse of dimensionality, manage computational complexity that increases with data size, and preprocessing outlier data.
Limit the frequency of executing kNN join workloads due to high time and compute costs.

Benefits of the kNN Join

Now with the kNN Join in Wherobots or Apache Sedona, you can:

Leverage geospatially optimized kNN joins that scale efficiently with increasing data sizes without the overhead of building and maintaining your own solution.
Iterate at a higher pace, by leveraging the Approximate kNN Join to trade-off performance versus accuracy. And, as your workload requires, run the Exact kNN Join to guarantee accurate neighbors for points of interest.

We’ll walk through a brief overview of the kNN Join, share how each perform at various scale, and how to use it.

How it Works: Diving Deeper on the kNN Join

When used with geospatial data, the kNN Join identifies the k-nearest neighbors for a given spatial point or region based on geographic proximity. It involves two datasets: a query dataset containing points of interest, and an objects dataset, containing the entities you’re trying to locate near those points of interest. For each record in the queries dataset, the kNN Join finds the resulting set (k) of nearest neighbors from the objects dataset based on a user-defined distance metric and k value.

For example, let’s say you’re interested in locating the three closest coffee shops to specific public transit stops in your neighborhood. The query dataset would be gps coordinates of the public transit stops. The objects dataset would be gps coordinates of all coffee shops in your neighborhood. As we’re looking for the three closest coffee shops, the k value is 3. We’ll say that our distance metric is three blocks. Using this as input, the kNN Join will conduct a search of the objects and return three coffee shops that are within three blocks of each of the public transit stops in your neighborhood.

An illustration of kNN join. Using the example above, the red dots are the query dataset containing gps coordinates of the public transit stops. The green dots are the objects dataset containing gps coordinates of all coffee shops in your area. We want the 3-nearest coffee shops (k=3) to these public transit stops. Using kNN join, we find the closest coffee shops to the public transit stops. The query dataset and only their closest found neighbors are shown in the df_joined.

kNN Join SQL API Overview

We’ve made it easy to integrate the kNN Join into your workflows. Here is a quick overview of the kNN Join Spatial SQL APIs. For an in-depth look, see our docs here.

Exact kNN Join

	Supported Geometries	Hyperparameters	Output	SQL API
Exact kNN Join	points, geometries	number of nearest neighbors to identify (k)	dataframe with k nearest neighbors per query ranked by distance	`ST_KNN(...)`

Exact_kNN_Join_Results = ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)

Approximate kNN Join

	Supported Geometries	Hyperparameters	Output	SQL API
Approximate kNN Join	points, geometries	number of nearest neighbors to identify (k)	dataframe with k nearest neighbors per query ranked by distance	`ST_AKNN(...)`

Approximate_kNN_Join_Results = ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)

Performance Benchmarks

We’ll compare how the Exact and Approximate kNN Joins perform vs PostGIS, and how the Approximate kNN Join performs against the Exact kNN Join.

kNN Join vs PostGIS

Let’s look at some performance benchmarks for the Exact and Approximate kNN Joins compared to PostGIS’s exact kNN join algorithm. The following benchmarks are based on measured wall clock time for each join on varying data sizes. Each data size is defined by the query data size and the object data size, so a query set of 1M records and a object set of 12M records is described as 1M x 12M in the graph below. To compare scale and performance, we avoid compute comparisons across options by normalizing on cost per hour for each join.

As seen below, the Approximate and Exact kNN Joins scale better than PostGIS with increasing dataset sizes. A key limitation of PostGIS’s exact kNN join is its inability to scale horizontally across machines, which is why performance with large scale exact kNN joins is better with Wherobots. Meanwhile, the Exact and Approximate kNN Joins distribute the join horizontally across workers and process larger datasets more efficiently. Specifically, the Approximate kNN Join is more performant and efficient at cost per workload than PostGIS and Exact kNN Join. Meanwhile, the Exact kNN Join is more effective for larger workloads than PostGIS, with PostGIS unable to complete the join for the 10M x 1B data size.

Approximate vs Exact KNN Join Benchmarks

Let’s look at some performance benchmarks to help you choose which kNN Join is right for a given workload. We’ve chosen a variety of metrics to evaluate the accuracy of the Approximate kNN Join.

Accuracy: Measures the percentage of the Approximate kNN Join results that are the closest objects to queries.
Jaccard Index: Measures the overlap between the Approximate kNN Join results and the Exact kNN Join results.

Let’s take a look at wall clock time for the Approximate kNN Join versus Exact kNN Join for increasing data sizes.

Let’s take a look at accuracy and Jaccard scores for the Approximate kNN Join for increasing data sizes.

As seen above, the Approximate kNN Join will run faster than the Exact kNN Join for datasets of varying size, is highly accurate, and identifies highly similar sets of nearest neighbors as the Exact KNN Join, even for increasing dataset sizes. We recommend leveraging the Approximate kNN Join whenever you’re workloads allow for less precision, are cost constrained, or require high performance.

Tutorial

The following example details how to apply the Exact and Approximate kNN Joins on weather events data to find which flights are closest to each weather event, which can be crucial for making real-time decisions in air traffic management and ensuring flight safety.

We will load this weather data from the Wherobots Spatial Catalog. The Spatial Catalog includes a collection of Wherobots maintained open datasets from various data sources like Overture Maps, LandSAT, Wild Fires, New York Taxi Data, and more. These datasets are optimized for fast and efficient analytics with WherobotsDB.

The queries table contains the locations of weather events, such as storms or turbulence, while the objects table contains flight locations. Visit the user documentation for a full walk through of both kNN joins.

1. Preparing the Weather Events Data as Queries DataFrame

First, we load the weather events data from the Spatial Data Catalog. Then, we assign a monotonically increasing ID to each row in the weather events, and load it to a temporary view so we can use in the SQL.

df_queries = sedona.table("wherobots_pro_data.weather.weather_events")
df_queries = df_queries.withColumn("id", monotonically_increasing_id())
df_queries.createOrReplaceTempView("weather")

df_queries.select("id", "geometry", "type").show(20, False)

The weather queries events DataFrame looks like this:

+-------+-------------------------+----+
|id     |geometry                 |type|
+-------+-------------------------+----+
|2716677|POINT (-110.974 41.8191) |Snow|
|2656934|POINT (-109.4604 41.5323)|Snow|
|2653971|POINT (-109.0528 41.5945)|Rain|
|2519252|POINT (-105.0333 42.45)  |Rain|
|2480600|POINT (-105.5419 44.3394)|Snow|
+-------+-------------------------+----+
only showing top 5 rows

2. Preparing the Flights Data as Objects DataFrame

We are using the flights tracking data from ABS-B flight observation data. This data originated from ADSB.lol, which provides publicly available daily updates of crowdsourced flight tracking data.

df_objects = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet")
df_objects.createOrReplaceTempView("flights")

The flights objects DataFrame looks like this:

+---------------------------+-----------------------------+-------------------+
|desc                       |geometry                     |timestamp          |
+---------------------------+-----------------------------+-------------------+
|BOEING 737-800             |POINT (-110.232108 43.929066)|2024-01-11 00:00:00|
|CESSNA 240 CORVALIS TTX    |POINT (-110.852247 43.765574)|2024-01-01 00:00:00|
|CESSNA 172 SKYHAWK         |POINT (-108.790218 44.801926)|2024-01-23 00:00:00|
|CESSNA 510 CITATION MUSTANG|POINT (-110.795842 43.518448)|2024-01-26 00:00:00|
|BOEING 737-800             |POINT (-110.873321 43.079086)|2024-01-18 00:00:00|
+---------------------------+-----------------------------+-------------------+
only showing top 5 rows

3. Run the kNN Join

Once the two datasets are loaded, we can use the following SQL to run the kNN Joins.

The query below runs the Exact kNN Join with the number of neighbors set to 4.

df_knn_join = sedona.sql("""
SELECT
    WEATHER.GEOMETRY AS WEATHER_GEOM,
    FLIGHTS.GEOMETRY AS FLIGHTS_GEOM
FROM WEATHER
JOIN FLIGHTS ON ST_KNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
""")

You can also run the Approximate kNN Join with number of neighbors set to 4, and add UDFs to the result DataFrame as following.

df_knn_join = sedona.sql("""
SELECT
        WEATHER.GEOMETRY AS WEATHER_GEOM,
        [WEATHER.ID](http://weather.id/) AS QID,
        FLIGHTS.GEOMETRY AS FLIGHTS_GEOM,
        ST_DISTANCESPHERE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS DISTANCE,
        ST_MAKELINE(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY) AS LINE
FROM WEATHER
JOIN FLIGHTS ON ST_AKNN(WEATHER.GEOMETRY, FLIGHTS.GEOMETRY, 4, FALSE)
""")

The joined DataFrame looks like the following for a single query point.

+-------+-------------------------+-----------------------------+-----------------+
|QID    |QUERIES_GEOM             |OBJECTS_GEOM                 |DISTANCE         |
+-------+-------------------------+-----------------------------+-----------------+
|2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948|
+-------+-------------------------+-----------------------------+-----------------+
only showing top 4 rows+-------+-------------------------+-----------------------------+-----------------+
|QID    |QUERIES_GEOM             |OBJECTS_GEOM                 |DISTANCE         |
+-------+-------------------------+-----------------------------+-----------------+
|2804966|POINT (-110.4211 44.5444)|POINT (-110.425023 44.544205)|311.6515627672718|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.416483 44.546535)|436.1578754802839|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.428809 44.546302)|646.4967456258582|
|2804966|POINT (-110.4211 44.5444)|POINT (-110.415679 44.550445)|797.7249421529948|
+-------+-------------------------+-----------------------------+-----------------+
only showing top 4 rows

4. Visualize the kNN Join Results

We can also visualize the Exact kNN Join results using SedonaKepler by adding the joined DataFrame to a map.

# create map for the results
map_view = SedonaKepler.create_map(df_unique_qid.select('QUERIES_GEOM'), name="WEATHER EVENTS")
SedonaKepler.add_df(map_view, df=df_related_rows.select('OBJECTS_GEOM', 'DISTANCE').withColumnRenamed("OBJECTS_GEOM", "geometry"), name="FLIGHTS")
SedonaKepler.add_df(map_view, df=df_related_rows.select('LINE', 'DISTANCE').withColumnRenamed("LINE", "geometry"), name="KNN LINES")

# show the map
map_view

Below we see a portion of the map we generated above. In this portion of the map, we see the results of the Exact kNN Join. Given our distance requirement between a weather event and flight (the lines), for each of the weather events from our query database (blue dots), we see the 4 closest flights to each weather event (yellow dots) .

Get started with the Wherobots kNN Join

Wherobots Users

All Wherobots users have access to the Exact and Approximate kNN Joins.

If you haven’t already, create a free Wherobots organization subscribed to the Community Edition of Wherobots.
Start a Wherobots Notebook
In the Notebook environment, explore the python/wherobots-db/wherobots-db-knn-joins.ipynb example that you can use to get started.
Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

We’re releasing the Exact kNN Join in Apache Sedona 1.7.0. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the releases and get started!

What’s next

We’re excited to hear what algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Start using Wherobots for free

Get Started

Introducing kNN Join for Wherobots and Apache Sedona

Introducing kNN Join for Wherobots and Apache Sedona

When is the kNN Join useful?

Typical Use Cases for the Exact kNN Join

Typical Use Cases for the Approximate kNN Join

Hurdles of self-hosting kNN join algorithms

Benefits of the kNN Join

How it Works: Diving Deeper on the kNN Join

kNN Join SQL API Overview

Exact kNN Join

Approximate kNN Join

Performance Benchmarks

kNN Join vs PostGIS

Approximate vs Exact KNN Join Benchmarks

Tutorial

Get started with the Wherobots kNN Join

Wherobots Users

Apache Sedona Users

What’s next

RELATED POSTS