Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.
The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.
ST_DWithin
This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.
ST_KNN
To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.
For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:
sedona.sql(''' SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance FROM query_geometries a JOIN target_geometries b ON ST_DWithin(a.geometry, b.geometry, 500, true) -- Fixed 500m radius ORDER BY distance''')
This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.
Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:
ST_DWithin(geometry, 500m)
The same query produces wildly different result set cardinalities based on local feature density.
Attempting to solve this with radius adjustment creates new problems:
No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.
This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.
K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.
A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.
sedona.sql(''' SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance FROM query_geometries a JOIN target_geometries b ON ST_KNN(a.geometry, b.geometry, 10) -- Always 10 candidates ORDER BY distance''')
The same query now produces consistent results across all spatial densities:
KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”
The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required. Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.
The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.
Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.
KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.
We address this with a two-stage refinement approach:
ST_ClosestPoint
ST_DistanceSpheroid
This decomposition gives us both fast candidate pruning and geometrically precise results.
Here’s the complete two-stage pattern using Wherobots SQL:
Stage 1 – Candidate Generation:
knn_df = sedona.sql(''' SELECT query.id AS query_id, query.geometry AS query_geometry, target.id AS target_id, target.geometry AS target_geometry, -- Exact closest point calculation ST_ClosestPoint(target.geometry, query.geometry) AS closest_point, -- Precise spheroidal distance ST_DistanceSpheroid( ST_ClosestPoint(target.geometry, query.geometry), query.geometry ) AS distance_meters FROM query_geometries AS query JOIN target_geometries AS target ON ST_AKNN(query.geometry, target.geometry, 10, false)''') knn_df.writeTo('wherobots.pranav.knn_candidates')
Stage 2 – Exact refinement:
sedona.sql(''' ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY query_id ORDER BY distance_meters ASC ) AS rank FROM wherobots.pranav.knn_candidates ) SELECT * FROM ranked WHERE rank = 1;''')
Key functions:
ST_AKNN(..., 10, false)
ST_AKNN
ROW_NUMBER()
The pattern works for all large non-point geometries.
We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.
For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):
Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.
If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:
-- Only consider candidates within 5000 metersST_AKNN(query.geometry, target.geometry, 10, TRUE, 5000)
Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.This is useful when:
The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:
f(A, B)
Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.
At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.
KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down
A Wherobots Solution Accelerator for GPS Mobility Analytics — Part 1 of 2
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: