Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.
The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.
ST_DWithin
This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.
ST_KNN
To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.
For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:
sedona.sql(''' SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance FROM query_geometries a JOIN target_geometries b ON ST_DWithin(a.geometry, b.geometry, 500, true) -- Fixed 500m radius ORDER BY distance''')
This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.
Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:
ST_DWithin(geometry, 500m)
The same query produces wildly different result set cardinalities based on local feature density.
Attempting to solve this with radius adjustment creates new problems:
No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.
This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.
K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.
A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.
sedona.sql(''' SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance FROM query_geometries a JOIN target_geometries b ON ST_KNN(a.geometry, b.geometry, 10) -- Always 10 candidates ORDER BY distance''')
The same query now produces consistent results across all spatial densities:
KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”
The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required. Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.
The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.
Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.
KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.
We address this with a two-stage refinement approach:
ST_ClosestPoint
ST_DistanceSpheroid
This decomposition gives us both fast candidate pruning and geometrically precise results.
Here’s the complete two-stage pattern using Wherobots SQL:
Stage 1 – Candidate Generation:
knn_df = sedona.sql(''' SELECT query.id AS query_id, query.geometry AS query_geometry, target.id AS target_id, target.geometry AS target_geometry, -- Exact closest point calculation ST_ClosestPoint(target.geometry, query.geometry) AS closest_point, -- Precise spheroidal distance ST_DistanceSpheroid( ST_ClosestPoint(target.geometry, query.geometry), query.geometry ) AS distance_meters FROM query_geometries AS query JOIN target_geometries AS target ON ST_AKNN(query.geometry, target.geometry, 10, false)''') knn_df.writeTo('wherobots.pranav.knn_candidates')
Stage 2 – Exact refinement:
sedona.sql(''' ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY query_id ORDER BY distance_meters ASC ) AS rank FROM wherobots.pranav.knn_candidates ) SELECT * FROM ranked WHERE rank = 1;''')
Key functions:
ST_AKNN(..., 10, false)
ST_AKNN
ROW_NUMBER()
The pattern works for all large non-point geometries.
We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.
For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):
Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.
If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:
-- Only consider candidates within 5000 metersST_AKNN(query.geometry, target.geometry, 10, TRUE, 5000)
Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.This is useful when:
The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:
f(A, B)
Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.
At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.
KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Spatial Data Pipeline Architecture: PostGIS and Wherobots Together
In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]
Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade
Introduction Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it. Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: