WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis

Authors

knn adaptive solution

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.

The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.

This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.

To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.

The ST_DWithin Approach

For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:

Copy Code
  1. sedona.sql('''
  2. SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  3. FROM query_geometries a
  4. JOIN target_geometries b
  5. ON ST_DWithin(a.geometry, b.geometry, 500, true) -- Fixed 500m radius
  6. ORDER BY distance
  7. ''')

This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.

The Spatial Density Problem

Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:

st_dwithin_problem
Figure 1: The same ST_DWithin query produces many candidates in dense urban areas but 0 candidates in sparse rural areas

The same query produces wildly different result set cardinalities based on local feature density.

The Radius Paradox

Attempting to solve this with radius adjustment creates new problems:

RadiusUrban ResultRural ResultProblem
500m20 candidates0 candidatesRural queries fail
2km200 candidates3 candidatesUrban over-processing
10km2000+ candidates10 candidatesUrban becomes intractable

No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.

This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.

KNN Spatial Analysis: Solving the Density Problem at Scale

K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.

A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.

The Effect in Practice

Copy Code
  1. sedona.sql('''
  2. SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  3. FROM query_geometries a
  4. JOIN target_geometries b
  5. ON ST_KNN(a.geometry, b.geometry, 10) -- Always 10 candidates
  6. ORDER BY distance
  7. ''')

The same query now produces consistent results across all spatial densities:

Figure 2: KNN always returns exactly k candidates—nearby ones in dense areas, more distant ones in sparse areas.

KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”

The Adaptive Advantage

MetricST_DWithinST_KNN
Result cardinalityVariable (0 to 1000+)Bounded (k)
Time complexityData-dependentPredictable O(n × k)
Result qualityDensity-dependentDistribution-invariant
Parameter tuningManual, error-proneNot required

The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.

Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.

The Two-Stage Pattern for Accurate Spatial Proximity

KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.

We address this with a two-stage refinement approach:

  • Stage 1 (Candidate Generation): KNN selects k candidates using approximate centroid-based distance—O(nlogk)O(n\log k)
  • Stage 2 (Exact Refinement): Precise geometric calculation (ST_ClosestPointST_DistanceSpheroid) on k candidates only—O(k)O(k) per query point

This decomposition gives us both fast candidate pruning and geometrically precise results.

KNN Implementation Pattern

Here’s the complete two-stage pattern using Wherobots SQL:

Stage 1 – Candidate Generation:

Copy Code
  1. knn_df = sedona.sql('''
  2. SELECT
  3. query.id AS query_id,
  4. query.geometry AS query_geometry,
  5. target.id AS target_id,
  6. target.geometry AS target_geometry,
  7.  
  8. -- Exact closest point calculation
  9. ST_ClosestPoint(target.geometry, query.geometry) AS closest_point,
  10.  
  11. -- Precise spheroidal distance
  12. ST_DistanceSpheroid(
  13. ST_ClosestPoint(target.geometry, query.geometry),
  14. query.geometry
  15. ) AS distance_meters
  16.  
  17. FROM query_geometries AS query
  18. JOIN target_geometries AS target
  19. ON ST_AKNN(query.geometry, target.geometry, 10, false)
  20. ''')
  21.  
  22. knn_df.writeTo('wherobots.pranav.knn_candidates')

Stage 2 – Exact refinement:

Copy Code
  1. sedona.sql('''
  2. ranked AS (
  3. SELECT *,
  4. ROW_NUMBER() OVER (
  5. PARTITION BY query_id
  6. ORDER BY distance_meters ASC
  7. ) AS rank
  8. FROM wherobots.pranav.knn_candidates
  9. )
  10.  
  11. SELECT * FROM ranked WHERE rank = 1;
  12. ''')

Key functions:

  • ST_AKNN(..., 10, false): Approximate KNN with k=10 candidates, Euclidean distance
  • Why ST_AKNN over ST_KNN?: When using KNN for candidate generation (followed by exact refinement via ST_ClosestPoint + ST_DistanceSpheroid), approximate KNN is preferred. The relaxed precision bounds of ST_AKNN enable faster index traversal, and any approximation error is eliminated in the refinement stage.
  • ST_ClosestPoint: Precise closest point on target geometry
  • ST_DistanceSpheroid: Accurate geodetic distance in meters
  • ROW_NUMBER(): Rank candidates and select the true closest

The pattern works for all large non-point geometries.

Results: Consistent Performance Across Spatial Densities

We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.

StateBuildings (Query Geometries)Roads (Target Features)Time (sec)CostThroughput
New York6,447,78275,3291,374$25.314,693/sec
Texas13,289,136166,9432,140$41.606,208/sec
Colorado2,764,97034,4321,074$21.352,574/sec
California13,648,296135,2472,086$36.956,541/sec
Florida8,201,965122,0481,490$30.355,504/sec
Total44,425,199535,538535,538$157.085,301/sec

Key Observations

  • Consistent Throughput: Average of 5,300 geometries per second across all states, despite each containing vastly different urban/rural mixtures. This consistency is the adaptive search radius in action.
  • Cost Predictability: $0.0035 per geometry regardless of local spatial density. Budget with confidence.
  • Linear Scaling: Processing time scales linearly with geometry count—no density-dependent surprises.

Practical Guidance

ST_KNN vs ST_AKNN for Proximity Analysis

For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):

FunctionSpeedPrecisionBest For
ST_KNNFastExactWhen KNN result is the final answer
ST_AKNNFasterApproximateSearch space reduction before exact calculations

Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.

Using Distance Bounds with KNN

If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:

Copy Code
  1. -- Only consider candidates within 5000 meters
  2. ST_AKNN(query.geometry, target.geometry, 10, TRUE, 5000)

Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.
This is useful when:

  • Business logic requires a limit: e.g., “nearest hospital within 10km”
  • Performance optimization: You know neighbors beyond X meters aren’t relevant
  • Emergency response: Facilities must be within a critical response distance

KNN Performance Optimization Tips

  1. Materialize intermediate results: Either cache or write to disk as Iceberg table, the KNN output before applying filters
  2. Process categories separately: Run KNN for each target type (e.g., motorways, then trunk roads) independently
  3. Use appropriate runtime: Medium runtime handled 13M geometries in ~35 minutes

Applications of KNN-Based Spatial Proximity Analysis

The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:

  • Infrastructure & Planning
    • Properties → nearest utilities, transit, services
    • Customers → nearest stores, facilities, competitors
  • Environmental Analysis
    • Development sites → nearest protected areas, water bodies
    • Facilities → nearest residential areas, schools
  • Business Intelligence
    • Locations → nearest amenities, employment centers
    • Assets → nearest maintenance facilities, resources
  • General Pattern: For each query geometry A, find target geometry B that minimizes some expensive function f(A, B). Use KNN to reduce candidates, then apply exact calculation to the reduced set.

Conclusion

Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.

Key Takeaways

  1. Density-Invariant Selection: KNN’s bounded cardinality constraint naturally accommodates varying feature density—no manual threshold tuning required.
  2. Predictable Performance: Consistent throughput and cost across heterogeneous spatial data, enabling reliable budgeting and planning.
  3. Two-Stage Pattern: Combine KNN’s fast candidate selection with precise geometric calculations for both speed and accuracy.


At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.

KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.

Resources

Create Your Wherobots Account