6 Mins Read

5 Feb 2026

Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis

Authors

Pranav Toggi

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.

The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.

This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.

To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.

The Professional’s Dilemma: Static vs Adaptive Search

The `ST_DWithin` Approach

For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_DWithin(a.geometry, b.geometry, 500, true)  -- Fixed 500m radius
  ORDER BY distance
''')

This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.

The Spatial Density Problem

Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:

st_dwithin_problem — *Figure 1: The same ST_DWithin query produces many candidates in dense urban areas but 0 candidates in sparse rural areas*

The same query produces wildly different result set cardinalities based on local feature density.

The Radius Paradox

Attempting to solve this with radius adjustment creates new problems:

Radius	Urban Result	Rural Result	Problem
500m	20 candidates	0 candidates	Rural queries fail
2km	200 candidates	3 candidates	Urban over-processing
10km	2000+ candidates	10 candidates	Urban becomes intractable

No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.

This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.

KNN Spatial Analysis: Solving the Density Problem at Scale

K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.

A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.

The Effect in Practice

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_KNN(a.geometry, b.geometry, 10)  -- Always 10 candidates
  ORDER BY distance
''')

The same query now produces consistent results across all spatial densities:

KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”

The Adaptive Advantage

Metric	ST_DWithin	ST_KNN
Result cardinality	Variable (0 to 1000+)	Bounded (k)
Time complexity	Data-dependent	Predictable O(n × k)
Result quality	Density-dependent	Distribution-invariant
Parameter tuning	Manual, error-prone	Not required

The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.

Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.

The Two-Stage Pattern for Accurate Spatial Proximity

KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.

We address this with a two-stage refinement approach:

Stage 1 (Candidate Generation): KNN selects k candidates using approximate centroid-based distance— $O(n\log k)$
Stage 2 (Exact Refinement): Precise geometric calculation (ST_ClosestPoint, ST_DistanceSpheroid) on k candidates only— $O(k)$ per query point

This decomposition gives us both fast candidate pruning and geometrically precise results.

KNN Implementation Pattern

Here’s the complete two-stage pattern using Wherobots SQL:

Stage 1 – Candidate Generation:

knn_df = sedona.sql('''
  SELECT 
      query.id AS query_id,
      query.geometry AS query_geometry,
      target.id AS target_id,
      target.geometry AS target_geometry,
      
      -- Exact closest point calculation
      ST_ClosestPoint(target.geometry, query.geometry) AS closest_point,
      
      -- Precise spheroidal distance
      ST_DistanceSpheroid(
          ST_ClosestPoint(target.geometry, query.geometry),
          query.geometry
      ) AS distance_meters
      
  FROM query_geometries AS query
  JOIN target_geometries AS target
      ON ST_AKNN(query.geometry, target.geometry, 10, false)
''')

knn_df.writeTo('wherobots.pranav.knn_candidates')

Stage 2 – Exact refinement:

sedona.sql('''
  ranked AS ( 
      SELECT *,
             ROW_NUMBER() OVER (
                 PARTITION BY query_id 
                 ORDER BY distance_meters ASC
             ) AS rank
      FROM wherobots.pranav.knn_candidates
  )

  SELECT * FROM ranked WHERE rank = 1;
''')

Key functions:

ST_AKNN(..., 10, false): Approximate KNN with k=10 candidates, Euclidean distance
Why ST_AKNN over ST_KNN?: When using KNN for candidate generation (followed by exact refinement via ST_ClosestPoint + ST_DistanceSpheroid), approximate KNN is preferred. The relaxed precision bounds of ST_AKNN enable faster index traversal, and any approximation error is eliminated in the refinement stage.
ST_ClosestPoint: Precise closest point on target geometry
ST_DistanceSpheroid: Accurate geodetic distance in meters
ROW_NUMBER(): Rank candidates and select the true closest

The pattern works for all large non-point geometries.

Results: Consistent Performance Across Spatial Densities

We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.

State	Buildings (Query Geometries)	Roads (Target Features)	Time (sec)	Cost	Throughput
New York	6,447,782	75,329	1,374	$25.31	4,693/sec
Texas	13,289,136	166,943	2,140	$41.60	6,208/sec
Colorado	2,764,970	34,432	1,074	$21.35	2,574/sec
California	13,648,296	135,247	2,086	$36.95	6,541/sec
Florida	8,201,965	122,048	1,490	$30.35	5,504/sec
Total	44,425,199	535,538	535,538	$157.08	5,301/sec

Key Observations

Consistent Throughput: Average of 5,300 geometries per second across all states, despite each containing vastly different urban/rural mixtures. This consistency is the adaptive search radius in action.
Cost Predictability: $0.0035 per geometry regardless of local spatial density. Budget with confidence.
Linear Scaling: Processing time scales linearly with geometry count—no density-dependent surprises.

Practical Guidance

`ST_KNN` vs `ST_AKNN` for Proximity Analysis

For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):

Function	Speed	Precision	Best For
ST_KNN	Fast	Exact	When KNN result is the final answer
ST_AKNN	Faster	Approximate	Search space reduction before exact calculations

Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.

Using Distance Bounds with KNN

If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:

-- Only consider candidates within 5000 meters
ST_AKNN(query.geometry, target.geometry, 10, true, 5000)

Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.
This is useful when:

Business logic requires a limit: e.g., “nearest hospital within 10km”
Performance optimization: You know neighbors beyond X meters aren’t relevant
Emergency response: Facilities must be within a critical response distance

KNN Performance Optimization Tips

Materialize intermediate results: Either cache or write to disk as Iceberg table, the KNN output before applying filters
Process categories separately: Run KNN for each target type (e.g., motorways, then trunk roads) independently
Use appropriate runtime: Medium runtime handled 13M geometries in ~35 minutes

Applications of KNN-Based Spatial Proximity Analysis

The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:

Infrastructure & Planning
- Properties → nearest utilities, transit, services
- Customers → nearest stores, facilities, competitors
Environmental Analysis
- Development sites → nearest protected areas, water bodies
- Facilities → nearest residential areas, schools
Business Intelligence
- Locations → nearest amenities, employment centers
- Assets → nearest maintenance facilities, resources
General Pattern: For each query geometry A, find target geometry B that minimizes some expensive function f(A, B). Use KNN to reduce candidates, then apply exact calculation to the reduced set.

Conclusion

Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.

Key Takeaways

Density-Invariant Selection: KNN’s bounded cardinality constraint naturally accommodates varying feature density—no manual threshold tuning required.
Predictable Performance: Consistent throughput and cost across heterogeneous spatial data, enabling reliable budgeting and planning.
Two-Stage Pattern: Combine KNN’s fast candidate selection with precise geometric calculations for both speed and accuracy.

At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.

KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.

Resources

Wherobots Documentation: ST_AKNN, ST_KNN, spatial function reference
Wherobots Cloud: Distributed spatial analytics platform
Overture Maps Foundation: Open map data used in our validation

Create your Wherobots account

START BUILDING

7 Mins Read 8 Jul 2026

How Bad Telemetry Data Sabotages Modern Fleets

By the Teams at Action Engine & Wherobots Fleet monitoring is undergoing a generational shift. Fleet monitoring, the systems that ingest and analyze vehicle telemetry to track fleet health, performance, and safety, has become the foundation of how operators run vehicles, not just track them. Modern vehicles generate orders of magnitude more telemetry than even […]

Stories + 2

3 Mins Read 8 Jul 2026

Introducing the Wherobots Innovation Edition, designed to accelerate your physical world objectives

The Wherobots Innovation Edition helps you deliver outcomes on top of spatial data that propel your organization forward, and make this data AI-ready. Today we are announcing the Wherobots Innovation Edition. This is an annual partnership that pairs the full Wherobots Cloud platform with our forward deployed spatial engineering expertise, developed over years of delivering […]

post

13 Mins Read 24 Jun 2026

Spatial Data in Apache Iceberg: Optimizations and Management That Matter

Spatial data in Apache Iceberg needs different optimization than tabular data. A geometry column has no natural sort order, so unsorted files carry wide, overlapping bounding boxes and query planners cannot prune them… At all… This behaviour turns a selective spatial filter into a full table scan. A second problem compounds it: one oversized geometry […]

Data Management + 3