6 Mins Read

5 Feb 2026

Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis

Authors

Pranav Toggi

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

When scaling spatial proximity analysis from city to state to national level, the hidden challenge isn’t computational power—it’s spatial density. The techniques that work perfectly for urban neighborhoods fail dramatically when applied across heterogeneous landscapes.

The standard professional approach—using ST_DWithin with a fixed search radius—breaks down when spatial density varies. A 500-meter radius might capture 20 candidate features in Manhattan but zero in rural Wyoming. No single distance works for both.

This article demonstrates how k-nearest neighbors (ST_KNN) solves this problem. Unlike fixed-radius predicates that yield density-dependent result sets, KNN applies a top-k constraint—guaranteeing bounded cardinality regardless of local feature distribution. No distance threshold tuning required.

To validate this approach, we ran a buildings-to-roads proximity analysis across five US states on Wherobots Cloud: 44.4 million buildings against 535,000 road segments in 2.3 hours for $157—less than half a cent per geometry. The technique applies equally to any spatial proximity problem: customers to stores, facilities to services, properties to amenities.

The Professional’s Dilemma: Static vs Adaptive Search

The `ST_DWithin` Approach

For any GIS professional, the standard approach to spatial proximity analysis is `ST_DWithin` with a fixed radius:

Copy Code

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_DWithin(a.geometry, b.geometry, 500, true)  -- Fixed 500m radius
  ORDER BY distance
''')

This works beautifully for spatially homogeneous regions. But scale it to a state or nation, and you hit the spatial density problem—where non-uniform feature distribution causes query behavior to become unpredictable.

The Spatial Density Problem

Consider the same ST_DWithin(geometry, 500m) query across different spatial contexts:

st_dwithin_problem — *Figure 1: The same ST_DWithin query produces many candidates in dense urban areas but 0 candidates in sparse rural areas*

The same query produces wildly different result set cardinalities based on local feature density.

The Radius Paradox

Attempting to solve this with radius adjustment creates new problems:

Radius	Urban Result	Rural Result	Problem
500m	20 candidates	0 candidates	Rural queries fail
2km	200 candidates	3 candidates	Urban over-processing
10km	2000+ candidates	10 candidates	Urban becomes intractable

No single radius value works for heterogeneous spatial data: increase it to capture sparse regions, and dense regions become intractable with combinatorial explosion in candidate counts.

This isn’t a theoretical problem—it’s the practical limitation that prevents reliable large-scale spatial analysis using traditional methods.

KNN Spatial Analysis: Solving the Density Problem at Scale

K-nearest neighbors elegantly solves the spatial density problem by enforcing a cardinality constraint rather than a distance constraint—always returning exactly k candidates regardless of local feature density.

A useful mental model: ST_DWithin applies a distance predicate (fixed radius, variable cardinality), while ST_KNN applies a rank-based predicate (fixed cardinality, variable distance). This produces the effect of a density-adaptive search area—though KNN doesn’t compute any radius internally.

The Effect in Practice

Copy Code

sedona.sql('''
  SELECT a.*, b.*, ST_Distance(a.geometry, b.geometry) as distance
  FROM query_geometries a
  JOIN target_geometries b 
      ON ST_KNN(a.geometry, b.geometry, 10)  -- Always 10 candidates
  ORDER BY distance
''')

The same query now produces consistent results across all spatial densities:

KNN doesn’t compute or adjust a radius. It simply finds the k nearest neighbors, wherever they are. In dense areas, those neighbors happen to be close; in sparse areas, they’re farther away. The “adaptive” behavior emerges naturally from asking “what’s nearest?” rather than “what’s within X meters?”

The Adaptive Advantage

Metric	ST_DWithin	ST_KNN
Result cardinality	Variable (0 to 1000+)	Bounded (k)
Time complexity	Data-dependent	Predictable O(n × k)
Result quality	Density-dependent	Distribution-invariant
Parameter tuning	Manual, error-prone	Not required

The Key Insight: ST_DWithin asks “What’s within X meters?” (fixed radius, variable candidates). ST_KNN asks “What are the K nearest?” (fixed candidates, variable distance). This fundamental difference is why KNN handles heterogeneous density gracefully—no radius tuning required.

Note: This “adaptive” behavior is conceptual—a way to understand why KNN outperforms fixed-radius approaches for heterogeneous data. It’s distinct from the optional distance bound parameter (see Practical Guidance), which imposes an actual maximum distance limit on candidates.

The Two-Stage Pattern for Accurate Spatial Proximity

KNN computes distances using geometry centroids (or bounding box representatives) for computational efficiency. For complex geometries like linestrings or large polygons, the centroid-to-centroid distance may diverge significantly from the true minimum Hausdorff distance.

We address this with a two-stage refinement approach:

Stage 1 (Candidate Generation): KNN selects k candidates using approximate centroid-based distance— $O(n\log k)$
Stage 2 (Exact Refinement): Precise geometric calculation (ST_ClosestPoint, ST_DistanceSpheroid) on k candidates only— $O(k)$ per query point

This decomposition gives us both fast candidate pruning and geometrically precise results.

KNN Implementation Pattern

Here’s the complete two-stage pattern using Wherobots SQL:

Stage 1 – Candidate Generation:

Copy Code

knn_df = sedona.sql('''
  SELECT 
      query.id AS query_id,
      query.geometry AS query_geometry,
      target.id AS target_id,
      target.geometry AS target_geometry,
 
      -- Exact closest point calculation
      ST_ClosestPoint(target.geometry, query.geometry) AS closest_point,
 
      -- Precise spheroidal distance
      ST_DistanceSpheroid(
          ST_ClosestPoint(target.geometry, query.geometry),
          query.geometry
      ) AS distance_meters
 
  FROM query_geometries AS query
  JOIN target_geometries AS target
      ON ST_AKNN(query.geometry, target.geometry, 10, false)
''')
 
knn_df.writeTo('wherobots.pranav.knn_candidates')

Stage 2 – Exact refinement:

Copy Code

sedona.sql('''
  ranked AS ( 
      SELECT *,
             ROW_NUMBER() OVER (
                 PARTITION BY query_id 
                 ORDER BY distance_meters ASC
             ) AS rank
      FROM wherobots.pranav.knn_candidates
  )
 
  SELECT * FROM ranked WHERE rank = 1;
''')

Key functions:

ST_AKNN(..., 10, false): Approximate KNN with k=10 candidates, Euclidean distance
Why ST_AKNN over ST_KNN?: When using KNN for candidate generation (followed by exact refinement via ST_ClosestPoint + ST_DistanceSpheroid), approximate KNN is preferred. The relaxed precision bounds of ST_AKNN enable faster index traversal, and any approximation error is eliminated in the refinement stage.
ST_ClosestPoint: Precise closest point on target geometry
ST_DistanceSpheroid: Accurate geodetic distance in meters
ROW_NUMBER(): Rank candidates and select the true closest

The pattern works for all large non-point geometries.

Results: Consistent Performance Across Spatial Densities

We validated this approach using a buildings-to-roads proximity analysis across five US states. Each state contains a mix of dense urban, suburban, and sparse rural areas—exactly the heterogeneous density that breaks traditional methods.

State	Buildings (Query Geometries)	Roads (Target Features)	Time (sec)	Cost	Throughput
New York	6,447,782	75,329	1,374	$25.31	4,693/sec
Texas	13,289,136	166,943	2,140	$41.60	6,208/sec
Colorado	2,764,970	34,432	1,074	$21.35	2,574/sec
California	13,648,296	135,247	2,086	$36.95	6,541/sec
Florida	8,201,965	122,048	1,490	$30.35	5,504/sec
Total	44,425,199	535,538	535,538	$157.08	5,301/sec

Key Observations

Consistent Throughput: Average of 5,300 geometries per second across all states, despite each containing vastly different urban/rural mixtures. This consistency is the adaptive search radius in action.
Cost Predictability: $0.0035 per geometry regardless of local spatial density. Budget with confidence.
Linear Scaling: Processing time scales linearly with geometry count—no density-dependent surprises.

Practical Guidance

`ST_KNN` vs `ST_AKNN` for Proximity Analysis

For the two-stage pattern described in this article, use ST_AKNN (approximate) rather than ST_KNN (exact):

Function	Speed	Precision	Best For
ST_KNN	Fast	Exact	When KNN result is the final answer
ST_AKNN	Faster	Approximate	Search space reduction before exact calculations

Since we’re computing exact distances with ST_DistanceSpheroid on the candidates anyway, the approximate nature of ST_AKNN has no impact on final accuracy—only on speed.

Using Distance Bounds with KNN

If you know the maximum acceptable distance for your use case, add a distance bound parameter to further optimize performance:

Copy Code

-- Only consider candidates within 5000 meters
ST_AKNN(query.geometry, target.geometry, 10, TRUE, 5000)

Unlike the conceptual “adaptive search area” discussed earlier, this is an actual distance predicate pushed down to the spatial partitioning stage—not applied as post-hoc filtering. Candidates beyond this threshold are pruned during index traversal, reducing I/O and computation.
This is useful when:

Business logic requires a limit: e.g., “nearest hospital within 10km”
Performance optimization: You know neighbors beyond X meters aren’t relevant
Emergency response: Facilities must be within a critical response distance

KNN Performance Optimization Tips

Materialize intermediate results: Either cache or write to disk as Iceberg table, the KNN output before applying filters
Process categories separately: Run KNN for each target type (e.g., motorways, then trunk roads) independently
Use appropriate runtime: Medium runtime handled 13M geometries in ~35 minutes

Applications of KNN-Based Spatial Proximity Analysis

The adaptive search radius pattern applies to any spatial proximity problem with heterogeneous density:

Infrastructure & Planning
- Properties → nearest utilities, transit, services
- Customers → nearest stores, facilities, competitors
Environmental Analysis
- Development sites → nearest protected areas, water bodies
- Facilities → nearest residential areas, schools
Business Intelligence
- Locations → nearest amenities, employment centers
- Assets → nearest maintenance facilities, resources
General Pattern: For each query geometry A, find target geometry B that minimizes some expensive function f(A, B). Use KNN to reduce candidates, then apply exact calculation to the reduced set.

Conclusion

Large-scale spatial analysis requires solving the spatial density problem that causes traditional distance predicates to fail. KNN provides the solution: a cardinality-bounded query operator that delivers consistent result sets regardless of local feature distribution, with predictable computational complexity.

Key Takeaways

Density-Invariant Selection: KNN’s bounded cardinality constraint naturally accommodates varying feature density—no manual threshold tuning required.
Predictable Performance: Consistent throughput and cost across heterogeneous spatial data, enabling reliable budgeting and planning.
Two-Stage Pattern: Combine KNN’s fast candidate selection with precise geometric calculations for both speed and accuracy.

At $0.0035 per geometry, spatial proximity analysis at any scale becomes economically trivial. The technique that processed 44 million buildings in 2.3 hours works equally well for customer analytics, infrastructure planning, or environmental assessment.

KNN’s density-invariant behavior isn’t just a performance optimization—it’s what makes heterogeneous spatial analysis reliable.

Resources

Wherobots Documentation: ST_AKNN, ST_KNN, spatial function reference
Wherobots Cloud: Distributed spatial analytics platform
Overture Maps Foundation: Open map data used in our validation

Create Your Wherobots Account

Get Started

7 Mins Read 10 Dec 2025

Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence

RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.

Computer Vision + 4

8 Mins Read 26 Mar 2026

Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down

A Wherobots Solution Accelerator for GPS Mobility Analytics — Part 1 of 2

General + 2

6 Mins Read 19 Mar 2026

PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong

When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]

General + 2

11 Mins Read 18 Mar 2026

Streaming Spatial Data into Wherobots with Spark Structured Streaming

Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]

Apache Sedona + 4

Scaling Spatial Analysis: How KNN Solves the Spatial Density Problem for Large-Scale Proximity Analysis

How we processed 44 million geometries across 5 US states by solving the spatial density problem that breaks traditional spatial proximity analysis

The Professional’s Dilemma: Static vs Adaptive Search

The ST_DWithin Approach

The Spatial Density Problem

The Radius Paradox

KNN Spatial Analysis: Solving the Density Problem at Scale

The Effect in Practice

The Adaptive Advantage

The Two-Stage Pattern for Accurate Spatial Proximity

KNN Implementation Pattern

Results: Consistent Performance Across Spatial Densities

Key Observations

Practical Guidance

ST_KNN vs ST_AKNN for Proximity Analysis

Using Distance Bounds with KNN

KNN Performance Optimization Tips

Applications of KNN-Based Spatial Proximity Analysis

Conclusion

Key Takeaways

Resources

RELATED POSTS

The `ST_DWithin` Approach

`ST_KNN` vs `ST_AKNN` for Proximity Analysis