Connect your AI coding assistants to the physical world with Wherobots MCP and CLI Learn More

How well does SAM3 detect building footprints? Let’s ask the Wherobots Spatial AI Assistant!

SAM 3 rasterflow roofs blog image

In a recent post, we showed how easy it is to use RasterFlow and Meta’s Segment Anything 3 Model (SAM3) to detect features in the physical world. A single end-to-end pipeline built a 133 GB NAIP mosaic of Marion County, Oregon, ran SAM3 against it with text prompts spanning eight classes, and produced approximately one million detection polygons in a Wherobots table including roughly 312,000 building roofs.

That is an impressive result on its own. But once the inference job finished, the obvious next question was: are these detections any good? Specifically, how well do they agree with an independent reference dataset of building footprints?

The Overture Maps Foundation publishes a global buildings dataset that is freely available in the Wherobots Hub. If I could compare the SAM3 roof detections against Overture for the same county, I would have a first-pass evaluation of whether SAM3 is finding the right things in roughly the right places.

The catch: I am a product manager, not a data scientist, and I am not well-versed in the standard techniques used to evaluate the output of remote-sensing models. Intersection-over-union, recall and precision curves, confidence calibration, the right metric coordinate reference system… I knew the terms, but I had not performed an evaluation like this from scratch before.

That is exactly where the Wherobots Spatial AI Assistant comes in.

Evaluating SAM3 on Aerial Imagery in Four Prompts

I opened a conversation with the assistant inside Claude, using the Wherobots MCP server. I described what I had: a fresh SAM3 detection table in org_catalog.sam3_marion_db.sam3_outputs, and a goal of comparing it to Overture buildings. The full session took four prompts.

Prompt 1: Describe the data.

“can you see SAM3 results in org_catalog.sam3_marion_db.sam3_outputs? can you tell me more about this dataset of SAM3 detections for Marion County OR?”

Within seconds, the assistant had walked the catalog, run summary statistics, and returned a complete profile: eight detection classes (layers), around one million total rows, 312k roofs, confidence scores ranging from 0.50 to 0.96, the table’s spatial extent, and the source mosaic. One of the queries it ran was a simple breakdown of the detections by class (layer):

SELECT layer, COUNT(*) AS n, AVG(bbox_score) AS avg_score
FROM org_catalog.sam3_marion_db.sam3_outputs
GROUP BY layer
ORDER BY n DESC

A sample of SAM3 roof detections (blue) over NAIP imagery in Marion County, Oregon. Each polygon is a segmentation mask, not a bounding box.

Prompt 2: Design the comparison.

“if I wanted to compare the buildings (roofs) detecting with SAM3 against the building footprints in the overture dataset in wherobots_open_data.overture_maps_foundation, how would I do that?”

The assistant designed a four-stage approach: clip both datasets to a Marion County area of interest; spatial-join them on intersection; compute intersection-over-union per pair; and aggregate to recall, precision, and calibration metrics. It also surfaced a set of caveats:

  • Roofs are not footprints, since overhangs and occlusion mean the shapes will never match exactly.
  • Overture is not ground truth in rural areas and is likely to miss buildings.
  • The 0.5 confidence cutoff was already baked into the SAM3 outputs, which limits any precision-recall analysis to the upper half of the confidence range.
  • Area math must be performed in UTM zone 10N rather than EPSG:4326.

These considerations gave me confidence that the analysis was well thought through and likely accurate.

Prompt 3: Use the right area of interest.

Rather than a rough bounding box, I wanted the comparison clipped to the actual Marion County admin boundary, fetched from a trustworthy source.

“can you use wkls to get the official admin boundary from Overture for Marion County, like this: gdf = gpd.read_file(wkls['us']['or']['Marion County'].geojson())

The assistant rebuilt the join strategy around the admin polygon generated using wkls. It pulled the boundary as a single-row Spark view that could be broadcast into the spatial joins, and combined an Iceberg bounding-box prefilter on Overture with an exact ST_Intersects predicate against the real Marion County shape. The generated code was straightforward:

import wkls
import geopandas as gpd

gdf = gpd.read_file(wkls['us']['or']['Marion County'].geojson())
aoi_geom = gdf.geometry.iloc[0]
aoi_wkt  = aoi_geom.wkt

sedona.sql(f"""
  CREATE OR REPLACE TEMP VIEW aoi AS
  SELECT ST_GeomFromWKT('{aoi_wkt}') AS geom
""")

Prompt 4: Let’s do the analysis in a notebook.

“can you create this analysis in a Jupyter notebook that I can run with Wherobots?”

The assistant returned a 24-cell notebook covering everything from SedonaContext setup through the final visualization. I ran it in VS Code using a Wherobots cloud runtime. The rest of this post walks through the code and the results.

Comparing SAM3 Building Detections to Overture Footprints

Here’s the notebook:

sam3_vs_overture_marion.ipynb

The notebook has four sections:

Setup and constants. Setting up imports, initializing the SedonaContext, defining the target CRS (EPSG:32610, UTM zone 10N), IoU thresholds, and table names. It’s worth noting out that the assistant chose UTM zone 10N automatically (the right metric CRS for western Oregon) without me having to ask.

Fetch the AOI from wkls. The notebook calls wkls['us']['or']['Marion County'].geojson(), reads the result into a GeoPandas frame, extracts the polygon’s WKT and bounding box, and registers a one-row aoi Spark view. It then renders the boundary on a SedonaKepler map so I could visually confirm I had the right AOI before running anything expensive.

Sanity counts. Two COUNT(*) queries over the SAM3 and Overture tables, both clipped to the admin boundary:

  • SAM3 roofs inside Marion County: 261,414 (the full dataset has 312k; the remainder fell outside the official admin polygon).
  • Overture buildings inside Marion County: 146,642.

SAM3 finds notably more candidate roof shapes than Overture has buildings.

Filtered AOI views. Two temporary views, sam3_roofs and overture_bldgs, each with the lon/lat geometry and a UTM-projected geometry. The Overture view uses a two-stage filter: a bounding-box prefilter that takes advantage of Iceberg column statistics and an exact ST_Intersects predicate against the admin polygon for Marion County.

Spatial join. This step matches candidate buildings. For every pair of SAM3 roof and Overture building whose geometries intersect, the query computes the intersection area, the two source areas, and the IoU:

SELECT
  s.sam3_id,
  s.bbox_score,
  o.overture_id,
  ST_Area(ST_Intersection(s.geom_m, o.geom_m)) AS inter_m2,
  ST_Area(s.geom_m) AS sam3_m2,
  ST_Area(o.geom_m) AS overture_m2,
  ST_Area(ST_Intersection(s.geom_m, o.geom_m)) /
    NULLIF(ST_Area(s.geom_m) + ST_Area(o.geom_m)
           - ST_Area(ST_Intersection(s.geom_m, o.geom_m)), 0) AS iou
FROM sam3_roofs s
JOIN overture_bldgs o
  ON ST_Intersects(s.geom_4326, o.geom_4326)

Intersection-over-union (IoU) is the area shared by two polygons divided by their combined area. A value of 1.0 is a perfect match; 0.0 means no overlap.

The ST_Intersects predicate runs on the lon/lat geometries, where Sedona’s spatial join planner has the column statistics it needs to be efficient; the ST_Area and ST_Intersection calls run on the UTM-projected geometries, where the resulting numbers are in square meters and meaningful. The query returns 207,109 candidate matched pairs and is cached for the rest of the analysis.

Resolution and metrics. Sometimes a single SAM3 polygon covers several Overture buildings, for example when SAM3 segments a multi-roof complex as one shape. Other times, several SAM3 polygons land on the same Overture building. To keep the comparison clean, the notebook keeps only the best-matching pair on each side. The remaining cells then compute the numbers the rest of the post relies on:

  • Recall. What fraction of Overture buildings has a matching SAM3 detection of roughly the right shape.
  • Precision. What fraction of SAM3 detections lines up with a known Overture building.
  • Calibration. Whether SAM3’s confidence score is a reliable signal of how well-shaped each detection actually is.
  • Aggregate area. How the total roof area SAM3 found across the county compares to the total building footprint area Overture has.

A final cell renders five high-IoU and five low-IoU matched pairs on aerial imagery for visual spot-checking, so I could see with my own eyes where SAM3 and Overture agree and where they don’t.

Example results comparing Overture building footprints (yellow) and SAM3 roof detections (blue). Top: a high quality result. Center: SAM3 failed to segment the roof in this building. Bottom: SAM3 segments part of the roof, but not the whole footprint.

SAM3 Building Detection Accuracy: Recall, Precision, and IoU Results

I wasn’t sure how best to assess the results. So I asked Claude to help with the analysis and here is a summary:

  • The overall correlation is very high. SAM3 and Overture report total roof area within 7% of each other across this county. That level of agreement says they are measuring the same physical objects, not coincidentally landing on the same total.
  • SAM3’s confidence score correlates to the accuracy of the polygon. SAM3 detections with high confidence scores have polygons that line up well with the real buildings, while lower confidence detections have less accurate shapes. That means we can use the confidence score as a reliable filter.
  • Many of SAM3’s “false positives” are real buildings Overture missed. The assistant pointed out that many of the SAM3 detections without an Overture match are actual rural houses and outbuildings visible on aerial imagery. So the real precision is likely higher than 75%.
  • When SAM3 finds a building, it usually gets the shape right. Only about 12 percentage points separate “found something on this building” from “found roughly the right shape on this building.” That means SAM3 is not just placing a point on the map, it is accurately outlining the building roofs.

These results are impressive! With a single text prompt (”roofs”), I was able to produce detections that match a curated, multi-source reference dataset to within 7% on total roof area, found the right shape on three-quarters of known buildings, and carried a confidence score an application can actually trust.

Even though SAM3 was not fine-tuned for roof detection, the resulting output would be operationally useful for many use cases.

Using the Spatial AI Coding Assistant to Evaluate Computer Vision ML Models

A few hours after I started, I had an initial assessment on the quality of the SAM3 detections. And I had a notebook with reproducible results that anyone can rerun in minutes. The assistant does not replace the role of the data scientist, but accelerates initial spatial and statistical analysis. And it did it while surfacing important caveats from the start.

Try It Yourself

Try the Spatial AI Coding Assistant