Connect your AI coding assistants to the physical world with Wherobots MCP and CLI Learn More

Detecting Objects From Text Prompts with RasterFlow and Segment Anything 3

Authors

SAM 3 rasterflow roofs blog image

RasterFlow now Supports Detections from Text!

RasterFlow now makes it simple to run promptable geospatial vision models across large aerial and satellite imagery collections, removing the need to build bespoke inference pipelines. With our new SAM 3 support, you can prompt for concepts like “roofs”, “roads”, or “shipping containers” and turn those detections into vector outputs ready for analysis in Wherobots, from city scale to country scale.

RasterFlow visualization tool showing SAM 3 detections.

Here we see SAM 3 localize and capture roofs really well. It also segments roads pretty cleanly, though there is still room for improvement. Results are shown on the inference input, a NAIP 30 centimeter basemap, and include all instance pixels above 50% confidence.

In this post, we put SAM 3 to the test detecting roofs in suburban neighborhoods, shipping containers in crowded loading docks, and tractors across agricultural landscapes. Throughout, we comment on where it succeeds, where it falls short, and how results compare to previous SAM models. Finally, we discuss what’s next for computer vision in Earth observation (EO) and how the community can build on models like SAM to create more promptable, flexible applications that handle the scale and diversity of remote sensing imagery.

How Segment Anything Model 3 Improves on SAM 1 and SAM 2 for Remote Sensing

In 2023, the Segment Anything Model (SAM) set a new paradigm for computer vision. While most models were trained to address one task at a time, SAM demonstrated that a single model could reliably classify, localize, and segment many kinds of objects in complex natural scenes.

SAM panoptic segmentation examples. SAM mask predictions across diverse scenes.

SAM set a new standard task for computer vision models that went beyond classification or semantic segmentation: Promptable Visual Segmentation (PVS), which takes spatial prompts such as boxes, points, or masks as input and predicts masks.

However, SAM 1’s performance on out-of-distribution imagery domains had issues. Back in 2023, while I was working on detection projects with 10 meter resolution imagery, I found it often took multiple rounds of prompting to get useful segments, and many results still needed manual cleanup.

For example, in a rural town, SAM 1 can segment many features across different spatial scales, but it still misses all roads.

SAM 1 segmenting features in rural Dikwa, Nigeria.
SAM 1 segments many features across different spatial scales in Dikwa, Nigeria, but still misses roads.

SAM 2 improved accuracy and added video support, but both SAM 1 and SAM 2 were limited to making predictions without associated labels. The masks they produced were not grounded in categories useful for deriving insights.

Fast forward to today: SAM 3 addresses Promptable Concept Segmentation (PCS), a task where a model can accept either spatial prompts (masks, boxes, pixels) or text prompts like “cat”, “dog”, even “roofs”!

The SAM 3 architecture.
The SAM 3 architecture supports Promptable Concept Segmentation, accepting text and image exemplars in addition to spatial prompts to produce masks with labeled concepts. It also benefits from pretraining on overhead imagery. Source: YouTube.

This opens up new possibilities for detecting objects in imagery. Because SAM 3’s training data spans many imaging domains, including overhead aerial imagery, it can succeed in many Earth observation contexts where previous model generations struggled. For example, one can use SAM 3 to predict “roofs” in NAIP imagery simply by asking, with no model training or ad hoc labeling needed, and get pretty stellar results.

Roofs detected across an urban scene.
With RasterFlow, we can create imagery mosaics and predict roofs simply by prompting SAM 3 with short noun phrases like “roofs”.

To put SAM 3 to the test on Earth observation imagery, we generated many more SAM 3 predictions on top of National Agriculture Program 30 centimeter imagery using RasterFlow, our scalable mosaic building and inference engine. Below details the performance of RasterFlow on this high resolution detection task.

RasterFlow Task Billed Runtime Output Total Output Size Spatial Scale of Output
build_gti_mosaic 11m 47s NAIP Zarr mosaic 133.08 GB (123.94 GiB) ~114.15 km × 66.81 km
predict_mosaic_geometries 39m 51s GeoParquet geometry results 263.85 MB (251.63 MiB) ~111.02 km × 66.81 km

How to Run SAM 3 on Aerial and Satellite Imagery with RasterFlow

What used to take a complex mix of imagery ETL, bespoke inference pipelines, and self-provisioned infrastructure for large Earth observation processing now takes under an hour and two simple Python functions. Let’s check it out below.

Build an Aerial Imagery Mosaic from NAIP

First, we will generate a mosaic: a seamless, stitched-together image from many independent remote sensing scenes. RasterFlow handles all the data sourcing, loading, cleaning, and partitioning into a data asset optimized for inference for a particular model.

from rasterflow_remote import RasterflowClient

rf_client = RasterflowClient()
mosaic_output = rf_client.build_gti_mosaic(
    gti="s3://wherobots-examples/rasterflow/indexes/naip_index.parquet",
    aoi="s3://wherobots-examples/rasterflow/aois/marion_county.parquet",
    bands=["red", "green", "blue", "nir"],
    location_field="url",
    crs_epsg=3857,
    time_column="year",
    skip_xy_coords=False,
    xy_chunksize=1024,
    query="res == 0.3 and time >= '2022-01-01' and time <= '2023-01-01'",
    requester_pays=True,
    sort_field="time",
)
print(mosaic_output.uri)

Run the SAM 3 Inference on the Mosaic

With our mosaic built, we can now run inference with SAM 3. Unlike more rigid models which only predict one category, SAM 3 can accept one or more text prompts and detect all matching objects in a single pass. The runtime for a given batch of imagery scales linearly with the number of detections. More prompts tends to mean more detections, so keep in mind that the more prompts you add, the longer you can expect an inference run to take.

from rasterflow_remote.data_models import GeometryActorEnum, MergeModeEnum

model_output = rf_client.predict_mosaic_geometries(
    store="s3://wherobots-examples/rasterflow/mosaics/marion_county.zarr",
    model_path="https://huggingface.co/wherobots/sam3-text-geometry-pt2/resolve/main/full_sam3_pipeline.pt2",
    patch_size=1008,
    clip_size=0,
    device="cuda",
    features=["red", "green", "blue"],
    labels=["roads", "airplanes", "airports", "roofs", "solar panels", "swimming pools", "shipping containers", "tractors"],
    actor=GeometryActorEnum.TEXT_TO_VECTOR_GEOMETRIES,
    max_batch_size=1,
    confidence_threshold=0.5,
    merge_mode=MergeModeEnum.NONE,
    xy_block_multiplier=1,
)

Analyze the SAM 3 Detections with WherobotsDB

We can visualize both the mosaic and detections directly in the notebook with an embedded RasterFlow map. The RGB imagery mosaic and the SAM 3 detections have been web-optimized with RasterFlow for fast and fluid browsing.

If the embed does not load in your environment, open it in a new tab here.

After visually exploring detections, we can load the results into WherobotsDB for quantitative analysis. WherobotsDB lets us post-process geometries, calculate zonal statistics on other rasters, count objects, measure clustering, and much more.

import os
from sedona.spark import *
from pyspark.sql.functions import *
from wherobots import vtiles

config = (
    SedonaContext.builder()
    .getOrCreate()
)
sedona = SedonaContext.create(config)

parquet_path = "s3://wherobots-examples/rasterflow/model-outputs/marion_county_sam3/"
df = sedona.read.format("geoparquet").load(parquet_path)

df.printSchema()
df.show(10)

Here we’ll use ST_Area to estimate the total square km area of all roofs in Marion County.

roofs_df = df.filter(col("label") == "roofs")
source_srid = roofs_df.selectExpr("ST_SRID(geometry) AS srid").first()["srid"]
source_crs = f"EPSG:{source_srid}"
print(source_crs)
# Run the SRID inspection cell above first so source_crs reflects the data's actual CRS.
roof_areas = roofs_df.withColumn(
    "area_sq_m",
    expr(f"ST_AreaSpheroid(ST_Transform(geometry, '{source_crs}', 'EPSG:4326'))"),
).cache()

roof_areas_agg = roof_areas.agg(
    sum("area_sq_m").alias("total_area_sq_m"),
    (sum("area_sq_m") / lit(1_000_000.0)).alias("total_area_sq_km"),
)

roof_areas_agg.show(truncate=False)
roof_size_stats = roof_areas.agg(
    count("*").alias("roof_count"),
    avg("area_sq_m").alias("avg_roof_sq_m"),
    expr("percentile_approx(area_sq_m, 0.5)").alias("median_roof_sq_m"),
    (avg("area_sq_m") * lit(10.7639)).alias("avg_roof_sq_ft"),
)

stats = roof_size_stats.first()

print(f"roof_count: {stats['roof_count']:,}")
print(f"avg_roof_sq_m: {stats['avg_roof_sq_m']:.2f}")
print(f"median_roof_sq_m: {stats['median_roof_sq_m']:.2f}")
label_counts = df.groupBy("label").count().orderBy(col("count").desc(), col("label"))
label_counts.show(truncate=False)

We can also generate web-optimized vectors in PMTiles format. PMTiles are easily shareable and plug directly into geospatial visualization applications, making them a great choice for distributing detection results.

df_tiles = df.withColumn("layer", col("label"))

output_path = 's3://wherobots-examples/rasterflow/model-outputs/marion_county_sam3.pmtiles'
vtiles.generate_pmtiles(df_tiles, output_path)

SAM 3 Accuracy on Earth Observation: Where It Succeeds and Where It Falls Short

After experimenting with SAM 3 on NAIP, I’m impressed with the range of categories that SAM 3 can positively identify. At the same time, the model still has clear precision and recall gaps for more niche semantic categories, like “tractors” or “shipping containers”.

SAM 3 detecting tractors in a field.
SAM 3 correctly detects tractors in a harvested field.
SAM 3 detecting tractors at a dealer lot.
In an agricultural context, SAM 3 finds some tractors but also misses some.
SAM 3 detecting shipping containers in a storage yard.
SAM 3 correctly identifies shipping containers in a storage yard.
SAM 3 false positive shipping container detections on rectangular structures.
False positives: SAM 3 labels rectangular roofs as “shipping containers”.

These examples illustrate the gap between SAM 3’s strengths on common categories and its current limitations on more niche ones. I’m bullish that if SAM 3 were fine-tuned on a larger corpus of mixed high-resolution imagery and high-quality labels, it would perform even better outside of its primary domain of natural imagery.

Even so, I think the potential for SAM 3 for simpler categories like “roofs” is underutilized. SAM 3 could potentially improve many existing datasets we rely on to make decisions, like Overture Buildings and detections of other kinds of structures.

What to Try Next with SAM 3 on Wherobots

Some things we didn’t showcase but I recommend trying in your experiments:

  • Can you find and count vehicles that are a certain color?
  • Can you use SAM 3 to segment city streets? What about rural roads?
  • Try to map forest canopy! SAM 3 can do a pretty great job at segmenting concepts that don’t map cleanly to individual objects, but are more textural.

If you’re excited to try SAM 3 on Wherobots, sign up for RasterFlow Private Preview. You can also get in touch at ryan@wherobots.com or talk to us here. I’d love to hear about what you’re looking to build and how SAM 3 could fit into your detection workflows.