Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach

Authors

When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline.

This isn’t about just storing location data in your existing medallion setup. It’s about recognizing that spatial data introduces complexity in data structures, compute patterns, and efficiency requirements that traditional architectures simply cannot handle without significant compromise. The medallion (or multi-hop) architecture, when properly adapted for geospatial workloads, becomes something more powerful: a systematic approach to managing spatial intelligence at scale.

In this post, I will outline how we have taught this in the Wherobots Geospatial Data Engineering Associate course to leverage this base design and group different spatial tasks, query patterns, and outputs into each step, leveraging both raster and vector datasets.

Why Geospatial Data Demands Architectural Rethinking

Before diving into the medallion architecture itself, let’s acknowledge what makes geospatial data different.

The Fragmentation Problem

In most organizations, geospatial data doesn’t arrive nicely packaged. You might pull property boundaries from a municipal data portal in shapefiles, elevation data from AWS open data as cloud-optimized GeoTIFFs, satellite imagery from NASA, and local POI data from your internal systems.

Each source has different formats, coordinate systems, refresh rates, and quality levels.

Traditional data lakes aren’t equipped to handle this heterogeneity efficiently. Moving terabytes of raw geospatial data through your pipeline adds cost and latency.

And then there’s the format problem: legacy formats like shapefiles and JPEG2000 can’t be partially queried. If you want Seattle property data from global shapefiles, the traditional approach is to download the entire file, unzip it, filter locally, then transfer what you need.

The Scale-Complexity Tradeoff

Geospatial scale isn’t just about row count. It’s about the combination of volume, global distribution, mixed data types (vectors, rasters, array-based formats, point clouds), conversion / interpolation between those data types, and the inherent cost of spatial operations. A simple point-in-polygon operation across a million properties and thousands of geographical features involves computation that traditional systems struggle with.

This is why you see organizations either underutilizing their spatial data or building custom solutions that work for one specific problem but don’t generalize. The efficiency gap is real, and it’s expensive.

The Medallion Architecture: Bronze, Silver, Gold

The medallion architecture solves these challenges by enforcing structure while maintaining flexibility. Here’s how it works for geospatial data:

Bronze Layer: Ingestion and Preservation

The bronze layer is deceptively simple: get raw geospatial data into your system without transformation.

For Data Engineers: This layer is your staging area. You’re ingesting data from disparate sources: APIs, files, databases, cloud storage without enforcing business logic. The goal is to preserve raw data fidelity while standardizing the container.

For geospatial data specifically, this means several important decisions:

  • Leave data at source when possible. If you have a global elevation dataset sitting on AWS S3, don’t download it into your local systems. Create a remote reference to it. This is one of the biggest efficiency gains in modern geospatial data pipelines.
  • Convert to cloud-native formats. As data arrives in your bronze layer, immediately convert legacy formats (shapefiles, JPEG2000, uncompressed GeoTIFFs) into cloud-native equivalents (GeoParquet, cloud-optimized GeoTIFFs). This isn’t optimization but a prerequisite for efficient querying downstream.
  • Preserve lineage and versioning. Geospatial data often changes such as when property boundaries get redrawn, satellite imagery updates, elevation models improve. You need to track when data arrived, from where, and what version you’re working with.
  • Raster tiling and storage optimization. For raster data (imagery, DEMs), bronze handles tiling and creates remote references. Instead of loading massive rasters into memory, you tile them and access only what you need.

For GIS Professionals: Think of bronze as your data repository. In traditional GIS workflows, you manage multiple datasets, each in its own format, living in different folders or databases. Bronze centralizes this but respects the raw nature of the data. You’re not making decisions about coordinate systems, geometry validation, or spatial relationships yet.

This is also where automation becomes critical. Instead of manually downloading datasets monthly, bronze ingestion jobs run on schedules, automatically pulling new data and versioning it. This means your analyses always reflect current reality.

Data table formats such as Apache Iceberg and Delta Lake provide several advantages here since you can easily append data to the table for records that have changed. They also use underlying cloud-native formats such as GeoParquet with the new V3 Iceberg specification and within Wherobots you can read out-of-database rasters (or a reference to a specific part of an image) without moving the source raster file saving massive amounts of I/O operations.

How Wherobots Handles the Bronze Layer

Wherobots Cloud simplifies bronze layer operations through WherobotsDB, its cloud-native spatial processing engine. When ingesting raw geospatial data, Wherobots immediately enables critical capabilities for you:

  • Format conversion at ingestion: Wherobots automatically converts legacy spatial formats (shapefiles, JPEG2000, standard GeoTIFFs) into cloud-native formats during ingestion. This happens transparently—you connect to data in whatever format it arrives, and it’s can be optimized for cloud storage and querying.
  • Remote reference handling: For massive open datasets, Wherobots manages remote references natively. Global elevation datasets sitting on AWS S3 don’t get duplicated into your systems. Instead, Wherobots creates efficient references to the original data, allowing you to query it as if it were local while paying only for what you access.
  • Apache Iceberg storage with spatial extensions: Data in bronze is stored using Havasu, Wherobots’ Apache Iceberg-based spatial table format – which is the foundation of the Iceberg v3 spec. This provides built-in versioning, time travel, and ACID transactions from day one. You can rewind to yesterday’s data for audits or reprocessing without complex manual versioning schemes.
  • Automatic metadata tracking: Wherobots tracks lineage, data arrival times, and schema information automatically, eliminating the need for manual data cataloging in bronze.

The key to Wherobots’ approach to bronze is that it enforces good practices without requiring you to build them yourself. You’re getting schema evolution, ACID compliance, and spatial optimization automatically.

Silver Layer: Enrichment and Standardization

Silver is where the intelligence begins to emerge. This is where you clean, enrich, and structure data into a consistent, analysis-ready state. It’s also often where Data Science teams may want to work before “publishing” analysis ready data to end users (in gold tables).

For Data Engineers: Silver involves:

  • Coordinate system harmonization. Raw geospatial data arrives in dozens of different projections. Silver transforms everything into standard coordinate systems (typically WGS 84 for global work, or local UTM zones for regional analysis).
  • Geometry validation and repair. Geospatial data is messy. Overlapping polygons, self-intersecting lines, invalid geometries from legacy systems all appears in silver transformation. You validate geometries and fix what’s repairable.
  • Spatial enrichment operations. This is where you add geospatial relationships. Perform spatial joins to associate properties with neighborhoods. Buffer roads to identify nearby areas. Create 3D geometries by joining elevation and additional data attributes to 2D features. Calculate spatial statistics like area, perimeter, or distance to nearest feature.
  • Advanced spatial operations. Often times we think of spatial relationships just as a spatial join. But performing analysis such as a K-Nearest Neighbor join, zonal statistics (or a raster to vector join), spatial aggregation, distance within (i.e. features within N distance), area weighted interpolation (such as weighted population statistics), shortest path analysis, line of sight analysis, space and time proximity analysis (i.e. near miss) and more. These require not only optimized spatial data but the right functions to make them work.

The key architectural principle in silver is that you’re not making final analytical decisions you’re preparing the data so downstream consumers can make those decisions efficiently.

For GIS Professionals: Silver is where your spatial data processing happens. This is the layer where you answer structural questions:

  • What coordinate system makes sense for this analysis region?
  • Are there geometry issues I need to resolve?
  • What spatial relationships exist between datasets that I’ll need downstream?
  • Should I enrich my vector data with elevation? Satellite indices? Population density?

In traditional GIS, this work happens ad-hoc before each analysis. In the medallion architecture, you do it once, well, and capture the work in reusable transformations. This is efficiency multiplied across teams and projects.

Another key note is that you will likely want different scales of compute to handle these processes. Easier process like spatial joins on vector basic geometries may only require a smaller compute instance whereas a large scale NDVI (or other vegetation indexes) would require a larger compute instance. The ability to mix and match your compute scales is a critical advantage in a spatial platform not only for speed but cost optimization.

How Wherobots Handles the Silver Layer

The silver layer is where WherobotsDB spatial optimization truly shines. Wherobots provides native spatial operations that make complex transformations not just possible but efficient at scale:

  • Coordinate system transformations at speed: Wherobots includes optimized functions for reprojecting geometries across coordinate systems. What might take hours in desktop GIS or loose scripts in Python happens in seconds on Wherobots’ distributed architecture. This is critical because spatial efficiency compounds—faster transformations mean you can afford to enrich data more completely.
  • Geometry validation and repair: Wherobots includes native geometry validation and repair functions. You can identify invalid geometries, split self-intersecting polygons, and fix common data quality issues in SQL. This work is parallelized across the cluster, not constrained by a single machine’s memory.
  • Spatial enrichment operations: Silver transformations in Wherobots use Spatial SQL with 300+ functions for vector and raster operations. Spatial joins that would take hours in traditional systems execute in minutes. Buffer operations, intersection calculations, proximity analysis—all happen efficiently on distributed data. Wherobots handles both vector and raster data natively in the same query, meaning you can enrich vector property data with raster elevation or satellite imagery in a single operation.
  • Raster tiling and remote storage: For raster data, Wherobots can tile and optimize imagery while keeping it remote on S3. You’re not loading massive datasets into memory. Instead, Wherobots’ raster functions work against remote tiles, accessing only what’s needed for each query.
  • Reusable SQL transformations: Silver transformations are written in Spatial SQL and stored as views or jobs. These are version-controlled, reproducible, and can be re-run as new data arrives in bronze. Unlike one-off Python scripts, silver logic in Wherobots is production-grade from day one.

The combination of Spatial SQL and WherobotsDB’s distributed architecture means silver layer work that once required deep geospatial expertise and careful optimization is now accessible to data engineers who know SQL. Wherobots handles the spatial complexity.

Gold Layer: Analytical Readiness and Delivery

Gold is your analytics-ready product layer. Data here is fully prepared, enriched, and available for immediate use in downstream applications and BI or GIS systems.

For Data Engineers: Gold contains:

  • Advanced aggregates and spatial statistics. Point-in-polygon counts (how many properties in each neighborhood), zonal statistics (average elevation by region), spatial clustering (identify hotspots) updated on a regular basis, ready to join by identifier to non spatial data.
  • Optimized indices and pre-computed relationships. Store commonly-needed spatial joins and queries pre-computed, reducing downstream query time.
  • Tiled and multi-format outputs. Generate PMTiles for web visualization, GeoParquet for analytics, and optionally push to PostGIS or SedonaDB/DuckDB for application serving.
  • Quality-controlled data. Remove erroneous geometries that made it through silver, apply final business logic, and ensure consistency.
  • AI ready data. Create language based data that is ready for an LLM or agentic applications to consume removing heavy geometries which can confuse an LLM.

Gold is where you can afford to be opinionated because the work upstream ensures those opinions are well-informed.

For GIS Professionals: Gold is your analysis starting point. Instead of wrestling with raw data, you access gold tables that are:

  • Geometrically valid and properly projected
  • Enriched with relevant spatial context (elevation, proximity to features, administrative boundaries)
  • Pre-aggregated for common questions
  • Formatted for direct use in your tools and custom applications (QGIS, Felt, BI dashboards, machine learning pipelines)

You go from “I need to combine three datasets and fix projection issues” to “Here’s the enriched regional dataset, ready for analysis.”

How Wherobots Handles the Gold Layer

Gold layer work in Wherobots focuses on making spatial data immediately useful across your entire organization:

  • Pre-computed aggregates at scale: Wherobots can compute spatial aggregates—point-in-polygon counts, zonal statistics, spatial clustering—and store them efficiently. These pre-computed metrics are small enough to serve directly to applications but powerful enough to support complex analyses. A query that aggregates a billion point observations into regional summaries completes in seconds.
  • Multiple output formats: Gold data can be materialized in multiple forms depending on consumer needs. Export as GeoParquet for analytics, create PMTiles for web visualization, push to PostGIS for application serving, or generate standard Parquet for BI tools. Wherobots’ native format support means you write once and consume many ways.
  • Web tile generation at scale: Wherobots includes a scalable vector tile (VTiles) generator that’s optimized for producing map tiles from massive datasets. Instead of spending weeks generating tiles for a global dataset, Wherobots produces them in hours. These tiles feed directly into web applications, Felt, QGIS, or any tile-consuming tool.
  • RasterFlow integration: Gold layer work in Wherobots increasingly includes AI-powered enrichment. RasterFlow raster inference allows you to extract insights from satellite imagery at scale—identifying buildings, roads, vegetation patterns—and embed those insights directly into your gold tables. Machine learning models that would require custom integration work in other systems are built into Wherobots’ gold workflow.
  • Spatial SQL API for consumption: Instead of exporting and distributing files, gold data is served through Wherobots’ Spatial SQL API. Applications, dashboards, and downstream systems query gold data directly through Python SDK, Java JDBC driver, or standard SQL endpoints. This means gold data stays fresh and you don’t distribute copies.

Gold in Wherobots is less about storing static snapshots and more about providing a live, queryable product layer that adapts to consumer needs.

Why This Architecture Solves Geospatial Problems

Efficiency Through Format Optimization

Traditional systems treat geospatial data like regular tabular data. This is inefficient. A shapefile is a collection of files in a zip that requires downloading the entire dataset to access a subset. GeoParquet, by contrast, stores data in columnar format with spatial indexing built-in. You can query it over the internet by bounding box or geometry filter and only transfer what you need.

The medallion architecture enforces this progression from legacy to cloud-native formats, reducing storage costs and query latency dramatically. A terabyte global elevation dataset stays on AWS S3 as a reference. You don’t duplicate it.

Separation of Concerns

Bronze, silver, and gold are separate databases with separate lineage. This means:

  • Raw data safety. Your source data is never modified. If a transformation downstream goes wrong, you can reprocess from bronze without losing the original.
  • Independent evolution. Teams can improve silver transformations without affecting gold. Applications consuming gold don’t care about silver changes as long as contracts remain stable.
  • Governance simplicity. Access controls are straightforward. Grant different teams access to different layers based on their role.

Automation and Scalability

By encoding transformations into standardized layers, you enable automation. Data ingestion becomes scheduled jobs. Updates propagate automatically from bronze through silver to gold. You stop manually managing data and start managing logic.

This scales to global datasets because you’re not moving data—you’re moving queries. Spatial predicates push down to where the data lives, reducing network transfer and computation.

Real-World Application: Housing Analytics at Scale

Consider a concrete example: analyzing housing market patterns across Seattle using a blend of property records, elevation, transportation networks, satellite imagery, and census data.

Bronze Layer:

  • Ingest property sales records from the county (updated monthly)
  • Pull property boundaries from municipal GIS (shapefile, converted to GeoParquet)
  • Reference global DEM (stays on AWS S3)
  • Ingest road network (OpenStreetMap)
  • Store satellite imagery references (Copernicus, AWS Open Data)

Silver Layer:

  • Transform property boundaries to WGS 84 and validate geometries
  • Perform spatial join: properties to neighborhoods
  • Enrich with elevation by querying DEM at property centroids
  • Calculate proximity metrics: distance to nearest transit, nearest park
  • Tile satellite imagery for efficient access

Gold Layer:

  • Aggregate: median price by neighborhood, price trends by elevation
  • Pre-compute spatial clusters: identify hot markets
  • Generate PMTiles for web visualization
  • Export clean GeoParquet for ML pipelines
  • Push refined data to PostGIS for application serving

This progression from raw, fragmented sources to a unified, enriched analytical product is exactly what the medallion architecture enables. And the efficiency gains compound as more analyses build on the same gold layer.

Comparison to Other Approaches

Why Not Keep Everything in PostGIS?

PostGIS is powerful for transactional spatial queries and operations, but it’s a database optimized for consistency and ACID transactions, not analytics at scale. As data volume grows, PostGIS becomes expensive to operate and scale. You’re paying for transactional guarantees you don’t need for analytics.

The medallion approach uses cloud storage (S3) as the primary store, which scales cheaply, and can use PostGIS only for the final gold layer serving to applications that need it. This is typically more cost-effective and performant.

Why Not Use a Specialized GIS Data Warehouse?

Geospatial-specific data warehouses exist but are typically proprietary, expensive, and inflexible. They don’t integrate easily with your existing data infrastructure or machine learning pipelines. The medallion architecture is platform-agnostic—it works with Spark, Flink, DuckDB, or any distributed compute framework that understands spatial operations.

Why Not Just Use Desktop GIS with Cloud Storage?

Desktop GIS tools (QGIS, ArcGIS) can read cloud storage but aren’t designed for production pipelines. They require manual steps, don’t automate updates, and don’t scale to the volume and frequency of modern geospatial data. The medallion architecture automates what desktop GIS does manually.

Implementation Considerations

Technology Stack

The medallion architecture for geospatial data typically uses:

  • Storage: Cloud object storage (AWS S3, Google Cloud Storage, Azure Blob)
  • Table Format: Apache Iceberg with spatial extensions (Havasu), which provides schema evolution, ACID transactions, time travel, and spatial indexing
  • Compute: Apache Sedona or Wherobots (distributed geospatial framework) or SedonaDB (single-node spatial SQL)
  • Orchestration: Workflow tools (Airflow) to manage bronze→silver→gold pipelines
  • Visualization: Web tiles (PMTiles), GIS tools (QGIS, Felt), or BI platforms

Iceberg: The Foundation

Apache Iceberg is the table format that makes this all work efficiently. It’s a metadata layer over Parquet files that provides:

  • Schema evolution: Add columns without breaking downstream queries
  • ACID transactions: Reliable concurrent updates
  • Time travel: Query historical snapshots (rewind to yesterday’s data)
  • Partitioning: Automatic data organization for efficient queries
  • Spatial indexing: Efficient spatial predicates and pushdown optimization

With Iceberg’s spatial extensions, you get native geometry storage and spatial optimizations, making it the ideal foundation for medallion pipelines. Plus with many upstream data warehouses now accepting Iceberg V3 you have a zero ETL process to all of these systems from your gold tables.

Data Lineage and Governance

The medallion architecture inherently supports data governance. Each layer has clear ownership and lineage. Track which source datasets feed which analyses. Implement role-based access controls at layer boundaries. Maintain data catalogs describing bronze sources, silver transformations, and gold products.

Putting this into practice with Wherobots

From Theory to Practice

The medallion architecture isn’t a theoretical concept—it’s proven across thousands of data engineering organizations. What makes it powerful for geospatial is that it acknowledges geospatial’s unique challenges:

  • Format diversity: Enforced conversion to cloud-native formats
  • Computation intensity: Spatial predicates pushed down to efficient compute
  • Scale complexity: Remote references and tiling instead of data movement
  • Governance needs: Clear layer separation for access control and lineage
  • Integration requirements: Gold layer can feed into GIS tools, ML pipelines, or applications

For teams managing geospatial data, whether you’re data engineers building analytics platforms or GIS professionals upgrading legacy workflows, the medallion architecture provides a systematic, scalable path forward.

The alternative (managing spatial data ad-hoc) becomes increasingly untenable as volume, velocity, and complexity grow. The medallion architecture makes it possible to move at scale without moving data.


Key Takeaways

  • Bronze is raw: Get all your geospatial sources into cloud storage without transformation, leaving data at source where possible
  • Silver is structure: Standardize projections, validate geometries, enrich with spatial relationships, and convert to cloud-native formats
  • Gold is analysis-ready: Aggregate, optimize, and prepare data for consumption across tools and applications
  • Iceberg is the glue: Use Apache Iceberg’s spatial extensions as your table format to manage schema evolution, lineage, and efficient spatial operations
  • Efficiency is existential: Geospatial data demands careful format choices and architectural decisions—the medallion approach systematizes those choices

The geospatial industry is shifting from moving data to moving queries. The medallion architecture is how you make that shift sustainable.

Create your Wherobots account