Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline.
This isn’t about just storing location data in your existing medallion setup. It’s about recognizing that spatial data introduces complexity in data structures, compute patterns, and efficiency requirements that traditional architectures simply cannot handle without significant compromise. The medallion (or multi-hop) architecture, when properly adapted for geospatial workloads, becomes something more powerful: a systematic approach to managing spatial intelligence at scale.
In this post, I will outline how we have taught this in the Wherobots Geospatial Data Engineering Associate course to leverage this base design and group different spatial tasks, query patterns, and outputs into each step, leveraging both raster and vector datasets.
Before diving into the medallion architecture itself, let’s acknowledge what makes geospatial data different.
The Fragmentation Problem
In most organizations, geospatial data doesn’t arrive nicely packaged. You might pull property boundaries from a municipal data portal in shapefiles, elevation data from AWS open data as cloud-optimized GeoTIFFs, satellite imagery from NASA, and local POI data from your internal systems.
Each source has different formats, coordinate systems, refresh rates, and quality levels.
Traditional data lakes aren’t equipped to handle this heterogeneity efficiently. Moving terabytes of raw geospatial data through your pipeline adds cost and latency.
And then there’s the format problem: legacy formats like shapefiles and JPEG2000 can’t be partially queried. If you want Seattle property data from global shapefiles, the traditional approach is to download the entire file, unzip it, filter locally, then transfer what you need.
The Scale-Complexity Tradeoff
Geospatial scale isn’t just about row count. It’s about the combination of volume, global distribution, mixed data types (vectors, rasters, array-based formats, point clouds), conversion / interpolation between those data types, and the inherent cost of spatial operations. A simple point-in-polygon operation across a million properties and thousands of geographical features involves computation that traditional systems struggle with.
This is why you see organizations either underutilizing their spatial data or building custom solutions that work for one specific problem but don’t generalize. The efficiency gap is real, and it’s expensive.
The medallion architecture solves these challenges by enforcing structure while maintaining flexibility. Here’s how it works for geospatial data:
The bronze layer is deceptively simple: get raw geospatial data into your system without transformation.
For Data Engineers: This layer is your staging area. You’re ingesting data from disparate sources: APIs, files, databases, cloud storage without enforcing business logic. The goal is to preserve raw data fidelity while standardizing the container.
For geospatial data specifically, this means several important decisions:
For GIS Professionals: Think of bronze as your data repository. In traditional GIS workflows, you manage multiple datasets, each in its own format, living in different folders or databases. Bronze centralizes this but respects the raw nature of the data. You’re not making decisions about coordinate systems, geometry validation, or spatial relationships yet.
This is also where automation becomes critical. Instead of manually downloading datasets monthly, bronze ingestion jobs run on schedules, automatically pulling new data and versioning it. This means your analyses always reflect current reality.
Data table formats such as Apache Iceberg and Delta Lake provide several advantages here since you can easily append data to the table for records that have changed. They also use underlying cloud-native formats such as GeoParquet with the new V3 Iceberg specification and within Wherobots you can read out-of-database rasters (or a reference to a specific part of an image) without moving the source raster file saving massive amounts of I/O operations.
Wherobots Cloud simplifies bronze layer operations through WherobotsDB, its cloud-native spatial processing engine. When ingesting raw geospatial data, Wherobots immediately enables critical capabilities for you:
The key to Wherobots’ approach to bronze is that it enforces good practices without requiring you to build them yourself. You’re getting schema evolution, ACID compliance, and spatial optimization automatically.
Silver is where the intelligence begins to emerge. This is where you clean, enrich, and structure data into a consistent, analysis-ready state. It’s also often where Data Science teams may want to work before “publishing” analysis ready data to end users (in gold tables).
For Data Engineers: Silver involves:
The key architectural principle in silver is that you’re not making final analytical decisions you’re preparing the data so downstream consumers can make those decisions efficiently.
For GIS Professionals: Silver is where your spatial data processing happens. This is the layer where you answer structural questions:
In traditional GIS, this work happens ad-hoc before each analysis. In the medallion architecture, you do it once, well, and capture the work in reusable transformations. This is efficiency multiplied across teams and projects.
Another key note is that you will likely want different scales of compute to handle these processes. Easier process like spatial joins on vector basic geometries may only require a smaller compute instance whereas a large scale NDVI (or other vegetation indexes) would require a larger compute instance. The ability to mix and match your compute scales is a critical advantage in a spatial platform not only for speed but cost optimization.
The silver layer is where WherobotsDB spatial optimization truly shines. Wherobots provides native spatial operations that make complex transformations not just possible but efficient at scale:
The combination of Spatial SQL and WherobotsDB’s distributed architecture means silver layer work that once required deep geospatial expertise and careful optimization is now accessible to data engineers who know SQL. Wherobots handles the spatial complexity.
Gold is your analytics-ready product layer. Data here is fully prepared, enriched, and available for immediate use in downstream applications and BI or GIS systems.
For Data Engineers: Gold contains:
Gold is where you can afford to be opinionated because the work upstream ensures those opinions are well-informed.
For GIS Professionals: Gold is your analysis starting point. Instead of wrestling with raw data, you access gold tables that are:
You go from “I need to combine three datasets and fix projection issues” to “Here’s the enriched regional dataset, ready for analysis.”
Gold layer work in Wherobots focuses on making spatial data immediately useful across your entire organization:
Gold in Wherobots is less about storing static snapshots and more about providing a live, queryable product layer that adapts to consumer needs.
Traditional systems treat geospatial data like regular tabular data. This is inefficient. A shapefile is a collection of files in a zip that requires downloading the entire dataset to access a subset. GeoParquet, by contrast, stores data in columnar format with spatial indexing built-in. You can query it over the internet by bounding box or geometry filter and only transfer what you need.
The medallion architecture enforces this progression from legacy to cloud-native formats, reducing storage costs and query latency dramatically. A terabyte global elevation dataset stays on AWS S3 as a reference. You don’t duplicate it.
Bronze, silver, and gold are separate databases with separate lineage. This means:
By encoding transformations into standardized layers, you enable automation. Data ingestion becomes scheduled jobs. Updates propagate automatically from bronze through silver to gold. You stop manually managing data and start managing logic.
This scales to global datasets because you’re not moving data—you’re moving queries. Spatial predicates push down to where the data lives, reducing network transfer and computation.
Consider a concrete example: analyzing housing market patterns across Seattle using a blend of property records, elevation, transportation networks, satellite imagery, and census data.
Bronze Layer:
Silver Layer:
Gold Layer:
This progression from raw, fragmented sources to a unified, enriched analytical product is exactly what the medallion architecture enables. And the efficiency gains compound as more analyses build on the same gold layer.
PostGIS is powerful for transactional spatial queries and operations, but it’s a database optimized for consistency and ACID transactions, not analytics at scale. As data volume grows, PostGIS becomes expensive to operate and scale. You’re paying for transactional guarantees you don’t need for analytics.
The medallion approach uses cloud storage (S3) as the primary store, which scales cheaply, and can use PostGIS only for the final gold layer serving to applications that need it. This is typically more cost-effective and performant.
Geospatial-specific data warehouses exist but are typically proprietary, expensive, and inflexible. They don’t integrate easily with your existing data infrastructure or machine learning pipelines. The medallion architecture is platform-agnostic—it works with Spark, Flink, DuckDB, or any distributed compute framework that understands spatial operations.
Desktop GIS tools (QGIS, ArcGIS) can read cloud storage but aren’t designed for production pipelines. They require manual steps, don’t automate updates, and don’t scale to the volume and frequency of modern geospatial data. The medallion architecture automates what desktop GIS does manually.
The medallion architecture for geospatial data typically uses:
Apache Iceberg is the table format that makes this all work efficiently. It’s a metadata layer over Parquet files that provides:
With Iceberg’s spatial extensions, you get native geometry storage and spatial optimizations, making it the ideal foundation for medallion pipelines. Plus with many upstream data warehouses now accepting Iceberg V3 you have a zero ETL process to all of these systems from your gold tables.
The medallion architecture inherently supports data governance. Each layer has clear ownership and lineage. Track which source datasets feed which analyses. Implement role-based access controls at layer boundaries. Maintain data catalogs describing bronze sources, silver transformations, and gold products.
The medallion architecture isn’t a theoretical concept—it’s proven across thousands of data engineering organizations. What makes it powerful for geospatial is that it acknowledges geospatial’s unique challenges:
For teams managing geospatial data, whether you’re data engineers building analytics platforms or GIS professionals upgrading legacy workflows, the medallion architecture provides a systematic, scalable path forward.
The alternative (managing spatial data ad-hoc) becomes increasingly untenable as volume, velocity, and complexity grow. The medallion architecture makes it possible to move at scale without moving data.
The geospatial industry is shifting from moving data to moving queries. The medallion architecture is how you make that shift sustainable.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Introducing Scalability for GeoPandas in Apache Sedona
Learn about the new GeoPandas API for Apache Sedona, now available in Wherobots. This new API allows GeoPandas developers to seamlessly scale their analysis beyond what a single compute instance can provide, unlocking insights from large-scale datasets. This integration combines the Pythonic GeoPandas API with the distributed processing power of Apache Sedona.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: