Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Geospatial solutions were thought of as “special”, because what modernized the data ecosystem of today, left geospatial data mostly behind. This changes today. Thanks to the efforts of the Apache Iceberg and Parquet communities, we are excited to share that both Iceberg and Parquet now support geometry and geography (collectively the GEO) data types.
Geospatial data has been disconnected from the broader data ecosystem that modernized from open file formats like Apache Parquet, and open table formats like Apache Iceberg, Delta Lake, and Apache Hudi.The benefits of these cloud-native open file and table formats fueled widespread adoption of data lake and lakehouse architectures. Organizations moved away from the use of expensive proprietary systems, away from data siloes that coupled compute with storage and didn’t scale, and away from formats that locked them in and stifled innovation. Relative to legacy options, these cloud-native formats fundamentally change how data is stored, managed, and accessed. This in turn lowers costs, increases agency, and unlocks innovation over time. But because geospatial data was different, which led to a number of technical challenges, it wasn’t supported by these formats from the start. As a result developers building solutions with geospatial data struggled with fragmented formats, proprietary file types, and data siloes – making solutions harder and costlier to build.
With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, DuckDB, Apache Sedona, Apache Spark, Databricks, Snowflake, and BigQuery on your data. All the while benefitting from faster queries and lower storage costs from Parquet formatted data.These changes improve short and long term economics for geospatial solutions. Organizations will have a new freedom to innovate with a lower cost, highly interoperable architecture. They get to choose the best tool for the job over time without having to shuttle data between systems. Their costs reduce, productivity improves, innovation accelerates, and the playing field is leveled with respect to who can provide the best solution for their data. The legacy siloes will break down, just like they’ve done for non-geospatial data. And most importantly, these changes will lead to new innovation about our physical world.
These changes make geospatial solutions based on a data lake a lot more attractive. Here are a few benefits.
In the coming weeks, we will be covering these features in detail and demonstrate how they’re beneficial for geospatial solutions.
These changes were the result of grassroots initiatives, investment, and influence from community members at Planet, CARTO, Wherobots, and many others across the Cloud Native Geospatial community. This includes GeoParquet, which was a grassroots project and an extension of Parquet that proved its worth through use and popularity, countless meetups, and discussions. And we also want to give credit to the Iceberg community for working with members of the Wherobots team, to bring a solution forward while also influencing the Parquet community to make a GEO native data type.While Iceberg and Parquet communities led with support for GEO data types, we welcome compatibility and support for GEO data types in all cloud-native formats, including Apache Hudi and Delta Lake.
Thoughts from Szehon Ho, Apache Iceberg PMC Member“The long-awaited incorporation of geospatial data types in the Iceberg V3 spec extends a core theme of Iceberg as a project to provide a universal ‘shared warehouse storage’ across many engines and users, and will now allow this huge, growing ecosystem to work on the same geospatial data as well, unlocking many exciting use cases. It is also a demonstration of Iceberg community’s willingness to take the time and ‘do hard things’, engaging in months of very active discussions across companies and OSS communities, finally reaching consensus on a spec that supports the largest variety of use cases in the fast-evolving geospatial data domain.”
Thoughts from Chris Holmes, co-creator of GeoParquet“The community developed and rallied behind GeoParquet to make geospatial data in Parquet fully interoperable and to let the geospatial world tap into all the advantages the big data world has been getting from Parquet. I’m very excited to see Parquet and Iceberg formally support geospatial types, and look forward to the acceleration in geospatial innovation that these changes will activate across industries and for our planet.”
Committers are already working to bring support for these changes into Apache Sedona, and will notify the community as they are introduced.
At Wherobots, we’ve supported these GEO data types in Havasu (our Iceberg fork) which we built to enable geospatial lakehouse architectures with Wherobots, along with GeoParquet. We’ve begun developing native support for Iceberg and Parquet into how Wherobots operates on customer data, and will put our full support behind these native formats moving forward.To learn more about the reasoning behind the Iceberg GEO types design, the trade-offs we navigated, and what it all means for implementers, please read our follow-up blog: Iceberg GEO: Technical Insights and Implementation Strategies. If you need support throughout your journey adopting and utilizing these cloud-native formats for geospatial use, reach out to Apache Iceberg on Slack or Apache Sedona on Discord.
Watch this livestream from Wednesday, May 7 with leaders from Foursquare, Databricks, Planet, and Wherobots as they discuss the historical challenges of handling spatial data, bridging the gap, and future adoption of these advancements. Sign up for our newsletter to stay up to date with everything we are doing to enable the spatial community to embrace the modern geospatial lake-house.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
WherobotsDB is 3x faster with up to 45% better price performance
The next generation of WherobotsDB, the Apache Sedona and Spark 4 compatible engine, is now generally available.
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: