WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Apache Iceberg and Parquet now support GEO

Authors

Apache Iceberg and Parquet now support GEO data types

Geospatial data isn’t special anymore, and that’s a good thing.

Geospatial solutions were thought of as “special”, because what modernized the data ecosystem of today, left geospatial data mostly behind. This changes today. 

Thanks to the efforts of the Apache Iceberg and Parquet communities, we are excited to share that both Iceberg and Parquet now support geometry and geography (collectively the GEO) data types.

Geospatial challenges

Geospatial data has been disconnected from the broader data ecosystem that modernized from open file formats like Apache Parquet, and open table formats like Apache Iceberg, Delta Lake, and Apache Hudi.

The benefits of these cloud-native open file and table formats fueled widespread adoption of data lake and lakehouse architectures. Organizations moved away from the use of expensive proprietary systems, away from data siloes that coupled compute with storage and didn’t scale, and away from formats that locked them in and stifled innovation. Relative to legacy options, these cloud-native formats fundamentally change how data is stored, managed, and accessed. This in turn lowers costs, increases agency, and unlocks innovation over time. But because geospatial data was different, which led to a number of technical challenges, it wasn’t supported by these formats from the start. As a result developers building solutions with geospatial data struggled with fragmented formats, proprietary file types, and data siloes – making solutions harder and costlier to build.

The silos will break down

With native geospatial data type support in Apache Iceberg and Parquet, you can seamlessly run query and processing engines like Wherobots, DuckDB, Apache Sedona, Apache Spark, Databricks, Snowflake, and BigQuery on your data. All the while benefitting from faster queries and lower storage costs from Parquet formatted data.

These changes improve short and long term economics for geospatial solutions. Organizations will have a new freedom to innovate with a lower cost, highly interoperable architecture. They get to choose the best tool for the job over time without having to shuttle data between systems. Their costs reduce, productivity improves, innovation accelerates, and the playing field is leveled with respect to who can provide the best solution for their data. The legacy siloes will break down, just like they’ve done for non-geospatial data. And most importantly, these changes will lead to new innovation about our physical world. 

Benefits of Iceberg and Parquet

These changes make geospatial solutions based on a data lake a lot more attractive. Here are a few benefits. 

  • Iceberg and Parquet alone don’t separate compute from storage, but together they make it possible to utilize low cost data lake storage, along with multiple independent high performance computing solutions for different use cases
  • ACID transactions and data versioning enable the use of multiple compute engines without conflicts
  • Time travel allows tracking of data changes over time
  • Query performance is higher from features like column pruning, row-group filtering, and fast file access
  • Open data formats minimize vendor lock-in
  • Geospatial data will be supported across a broader ecosystem of tools and services
  • And many more…

In the coming weeks, we will be covering these features in detail and demonstrate how they’re beneficial for geospatial solutions. 

Grassroots efforts made this happen

These changes were the result of grassroots initiatives, investment, and influence from community members at Planet, CARTO, Wherobots, and many others across the Cloud Native Geospatial community. This includes GeoParquet, which was a grassroots project and an extension of Parquet that proved its worth through use and popularity, countless meetups, and discussions. And we also want to give credit to the Iceberg community for working with members of the Wherobots team, to bring a solution forward while also influencing the Parquet community to make a GEO native data type.

While Iceberg and Parquet communities led with support for GEO data types, we welcome compatibility and support for GEO data types in all cloud-native formats, including Apache Hudi and Delta Lake.


Thoughts from Szehon Ho, Apache Iceberg PMC Member
“The long-awaited incorporation of geospatial data types in the Iceberg V3 spec extends a core theme of Iceberg as a project to provide a universal ‘shared warehouse storage’ across many engines and users, and will now allow this huge, growing ecosystem to work on the same geospatial data as well, unlocking many exciting use cases. It is also a demonstration of Iceberg community’s willingness to take the time and ‘do hard things’, engaging in months of very active discussions across companies and OSS communities, finally reaching consensus on a spec that supports the largest variety of use cases in the fast-evolving geospatial data domain.”
Thoughts from Chris Holmes, co-creator of GeoParquet
“The community developed and rallied behind GeoParquet to make geospatial data in Parquet fully interoperable and to let the geospatial world tap into all the advantages the big data world has been getting from Parquet. I’m very excited to see Parquet and Iceberg formally support geospatial types, and look forward to the acceleration in geospatial innovation that these changes will activate across industries and for our planet.”

Looking ahead


Committers are already working to bring support for these changes into Apache Sedona, and will notify the community as they are introduced. 

At Wherobots, we’ve supported these GEO data types in Havasu (our Iceberg fork) which we built to enable geospatial lakehouse architectures with Wherobots, along with GeoParquet. We’ve begun developing native support for Iceberg and Parquet into how Wherobots operates on customer data, and will put our full support behind these native formats moving forward.

To learn more about the reasoning behind the Iceberg GEO types design, the trade-offs we navigated, and what it all means for implementers, please read our follow-up blog: Iceberg GEO: Technical Insights and Implementation Strategies. If you need support throughout your journey adopting and utilizing these cloud-native formats for geospatial use, reach out to Apache Iceberg on Slack or Apache Sedona on Discord.

Livestream banner image: Geospatial Tables in the Open Lakehouse - Iceberg and Parquet

Watch this livestream from Wednesday, May 7 with leaders from Foursquare, Databricks, Planet, and Wherobots as they discuss the historical challenges of handling spatial data, bridging the gap, and future adoption of these advancements. 

Sign up for our newsletter to stay up to date with everything we are doing to enable the spatial community to embrace the modern geospatial lake-house.

RELATED POSTS
VIEW MORE