Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
You can now use Databricks for geospatial with Wherobots.
Authors
The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.
Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.
In a nutshell, GeoParquet adds the following information to the file metadata:
Column metadata per Geometry Type column:
In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.
The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.
GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.
Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.
A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.
As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.
The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.
With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/
Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:
myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")
In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash function to generate dual GeoHash IDs for each geometry in the dataset.
ST_GeoHash
The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.
This dual GeoHash approach serves three core purposes:
Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.
The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:
docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \ apache/sedona:latest
This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888
http://localhost:8888
Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:
from sedona.spark import * config = SedonaContext. \ builder(). \ config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate() sedona = SedonaContext.create(config) df = sedona.read.format("geoparquet"). \ load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")
We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:
ST_PolygonFromEnvelope
ST_Within
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))" df = df.where(spatial_filter) df.createOrReplaceTempView("places")
When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.
To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:
ST_Buffer
ST_Intersects
stadium_places = sedona.sql(""" WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena") SELECT * FROM places, stadiums WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))
You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.
The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.
Interested in learning more about Apache Sedona? Download this free O’Reilly book and learn practical solutions for challenges working with various type of geospatial at scale.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots and Taylor Geospatial Engine Bring Fields-of-the-World Models to Production Scale
Agriculture depends on timely, reliable insight into what’s happening on the ground—what’s being planted, what’s being harvested, and how fields evolve over time. The Fields of The World (FTW) project was created to support exactly this mission, by building a fully open ecosystem of labeled data, software, standards and models to create a reliable global […]
How Aarden.ai Scaled Spatial Intelligence 300× Faster for Land Investments with Wherobots
When Aarden.ai emerged from stealth recently with $4M in funding to “empower landowners in data center and renewable energy deals,” the company joined a new wave of data and AI startups reimagining how physical-world data drives modern business. Their mission: help institutional land investors rapidly evaluate the value and potential uses of land across the country. […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: