Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona
The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the […]
TABLE OF CONTENTS
Contributors
-
Feng Jiang
Senior software engineer At Microsoft. Loving to build the maps with decent tech.
-
Jia Yu
Jia Yu is a co-founder and the Chief Architect of Wherobots Inc. He is the PMC Chair of Apache Sedona
-
William Lyon
Developer Relations At Wherobots
The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.
Understanding GeoParquet
Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.
In a nutshell, GeoParquet adds the following information to the file metadata:
- Version: the GeoParquet specification version of this file
- Primary column: the main Geometry Type column should be considered when there are multiple Geometry Type columns stored in this file.
-
Column metadata per Geometry Type column:
- Encoding of the geometry data. Currently only WKB format is supported.
- Geometry types: all possible geometry types occurred in this column, such as POINT, POLYGON, LINESTRING
- Coordinate reference system information, bounding box, geometry orientation, and so on
In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.
The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.
Benefits of GeoParquet
GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.
Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.
A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.
As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.
The Role Of Apache Sedona
The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.
With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/
Generating GeoParquet Files With Apache Sedona
Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:
myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")
In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash
function to generate dual GeoHash IDs for each geometry in the dataset.
- Coarse-Level GeoHashing (Precision Level 3): The first GeoHash, with a precision level of 3, creates cells approximately 156.5km x 156km in size. This level is ideal for grouping data based on geographic proximity. Using this GeoHash key, a large Sedona DataFrame is partitioned into numerous smaller GeoParquet files. Each file correlates to a unique 3-character GeoHash key, facilitating efficient data pruning at the Parquet file level. Importantly, this approach enables users to manually download individual GeoParquet files by referencing the GeoHash key in the file path, bypassing the need for a Parquet file reader.
- Fine-Level GeoHashing (Precision Level 8): Concurrently, OMF utilizes a finer GeoHash at level 8, resulting in smaller cells around 38.2m x 19m and then sorts data using this GeoHash within each GeoParquet file. This precision is used to sort data within each GeoParquet file. By doing so, OMF ensures that the data within each Parquet row-group or data page is organized based on spatial proximity. This organization is crucial for enabling efficient data pruning, particularly on the bbox column.
The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.
This dual GeoHash approach serves three core purposes:
- It allows users to download individual GeoParquet files from OMF without requiring a GeoParquet or Parquet reader, simply by checking the GeoHash key.
- GeoParquet readers can efficiently filter files using the BBox information in each file’s metadata, enhancing query speed.
- The strategy also enhances the functionality of standard Parquet readers which can effectively filter files using the statistical information of the bbox column, found in each file’s column and row-group metadata, without needing to recognize the file as a GeoParquet file.
Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud
Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.
The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:
docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
apache/sedona:latest
This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888
Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:
from sedona.spark import *
config = SedonaContext. \
builder(). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate()
sedona = SedonaContext.create(config)
df = sedona.read.format("geoparquet"). \
load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")
We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:
ST_PolygonFromEnvelope
– create a polygon geometry that will serve as the area roughly representing the boundaries of New York City from a specified bounding boxST_Within
– filter rows for only points of interest that fall within the boundary of that polygon geometry
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
df = df.where(spatial_filter)
df.createOrReplaceTempView("places")
When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.
To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:
ST_Buffer
– create a new geometry that represents a buffer around each stadium that represents a short walking distance from the stadiumST_Intersects
– filter for points of interest that lie within the buffer geometry, identifying points of interest within walking distance of each stadium
stadium_places = sedona.sql("""
WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena")
SELECT * FROM places, stadiums
WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))
""")
You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.
Conclusion
The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.
Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:
Contributors
-
Feng Jiang
Senior software engineer At Microsoft. Loving to build the maps with decent tech.
-
Jia Yu
Jia Yu is a co-founder and the Chief Architect of Wherobots Inc. He is the PMC Chair of Apache Sedona
-
William Lyon
Developer Relations At Wherobots