7 Mins Read

16 Apr 2024

Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

Authors

Querying Overture Maps GeoParquet data using Apache Sedona

Visualizing points of interest within walking distance from each stadium.

The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.

Understanding GeoParquet

Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.

In a nutshell, GeoParquet adds the following information to the file metadata:

Version: the GeoParquet specification version of this file
Primary column: the main Geometry Type column should be considered when there are multiple Geometry Type columns stored in this file.
Column metadata per Geometry Type column:
- Encoding of the geometry data. Currently only WKB format is supported.
- Geometry types: all possible geometry types occurred in this column, such as POINT, POLYGON, LINESTRING
- Coordinate reference system information, bounding box, geometry orientation, and so on

In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.

The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.

Benefits of GeoParquet

GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.

Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.

A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.

As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.

The Role Of Apache Sedona

The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.

With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/

Generating GeoParquet Files With Apache Sedona

Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:

myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")

In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash function to generate dual GeoHash IDs for each geometry in the dataset.

Coarse-Level GeoHashing (Precision Level 3): The first GeoHash, with a precision level of 3, creates cells approximately 156.5km x 156km in size. This level is ideal for grouping data based on geographic proximity. Using this GeoHash key, a large Sedona DataFrame is partitioned into numerous smaller GeoParquet files. Each file correlates to a unique 3-character GeoHash key, facilitating efficient data pruning at the Parquet file level. Importantly, this approach enables users to manually download individual GeoParquet files by referencing the GeoHash key in the file path, bypassing the need for a Parquet file reader.
Fine-Level GeoHashing (Precision Level 8): Concurrently, OMF utilizes a finer GeoHash at level 8, resulting in smaller cells around 38.2m x 19m and then sorts data using this GeoHash within each GeoParquet file. This precision is used to sort data within each GeoParquet file. By doing so, OMF ensures that the data within each Parquet row-group or data page is organized based on spatial proximity. This organization is crucial for enabling efficient data pruning, particularly on the bbox column.

The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.

This dual GeoHash approach serves three core purposes:

It allows users to download individual GeoParquet files from OMF without requiring a GeoParquet or Parquet reader, simply by checking the GeoHash key.
GeoParquet readers can efficiently filter files using the BBox information in each file’s metadata, enhancing query speed.
The strategy also enhances the functionality of standard Parquet readers which can effectively filter files using the statistical information of the bbox column, found in each file’s column and row-group metadata, without needing to recognize the file as a GeoParquet file.

Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud

Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.

The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:

docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
apache/sedona:latest

This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888

Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:

from sedona.spark import *

config = SedonaContext. \
builder(). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate()

sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet"). \
load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")

We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:

ST_PolygonFromEnvelope – create a polygon geometry that will serve as the area roughly representing the boundaries of New York City from a specified bounding box
ST_Within – filter rows for only points of interest that fall within the boundary of that polygon geometry

spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
df = df.where(spatial_filter)
df.createOrReplaceTempView("places")

When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.

To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:

ST_Buffer – create a new geometry that represents a buffer around each stadium that represents a short walking distance from the stadium
ST_Intersects – filter for points of interest that lie within the buffer geometry, identifying points of interest within walking distance of each stadium

stadium_places = sedona.sql("""
WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena")
SELECT * FROM places, stadiums
WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))

You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.

Conclusion

The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.

O'Reilly Apache Sedona Cloud Native Geospatial with Apache Sedona book cover

Interested in learning more about Apache Sedona? Download this free O’Reilly book and learn practical solutions for challenges working with various type of geospatial at scale.

Access the free guide

Download Now

TABLE OF CONTENTS

Understanding GeoParquet
Benefits of GeoParquet
The Role Of Apache Sedona
Generating GeoParquet Files With Apache Sedona
Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud
Conclusion

Contributors

Feng Jiang

Senior software engineer At Microsoft. Loving to build the maps with decent tech.
Jia Yu

Jia Yu is the co-founder and Chief Architect of Wherobots. He also serves as the PMC Chair of Apache Sedona.

7 Mins Read 8 Jul 2026

How Bad Telemetry Data Sabotages Modern Fleets

By the Teams at Action Engine & Wherobots Fleet monitoring is undergoing a generational shift. Fleet monitoring, the systems that ingest and analyze vehicle telemetry to track fleet health, performance, and safety, has become the foundation of how operators run vehicles, not just track them. Modern vehicles generate orders of magnitude more telemetry than even […]

Stories + 2

3 Mins Read 8 Jul 2026

Introducing the Wherobots Innovation Edition, designed to accelerate your physical world objectives

The Wherobots Innovation Edition helps you deliver outcomes on top of spatial data that propel your organization forward, and make this data AI-ready. Today we are announcing the Wherobots Innovation Edition. This is an annual partnership that pairs the full Wherobots Cloud platform with our forward deployed spatial engineering expertise, developed over years of delivering […]

post

13 Mins Read 24 Jun 2026

Spatial Data in Apache Iceberg: Optimizations and Management That Matter

Spatial data in Apache Iceberg needs different optimization than tabular data. A geometry column has no natural sort order, so unsorted files carry wide, overlapping bounding boxes and query planners cannot prune them… At all… This behaviour turns a selective spatial filter into a full table scan. A second problem compounds it: one oversized geometry […]

Data Management + 3