Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
The Overture Maps Foundation dataset is now Generally Available (GA), making efficient access to its vast open map data a top priority for developers, researchers, and organizations. In this blog post, we’ll explore how to seamlessly integrate and leverage Overture datasets within Wherobots, guiding you through creating some exciting visualizations.
The spatial catalog utilizes the open table format, Havasu Iceberg, which natively supports spatial types and enables performance optimization through spatial push-down and indexing. Havasu builds upon the solid foundation of Apache Iceberg, extending its capabilities to better serve geospatial data needs. As an enhancement of Iceberg tables, Havasu inherits all the benefits that make Iceberg a powerful choice for data management, such as schema evolution, time travel, and efficient metadata handling.
Update: Great news for geospatial data users! As the Iceberg community plans to natively support GEO datatypes, we’ll be transitioning Havasu to Iceberg native in the near future. This means all Wherobots users can expect full compatibility between Havasu and the Iceberg V3 spec, including these new GEO datatypes.
Wherobots aggregates open datasets from various sources, cleaning and transforming them into the Havasu Iceberg format. This process facilitates the integration of enterprise data with real-world physical information. Some premium datasets are available exclusively to our Pro Edition subscribers. For more detailed information about our open data offerings, please refer to the Wherobots Spatial Catalog.
Wherobots includes the latest Overture Maps release in its spatial catalog, offering users direct access to this and other popular geospatial datasets. We handle all the configuration and pre-processing, ensuring you always have up-to-date Overture data as new releases become available.
Overture organizes its datasets into distinct themes, each containing various types. The structure of these themes and types is as follows:
Wherobots simplifies access to Overture data by seamlessly integrating it into the Wherobots Cloud platform. The raw Overture data is processed into Havasu Iceberg tables, making it readily available for analysis within the WherobotsDB environment. Users can leverage the power of spatial SQL queries within the Notebook experience to directly interact with and gain insights from these tables:
from sedona.spark import * config = SedonaContext.builder().getOrCreate() sedona = SedonaContext.create(config) sedona.sql("SHOW TABLES IN wherobots_open_data.overture_maps_foundation").show(truncate=False)
+------------------------+---------------------------+-----------+ |namespace |tableName |isTemporary| +------------------------+---------------------------+-----------+ |overture_maps_foundation|places_place |false | |overture_maps_foundation|transportation_segment |false | |overture_maps_foundation|transportation_connector |false | |overture_maps_foundation|buildings_building |false | |overture_maps_foundation|divisions_division_area |false | |overture_maps_foundation|buildings_building_part |false | |overture_maps_foundation|base_land_cover |false | |overture_maps_foundation|addresses_address |false | |overture_maps_foundation|base_land_use |false | |overture_maps_foundation|base_infrastructure |false | |overture_maps_foundation|divisions_division_boundary|false | |overture_maps_foundation|base_land |false | |overture_maps_foundation|geocodes |false | |overture_maps_foundation|base_water |false | |overture_maps_foundation|divisions_division |false | |overture_maps_foundation|base_bathymetry |false | +------------------------+---------------------------+-----------+
Each table is accessible through a simple, standardized command. Below is an example for a single table, but you can access other tables in the same manner:
sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.base_land_use").show()
Here’s an example to work with the tables in the Overture catalog:
sedona.sql(""" SELECT subtype, COUNT(subtype) AS count FROM wherobots_open_data.overture_maps_foundation.buildings_building GROUP BY subtype ORDER BY count DESC """).show(truncate=False)
+--------------+---------+ |subtype |count | +--------------+---------+ |residential |107898149| |outbuilding |7696532 | |agricultural |4290701 | |commercial |4008648 | |industrial |3520833 | |education |1686603 | |religious |1077831 | |civic |1015775 | |service |833565 | |transportation|475667 | |medical |297068 | |entertainment |200166 | |military |43198 | |NULL |0 | +--------------+---------+
Overture has made a significant shift in how they distribute their datasets worldwide, adopting GeoParquet as their new format of choice. GeoParquet, an extension of the popular Parquet format, is specifically designed to handle geospatial data efficiently.
Update: Good news for geospatial data! Alongside Iceberg, Apache Parquet now also supports GEO data types. We anticipate GeoParquet merging into the Parquet standard in time.
To create these GeoParquet files, Overture leverages Apache Sedona, an open-source project that’s gaining traction in the geospatial community. If you’re curious about how Overture uses Apache Sedona, you can dive deeper into the details by clicking here. This move to GeoParquet isn’t just about keeping up with trends, it brings some real benefits.
One of the key advantages of using GeoParquet is its ability to support spatial filter push-down. This means that when you’re querying the data, you’re not wasting time and resources sifting through irrelevant information. Instead, the system only reads the files that are relevant to your specific query. This can lead to significant performance improvements, especially when dealing with large-scale geospatial datasets.
While GeoParquet is a step forward for geospatial data storage, Havasu Iceberg takes things even further. Unlike GeoParquet, which is essentially a file format, Havasu is a comprehensive table format. This means it can offer more advanced features like ACID transactions, which are crucial for maintaining data integrity in complex, multi-user environments. See this post on the benefits of Apache Iceberg for GEO datatypes.
Havasu Iceberg enhances the developer experience compared to GeoParquet by abstracting file management complexities. It presents data as SQL tables, allowing for intuitive querying and manipulation using familiar operations. Havasu’s support for write operations, partitioning, and advanced features. This higher-level abstraction lets you focus on data analysis and application logic rather than low-level file handling, boosting productivity and easing maintenance.
Overture’s GA release introduces the “addresses” theme, which currently consists of a single type named “address”. This theme, still in its alpha stage, covers addresses from 14 countries. Due to its alpha status, the schema is subject to changes. The current schema structure is as follows:
root |-- id: string (nullable = true) |-- geometry: geometry (nullable = true) |-- bbox: struct (nullable = true) | |-- xmin: float (nullable = true) | |-- xmax: float (nullable = true) | |-- ymin: float (nullable = true) | |-- ymax: float (nullable = true) |-- country: string (nullable = true) |-- postcode: string (nullable = true) |-- street: string (nullable = true) |-- number: string (nullable = true) |-- unit: string (nullable = true) |-- address_levels: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- value: string (nullable = true) |-- postal_city: string (nullable = true) |-- version: integer (nullable = true) |-- sources: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- property: string (nullable = true) | | |-- dataset: string (nullable = true) | | |-- record_id: string (nullable = true) | | |-- update_time: string (nullable = true) | | |-- confidence: double (nullable = true)
Let’s look at a few examples to understand the new theme from Overture:
This query first generates an H3 index for every Point, then groups data sharing the same H3 index to count occurrences. Finally, it adjusts polygon longitudes derived from these H3 indexes to ensure accurate visualization:
address_h3 = sedona.sql(""" -- Generating H3 cells at zoom level 2 WITH address_ww AS ( SELECT *, ST_H3CellIDs(geometry, 2, false) as h3 FROM wherobots_open_data.overture_maps_foundation.addresses_address ), -- Generating polygons representing H3 indexes, group data by the H3 index, -- and counting how many data entries belong to each H3 index. address_h3 AS ( SELECT COUNT(*) AS num, ST_H3ToGeom(h3)[0] as geometry, h3 FROM address_ww GROUP BY h3 ORDER BY num DESC ) -- Finally shifting longitude for the polygons that cross the International Date Line SELECT num, h3, CASE WHEN ST_CrossesDateLine(geometry) = true THEN ST_ShiftLongitude(geometry) ELSE geometry END AS geometry FROM address_h3 """)
We can now visualize this aggregated data as a choropleth map using SedonaKepler.
North America has some missing values. Let’s take a closer look at addresses just in America:
address_US = sedona.sql(""" WITH address_US AS ( SELECT *, ST_H3CellIDs(geometry, 3, false) as h3 FROM wherobots_open_data.overture_maps_foundation.addresses_address WHERE country = "US" ) SELECT COUNT(*) AS num, ST_H3ToGeom(h3)[0] as geometry, h3 FROM address_US GROUP BY h3 ORDER BY num DESC """)
The visualization reveals gaps in data coverage, likely due to privacy regulations in certain states or the absence of addresses in specific areas:
In this analysis, we explore the spatial accessibility of grocery stores in two major cities: Austin, Texas, and New York City, New York. Using the latest release of Overture Maps Foundation’s dataset, we aim to measure and compare the distances between addresses and the nearest grocery stores in each city. The process leverages the power of our new K-nearest neighbors (KNN) algorithm to identify the closest grocery stores to each address.
Accessibility to essential services like grocery stores is a crucial factor in urban planning, public health, and economic development. Understanding how far residents have to travel to reach a grocery store can reveal insights into the livability of different neighborhoods, the potential for food deserts, and the effectiveness of current infrastructure. For instance:
We begin by loading the Overture Maps Foundation’s datasets for addresses and places. The places dataset contains various categories, including grocery stores, which are our primary focus.
places
addresses_df = sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.addresses_address") places_df = sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.places_place")
For each city, we define a geographic bounding box to filter out the relevant addresses and places. Additionally, we apply a 1000-meter buffer around the bounding box used for places to account for addresses near the edges of the bounding box, ensuring that all nearby places are considered for those edge bound addresses.
austin_addresses = addresses_df.where("ST_Intersects(geometry, ST_GeomFromWKT('POLYGON ((-97.998275 30.045308, -97.998275 30.543325, -97.465501 30.543325, -97.465501 30.045308, -97.998275 30.045308))'))") austin_places = places_df.where("ST_Intersects(geometry, ST_Buffer(ST_GeomFromWKT('POLYGON ((-97.998275 30.045308, -97.998275 30.543325, -97.465501 30.543325, -97.465501 30.045308, -97.998275 30.045308))'), 1000, True))")
nyc_addresses = addresses_df.where("ST_Intersects(geometry, ST_GeomFromWKT('POLYGON ((-74.016067 40.566497, -73.943949 40.532062, -73.73584 40.581101, -73.735153 40.606128, -73.742708 40.612383, -73.737213 40.630625, -73.735153 40.667092, -73.720729 40.68584, -73.72279 40.71395, -73.717982 40.738928, -73.777049 40.791454, -73.790786 40.816402, -73.755071 40.850691, -73.744768 40.871982, -73.834743 40.896381, -73.838864 40.906761, -73.853974 40.910913, -73.86153 40.904685, -73.917163 40.919734, -73.962494 40.826795, -74.012689 40.754014, -74.025738 40.706143, -74.064201 40.653029, -74.092767 40.647822, -74.108564 40.647822, -74.125391 40.642351, -74.132259 40.643393, -74.181711 40.645998, -74.201973 40.631409, -74.20369 40.624634, -74.202316 40.617598, -74.201973 40.611343, -74.202316 40.606652, -74.197852 40.601178, -74.201286 40.593618, -74.206437 40.586318, -74.215709 40.559197, -74.231163 40.558414, -74.246617 40.549545, -74.250051 40.54198, -74.245587 40.520582, -74.254172 40.515101, -74.257606 40.508576, -74.258636 40.499178, -74.249708 40.493435, -74.227729 40.496306, -74.180338 40.50962, -74.130199 40.528672, -74.070101 40.577192, -74.016067 40.566497))'))") nyc_places = places_df.where("ST_Intersects(geometry, ST_Buffer(ST_GeomFromWKT('POLYGON ((-74.016067 40.566497, -73.943949 40.532062, -73.73584 40.581101, -73.735153 40.606128, -73.742708 40.612383, -73.737213 40.630625, -73.735153 40.667092, -73.720729 40.68584, -73.72279 40.71395, -73.717982 40.738928, -73.777049 40.791454, -73.790786 40.816402, -73.755071 40.850691, -73.744768 40.871982, -73.834743 40.896381, -73.838864 40.906761, -73.853974 40.910913, -73.86153 40.904685, -73.917163 40.919734, -73.962494 40.826795, -74.012689 40.754014, -74.025738 40.706143, -74.064201 40.653029, -74.092767 40.647822, -74.108564 40.647822, -74.125391 40.642351, -74.132259 40.643393, -74.181711 40.645998, -74.201973 40.631409, -74.20369 40.624634, -74.202316 40.617598, -74.201973 40.611343, -74.202316 40.606652, -74.197852 40.601178, -74.201286 40.593618, -74.206437 40.586318, -74.215709 40.559197, -74.231163 40.558414, -74.246617 40.549545, -74.250051 40.54198, -74.245587 40.520582, -74.254172 40.515101, -74.257606 40.508576, -74.258636 40.499178, -74.249708 40.493435, -74.227729 40.496306, -74.180338 40.50962, -74.130199 40.528672, -74.070101 40.577192, -74.016067 40.566497))'), 1000, True))")
Next, we filter out the grocery stores from the places dataset using the primary category field. This allows us to focus on the relevant data for our analysis.
primary
austin_groceries = austin_places.where(col("categories")['primary'] == "grocery_store").repartition(3) nyc_groceries = nyc_places.where(col("categories")['primary'] == "grocery_store").repartition(5)
The core of our analysis is the K-nearest neighbors (KNN) algorithm, which we use to identify the nearest grocery store for each address. KNN is a powerful tool in spatial analysis, enabling us to efficiently find the closest points of interest (in this case, grocery stores) to each address.
In this context, we perform a spatial join using ST_KNN to calculate the distance between each address and its nearest grocery store. This method ensures that we accurately capture the nearest grocery store for every address in both cities.
ST_KNN
austin_groceries_join = sedona.sql(''' SELECT austin_addresses.id AS address_id, austin_addresses.country AS address_country, austin_addresses.postcode AS address_postcode, austin_addresses.street AS address_street, austin_addresses.number AS address_number, austin_addresses.unit AS address_unit, austin_addresses.GEOMETRY AS address_GEOM, austin_groceries.id AS place_id, austin_groceries.names['primary'] AS place_name, austin_groceries.GEOMETRY AS place_GEOM, ST_DISTANCESPHERE(austin_addresses.GEOMETRY, austin_groceries.GEOMETRY) AS DISTANCE FROM austin_addresses JOIN austin_groceries ON ST_KNN(austin_addresses.GEOMETRY, austin_groceries.GEOMETRY, 1, FALSE) ''')
nyc_groceries_join = sedona.sql(''' SELECT nyc_addresses.id AS address_id, nyc_addresses.country AS address_country, nyc_addresses.postcode AS address_postcode, nyc_addresses.street AS address_street, nyc_addresses.number AS address_number, nyc_addresses.unit AS address_unit, nyc_addresses.GEOMETRY AS address_GEOM, nyc_groceries.id AS place_id, nyc_groceries.names['primary'] AS place_name, nyc_groceries.GEOMETRY AS place_GEOM, ST_DISTANCESPHERE(nyc_addresses.GEOMETRY, nyc_groceries.GEOMETRY) AS DISTANCE FROM nyc_addresses JOIN nyc_groceries ON ST_KNN(nyc_addresses.GEOMETRY, nyc_groceries.GEOMETRY, 1, FALSE) ''')
In this SQL query, the ST_KNN function is used to perform the KNN join, identifying the closest grocery store to each address. The ST_DISTANCESPHERE function then calculates the great-circle distance between the two points.
ST_DISTANCESPHERE
We calculate key statistical measures such as the mean, median, and standard deviation of the distances for both cities. These statistics provide insights into the city’s overall accessibility to grocery stores.
austin_stats = austin_groceries_join.agg( F.mean("DISTANCE").alias("mean"), F.median("DISTANCE").alias("median"), F.stddev("DISTANCE").alias("stddev") ).collect()[0]
nyc_stats = nyc_groceries_join.agg( F.mean("DISTANCE").alias("mean"), F.median("DISTANCE").alias("median"), F.stddev("DISTANCE").alias("stddev") ).collect()[0]
A histogram of distances to the nearest grocery store for each of the cities reveals the stark difference in walkability to grocery stores between the 2 cities.
In line with Wherobots’ mission to democratize access to open data and facilitate analysis of our world, we’ve made Overture data even more accessible and user-friendly. By leveraging the scalable compute capabilities of Wherobots Cloud, you can now effortlessly explore and analyze this rich geospatial dataset.
The combination of Overture’s comprehensive data and Wherobots’ powerful platform opens up a world of possibilities for further analysis. Imagine conducting walkability studies to assess pedestrian-friendliness, identifying optimal locations for new businesses based on proximity to key amenities, or even modeling traffic patterns to improve urban planning.
Stay tuned for more exciting explorations into the world of geospatial data analysis with Wherobots and Overture! Subscribe to our newsletter to stay up-to-date with the latest content:
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: