9 Mins Read

20 Nov 2024

Overture Data In Wherobots Spatial Data Catalog

Authors

The Overture Maps Foundation dataset is now Generally Available (GA), making efficient access to its vast open map data a top priority for developers, researchers, and organizations. In this blog post, we’ll explore how to seamlessly integrate and leverage Overture datasets within Wherobots, guiding you through creating some exciting visualizations.

Spatial catalog

The spatial catalog utilizes the open table format, Havasu Iceberg, which natively supports spatial types and enables performance optimization through spatial push-down and indexing. Havasu builds upon the solid foundation of Apache Iceberg, extending its capabilities to better serve geospatial data needs. As an enhancement of Iceberg tables, Havasu inherits all the benefits that make Iceberg a powerful choice for data management, such as schema evolution, time travel, and efficient metadata handling.

Update: Great news for geospatial data users! As the Iceberg community plans to natively support GEO datatypes, we’ll be transitioning Havasu to Iceberg native in the near future. This means all Wherobots users can expect full compatibility between Havasu and the Iceberg V3 spec, including these new GEO datatypes.

Wherobots aggregates open datasets from various sources, cleaning and transforming them into the Havasu Iceberg format. This process facilitates the integration of enterprise data with real-world physical information. Some premium datasets are available exclusively to our Pro Edition subscribers. For more detailed information about our open data offerings, please refer to the Wherobots Spatial Catalog.

Overture datasets

Wherobots includes the latest Overture Maps release in its spatial catalog, offering users direct access to this and other popular geospatial datasets. We handle all the configuration and pre-processing, ensuring you always have up-to-date Overture data as new releases become available.

Overture organizes its datasets into distinct themes, each containing various types. The structure of these themes and types is as follows:

Wherobots simplifies access to Overture data by seamlessly integrating it into the Wherobots Cloud platform. The raw Overture data is processed into Havasu Iceberg tables, making it readily available for analysis within the WherobotsDB environment. Users can leverage the power of spatial SQL queries within the Notebook experience to directly interact with and gain insights from these tables:

from sedona.spark import *

config = SedonaContext.builder().getOrCreate()

sedona = SedonaContext.create(config)

sedona.sql("SHOW TABLES IN wherobots_open_data.overture_maps_foundation").show(truncate=False)

+------------------------+---------------------------+-----------+
|namespace               |tableName                  |isTemporary|
+------------------------+---------------------------+-----------+
|overture_maps_foundation|places_place               |false      |
|overture_maps_foundation|transportation_segment     |false      |
|overture_maps_foundation|transportation_connector   |false      |
|overture_maps_foundation|buildings_building         |false      |
|overture_maps_foundation|divisions_division_area    |false      |
|overture_maps_foundation|buildings_building_part    |false      |
|overture_maps_foundation|base_land_cover            |false      |
|overture_maps_foundation|addresses_address          |false      |
|overture_maps_foundation|base_land_use              |false      |
|overture_maps_foundation|base_infrastructure        |false      |
|overture_maps_foundation|divisions_division_boundary|false      |
|overture_maps_foundation|base_land                  |false      |
|overture_maps_foundation|geocodes                   |false      |
|overture_maps_foundation|base_water                 |false      |
|overture_maps_foundation|divisions_division         |false      |
|overture_maps_foundation|base_bathymetry            |false      |
+------------------------+---------------------------+-----------+

Each table is accessible through a simple, standardized command. Below is an example for a single table, but you can access other tables in the same manner:

sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.base_land_use").show()

Here’s an example to work with the tables in the Overture catalog:

sedona.sql("""
SELECT 
    subtype,
    COUNT(subtype) AS count
FROM
    wherobots_open_data.overture_maps_foundation.buildings_building
GROUP BY subtype
ORDER BY count DESC
""").show(truncate=False)

+--------------+---------+
|subtype       |count    |
+--------------+---------+
|residential   |107898149|
|outbuilding   |7696532  |
|agricultural  |4290701  |
|commercial    |4008648  |
|industrial    |3520833  |
|education     |1686603  |
|religious     |1077831  |
|civic         |1015775  |
|service       |833565   |
|transportation|475667   |
|medical       |297068   |
|entertainment |200166   |
|military      |43198    |
|NULL          |0        |
+--------------+---------+

Overture has made a significant shift in how they distribute their datasets worldwide, adopting GeoParquet as their new format of choice. GeoParquet, an extension of the popular Parquet format, is specifically designed to handle geospatial data efficiently.

Update: Good news for geospatial data! Alongside Iceberg, Apache Parquet now also supports GEO data types. We anticipate GeoParquet merging into the Parquet standard in time.

To create these GeoParquet files, Overture leverages Apache Sedona, an open-source project that’s gaining traction in the geospatial community. If you’re curious about how Overture uses Apache Sedona, you can dive deeper into the details by clicking here. This move to GeoParquet isn’t just about keeping up with trends, it brings some real benefits.

One of the key advantages of using GeoParquet is its ability to support spatial filter push-down. This means that when you’re querying the data, you’re not wasting time and resources sifting through irrelevant information. Instead, the system only reads the files that are relevant to your specific query. This can lead to significant performance improvements, especially when dealing with large-scale geospatial datasets.

While GeoParquet is a step forward for geospatial data storage, Havasu Iceberg takes things even further. Unlike GeoParquet, which is essentially a file format, Havasu is a comprehensive table format. This means it can offer more advanced features like ACID transactions, which are crucial for maintaining data integrity in complex, multi-user environments. See this post on the benefits of Apache Iceberg for GEO datatypes.

Havasu Iceberg enhances the developer experience compared to GeoParquet by abstracting file management complexities. It presents data as SQL tables, allowing for intuitive querying and manipulation using familiar operations. Havasu’s support for write operations, partitioning, and advanced features. This higher-level abstraction lets you focus on data analysis and application logic rather than low-level file handling, boosting productivity and easing maintenance.

Overture, new theme

Overture’s GA release introduces the “addresses” theme, which currently consists of a single type named “address”. This theme, still in its alpha stage, covers addresses from 14 countries. Due to its alpha status, the schema is subject to changes. The current schema structure is as follows:

root
 |-- id: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- bbox: struct (nullable = true)
 |    |-- xmin: float (nullable = true)
 |    |-- xmax: float (nullable = true)
 |    |-- ymin: float (nullable = true)
 |    |-- ymax: float (nullable = true)
 |-- country: string (nullable = true)
 |-- postcode: string (nullable = true)
 |-- street: string (nullable = true)
 |-- number: string (nullable = true)
 |-- unit: string (nullable = true)
 |-- address_levels: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- value: string (nullable = true)
 |-- postal_city: string (nullable = true)
 |-- version: integer (nullable = true)
 |-- sources: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- property: string (nullable = true)
 |    |    |-- dataset: string (nullable = true)
 |    |    |-- record_id: string (nullable = true)
 |    |    |-- update_time: string (nullable = true)
 |    |    |-- confidence: double (nullable = true)

Let’s look at a few examples to understand the new theme from Overture:

Density of the address table

This query first generates an H3 index for every Point, then groups data sharing the same H3 index to count occurrences. Finally, it adjusts polygon longitudes derived from these H3 indexes to ensure accurate visualization:

address_h3 = sedona.sql("""
-- Generating H3 cells at zoom level 2
WITH address_ww AS (
    SELECT 
        *,
        ST_H3CellIDs(geometry, 2, false) as h3
    FROM
        wherobots_open_data.overture_maps_foundation.addresses_address
), 
-- Generating polygons representing H3 indexes, group data by the H3 index, 
-- and counting how many data entries belong to each H3 index.
address_h3 AS (
    SELECT 
        COUNT(*) AS num,
        ST_H3ToGeom(h3)[0] as geometry,
        h3
    FROM 
        address_ww
    GROUP BY h3
    ORDER BY num DESC
)
-- Finally shifting longitude for the polygons that cross the International Date Line
SELECT 
    num, h3,
    CASE 
        WHEN ST_CrossesDateLine(geometry) = true
        THEN ST_ShiftLongitude(geometry)
        ELSE geometry
    END AS geometry
FROM
    address_h3
""")

We can now visualize this aggregated data as a choropleth map using SedonaKepler.

North America has some missing values. Let’s take a closer look at addresses just in America:

address_US = sedona.sql("""
WITH address_US AS (
SELECT 
    *,
    ST_H3CellIDs(geometry, 3, false) as h3
FROM
    wherobots_open_data.overture_maps_foundation.addresses_address
WHERE
    country = "US"
)

SELECT 
    COUNT(*) AS num,
    ST_H3ToGeom(h3)[0] as geometry,
    h3
FROM 
    address_US
GROUP BY h3
ORDER BY num DESC
""")

The visualization reveals gaps in data coverage, likely due to privacy regulations in certain states or the absence of addresses in specific areas:

Analyzing Spatial Accessibility to Grocery Stores in Austin and New York City Using KNN

In this analysis, we explore the spatial accessibility of grocery stores in two major cities: Austin, Texas, and New York City, New York. Using the latest release of Overture Maps Foundation’s dataset, we aim to measure and compare the distances between addresses and the nearest grocery stores in each city. The process leverages the power of our new K-nearest neighbors (KNN) algorithm to identify the closest grocery stores to each address.

Why This Analysis Matters

Accessibility to essential services like grocery stores is a crucial factor in urban planning, public health, and economic development. Understanding how far residents have to travel to reach a grocery store can reveal insights into the livability of different neighborhoods, the potential for food deserts, and the effectiveness of current infrastructure. For instance:

Urban Planning: City planners can use this information to identify underserved areas where additional grocery stores or better transportation options might be needed.
Public Health: Easy access to fresh food is linked to better health outcomes. Areas where residents have to travel long distances to reach a grocery store may be at a higher risk of being food deserts, contributing to poor dietary habits and related health issues.
Economic Development: Retailers and businesses can use these insights to identify optimal locations for new grocery stores or services, ensuring they meet the needs of the community while also maximizing profitability.

Data Preparation and Spatial Filtering

We begin by loading the Overture Maps Foundation’s datasets for addresses and places. The places dataset contains various categories, including grocery stores, which are our primary focus.

addresses_df = sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.addresses_address")
places_df = sedona.sql("SELECT * FROM wherobots_open_data.overture_maps_foundation.places_place")

For each city, we define a geographic bounding box to filter out the relevant addresses and places. Additionally, we apply a 1000-meter buffer around the bounding box used for places to account for addresses near the edges of the bounding box, ensuring that all nearby places are considered for those edge bound addresses.

Austin, Texas:

austin_addresses = addresses_df.where("ST_Intersects(geometry, ST_GeomFromWKT('POLYGON ((-97.998275 30.045308, -97.998275 30.543325, -97.465501 30.543325, -97.465501 30.045308, -97.998275 30.045308))'))")
austin_places = places_df.where("ST_Intersects(geometry, ST_Buffer(ST_GeomFromWKT('POLYGON ((-97.998275 30.045308, -97.998275 30.543325, -97.465501 30.543325, -97.465501 30.045308, -97.998275 30.045308))'), 1000, True))")

New York City, New York:

nyc_addresses = addresses_df.where("ST_Intersects(geometry, ST_GeomFromWKT('POLYGON ((-74.016067 40.566497, -73.943949 40.532062, -73.73584 40.581101, -73.735153 40.606128, -73.742708 40.612383, -73.737213 40.630625, -73.735153 40.667092, -73.720729 40.68584, -73.72279 40.71395, -73.717982 40.738928, -73.777049 40.791454, -73.790786 40.816402, -73.755071 40.850691, -73.744768 40.871982, -73.834743 40.896381, -73.838864 40.906761, -73.853974 40.910913, -73.86153 40.904685, -73.917163 40.919734, -73.962494 40.826795, -74.012689 40.754014, -74.025738 40.706143, -74.064201 40.653029, -74.092767 40.647822, -74.108564 40.647822, -74.125391 40.642351, -74.132259 40.643393, -74.181711 40.645998, -74.201973 40.631409, -74.20369 40.624634, -74.202316 40.617598, -74.201973 40.611343, -74.202316 40.606652, -74.197852 40.601178, -74.201286 40.593618, -74.206437 40.586318, -74.215709 40.559197, -74.231163 40.558414, -74.246617 40.549545, -74.250051 40.54198, -74.245587 40.520582, -74.254172 40.515101, -74.257606 40.508576, -74.258636 40.499178, -74.249708 40.493435, -74.227729 40.496306, -74.180338 40.50962, -74.130199 40.528672, -74.070101 40.577192, -74.016067 40.566497))'))")
nyc_places = places_df.where("ST_Intersects(geometry, ST_Buffer(ST_GeomFromWKT('POLYGON ((-74.016067 40.566497, -73.943949 40.532062, -73.73584 40.581101, -73.735153 40.606128, -73.742708 40.612383, -73.737213 40.630625, -73.735153 40.667092, -73.720729 40.68584, -73.72279 40.71395, -73.717982 40.738928, -73.777049 40.791454, -73.790786 40.816402, -73.755071 40.850691, -73.744768 40.871982, -73.834743 40.896381, -73.838864 40.906761, -73.853974 40.910913, -73.86153 40.904685, -73.917163 40.919734, -73.962494 40.826795, -74.012689 40.754014, -74.025738 40.706143, -74.064201 40.653029, -74.092767 40.647822, -74.108564 40.647822, -74.125391 40.642351, -74.132259 40.643393, -74.181711 40.645998, -74.201973 40.631409, -74.20369 40.624634, -74.202316 40.617598, -74.201973 40.611343, -74.202316 40.606652, -74.197852 40.601178, -74.201286 40.593618, -74.206437 40.586318, -74.215709 40.559197, -74.231163 40.558414, -74.246617 40.549545, -74.250051 40.54198, -74.245587 40.520582, -74.254172 40.515101, -74.257606 40.508576, -74.258636 40.499178, -74.249708 40.493435, -74.227729 40.496306, -74.180338 40.50962, -74.130199 40.528672, -74.070101 40.577192, -74.016067 40.566497))'), 1000, True))")

Filtering and Repartitioning Data

Next, we filter out the grocery stores from the places dataset using the primary category field. This allows us to focus on the relevant data for our analysis.

austin_groceries = austin_places.where(col("categories")['primary'] == "grocery_store").repartition(3)
nyc_groceries = nyc_places.where(col("categories")['primary'] == "grocery_store").repartition(5)

K-Nearest Neighbors (KNN) Spatial Join

The core of our analysis is the K-nearest neighbors (KNN) algorithm, which we use to identify the nearest grocery store for each address. KNN is a powerful tool in spatial analysis, enabling us to efficiently find the closest points of interest (in this case, grocery stores) to each address.

In this context, we perform a spatial join using ST_KNN to calculate the distance between each address and its nearest grocery store. This method ensures that we accurately capture the nearest grocery store for every address in both cities.

austin_groceries_join = sedona.sql('''
SELECT
austin_addresses.id AS address_id,
austin_addresses.country AS address_country,
austin_addresses.postcode AS address_postcode,
austin_addresses.street AS address_street,
austin_addresses.number AS address_number,
austin_addresses.unit AS address_unit,
austin_addresses.GEOMETRY AS address_GEOM,
austin_groceries.id AS place_id,
austin_groceries.names['primary'] AS place_name,
austin_groceries.GEOMETRY AS place_GEOM,
ST_DISTANCESPHERE(austin_addresses.GEOMETRY, austin_groceries.GEOMETRY) AS DISTANCE
FROM austin_addresses
JOIN austin_groceries ON ST_KNN(austin_addresses.GEOMETRY, austin_groceries.GEOMETRY, 1, FALSE)
''')

nyc_groceries_join = sedona.sql('''
SELECT
nyc_addresses.id AS address_id,
nyc_addresses.country AS address_country,
nyc_addresses.postcode AS address_postcode,
nyc_addresses.street AS address_street,
nyc_addresses.number AS address_number,
nyc_addresses.unit AS address_unit,
nyc_addresses.GEOMETRY AS address_GEOM,
nyc_groceries.id AS place_id,
nyc_groceries.names['primary'] AS place_name,
nyc_groceries.GEOMETRY AS place_GEOM,
ST_DISTANCESPHERE(nyc_addresses.GEOMETRY, nyc_groceries.GEOMETRY) AS DISTANCE
FROM nyc_addresses
JOIN nyc_groceries ON ST_KNN(nyc_addresses.GEOMETRY, nyc_groceries.GEOMETRY, 1, FALSE)
''')

In this SQL query, the ST_KNN function is used to perform the KNN join, identifying the closest grocery store to each address. The ST_DISTANCESPHERE function then calculates the great-circle distance between the two points.

Statistical Summary

We calculate key statistical measures such as the mean, median, and standard deviation of the distances for both cities. These statistics provide insights into the city’s overall accessibility to grocery stores.

austin_stats = austin_groceries_join.agg(
F.mean("DISTANCE").alias("mean"),
F.median("DISTANCE").alias("median"),
F.stddev("DISTANCE").alias("stddev")
).collect()[0]

nyc_stats = nyc_groceries_join.agg(
F.mean("DISTANCE").alias("mean"),
F.median("DISTANCE").alias("median"),
F.stddev("DISTANCE").alias("stddev")
).collect()[0]

A histogram of distances to the nearest grocery store for each of the cities reveals the stark difference in walkability to grocery stores between the 2 cities.

Majority of Addresses in NYC are within 200m from a grocery store.
Austin’s distribution is a bit more diverse and in general, automobiles are needed to reach the nearest grocery store.

Wrapping up

In line with Wherobots’ mission to democratize access to open data and facilitate analysis of our world, we’ve made Overture data even more accessible and user-friendly. By leveraging the scalable compute capabilities of Wherobots Cloud, you can now effortlessly explore and analyze this rich geospatial dataset.

The combination of Overture’s comprehensive data and Wherobots’ powerful platform opens up a world of possibilities for further analysis. Imagine conducting walkability studies to assess pedestrian-friendliness, identifying optimal locations for new businesses based on proximity to key amenities, or even modeling traffic patterns to improve urban planning.

Stay tuned for more exciting explorations into the world of geospatial data analysis with Wherobots and Overture! Subscribe to our newsletter to stay up-to-date with the latest content:

TABLE OF CONTENTS

Spatial catalog
Overture datasets
Overture, new theme
Density of the address table
Analyzing Spatial Accessibility to Grocery Stores in Austin and New York City Using KNN
Why This Analysis Matters
Data Preparation and Spatial Filtering
Austin, Texas:
New York City, New York:
Filtering and Repartitioning Data
K-Nearest Neighbors (KNN) Spatial Join
Statistical Summary
Wrapping up

Contributors

Furqaan Khan

I love learning, sharing my knowledge, and contributing to the future of geospatial data processing as a PMC member of Apache Sedona.
Pranav Toggi

Pranav is building the next wave of scalable geospatial analytics engine at Wherobots. He is passionate about enabling spatial intelligence to solve planetary-scale problems and democratizing the power of geospatial data through open-source software like Apache Sedona.