8 Mins Read

7 Dec 2023

Analyzing The Overture Maps Places Dataset Using Apache Sedona, Wherobots Cloud, & GeoParquet

Authors

Pranav Toggi

Visualizing Overture Maps points of interest

Introduction

Overture Maps, supported by the Overture Maps Foundation (OMF), offers a comprehensive geospatial data set, now in GeoParquet format, categorized into themes like places of interest, buildings, transportation networks, and administrative boundaries. GeoParquet, a geospatially optimized variant of the standard Parquet format, enhances the management of spatial data, making it particularly well-suited for geospatial analytics. Unlike traditional Parquet, GeoParquet is specifically designed to efficiently store and handle spatial information, which includes the addition of spatial indexing and optimized storage of geometry data.

This article aims to showcase the practical applications and benefits of Overture Maps data available in the Wherobots Open Data Catalog. By delving into real-world use cases, we demonstrate how the Overture Maps dataset enables deeper and faster insights into urban dynamics and broadens the scope for advanced geospatial analysis.

To follow along, first create a free account in Wherobots Cloud.

Data Schema for Places Theme in Overture Maps

The Places theme in Overture Maps represents point locations of various facilities, services, or amenities. Key schema design choices include:

Extensible Attributes: Basic common attributes such as phone, mail, website, and brand are included. Additional attributes not currently in the official release are allowed with an “ext” prefix. Attributes specific to certain types of places are planned for future inclusion.
Controlled Categories: A hierarchical categorization system (taxonomy) allows for the transformation of various categorization systems to the Overture framework. This taxonomy is intended to be comprehensive and will be fine-tuned over time.

Schema Representation

root
|-- id: string (nullable = true)
|-- updatetime: string (nullable = true)
|-- version: integer (nullable = true)
|-- names: map (nullable = true)
|    |-- key: string
|    |-- value: array (valueContainsNull = true)
|    |    |-- element: map (containsNull = true)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = true)
|-- categories: struct (nullable = true)
|    |-- main: string (nullable = true)
|    |-- alternate: array (nullable = true)
|    |    |-- element: string (containsNull = true)
|-- confidence: double (nullable = true)
|-- websites: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- socials: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- emails: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- phones: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- brand: struct (nullable = true)
|    |-- names: map (nullable = true)
|    |    |-- key: string
|    |    |-- value: array (valueContainsNull = true)
|    |    |    |-- element: map (containsNull = true)
|    |    |    |    |-- key: string
|    |    |    |    |-- value: string (valueContainsNull = true)
|    |-- wikidata: string (nullable = true)
|-- addresses: array (nullable = true)
|    |-- element: map (containsNull = true)
|    |    |-- key: string
|    |    |-- value: string (valueContainsNull = true)
|-- sources: array (nullable = true)
|    |-- element: map (containsNull = true)
|    |    |-- key: string
|    |    |-- value: string (valueContainsNull = true)
|-- bbox: struct (nullable = true)
|    |-- minx: double (nullable = true)
|    |-- maxx: double (nullable = true)
|    |-- miny: double (nullable = true)
|    |-- maxy: double (nullable = true)
|-- geometry: geometry (nullable = true)
|-- geohash: string (nullable = true)

Accessing Overture Maps Places Dataset

To analyze the data from Overture Maps, we first create and connect our SedonaContext to the Wherobots Open Data Catalog like so,

from sedona.spark import *

config = SedonaContext.builder(). \
config("spark.sql.catalog.wherobots_examples.type", "hadoop"). \
config("spark.sql.catalog.wherobots_examples", "org.apache.iceberg.spark.SparkCatalog"). \
config("spark.sql.catalog.wherobots_examples.warehouse", "s3://wherobots-examples-prod/havasu/warehouse"). \
config("spark.sql.catalog.wherobots_examples.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
getOrCreate()

sedona = SedonaContext.create(config)

Next, we access the Places theme dataset of Overture Maps via,

places_df = sedona.table("wherobots_examples.overture.places_place")

Spatial Filtering for NYC Metropolitan Area

For all use cases in this article, we focus on the New York City (NYC) metropolitan area. We apply spatial filtering to limit our dataset to this specific area, using the bounding box coordinates of New York City.

spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
places_df = places_df.where(spatial_filter)

To illustrate the comprehensive coverage of the dataset, the following map showcases 278,998 points of interest just in the New York City area.

Visualizing places from the Overture Maps dataset

Elevate key nested fields to top level columns

To facilitate easier aggregation and analysis, it’s important to transform certain nested fields into top-level columns. In our dataset, we focus on the ‘main’ and ‘alternate’ subcategories within the ‘categories’ column of the places dataset.

First, we create a new column ‘category’ that directly holds the values from ‘categories.main’:

places_df = places_df.withColumn("category", col("categories.main"))

Next, we use the explode function to transform the ‘alternate’ subcategories. The explode function is used to expand an array or map column into multiple rows. When applied to the ‘categories.alternate’ array, each element in the array is turned into a separate row, effectively creating a new row for each alternate category associated with the same place.

places_df_exploded = places_df.withColumn("alternate_category", explode("categories.alternate"))

Here’s what the explode transformation looks like:

Before applying transformation:

+--------------------+--------------------------------------------+
|id                  |categories                                  |
+--------------------+--------------------------------------------+
|tmp_F36B3571B3E58...|{hvac_services, [industrial_equipment]}     |
|tmp_240555DC4354D...|{elementary_school, [school, public_school]}|
+--------------------+--------------------------------------------+

After applying transformation:

+------------------------------------+-----------------+--------------------+
|id                                  |category         |alternate_category  |
+------------------------------------+-----------------+--------------------+
|tmp_F36B3571B3E583C482BD02CAC65657B6|hvac_services    |industrial_equipment|
|tmp_240555DC4354D0975F72960E276D481C|elementary_school|school              |
|tmp_240555DC4354D0975F72960E276D481C|elementary_school|public_school       |
+------------------------------------+-----------------+--------------------+

Explore categories

Group the data by ‘category’ and count the occurrences to understand the distribution of categories. After GroupBy, the categories are ranked based on number of occurrences. This tells us about the most common business categories in NYC.

categories_df = places_df.groupBy("category").agg(count("*").alias("count"))
categories_df = categories_df.orderBy("count", ascending=False)
windowSpec = Window.orderBy(col("count").desc())
categories_df = categories_df.withColumn("overall_rank", rank().over(windowSpec))
categories_df.show(10, truncate=False)

This gives us the following output:

+--------------------------------+-----+------------+
|category                        |count|overall_rank|
+--------------------------------+-----+------------+
|beauty_salon                    |10919|1           |
|community_services_non_profits  |5963 |2           |
|church_cathedral                |5507 |3           |
|professional_services           |4675 |4           |
|landmark_and_historical_building|4436 |5           |
|hospital                        |4035 |6           |
|dentist                         |3538 |7           |
|real_estate                     |3330 |8           |
|park                            |3171 |9           |
|school                          |3016 |10          |
+--------------------------------+-----+------------+

Explore Coffee Shops

Lets explore the coffee shops category a bit more.

coffee_df = places_df.filter(places_df_exploded.category == "coffee_shop")
coffee_alt_cats = places_df_exploded.filter(places_df_exploded.category == "coffee_shop").groupBy("alternate_category").agg(count("*").alias("count"))
coffee_alt_cats = coffee_alt_cats.orderBy("count", ascending = False)
coffee_alt_cats.show(11, truncate=False)

We group the coffee shop data by ‘alternate_category’ and count the occurrences to understand the distribution of coffee shop types.
After grouping by ‘alternate_category’, the data is aggregated to count the occurrences and then ordered to show the most common alternate categories within coffee shops.

The Bar chart below shows the relative frequency of each alternate category as a percentage of total coffee shops.

Now, lets filter the coffee_df with ‘bagel_shop’ as alternate_category because you may want to grab coffee and bagels on your way to work without having to stand in line at both a coffee shop and a bagel shop.

coffee_bagel_df = coffee_df.filter(array_contains(coffee_df.categories.alternate,"bagel_shop"))
coffee_bagel_df = coffee_bagel_df.select(coffee_bagel_df.id, coffee_bagel_df.names, coffee_bagel_df.geometry)
coffee_bagel_df = coffee_bagel_df.withColumn("name", col("names.common")[0]["value"]).drop("names")

Visualizing coffee shops that make bagels in NYC using SedonaKepler:

Visualizing points of interest from Overture Maps

Exploring Stadiums Arena category

Let’s imagine we want to analyze places where we might see a show or sports event, such as stadiums and arenas, and understand the types of businesses located within walking distance. This analysis can provide insights into the commercial ecosystem surrounding entertainment venues and help us understand the urban dynamics in these areas.

We begin by filtering out the category and then creating temporary views for places and arenas. In PySpark, in order to execute SQL commands on a DataFrame, you need to register it as a temporary view or table first.

arena_df = places_df.filter(places_df.category == "stadium_arena")
arena_df.createOrReplaceTempView("Arenas")
places_df.createOrReplaceTempView("Places")

To Identify proximal businesses to Stadium Arenas, we perform a spatial intersection.

The following SQL query performs a spatial intersection to find businesses within a 0.002 unit distance (about 1 block) from Stadium Arenas . It uses ST_Intersects for spatial relation checks, combined with ST_Buffer to expand the Arena geometries by 0.02 units, creating a search area. The value 0.02 units is assumed to be the walkable distance of 1 block.

arena_places = sedona.sql('''
    SELECT
        Places.id AS places_id,
        Places.geometry AS places_geometry,
        Places.category AS places_category,
        Arenas.id AS arena_id,
        Arenas.geometry AS arena_geometry,
        Arenas.names.common[0].value AS arena_name
    FROM
        Places, Arenas
    WHERE
        ST_Intersects(Places.geometry, ST_Buffer(Arenas.geometry, 0.002))
    ''')

To get a better picture, here’s a rendered map of the arena_places DataFrame using SedonaKepler. The red dots are the Stadium Arenas while the blue dots are the businesses in the arena’s vicinity.

Next, we group the proximal businesses by ‘category’ and count the occurrences to understand the distribution of proximal businesses. After the GroupBy operation, the categories are ranked based on number of occurrences. This tells us about the most common business categories in proximity to Stadium Arenas.

arena_places_count = arena_places.groupBy("places_category").agg(countDistinct("places_id").alias("count"))
arena_places_count = arena_places_count.orderBy("count", ascending=False)
windowSpec = Window.orderBy(col("count").desc())
arena_places_count = arena_places_count.withColumn("arena_rank", rank().over(windowSpec))
arena_places_count.show(15, truncate=False)
arena_places_count.count()

To highlight which types of businesses are more commonly found in the vicinity of Stadium Arenas, we compare the frequency of various business categories overall versus those near Stadium Arenas. This is achieved by,

Joining the categories_count and arena_places_count DataFrames
Calculating the rank differences
Ordering the result by rank difference

sedona.sql('''
    SELECT
        cc.category,
        cc.count AS overall_count,
        apc.count AS arena_count,
        cc.overall_rank,
        apc.arena_rank,
        cc.overall_rank - apc.arena_rank AS rank_difference
    FROM categories_count AS cc
    LEFT JOIN arena_places_count AS apc
        ON cc.category = apc.places_category
    WHERE apc.arena_rank <= 50
    ORDER BY rank_difference desc nulls last
''').show(12, truncate=False)

We get the table below,

+---------------------------------+-------------+-----------+------------+----------+---------------+
|category                         |overall_count|arena_count|overall_rank|arena_rank|rank_difference|
+---------------------------------+-------------+-----------+------------+----------+---------------+
|advertising_agency               |628          |73         |95          |47        |48             |
|broadcasting_media_production    |873          |102        |66          |27        |39             |
|theatre                          |1046         |129        |56          |18        |38             |
|travel_services                  |710          |76         |80          |42        |38             |
|college_university               |1503         |188        |38          |10        |28             |
|jewelry_store                    |1661         |227        |33          |7         |26             |
|arts_and_entertainment           |1154         |107        |48          |23        |25             |
|counseling_and_mental_health     |1023         |82         |61          |39        |22             |
|topic_concert_venue              |1056         |89         |54          |33        |21             |
|hotel                            |1626         |151        |35          |16        |19             |
|event_planning                   |1082         |84         |52          |37        |15             |
|financial_service                |1892         |168        |26          |12        |14             |

Insights

High Concentration of Relevant Services: Categories like ‘advertising_agency’, ‘broadcasting_media_production’, and ‘theatre’ are much more common near stadium arenas than their overall city rankings, suggesting a synergy with sports and entertainment venues.
University-Affiliated Stadiums: The proximity of ‘college_university’ to stadium arenas might suggest that some of these stadiums are located within or near university campuses, serving as venues for college sports events, which are often significant in the United States.
Accommodation for Visitors: The high ranking of ‘hotel’ near stadium arenas indicates a demand for accommodation by visitors who may be traveling to NYC for games or events. This is consistent with the transient nature of sports and entertainment events, which often draw fans and participants from outside the local area.
Luxury and Leisure: ‘jewelry_store’ has a rank difference of 26, showcasing a demand for luxury shopping experiences around stadium arenas, potentially linked to the high-profile nature of events held at these locations.
Entertainment and Event Planning: Categories like ‘arts_and_entertainment’, ‘event_planning’, and ‘topic_concert_venue’ have higher rankings near arenas, reflecting the role of these venues as hubs for events and cultural activities.

Conclusion

The Overture Maps data in Wherobots Spatial Catalog, offers great efficiency in spatial analytics. This synergy between advanced data formats and powerful analytics tools opens up new possibilities for geospatial analysis and insights. The analyses presented here are just the beginning and make several assumptions. However, with Wherobots and Overture Maps data, the possibilities for uncovering new insights and informing data-driven decisions are virtually limitless.

You can follow along with the code from this blog post by creating a free account in Wherobots Cloud.

Try Wherobots Cloud

Access Now

TABLE OF CONTENTS

Introduction
Data Schema for Places Theme in Overture Maps
Schema Representation
Accessing Overture Maps Places Dataset
Spatial Filtering for NYC Metropolitan Area
Elevate key nested fields to top level columns
Explore categories
Explore Coffee Shops
Exploring Stadiums Arena category
Insights
Conclusion

Contributors

Pranav Toggi

Pranav is building the next wave of scalable geospatial analytics engine at Wherobots. He is passionate about enabling spatial intelligence to solve planetary-scale problems and democratizing the power of geospatial data through open-source software like Apache Sedona.

11 Mins Read 23 Apr 2026

How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline

See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.

Computer Vision + 3

8 Mins Read 1 May 2026

Spatial Data Processing Platforms: A Comparison of Enterprise and Cloud-Native Options

For Data Engineers and Architects Evaluating Spatial Workloads on Snowflake, Databricks, and PostGIS Six platforms dominate spatial data processing today: PostGIS for transactional workloads under 100GB, Snowflake and BigQuery GIS for light spatial enrichment inside a broader analytics platform, Databricks for vector spatial joins on the Lakehouse, Apache Sedona for self-managed open-source distributed spatial compute, […]

General + 2

5 Mins Read 21 Apr 2026

Spatial Data Pipeline Architecture: PostGIS and Wherobots Together

In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]

General + 1