Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Overture Maps, supported by the Overture Maps Foundation (OMF), offers a comprehensive geospatial data set, now in GeoParquet format, categorized into themes like places of interest, buildings, transportation networks, and administrative boundaries. GeoParquet, a geospatially optimized variant of the standard Parquet format, enhances the management of spatial data, making it particularly well-suited for geospatial analytics. Unlike traditional Parquet, GeoParquet is specifically designed to efficiently store and handle spatial information, which includes the addition of spatial indexing and optimized storage of geometry data.
This article aims to showcase the practical applications and benefits of Overture Maps data available in the Wherobots Open Data Catalog. By delving into real-world use cases, we demonstrate how the Overture Maps dataset enables deeper and faster insights into urban dynamics and broadens the scope for advanced geospatial analysis.
To follow along, first create a free account in Wherobots Cloud.
The Places theme in Overture Maps represents point locations of various facilities, services, or amenities. Key schema design choices include:
root |-- id: string (nullable = true) |-- updatetime: string (nullable = true) |-- version: integer (nullable = true) |-- names: map (nullable = true) | |-- key: string | |-- value: array (valueContainsNull = true) | | |-- element: map (containsNull = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) |-- categories: struct (nullable = true) | |-- main: string (nullable = true) | |-- alternate: array (nullable = true) | | |-- element: string (containsNull = true) |-- confidence: double (nullable = true) |-- websites: array (nullable = true) | |-- element: string (containsNull = true) |-- socials: array (nullable = true) | |-- element: string (containsNull = true) |-- emails: array (nullable = true) | |-- element: string (containsNull = true) |-- phones: array (nullable = true) | |-- element: string (containsNull = true) |-- brand: struct (nullable = true) | |-- names: map (nullable = true) | | |-- key: string | | |-- value: array (valueContainsNull = true) | | | |-- element: map (containsNull = true) | | | | |-- key: string | | | | |-- value: string (valueContainsNull = true) | |-- wikidata: string (nullable = true) |-- addresses: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) |-- sources: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) |-- bbox: struct (nullable = true) | |-- minx: double (nullable = true) | |-- maxx: double (nullable = true) | |-- miny: double (nullable = true) | |-- maxy: double (nullable = true) |-- geometry: geometry (nullable = true) |-- geohash: string (nullable = true)
To analyze the data from Overture Maps, we first create and connect our SedonaContext to the Wherobots Open Data Catalog like so,
SedonaContext
from sedona.spark import * config = SedonaContext.builder(). \ config("spark.sql.catalog.wherobots_examples.type", "hadoop"). \ config("spark.sql.catalog.wherobots_examples", "org.apache.iceberg.spark.SparkCatalog"). \ config("spark.sql.catalog.wherobots_examples.warehouse", "s3://wherobots-examples-prod/havasu/warehouse"). \ config("spark.sql.catalog.wherobots_examples.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \ getOrCreate() sedona = SedonaContext.create(config)
Next, we access the Places theme dataset of Overture Maps via,
places_df = sedona.table("wherobots_examples.overture.places_place")
For all use cases in this article, we focus on the New York City (NYC) metropolitan area. We apply spatial filtering to limit our dataset to this specific area, using the bounding box coordinates of New York City.
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))" places_df = places_df.where(spatial_filter)
To illustrate the comprehensive coverage of the dataset, the following map showcases 278,998 points of interest just in the New York City area.
To facilitate easier aggregation and analysis, it’s important to transform certain nested fields into top-level columns. In our dataset, we focus on the ‘main’ and ‘alternate’ subcategories within the ‘categories’ column of the places dataset.
First, we create a new column ‘category’ that directly holds the values from ‘categories.main’:
places_df = places_df.withColumn("category", col("categories.main"))
Next, we use the explode function to transform the ‘alternate’ subcategories. The explode function is used to expand an array or map column into multiple rows. When applied to the ‘categories.alternate’ array, each element in the array is turned into a separate row, effectively creating a new row for each alternate category associated with the same place.
explode
places_df_exploded = places_df.withColumn("alternate_category", explode("categories.alternate"))
Here’s what the explode transformation looks like:
Before applying transformation:
+--------------------+--------------------------------------------+ |id |categories | +--------------------+--------------------------------------------+ |tmp_F36B3571B3E58...|{hvac_services, [industrial_equipment]} | |tmp_240555DC4354D...|{elementary_school, [school, public_school]}| +--------------------+--------------------------------------------+
After applying transformation:
+------------------------------------+-----------------+--------------------+ |id |category |alternate_category | +------------------------------------+-----------------+--------------------+ |tmp_F36B3571B3E583C482BD02CAC65657B6|hvac_services |industrial_equipment| |tmp_240555DC4354D0975F72960E276D481C|elementary_school|school | |tmp_240555DC4354D0975F72960E276D481C|elementary_school|public_school | +------------------------------------+-----------------+--------------------+
Group the data by ‘category’ and count the occurrences to understand the distribution of categories. After GroupBy, the categories are ranked based on number of occurrences. This tells us about the most common business categories in NYC.
categories_df = places_df.groupBy("category").agg(count("*").alias("count")) categories_df = categories_df.orderBy("count", ascending=False) windowSpec = Window.orderBy(col("count").desc()) categories_df = categories_df.withColumn("overall_rank", rank().over(windowSpec)) categories_df.show(10, truncate=False)
This gives us the following output:
+--------------------------------+-----+------------+ |category |count|overall_rank| +--------------------------------+-----+------------+ |beauty_salon |10919|1 | |community_services_non_profits |5963 |2 | |church_cathedral |5507 |3 | |professional_services |4675 |4 | |landmark_and_historical_building|4436 |5 | |hospital |4035 |6 | |dentist |3538 |7 | |real_estate |3330 |8 | |park |3171 |9 | |school |3016 |10 | +--------------------------------+-----+------------+
Lets explore the coffee shops category a bit more.
coffee_df = places_df.filter(places_df_exploded.category == "coffee_shop") coffee_alt_cats = places_df_exploded.filter(places_df_exploded.category == "coffee_shop").groupBy("alternate_category").agg(count("*").alias("count")) coffee_alt_cats = coffee_alt_cats.orderBy("count", ascending = False) coffee_alt_cats.show(11, truncate=False)
We group the coffee shop data by ‘alternate_category’ and count the occurrences to understand the distribution of coffee shop types.After grouping by ‘alternate_category’, the data is aggregated to count the occurrences and then ordered to show the most common alternate categories within coffee shops.
The Bar chart below shows the relative frequency of each alternate category as a percentage of total coffee shops.
Now, lets filter the coffee_df with ‘bagel_shop’ as alternate_category because you may want to grab coffee and bagels on your way to work without having to stand in line at both a coffee shop and a bagel shop.
coffee_df
coffee_bagel_df = coffee_df.filter(array_contains(coffee_df.categories.alternate,"bagel_shop")) coffee_bagel_df = coffee_bagel_df.select(coffee_bagel_df.id, coffee_bagel_df.names, coffee_bagel_df.geometry) coffee_bagel_df = coffee_bagel_df.withColumn("name", col("names.common")[0]["value"]).drop("names")
Visualizing coffee shops that make bagels in NYC using SedonaKepler:
Let’s imagine we want to analyze places where we might see a show or sports event, such as stadiums and arenas, and understand the types of businesses located within walking distance. This analysis can provide insights into the commercial ecosystem surrounding entertainment venues and help us understand the urban dynamics in these areas.
We begin by filtering out the category and then creating temporary views for places and arenas. In PySpark, in order to execute SQL commands on a DataFrame, you need to register it as a temporary view or table first.
arena_df = places_df.filter(places_df.category == "stadium_arena") arena_df.createOrReplaceTempView("Arenas") places_df.createOrReplaceTempView("Places")
To Identify proximal businesses to Stadium Arenas, we perform a spatial intersection.
The following SQL query performs a spatial intersection to find businesses within a 0.002 unit distance (about 1 block) from Stadium Arenas . It uses ST_Intersects for spatial relation checks, combined with ST_Buffer to expand the Arena geometries by 0.02 units, creating a search area. The value 0.02 units is assumed to be the walkable distance of 1 block.
arena_places = sedona.sql(''' SELECT Places.id AS places_id, Places.geometry AS places_geometry, Places.category AS places_category, Arenas.id AS arena_id, Arenas.geometry AS arena_geometry, Arenas.names.common[0].value AS arena_name FROM Places, Arenas WHERE ST_Intersects(Places.geometry, ST_Buffer(Arenas.geometry, 0.002)) ''')
To get a better picture, here’s a rendered map of the arena_places DataFrame using SedonaKepler. The red dots are the Stadium Arenas while the blue dots are the businesses in the arena’s vicinity.
Next, we group the proximal businesses by ‘category’ and count the occurrences to understand the distribution of proximal businesses. After the GroupBy operation, the categories are ranked based on number of occurrences. This tells us about the most common business categories in proximity to Stadium Arenas.
arena_places_count = arena_places.groupBy("places_category").agg(countDistinct("places_id").alias("count")) arena_places_count = arena_places_count.orderBy("count", ascending=False) windowSpec = Window.orderBy(col("count").desc()) arena_places_count = arena_places_count.withColumn("arena_rank", rank().over(windowSpec)) arena_places_count.show(15, truncate=False) arena_places_count.count()
To highlight which types of businesses are more commonly found in the vicinity of Stadium Arenas, we compare the frequency of various business categories overall versus those near Stadium Arenas. This is achieved by,
sedona.sql(''' SELECT cc.category, cc.count AS overall_count, apc.count AS arena_count, cc.overall_rank, apc.arena_rank, cc.overall_rank - apc.arena_rank AS rank_difference FROM categories_count AS cc LEFT JOIN arena_places_count AS apc ON cc.category = apc.places_category WHERE apc.arena_rank <= 50 ORDER BY rank_difference desc nulls last ''').show(12, truncate=False)
We get the table below,
+---------------------------------+-------------+-----------+------------+----------+---------------+ |category |overall_count|arena_count|overall_rank|arena_rank|rank_difference| +---------------------------------+-------------+-----------+------------+----------+---------------+ |advertising_agency |628 |73 |95 |47 |48 | |broadcasting_media_production |873 |102 |66 |27 |39 | |theatre |1046 |129 |56 |18 |38 | |travel_services |710 |76 |80 |42 |38 | |college_university |1503 |188 |38 |10 |28 | |jewelry_store |1661 |227 |33 |7 |26 | |arts_and_entertainment |1154 |107 |48 |23 |25 | |counseling_and_mental_health |1023 |82 |61 |39 |22 | |topic_concert_venue |1056 |89 |54 |33 |21 | |hotel |1626 |151 |35 |16 |19 | |event_planning |1082 |84 |52 |37 |15 | |financial_service |1892 |168 |26 |12 |14 |
The Overture Maps data in Wherobots Spatial Catalog, offers great efficiency in spatial analytics. This synergy between advanced data formats and powerful analytics tools opens up new possibilities for geospatial analysis and insights. The analyses presented here are just the beginning and make several assumptions. However, with Wherobots and Overture Maps data, the possibilities for uncovering new insights and informing data-driven decisions are virtually limitless.
You can follow along with the code from this blog post by creating a free account in Wherobots Cloud.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: