Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Overture Maps, supported by the Overture Maps Foundation (OMF), offers a comprehensive geospatial data set, now in GeoParquet format, categorized into themes like places of interest, buildings, transportation networks, and administrative boundaries. GeoParquet, a geospatially optimized variant of the standard Parquet format, enhances the management of spatial data, making it particularly well-suited for geospatial analytics. Unlike traditional Parquet, GeoParquet is specifically designed to efficiently store and handle spatial information, which includes the addition of spatial indexing and optimized storage of geometry data.
This article aims to showcase the practical applications and benefits of Overture Maps data available in the Wherobots Open Data Catalog. By delving into real-world use cases, we demonstrate how the Overture Maps dataset enables deeper and faster insights into urban dynamics and broadens the scope for advanced geospatial analysis.
To follow along, first create a free account in Wherobots Cloud.
The Places theme in Overture Maps represents point locations of various facilities, services, or amenities. Key schema design choices include:
root |-- id: string (nullable = true) |-- updatetime: string (nullable = true) |-- version: integer (nullable = true) |-- names: map (nullable = true) | |-- key: string | |-- value: array (valueContainsNull = true) | | |-- element: map (containsNull = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) |-- categories: struct (nullable = true) | |-- main: string (nullable = true) | |-- alternate: array (nullable = true) | | |-- element: string (containsNull = true) |-- confidence: double (nullable = true) |-- websites: array (nullable = true) | |-- element: string (containsNull = true) |-- socials: array (nullable = true) | |-- element: string (containsNull = true) |-- emails: array (nullable = true) | |-- element: string (containsNull = true) |-- phones: array (nullable = true) | |-- element: string (containsNull = true) |-- brand: struct (nullable = true) | |-- names: map (nullable = true) | | |-- key: string | | |-- value: array (valueContainsNull = true) | | | |-- element: map (containsNull = true) | | | | |-- key: string | | | | |-- value: string (valueContainsNull = true) | |-- wikidata: string (nullable = true) |-- addresses: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) |-- sources: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) |-- bbox: struct (nullable = true) | |-- minx: double (nullable = true) | |-- maxx: double (nullable = true) | |-- miny: double (nullable = true) | |-- maxy: double (nullable = true) |-- geometry: geometry (nullable = true) |-- geohash: string (nullable = true)
To analyze the data from Overture Maps, we first create and connect our SedonaContext to the Wherobots Open Data Catalog like so,
SedonaContext
from sedona.spark import * config = SedonaContext.builder(). \ config("spark.sql.catalog.wherobots_examples.type", "hadoop"). \ config("spark.sql.catalog.wherobots_examples", "org.apache.iceberg.spark.SparkCatalog"). \ config("spark.sql.catalog.wherobots_examples.warehouse", "s3://wherobots-examples-prod/havasu/warehouse"). \ config("spark.sql.catalog.wherobots_examples.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \ getOrCreate() sedona = SedonaContext.create(config)
Next, we access the Places theme dataset of Overture Maps via,
places_df = sedona.table("wherobots_examples.overture.places_place")
For all use cases in this article, we focus on the New York City (NYC) metropolitan area. We apply spatial filtering to limit our dataset to this specific area, using the bounding box coordinates of New York City.
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))" places_df = places_df.where(spatial_filter)
To illustrate the comprehensive coverage of the dataset, the following map showcases 278,998 points of interest just in the New York City area.
To facilitate easier aggregation and analysis, it’s important to transform certain nested fields into top-level columns. In our dataset, we focus on the ‘main’ and ‘alternate’ subcategories within the ‘categories’ column of the places dataset.
First, we create a new column ‘category’ that directly holds the values from ‘categories.main’:
places_df = places_df.withColumn("category", col("categories.main"))
Next, we use the explode function to transform the ‘alternate’ subcategories. The explode function is used to expand an array or map column into multiple rows. When applied to the ‘categories.alternate’ array, each element in the array is turned into a separate row, effectively creating a new row for each alternate category associated with the same place.
explode
places_df_exploded = places_df.withColumn("alternate_category", explode("categories.alternate"))
Here’s what the explode transformation looks like:
Before applying transformation:
+--------------------+--------------------------------------------+ |id |categories | +--------------------+--------------------------------------------+ |tmp_F36B3571B3E58...|{hvac_services, [industrial_equipment]} | |tmp_240555DC4354D...|{elementary_school, [school, public_school]}| +--------------------+--------------------------------------------+
After applying transformation:
+------------------------------------+-----------------+--------------------+ |id |category |alternate_category | +------------------------------------+-----------------+--------------------+ |tmp_F36B3571B3E583C482BD02CAC65657B6|hvac_services |industrial_equipment| |tmp_240555DC4354D0975F72960E276D481C|elementary_school|school | |tmp_240555DC4354D0975F72960E276D481C|elementary_school|public_school | +------------------------------------+-----------------+--------------------+
Group the data by ‘category’ and count the occurrences to understand the distribution of categories. After GroupBy, the categories are ranked based on number of occurrences. This tells us about the most common business categories in NYC.
categories_df = places_df.groupBy("category").agg(count("*").alias("count")) categories_df = categories_df.orderBy("count", ascending=False) windowSpec = Window.orderBy(col("count").desc()) categories_df = categories_df.withColumn("overall_rank", rank().over(windowSpec)) categories_df.show(10, truncate=False)
This gives us the following output:
+--------------------------------+-----+------------+ |category |count|overall_rank| +--------------------------------+-----+------------+ |beauty_salon |10919|1 | |community_services_non_profits |5963 |2 | |church_cathedral |5507 |3 | |professional_services |4675 |4 | |landmark_and_historical_building|4436 |5 | |hospital |4035 |6 | |dentist |3538 |7 | |real_estate |3330 |8 | |park |3171 |9 | |school |3016 |10 | +--------------------------------+-----+------------+
Lets explore the coffee shops category a bit more.
coffee_df = places_df.filter(places_df_exploded.category == "coffee_shop") coffee_alt_cats = places_df_exploded.filter(places_df_exploded.category == "coffee_shop").groupBy("alternate_category").agg(count("*").alias("count")) coffee_alt_cats = coffee_alt_cats.orderBy("count", ascending = False) coffee_alt_cats.show(11, truncate=False)
We group the coffee shop data by ‘alternate_category’ and count the occurrences to understand the distribution of coffee shop types.After grouping by ‘alternate_category’, the data is aggregated to count the occurrences and then ordered to show the most common alternate categories within coffee shops.
The Bar chart below shows the relative frequency of each alternate category as a percentage of total coffee shops.
Now, lets filter the coffee_df with ‘bagel_shop’ as alternate_category because you may want to grab coffee and bagels on your way to work without having to stand in line at both a coffee shop and a bagel shop.
coffee_df
coffee_bagel_df = coffee_df.filter(array_contains(coffee_df.categories.alternate,"bagel_shop")) coffee_bagel_df = coffee_bagel_df.select(coffee_bagel_df.id, coffee_bagel_df.names, coffee_bagel_df.geometry) coffee_bagel_df = coffee_bagel_df.withColumn("name", col("names.common")[0]["value"]).drop("names")
Visualizing coffee shops that make bagels in NYC using SedonaKepler:
Let’s imagine we want to analyze places where we might see a show or sports event, such as stadiums and arenas, and understand the types of businesses located within walking distance. This analysis can provide insights into the commercial ecosystem surrounding entertainment venues and help us understand the urban dynamics in these areas.
We begin by filtering out the category and then creating temporary views for places and arenas. In PySpark, in order to execute SQL commands on a DataFrame, you need to register it as a temporary view or table first.
arena_df = places_df.filter(places_df.category == "stadium_arena") arena_df.createOrReplaceTempView("Arenas") places_df.createOrReplaceTempView("Places")
To Identify proximal businesses to Stadium Arenas, we perform a spatial intersection.
The following SQL query performs a spatial intersection to find businesses within a 0.002 unit distance (about 1 block) from Stadium Arenas . It uses ST_Intersects for spatial relation checks, combined with ST_Buffer to expand the Arena geometries by 0.02 units, creating a search area. The value 0.02 units is assumed to be the walkable distance of 1 block.
arena_places = sedona.sql(''' SELECT Places.id AS places_id, Places.geometry AS places_geometry, Places.category AS places_category, Arenas.id AS arena_id, Arenas.geometry AS arena_geometry, Arenas.names.common[0].value AS arena_name FROM Places, Arenas WHERE ST_Intersects(Places.geometry, ST_Buffer(Arenas.geometry, 0.002)) ''')
To get a better picture, here’s a rendered map of the arena_places DataFrame using SedonaKepler. The red dots are the Stadium Arenas while the blue dots are the businesses in the arena’s vicinity.
Next, we group the proximal businesses by ‘category’ and count the occurrences to understand the distribution of proximal businesses. After the GroupBy operation, the categories are ranked based on number of occurrences. This tells us about the most common business categories in proximity to Stadium Arenas.
arena_places_count = arena_places.groupBy("places_category").agg(countDistinct("places_id").alias("count")) arena_places_count = arena_places_count.orderBy("count", ascending=False) windowSpec = Window.orderBy(col("count").desc()) arena_places_count = arena_places_count.withColumn("arena_rank", rank().over(windowSpec)) arena_places_count.show(15, truncate=False) arena_places_count.count()
To highlight which types of businesses are more commonly found in the vicinity of Stadium Arenas, we compare the frequency of various business categories overall versus those near Stadium Arenas. This is achieved by,
sedona.sql(''' SELECT cc.category, cc.count AS overall_count, apc.count AS arena_count, cc.overall_rank, apc.arena_rank, cc.overall_rank - apc.arena_rank AS rank_difference FROM categories_count AS cc LEFT JOIN arena_places_count AS apc ON cc.category = apc.places_category WHERE apc.arena_rank <= 50 ORDER BY rank_difference desc nulls last ''').show(12, truncate=False)
We get the table below,
+---------------------------------+-------------+-----------+------------+----------+---------------+ |category |overall_count|arena_count|overall_rank|arena_rank|rank_difference| +---------------------------------+-------------+-----------+------------+----------+---------------+ |advertising_agency |628 |73 |95 |47 |48 | |broadcasting_media_production |873 |102 |66 |27 |39 | |theatre |1046 |129 |56 |18 |38 | |travel_services |710 |76 |80 |42 |38 | |college_university |1503 |188 |38 |10 |28 | |jewelry_store |1661 |227 |33 |7 |26 | |arts_and_entertainment |1154 |107 |48 |23 |25 | |counseling_and_mental_health |1023 |82 |61 |39 |22 | |topic_concert_venue |1056 |89 |54 |33 |21 | |hotel |1626 |151 |35 |16 |19 | |event_planning |1082 |84 |52 |37 |15 | |financial_service |1892 |168 |26 |12 |14 |
The Overture Maps data in Wherobots Spatial Catalog, offers great efficiency in spatial analytics. This synergy between advanced data formats and powerful analytics tools opens up new possibilities for geospatial analysis and insights. The analyses presented here are just the beginning and make several assumptions. However, with Wherobots and Overture Maps data, the possibilities for uncovering new insights and informing data-driven decisions are virtually limitless.
You can follow along with the code from this blog post by creating a free account in Wherobots Cloud.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
WherobotsDB is 3x faster with up to 45% better price performance
The next generation of WherobotsDB, the Apache Sedona and Spark 4 compatible engine, is now generally available.
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: