TABLE OF CONTENTS

    Introduction

    Overture Maps, supported by the Overture Maps Foundation (OMF), offers a comprehensive geospatial data set, now in GeoParquet format, categorized into themes like places of interest, buildings, transportation networks, and administrative boundaries. GeoParquet, a geospatially optimized variant of the standard Parquet format, enhances the management of spatial data, making it particularly well-suited for geospatial analytics. Unlike traditional Parquet, GeoParquet is specifically designed to efficiently store and handle spatial information, which includes the addition of spatial indexing and optimized storage of geometry data.

    This article aims to showcase the practical applications and benefits of Overture Maps data available in the Wherobots Open Data Catalog. By delving into real-world use cases, we demonstrate how the Overture Maps dataset enables deeper and faster insights into urban dynamics and broadens the scope for advanced geospatial analysis.

    To follow along, first create a free account in Wherobots Cloud.

    Data Schema for Places Theme in Overture Maps

    The Places theme in Overture Maps represents point locations of various facilities, services, or amenities. Key schema design choices include:

    • Extensible Attributes: Basic common attributes such as phone, mail, website, and brand are included. Additional attributes not currently in the official release are allowed with an "ext" prefix. Attributes specific to certain types of places are planned for future inclusion.
    • Controlled Categories: A hierarchical categorization system (taxonomy) allows for the transformation of various categorization systems to the Overture framework. This taxonomy is intended to be comprehensive and will be fine-tuned over time.
    Schema Representation
    root
    |-- id: string (nullable = true)
    |-- updatetime: string (nullable = true)
    |-- version: integer (nullable = true)
    |-- names: map (nullable = true)
    |    |-- key: string
    |    |-- value: array (valueContainsNull = true)
    |    |    |-- element: map (containsNull = true)
    |    |    |    |-- key: string
    |    |    |    |-- value: string (valueContainsNull = true)
    |-- categories: struct (nullable = true)
    |    |-- main: string (nullable = true)
    |    |-- alternate: array (nullable = true)
    |    |    |-- element: string (containsNull = true)
    |-- confidence: double (nullable = true)
    |-- websites: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- socials: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- emails: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- phones: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- brand: struct (nullable = true)
    |    |-- names: map (nullable = true)
    |    |    |-- key: string
    |    |    |-- value: array (valueContainsNull = true)
    |    |    |    |-- element: map (containsNull = true)
    |    |    |    |    |-- key: string
    |    |    |    |    |-- value: string (valueContainsNull = true)
    |    |-- wikidata: string (nullable = true)
    |-- addresses: array (nullable = true)
    |    |-- element: map (containsNull = true)
    |    |    |-- key: string
    |    |    |-- value: string (valueContainsNull = true)
    |-- sources: array (nullable = true)
    |    |-- element: map (containsNull = true)
    |    |    |-- key: string
    |    |    |-- value: string (valueContainsNull = true)
    |-- bbox: struct (nullable = true)
    |    |-- minx: double (nullable = true)
    |    |-- maxx: double (nullable = true)
    |    |-- miny: double (nullable = true)
    |    |-- maxy: double (nullable = true)
    |-- geometry: geometry (nullable = true)
    |-- geohash: string (nullable = true)

    Accessing Overture Maps Places Dataset

    To analyze the data from Overture Maps, we first create and connect our SedonaContext to the Wherobots Open Data Catalog like so,

    from sedona.spark import *
    
    config = SedonaContext.builder(). \
    config("spark.sql.catalog.wherobots_examples.type", "hadoop"). \
    config("spark.sql.catalog.wherobots_examples", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.wherobots_examples.warehouse", "s3://wherobots-examples-prod/havasu/warehouse"). \
    config("spark.sql.catalog.wherobots_examples.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    getOrCreate()
    
    sedona = SedonaContext.create(config)

    Next, we access the Places theme dataset of Overture Maps via,

    places_df = sedona.table("wherobots_examples.overture.places_place")

    Spatial Filtering for NYC Metropolitan Area

    For all use cases in this article, we focus on the New York City (NYC) metropolitan area. We apply spatial filtering to limit our dataset to this specific area, using the bounding box coordinates of New York City.

    spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
    places_df = places_df.where(spatial_filter)

    To illustrate the comprehensive coverage of the dataset, the following map showcases 278,998 points of interest just in the New York City area.

    Visualizing places from the Overture Maps dataset

    Elevate key nested fields to top level columns

    To facilitate easier aggregation and analysis, it’s important to transform certain nested fields into top-level columns. In our dataset, we focus on the ‘main’ and ‘alternate’ subcategories within the ‘categories’ column of the places dataset.

    First, we create a new column ‘category’ that directly holds the values from ‘categories.main’:

    places_df = places_df.withColumn("category", col("categories.main")) 

    Next, we use the explode function to transform the ‘alternate’ subcategories. The explode function is used to expand an array or map column into multiple rows. When applied to the ‘categories.alternate’ array, each element in the array is turned into a separate row, effectively creating a new row for each alternate category associated with the same place.

    places_df_exploded = places_df.withColumn("alternate_category", explode("categories.alternate")) 

    Here’s what the explode transformation looks like:

    Before applying transformation:

    +--------------------+--------------------------------------------+
    |id                  |categories                                  |
    +--------------------+--------------------------------------------+
    |tmp_F36B3571B3E58...|{hvac_services, [industrial_equipment]}     |
    |tmp_240555DC4354D...|{elementary_school, [school, public_school]}|
    +--------------------+--------------------------------------------+

    After applying transformation:

    +------------------------------------+-----------------+--------------------+
    |id                                  |category         |alternate_category  |
    +------------------------------------+-----------------+--------------------+
    |tmp_F36B3571B3E583C482BD02CAC65657B6|hvac_services    |industrial_equipment|
    |tmp_240555DC4354D0975F72960E276D481C|elementary_school|school              |
    |tmp_240555DC4354D0975F72960E276D481C|elementary_school|public_school       |
    +------------------------------------+-----------------+--------------------+

    Explore categories

    Group the data by ‘category’ and count the occurrences to understand the distribution of categories. After GroupBy, the categories are ranked based on number of occurrences. This tells us about the most common business categories in NYC.

    categories_df = places_df.groupBy("category").agg(count("*").alias("count"))
    categories_df = categories_df.orderBy("count", ascending=False)
    windowSpec = Window.orderBy(col("count").desc())
    categories_df = categories_df.withColumn("overall_rank", rank().over(windowSpec))
    categories_df.show(10, truncate=False)

    This gives us the following output:

    +--------------------------------+-----+------------+
    |category                        |count|overall_rank|
    +--------------------------------+-----+------------+
    |beauty_salon                    |10919|1           |
    |community_services_non_profits  |5963 |2           |
    |church_cathedral                |5507 |3           |
    |professional_services           |4675 |4           |
    |landmark_and_historical_building|4436 |5           |
    |hospital                        |4035 |6           |
    |dentist                         |3538 |7           |
    |real_estate                     |3330 |8           |
    |park                            |3171 |9           |
    |school                          |3016 |10          |
    +--------------------------------+-----+------------+

    Explore Coffee Shops

    Lets explore the coffee shops category a bit more.

    coffee_df = places_df.filter(places_df_exploded.category == "coffee_shop")
    coffee_alt_cats = places_df_exploded.filter(places_df_exploded.category == "coffee_shop").groupBy("alternate_category").agg(count("*").alias("count"))
    coffee_alt_cats = coffee_alt_cats.orderBy("count", ascending = False)
    coffee_alt_cats.show(11, truncate=False)

    We group the coffee shop data by ‘alternate_category’ and count the occurrences to understand the distribution of coffee shop types.
    After grouping by ‘alternate_category’, the data is aggregated to count the occurrences and then ordered to show the most common alternate categories within coffee shops.

    The Bar chart below shows the relative frequency of each alternate category as a percentage of total coffee shops.

    Visualizing alternate categories

    Now, lets filter the coffee_df with ‘bagel_shop’ as alternate_category because you may want to grab coffee and bagels on your way to work without having to stand in line at both a coffee shop and a bagel shop.

    coffee_bagel_df = coffee_df.filter(array_contains(coffee_df.categories.alternate,"bagel_shop"))
    coffee_bagel_df = coffee_bagel_df.select(coffee_bagel_df.id, coffee_bagel_df.names, coffee_bagel_df.geometry)
    coffee_bagel_df = coffee_bagel_df.withColumn("name", col("names.common")[0]["value"]).drop("names")

    Visualizing coffee shops that make bagels in NYC using SedonaKepler:

    Visualizing points of interest from Overture Maps

    Exploring Stadiums Arena category

    Let’s imagine we want to analyze places where we might see a show or sports event, such as stadiums and arenas, and understand the types of businesses located within walking distance. This analysis can provide insights into the commercial ecosystem surrounding entertainment venues and help us understand the urban dynamics in these areas.

    We begin by filtering out the category and then creating temporary views for places and arenas. In PySpark, in order to execute SQL commands on a DataFrame, you need to register it as a temporary view or table first.

    arena_df = places_df.filter(places_df.category == "stadium_arena")
    arena_df.createOrReplaceTempView("Arenas")
    places_df.createOrReplaceTempView("Places")

    To Identify proximal businesses to Stadium Arenas, we perform a spatial intersection.

    The following SQL query performs a spatial intersection to find businesses within a 0.002 unit distance (about 1 block) from Stadium Arenas . It uses ST_Intersects for spatial relation checks, combined with ST_Buffer to expand the Arena geometries by 0.02 units, creating a search area. The value 0.02 units is assumed to be the walkable distance of 1 block.

    arena_places = sedona.sql('''
        SELECT
            Places.id AS places_id,
            Places.geometry AS places_geometry,
            Places.category AS places_category,
            Arenas.id AS arena_id,
            Arenas.geometry AS arena_geometry,
            Arenas.names.common[0].value AS arena_name
        FROM
            Places, Arenas
        WHERE
            ST_Intersects(Places.geometry, ST_Buffer(Arenas.geometry, 0.002))
        ''')

    To get a better picture, here’s a rendered map of the arena_places DataFrame using SedonaKepler. The red dots are the Stadium Arenas while the blue dots are the businesses in the arena’s vicinity.

    Visualizing Overture Maps points of interest

    Next, we group the proximal businesses by ‘category’ and count the occurrences to understand the distribution of proximal businesses. After the GroupBy operation, the categories are ranked based on number of occurrences. This tells us about the most common business categories in proximity to Stadium Arenas.

    arena_places_count = arena_places.groupBy("places_category").agg(countDistinct("places_id").alias("count"))
    arena_places_count = arena_places_count.orderBy("count", ascending=False)
    windowSpec = Window.orderBy(col("count").desc())
    arena_places_count = arena_places_count.withColumn("arena_rank", rank().over(windowSpec))
    arena_places_count.show(15, truncate=False)
    arena_places_count.count()

    To highlight which types of businesses are more commonly found in the vicinity of Stadium Arenas, we compare the frequency of various business categories overall versus those near Stadium Arenas. This is achieved by,

    • Joining the categories_count and arena_places_count DataFrames
    • Calculating the rank differences
    • Ordering the result by rank difference
    sedona.sql('''
        SELECT
            cc.category,
            cc.count AS overall_count,
            apc.count AS arena_count,
            cc.overall_rank,
            apc.arena_rank,
            cc.overall_rank - apc.arena_rank AS rank_difference
        FROM categories_count AS cc
        LEFT JOIN arena_places_count AS apc
            ON cc.category = apc.places_category
        WHERE apc.arena_rank <= 50
        ORDER BY rank_difference desc nulls last
    ''').show(12, truncate=False)
    

    We get the table below,

    +---------------------------------+-------------+-----------+------------+----------+---------------+
    |category                         |overall_count|arena_count|overall_rank|arena_rank|rank_difference|
    +---------------------------------+-------------+-----------+------------+----------+---------------+
    |advertising_agency               |628          |73         |95          |47        |48             |
    |broadcasting_media_production    |873          |102        |66          |27        |39             |
    |theatre                          |1046         |129        |56          |18        |38             |
    |travel_services                  |710          |76         |80          |42        |38             |
    |college_university               |1503         |188        |38          |10        |28             |
    |jewelry_store                    |1661         |227        |33          |7         |26             |
    |arts_and_entertainment           |1154         |107        |48          |23        |25             |
    |counseling_and_mental_health     |1023         |82         |61          |39        |22             |
    |topic_concert_venue              |1056         |89         |54          |33        |21             |
    |hotel                            |1626         |151        |35          |16        |19             |
    |event_planning                   |1082         |84         |52          |37        |15             |
    |financial_service                |1892         |168        |26          |12        |14             |
    Insights
    • High Concentration of Relevant Services: Categories like ‘advertising_agency’, ‘broadcasting_media_production’, and ‘theatre’ are much more common near stadium arenas than their overall city rankings, suggesting a synergy with sports and entertainment venues.
    • University-Affiliated Stadiums: The proximity of ‘college_university’ to stadium arenas might suggest that some of these stadiums are located within or near university campuses, serving as venues for college sports events, which are often significant in the United States.
    • Accommodation for Visitors: The high ranking of ‘hotel’ near stadium arenas indicates a demand for accommodation by visitors who may be traveling to NYC for games or events. This is consistent with the transient nature of sports and entertainment events, which often draw fans and participants from outside the local area.
    • Luxury and Leisure: ‘jewelry_store’ has a rank difference of 26, showcasing a demand for luxury shopping experiences around stadium arenas, potentially linked to the high-profile nature of events held at these locations.
    • Entertainment and Event Planning: Categories like ‘arts_and_entertainment’, ‘event_planning’, and ‘topic_concert_venue’ have higher rankings near arenas, reflecting the role of these venues as hubs for events and cultural activities.

    Conclusion

    The Overture Maps data in Wherobots open data catalog, offers great efficiency in spatial analytics. This synergy between advanced data formats and powerful analytics tools opens up new possibilities for geospatial analysis and insights. The analyses presented here are just the beginning and make several assumptions. However, with Wherobots and Overture Maps data, the possibilities for uncovering new insights and informing data-driven decisions are virtually limitless.

    You can follow along with the code from this blog post by creating a free account in Wherobots Cloud.

    Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: