WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Tutorial: How to Use GERS IDs in Wherobots for Geospatial Data Integration

Leverage GERS ID Wherobots Blog Image

Introduction

In the world of geospatial data, entity matching and data integration are common challenges. In this blog post, we’ll explore how to use the Overture Maps Foundation GERS IDs within Wherobots to link information about the same physical location across different datasets.

What are GERS IDs?

GERS IDs (Global Entity Reference System identifiers) are persistent, unique identifiers for physical places and entities in the physical world. Created by the Overture Maps Foundation, these IDs serve as a universal reference system that allows different datasets to refer to the same physical location reliably.

Attributes of GERS IDs include:

  • Persistent identification: The same physical location maintains the same GERS ID over time
  • Cross-dataset compatibility: Enable joining and enrichment across multiple data providers
  • Standardized reference: Provide a common language for location data across the geospatial ecosystem

The Challenge: Working with Non-GERS Datasets

While Overture Maps data comes with GERS IDs built in, many other datasets–open source, commercial data products, and of course internal company datasets–don’t include these identifiers. This presents a challenge: how do you match your existing location data to GERS IDs to enable integration with the broader ecosystem?

At the recent Cloud Native Geospatial Summit, I co-presented with the Overture Maps Foundation team in a workshop session on GERS. My presentation focused on how to take a non-GERSified Point of Interest (POI) dataset and join it to the Overture Places dataset to assign GERS IDs to those POIs that have one. This simple process also allows users to identify those POIs that do not appear in the Overture Places dataset and thus do not have GERS IDs associated with them.

This blog post is a blogified version of that workshop, presented as a tutorial.

If you’d like to follow along with this tutorial, you’ll need a Wherobots Cloud account. You can sign up for a free community edition here, or sign up for a paid plan through the AWS marketplace here.

Tutorial

Setting Up Our Environment

First, let’s initialize our Wherobots environment with the necessary libraries:

from sedona.spark import *
from pyspark.sql import functions as f
import os

config = (SedonaContext.builder().getOrCreate())
sedona = SedonaContext.create(config)

Creating a GERS Matching Function

The heart of our solution is a pair of utility functions that perform GERS ID matching.

The first of these, gersify(), is for finding a matching point and getting the GERS ID for a single point geometry stored as a WKT or well-known-text format. It takes as input a WKT point geometry and a search string that relates to the name of the POI.

This is for if you need to process a one-off. However, if you have a dataframe with many points (say, 16,000), putting this function into a loop would be incredibly slow. That’s why we have the next function.

The second function, gersify_dataframe(), is for doing this same GERS matching operation on a larger scale. It takes as input a dataframe of points and a search string that relates to those points (i.e. “park” or “stadium”).

Let’s examine the code for these:

def gersify(point_wkt, search_param):
    """
    Uses a point and a search string to find the closest matching GERS ID.

    Returns a DataFrame with the GERS ID and other attributes from Overture Maps.
    """
    # Create a point from WKT
    query = f"""
    WITH point AS (
        SELECT ST_GeomFromWKT('{point_wkt}') as point
    )

    SELECT 
        p.id as gers_id, 
        p.geometry as OMF_geom, 
        p.names.primary, 
        p.categories.main as category,
        p.websites.primary as website,
        p.phones.primary as phone,
        p.geometry
    FROM 
        point,
        places p
    WHERE
        ST_DWithin(p.geometry, point, 500, true)
    ORDER BY
        ST_Distance(p.geometry, point) ASC
    """

    return_df = sedona.sql(query).withColumn("distance_from_point", f.expr("ST_DistanceSpheroid(geometry, point)")).cache().where(f"names.primary like '%{search_param}%'")

    return return_df

def gersify_dataframe(df, search_param):
    """
    Matches and add GERS ID to any dataset. 

    Returns all rows that have a GERS ID.
    """
    # Register the input dataframe as a temporary view
    df.createOrReplaceTempView("_temp_df")

    # For each row in the dataframe, find the closest matching GERS ID
    inter_query = """
    WITH points AS (
        SELECT 
            id,
            geometry
        FROM 
            _temp_df
    )

    SELECT 
        p.id as gers_id, 
        p.geometry as OMF_geom, 
        df.*,
        ST_DistanceSpheroid(p.geometry, df.geometry) as distance_from_point
    FROM 
        points df,
        places p
    WHERE
        ST_DWithin(p.geometry, df.geometry, 500, true)
        AND p.names.primary LIKE '%Starbucks%'
    QUALIFY
        ROW_NUMBER() OVER (PARTITION BY df.id ORDER BY ST_DistanceSpheroid(p.geometry, df.geometry)) = 1
    """

    # Execute the query to find matches
    inter_df = sedona.sql(inter_query)
    inter_df.createOrReplaceTempView("inter_df")

    # Find rows that didn't match and include them in the result
    remaining_rows = """
    SELECT 
        NULL as gers_id, NULL AS OMF_geom, df.*, NULL as distance_from_point
    FROM 
        _temp_df df
    LEFT ANTI JOIN
        inter_df i_df
    ON
        df.id = i_df.id
    """

    return_df = sedona.sql(remaining_rows)

    # Combine matched and unmatched rows
    final_df = inter_df.union(return_df)

    return final_df

Loading Our Dataset

For this example, we’ll work with a dataset of Starbucks locations across the United States:

df.count()  # 16820 locations
df.show()

Let’s take a look at our dataset. Here are the first few rows of the result:

stateaddresscategorynamewebsitephoneidgeometry
CO12027 E Lincoln Ave.coffee_shopStarbucks30369033000POINT (-104.84465…)
CO6424 E Hampden Ave.coffee_shopStarbuckshttps://www.starb…+130375864621POINT (-104.91555…)
CO4030 S Parker Rd.coffee_shopStarbuckshttps://www.starb…+130369033002POINT (-104.80830…)

Visualizing the Original Dataset

Before enrichment, let’s visualize our base dataset to get an idea of what we are working with.

map = SedonaKepler.create_map(df, "Original Location Dataset")
map

Adding GERS IDs to Our Dataset

Now for the exciting part, let’s enrich our dataset with GERS IDs:

# NOTE: the string search is case sensitive
df_gersified = gersify_dataframe(df, "Starbucks")

Let’s examine the results:

df_gersified.show(20, False)
gers_idOMF_geomstateaddresscategorynamewebsitephoneidgeometrydistance_from_point
2fbc782d-2f1e-4aed-8c9a-38b1e71cPOINT (-104.84465408325195 3…CO12027 E Lincoln Ave.coffee_shopStarbucks30369033000POINT (-104.84465408325195 39.59344482421875)13.190219010431478
3e70a787-2e92-4224-d008-f6366a17POINT (-104.91578674316406 3…CO6424 E Hampden Ave.coffee_shopStarbuckshttps://www.starbucks.com/store-locator/store/8713/+130375864621POINT (-104.91555786132812 39.65428161621094)22.779467851622873
dd60478c-562e-487f-e9ae-540aa5e0POINT (-104.80859375 39.631…CO4030 S Parker Rd.coffee_shopStarbuckshttps://www.starbucks.com/store-locator/store/16430/+130369033002POINT (-104.8083038330078 39.631256103515625)25.400139164709438

Visualizing Our GERS-Enriched Dataset

Let’s visualize both datasets together to see what points matched and what points did not match to anything.

SedonaKepler.add_df(map, df_gersified.drop("geometry"), "OMF enrichment locations")
map

Key Benefits of GERS ID Enrichment

By enriching our dataset with GERS IDs, we’ve unlocked several powerful capabilities:

  1. Cross-dataset integration: We can now join this Starbucks dataset with any other dataset that uses GERS IDs, such as any Overture Maps dataset or other GERS-enriched datasets.
  2. Data enrichment workflows: The GERS IDs allow us to pull in additional attributes from the Overture Maps dataset or any other dataset sharing the same identification system.
  3. Scalable processing: Using Wherobots and Apache Sedona, we can process large datasets efficiently. Our example processed over 16,000 locations in under a minute on a Medium Wherobots cluster.

Conclusion

GERS IDs represent a powerful tool for geospatial data integration, and Wherobots makes it easy to incorporate them into your workflows. With the functions demonstrated in this blog post, you can enrich any location dataset with GERS IDs, enabling integration with the broader geospatial data ecosystem.

Want to try it yourself? Sign up for a Wherobots account and explore the full capabilities of the platform for your spatial data processing needs.

Create your Wherobots account