Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
In the world of geospatial data, entity matching and data integration are common challenges. In this blog post, we’ll explore how to use the Overture Maps Foundation GERS IDs within Wherobots to link information about the same physical location across different datasets.
GERS IDs (Global Entity Reference System identifiers) are persistent, unique identifiers for physical places and entities in the physical world. Created by the Overture Maps Foundation, these IDs serve as a universal reference system that allows different datasets to refer to the same physical location reliably.
Attributes of GERS IDs include:
While Overture Maps data comes with GERS IDs built in, many other datasets–open source, commercial data products, and of course internal company datasets–don’t include these identifiers. This presents a challenge: how do you match your existing location data to GERS IDs to enable integration with the broader ecosystem?
At the recent Cloud Native Geospatial Summit, I co-presented with the Overture Maps Foundation team in a workshop session on GERS. My presentation focused on how to take a non-GERSified Point of Interest (POI) dataset and join it to the Overture Places dataset to assign GERS IDs to those POIs that have one. This simple process also allows users to identify those POIs that do not appear in the Overture Places dataset and thus do not have GERS IDs associated with them.
This blog post is a blogified version of that workshop, presented as a tutorial.
If you’d like to follow along with this tutorial, you’ll need a Wherobots Cloud account. You can sign up for a free community edition here, or sign up for a paid plan through the AWS marketplace here.
First, let’s initialize our Wherobots environment with the necessary libraries:
from sedona.spark import * from pyspark.sql import functions as f import os config = (SedonaContext.builder().getOrCreate()) sedona = SedonaContext.create(config)
The heart of our solution is a pair of utility functions that perform GERS ID matching.
The first of these, gersify(), is for finding a matching point and getting the GERS ID for a single point geometry stored as a WKT or well-known-text format. It takes as input a WKT point geometry and a search string that relates to the name of the POI.
gersify(),
This is for if you need to process a one-off. However, if you have a dataframe with many points (say, 16,000), putting this function into a loop would be incredibly slow. That’s why we have the next function.
The second function, gersify_dataframe(), is for doing this same GERS matching operation on a larger scale. It takes as input a dataframe of points and a search string that relates to those points (i.e. “park” or “stadium”).
gersify_dataframe(),
Let’s examine the code for these:
def gersify(point_wkt, search_param): """ Uses a point and a search string to find the closest matching GERS ID. Returns a DataFrame with the GERS ID and other attributes from Overture Maps. """ # Create a point from WKT query = f""" WITH point AS ( SELECT ST_GeomFromWKT('{point_wkt}') as point ) SELECT p.id as gers_id, p.geometry as OMF_geom, p.names.primary, p.categories.main as category, p.websites.primary as website, p.phones.primary as phone, p.geometry FROM point, places p WHERE ST_DWithin(p.geometry, point, 500, true) ORDER BY ST_Distance(p.geometry, point) ASC """ return_df = sedona.sql(query).withColumn("distance_from_point", f.expr("ST_DistanceSpheroid(geometry, point)")).cache().where(f"names.primary like '%{search_param}%'") return return_df def gersify_dataframe(df, search_param): """ Matches and add GERS ID to any dataset. Returns all rows that have a GERS ID. """ # Register the input dataframe as a temporary view df.createOrReplaceTempView("_temp_df") # For each row in the dataframe, find the closest matching GERS ID inter_query = """ WITH points AS ( SELECT id, geometry FROM _temp_df ) SELECT p.id as gers_id, p.geometry as OMF_geom, df.*, ST_DistanceSpheroid(p.geometry, df.geometry) as distance_from_point FROM points df, places p WHERE ST_DWithin(p.geometry, df.geometry, 500, true) AND p.names.primary LIKE '%Starbucks%' QUALIFY ROW_NUMBER() OVER (PARTITION BY df.id ORDER BY ST_DistanceSpheroid(p.geometry, df.geometry)) = 1 """ # Execute the query to find matches inter_df = sedona.sql(inter_query) inter_df.createOrReplaceTempView("inter_df") # Find rows that didn't match and include them in the result remaining_rows = """ SELECT NULL as gers_id, NULL AS OMF_geom, df.*, NULL as distance_from_point FROM _temp_df df LEFT ANTI JOIN inter_df i_df ON df.id = i_df.id """ return_df = sedona.sql(remaining_rows) # Combine matched and unmatched rows final_df = inter_df.union(return_df) return final_df
For this example, we’ll work with a dataset of Starbucks locations across the United States:
df.count() # 16820 locations df.show()
Let’s take a look at our dataset. Here are the first few rows of the result:
Before enrichment, let’s visualize our base dataset to get an idea of what we are working with.
map = SedonaKepler.create_map(df, "Original Location Dataset") map
Now for the exciting part, let’s enrich our dataset with GERS IDs:
# NOTE: the string search is case sensitive df_gersified = gersify_dataframe(df, "Starbucks")
Let’s examine the results:
df_gersified.show(20, False)
Let’s visualize both datasets together to see what points matched and what points did not match to anything.
SedonaKepler.add_df(map, df_gersified.drop("geometry"), "OMF enrichment locations") map
By enriching our dataset with GERS IDs, we’ve unlocked several powerful capabilities:
GERS IDs represent a powerful tool for geospatial data integration, and Wherobots makes it easy to incorporate them into your workflows. With the functions demonstrated in this blog post, you can enrich any location dataset with GERS IDs, enabling integration with the broader geospatial data ecosystem.
Want to try it yourself? Sign up for a Wherobots account and explore the full capabilities of the platform for your spatial data processing needs.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Graph RAG for the Physical World
Introduction RAG (Retrieval Augmented Generation) has addressed one of AI’s biggest challenges for enterprise users: missing or hallucinating empirical business and real world context . Instead of generating answers from nothing, RAG retrieves relevant documents and feeds them to the model as context. It works. Ask an AI about your company’s Q4 revenue, and RAG […]
Building the Wherobots Mobility Solution Accelerator: A Technical Deep Dive
Three Notebooks, One Medallion Architecture, Full 4D GPS Trajectory Processing: Part 2 of 2
How well does SAM3 detect building footprints? Let’s ask the Wherobots Spatial AI Assistant!
In a recent post, we showed how easy it is to use RasterFlow and Meta’s Segment Anything 3 Model (SAM3) to detect features in the physical world. A single end-to-end pipeline built a 133 GB NAIP mosaic of Marion County, Oregon, ran SAM3 against it with text prompts spanning eight classes, and produced approximately one […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: