Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
In the world of geospatial data, entity matching and data integration are common challenges. In this blog post, we’ll explore how to use the Overture Maps Foundation GERS IDs within Wherobots to link information about the same physical location across different datasets.
GERS IDs (Global Entity Reference System identifiers) are persistent, unique identifiers for physical places and entities in the physical world. Created by the Overture Maps Foundation, these IDs serve as a universal reference system that allows different datasets to refer to the same physical location reliably.
Attributes of GERS IDs include:
While Overture Maps data comes with GERS IDs built in, many other datasets–open source, commercial data products, and of course internal company datasets–don’t include these identifiers. This presents a challenge: how do you match your existing location data to GERS IDs to enable integration with the broader ecosystem?
At the recent Cloud Native Geospatial Summit, I co-presented with the Overture Maps Foundation team in a workshop session on GERS. My presentation focused on how to take a non-GERSified Point of Interest (POI) dataset and join it to the Overture Places dataset to assign GERS IDs to those POIs that have one. This simple process also allows users to identify those POIs that do not appear in the Overture Places dataset and thus do not have GERS IDs associated with them.
This blog post is a blogified version of that workshop, presented as a tutorial.
If you’d like to follow along with this tutorial, you’ll need a Wherobots Cloud account. You can sign up for a free community edition here, or sign up for a paid plan through the AWS marketplace here.
First, let’s initialize our Wherobots environment with the necessary libraries:
from sedona.spark import * from pyspark.sql import functions as f import os config = (SedonaContext.builder().getOrCreate()) sedona = SedonaContext.create(config)
The heart of our solution is a pair of utility functions that perform GERS ID matching.
The first of these, gersify(), is for finding a matching point and getting the GERS ID for a single point geometry stored as a WKT or well-known-text format. It takes as input a WKT point geometry and a search string that relates to the name of the POI.
gersify(),
This is for if you need to process a one-off. However, if you have a dataframe with many points (say, 16,000), putting this function into a loop would be incredibly slow. That’s why we have the next function.
The second function, gersify_dataframe(), is for doing this same GERS matching operation on a larger scale. It takes as input a dataframe of points and a search string that relates to those points (i.e. “park” or “stadium”).
gersify_dataframe(),
Let’s examine the code for these:
def gersify(point_wkt, search_param): """ Uses a point and a search string to find the closest matching GERS ID. Returns a DataFrame with the GERS ID and other attributes from Overture Maps. """ # Create a point from WKT query = f""" WITH point AS ( SELECT ST_GeomFromWKT('{point_wkt}') as point ) SELECT p.id as gers_id, p.geometry as OMF_geom, p.names.primary, p.categories.main as category, p.websites.primary as website, p.phones.primary as phone, p.geometry FROM point, places p WHERE ST_DWithin(p.geometry, point, 500, true) ORDER BY ST_Distance(p.geometry, point) ASC """ return_df = sedona.sql(query).withColumn("distance_from_point", f.expr("ST_DistanceSpheroid(geometry, point)")).cache().where(f"names.primary like '%{search_param}%'") return return_df def gersify_dataframe(df, search_param): """ Matches and add GERS ID to any dataset. Returns all rows that have a GERS ID. """ # Register the input dataframe as a temporary view df.createOrReplaceTempView("_temp_df") # For each row in the dataframe, find the closest matching GERS ID inter_query = """ WITH points AS ( SELECT id, geometry FROM _temp_df ) SELECT p.id as gers_id, p.geometry as OMF_geom, df.*, ST_DistanceSpheroid(p.geometry, df.geometry) as distance_from_point FROM points df, places p WHERE ST_DWithin(p.geometry, df.geometry, 500, true) AND p.names.primary LIKE '%Starbucks%' QUALIFY ROW_NUMBER() OVER (PARTITION BY df.id ORDER BY ST_DistanceSpheroid(p.geometry, df.geometry)) = 1 """ # Execute the query to find matches inter_df = sedona.sql(inter_query) inter_df.createOrReplaceTempView("inter_df") # Find rows that didn't match and include them in the result remaining_rows = """ SELECT NULL as gers_id, NULL AS OMF_geom, df.*, NULL as distance_from_point FROM _temp_df df LEFT ANTI JOIN inter_df i_df ON df.id = i_df.id """ return_df = sedona.sql(remaining_rows) # Combine matched and unmatched rows final_df = inter_df.union(return_df) return final_df
For this example, we’ll work with a dataset of Starbucks locations across the United States:
df.count() # 16820 locations df.show()
Let’s take a look at our dataset. Here are the first few rows of the result:
Before enrichment, let’s visualize our base dataset to get an idea of what we are working with.
map = SedonaKepler.create_map(df, "Original Location Dataset") map
Now for the exciting part, let’s enrich our dataset with GERS IDs:
# NOTE: the string search is case sensitive df_gersified = gersify_dataframe(df, "Starbucks")
Let’s examine the results:
df_gersified.show(20, False)
Let’s visualize both datasets together to see what points matched and what points did not match to anything.
SedonaKepler.add_df(map, df_gersified.drop("geometry"), "OMF enrichment locations") map
By enriching our dataset with GERS IDs, we’ve unlocked several powerful capabilities:
GERS IDs represent a powerful tool for geospatial data integration, and Wherobots makes it easy to incorporate them into your workflows. With the functions demonstrated in this blog post, you can enrich any location dataset with GERS IDs, enabling integration with the broader geospatial data ecosystem.
Want to try it yourself? Sign up for a Wherobots account and explore the full capabilities of the platform for your spatial data processing needs.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down
A Wherobots Solution Accelerator for GPS Mobility Analytics — Part 1 of 2
PostGIS vs Wherobots: What It Actually Costs You to Choose Wrong
When building a geospatial platform, technical decisions are never just technical, they are financial. Choosing the wrong architecture for your spatial data doesn’t just frustrate your data team; it directly impacts your bottom line through large cloud infrastructure bills and, perhaps more dangerously, delayed business insights. For decision-makers, the choice between a traditional spatial database […]
Streaming Spatial Data into Wherobots with Spark Structured Streaming
Real-time Spatial Pipelines Shouldn’t Be This Hard (But They Were) I’ve been doing geospatial work for over twenty years now. I’ve hand-rolled ETL pipelines, babysat cron jobs, and debugged more coordinate system mismatches than a person should reasonably endure in one lifetime. So when someone says “streaming spatial data,” my first reaction used to be […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: