Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
At Wherobots, we’re excited to offer access to the Foursquare Open Places dataset through the Wherobots Global Hub. We maintain a pipeline to keep this dataset updated along with the Foursquare releases. In this tutorial, we’ll show you how to work with Foursquare’s Open Places dataset using Wherobots. We’ll demonstrate how to access, query, and visualize Points of Interest (POI) data, culminating in aggregating and visualizing the data as a choropleth map.
The overall process we will show is applicable to many different use cases; for this tutorial we will use it to analyze coffee shops across San Francisco and see how they are distributed by neighborhood.
If you would like to follow along, you can spin up a free instance here and access this notebook for yourself in the Wherobots Jupyter Notebook environment under the examples folder.
Let’s dive in.
Foursquare’s Places dataset is a comprehensive global collection of points of interest (POIs) containing over 100 million locations worldwide. What makes this dataset particularly valuable is that it’s continuously updated and verified through Foursquare’s Placemaker Tools, which enable community contributions to maintain accuracy.
The dataset includes details such as:
To follow along with this tutorial, you’ll need a Wherobots account. You can sign up for free at Wherobots.com.
First, let’s set up our Sedona context in Wherobots. This will start up our spark-based compute environment and give us access to the geospatial functions we need to work with the Foursquare dataset.
from sedona.spark import * config = SedonaContext.builder().getOrCreate() sedona = SedonaContext.create(config)
The Foursquare data is already accessible in the Wherobots Open Data Catalog, so there’s no need to download or import it separately. Let’s check what tables are available:
sedona.sql("SHOW tables IN wherobots_open_data.foursquare").show(truncate=False)
+----------+----------+-----------+ |namespace |tableName |isTemporary| +----------+----------+-----------+ |foursquare|categories|false | |foursquare|places |false | +----------+----------+-----------+
There are two tables: places which contain the actual locations, and categories which provide the classification system. Let’s look at the schema of the places table:
places
categories
sedona.table("wherobots_open_data.foursquare.places").printSchema()
You’ll get a result that looks something like this:
root |-- fsq_place_id: string (nullable = true) |-- name: string (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- address: string (nullable = true) |-- locality: string (nullable = true) |-- region: string (nullable = true) |-- postcode: string (nullable = true) |-- admin_region: string (nullable = true) |-- post_town: string (nullable = true) |-- po_box: string (nullable = true) |-- country: string (nullable = true) |-- date_created: string (nullable = true) |-- date_refreshed: string (nullable = true) |-- date_closed: string (nullable = true) |-- tel: string (nullable = true) |-- website: string (nullable = true) |-- email: string (nullable = true) |-- facebook_id: long (nullable = true) |-- instagram: string (nullable = true) |-- twitter: string (nullable = true) |-- fsq_category_ids: array (nullable = true) | |-- element: string (containsNull = true) |-- fsq_category_labels: array (nullable = true) | |-- element: string (containsNull = true) |-- placemaker_url: string (nullable = true) |-- geometry: geometry (nullable = true) |-- bbox: struct (nullable = true) | |-- xmin: double (nullable = true) | |-- ymin: double (nullable = true) | |-- xmax: double (nullable = true) | |-- ymax: double (nullable = true)
The schema includes fields like fsq_place_id, name, latitude, longitude, address, country, fsq_category_labels, and more. Importantly, it also includes a geometry field that we can use for spatial operations.
fsq_place_id
name
latitude
longitude
address
country
fsq_category_labels
geometry
Wherobots provides access to multiple versions of the Foursquare data. You can select a specific version using the VERSION AS OF clause:
VERSION AS OF
sedona.sql("SELECT * FROM wherobots_open_data.foursquare.places VERSION AS OF 'dt=2025-02-06'").show(5, truncate=True)
If you don’t specify a version, you’ll get the latest one by default.
Let’s start by selecting the columns we’re interested in:
places_df = sedona.sql(""" SELECT fsq_place_id, name, geometry, fsq_category_labels, country, date_refreshed, date_created FROM wherobots_open_data.foursquare.places WHERE date_closed IS NULL AND name IS NOT NULL AND geometry IS NOT NULL AND country IS NOT NULL """) places_df.createOrReplaceTempView("places")
Now, let’s run a quick query to see what we’re working with and get some basic stats:
sedona.sql(""" SELECT COUNT(*) as total_places, COUNT(CASE WHEN name IS NOT NULL THEN 1 END) as has_name, COUNT(CASE WHEN address IS NOT NULL THEN 1 END) as has_address, COUNT(CASE WHEN fsq_category_labels IS NOT NULL THEN 1 END) as has_categories, COUNT(CASE WHEN date_closed IS NULL THEN 1 END) as still_open FROM places """).show()
When you run the above query, you’ll get a result like this:
+------------+---------+-----------+--------------+----------+ |total_places| has_name|has_address|has_categories|still_open| +------------+---------+-----------+--------------+----------+ | 104635092|104635092| 67380672| 92959858| 98436217| +------------+---------+-----------+--------------+----------+
That’s over 104 million places in the dataset. About 67 million have addresses, and nearly 93 million have category labels. That’s an impressive dataset.
Let’s narrow our focus to a specific region. We can first use standard SQL to filter to the United States:
us_df = sedona.sql(""" SELECT * FROM places WHERE country = 'US' """)
This approach using standard SQL works well for cases where the region you want to filter for is already defined in the dataset. For example, the dataset has a country column, so we can filter by country.
But what if we wanted to filter by a geographic region that isn’t already defined in the dataset? For example, maybe we want to filter by census block groups, cities, or internal company polygons that we have defined for trade areas.
That’s where the power of spatial joins in Wherobots comes in. We can define any arbitrary polygon and use it to filter the data.
Note that the actual polygon is a very long WKT string, so we truncate it here in the example code. But the full WKT polygon is defined in the notebook itself for you to access. As an alternative you could use other open boundary datasets, such as within the Overture Divisions dataset and join the Foursquare places against those divisions.
# Define a polygon for San Francisco SF = "MULTIPOLYGON (((-122.4773830027518 37.8110279985324, ...))" # Filter to places within San Francisco sf_df = sedona.sql(f""" SELECT * FROM places WHERE ST_Contains( ST_GeomFromWKT('{SF}'), geometry) """)
Now we have just the places located within San Francisco.
I know that filtering the data for a single polygon may not be a huge task by itself, but where you really start to see the power of Wherobots is when this scales up. We could just as easily spatially join all 104 million places in the dataset into a million different polygons (buildings, cities, counties, senate districts, zip codes, etc…) based on which polygon a point falls within–and we could use that exact same ST_Contains function to do so.
ST_Contains
Wherobots includes built-in visualization capabilities through SedonaKepler. Let’s create a simple map of our San Francisco places:
sf_map = SedonaKepler.create_map(sf_df, "Places")
We can see from the image that there are POIs clear across San Francisco, with a dense concentration of places in the Financial District downtown.
Now say we want to search for specific places such as a brand or retail chain. For example, perhaps we want to find all Starbucks locations in San Francisco. To do so we can use the following query:
sbux_df = sedona.sql(""" SELECT * FROM sf_places WHERE LOWER(name) = 'starbucks' """) sbux_map = SedonaKepler.create_map(sbux_df, "Starbucks in San Francisco")
The Foursquare data uses a hierarchical category system. Let’s see what categories are most common in our San Francisco dataset:
sedona.sql(""" SELECT fsq_category_labels[0] as primary_category, COUNT(*) as count FROM sf_places GROUP BY fsq_category_labels[0] ORDER BY count DESC LIMIT 20 """).show(truncate=False)
Here are the top categories in San Francisco by count:
We can also filter to places that fall into a specific category. For example, instead of searching for Starbucks locations by name, let’s select all Points of Interest that are in the Coffee Shop category:
category_places = sedona.sql(""" SELECT * FROM sf_places WHERE ARRAY_CONTAINS(fsq_category_labels, 'Dining and Drinking > Cafe, Coffee, and Tea House > Coffee Shop') """)
Here’s a quick look at the table that we get after running that query.
+--------------------+--------------------+--------------------+--------------------+-------+--------------+------------+ | fsq_place_id| name| geometry| fsq_category_labels|country|date_refreshed|date_created| +--------------------+--------------------+--------------------+--------------------+-------+--------------+------------+ |5622e819498ecbeed...|Four Barrel at TI...|POINT (-122.37331...|[Dining and Drink...| US| 2024-10-26| 2015-10-18| |49d6619ff964a520b...| Cafe deStijl|POINT (-122.40036...|[Dining and Drink...| US| 2022-07-29| 2009-04-03| |49dbc22af964a520f...|Battery Street Co...|POINT (-122.40133...|[Dining and Drink...| US| 2024-05-06| 2009-04-07| |5b1599d604d1ae002...| Good Mojo|POINT (-122.40049...|[Dining and Drink...| US| 2024-10-26| 2018-06-04| |4b02f2f9f964a5205...| Starbucks|POINT (-122.40102...|[Dining and Drink...| US| 2024-11-05| 2009-11-17| |49fa3e4ff964a520d...| Jackson Place Cafe|POINT (-122.40135...|[Dining and Drink...| US| 2024-07-17| 2009-05-01| |4d065736a26854819...|Réveille Coffee C...|POINT (-122.40035...|[Dining and Drink...| US| 2024-10-20| 2010-12-13| |4bafe772f964a5205...| om bucks|POINT (-122.40090...|[Dining and Drink...| US| 2023-04-04| 2010-03-28| |4ab317e7f964a5207...| Peet's Coffee|POINT (-122.40115...|[Dining and Drink...| US| 2023-09-20| 2009-09-18| |2af323a0d2d542feb...| Aroma Espresso Bar|POINT (-122.40102...|[Dining and Drink...| US| 2012-08-27| 2012-08-27| +--------------------+--------------------+--------------------+--------------------+-------+--------------+------------+
For a more advanced visualization, let’s create a choropleth map showing the number of coffee shops in each San Francisco neighborhood.
First we will need a dataset of polygons for the SF neighborhoods. Luckily, we have one from the SF open data portal. We’ll load these from a CSV file:
SF_NEIGHBORHOODS_URL = "s3://wherobots-examples/data/sf_neighborhoods.csv" neighborhoods = (sedona.read.format('csv') .option('header', 'true') .option('delimiter', ',') .option('inferSchema', 'true') .load(SF_NEIGHBORHOODS_URL) )
For context, here’s the table that results after we load those neighborhood polygon boundaries.
+--------------------+--------------------+ | the_geom| neighborho| +--------------------+--------------------+ |MULTIPOLYGON (((-...| Seacliff| |MULTIPOLYGON (((-...| Haight Ashbury| |MULTIPOLYGON (((-...| Outer Mission| |MULTIPOLYGON (((-...| Inner Sunset| |MULTIPOLYGON (((-...|Downtown/Civic Ce...| |MULTIPOLYGON (((-...| Diamond Heights| |MULTIPOLYGON (((-...| Lakeshore| |MULTIPOLYGON (((-...| Russian Hill| |MULTIPOLYGON (((-...| Noe Valley| |MULTIPOLYGON (((-...| Treasure Island/YBI| +--------------------+--------------------+
As you can see, there are two columns, the geometry column that contains the multi-polygon that defines the boundaries of the neighborhood and the neighborhood column that contains the name.
If you’ve spent any time in SF you’ll certainly recognize some of these names.
Now that we have the boundaries we’ll use a spatial join to count coffee shops in each neighborhood:
neighborhood_agg = sedona.sql(""" SELECT ST_GeomFromWKT(n.the_geom) AS geometry, n.neighborho AS neighborhood, COUNT(*) as location_count, collect_list(p.name) AS coffee_shops FROM category_places p JOIN neighborhoods n ON ST_CONTAINS(ST_GeomFromWKT(n.the_geom), p.geometry) GROUP BY n.the_geom, n.neighborho """)
This gives us the following table where we can see the neighborhood name, its polygon geometry, the count of coffee shops contained in that neighborhood, and an array of the names of the coffee shops.
+--------------------+--------------------+--------------+--------------------+ | geometry| neighborhood|location_count| coffee_shops| +--------------------+--------------------+--------------+--------------------+ |MULTIPOLYGON (((-...| Treasure Island/YBI| 1|[Four Barrel at T...| |MULTIPOLYGON (((-...| Potrero Hill| 25|[Starbucks, WFM C...| |MULTIPOLYGON (((-...| South of Market| 119|[W6 Coffee Bar, S...| |MULTIPOLYGON (((-...| Bayview| 20|[The Happy Vegan,...| |MULTIPOLYGON (((-...| Financial District| 186|[Starbucks, Jacks...| |MULTIPOLYGON (((-...| Visitacion Valley| 4|[Joe Leland, Miss...| |MULTIPOLYGON (((-...| Chinatown| 11|[Cafe Vivo, Latte...| |MULTIPOLYGON (((-...|Downtown/Civic Ce...| 98|[Chai Bar, Dignit...| |MULTIPOLYGON (((-...| North Beach| 30|[Cafe deStijl, Ba...| |MULTIPOLYGON (((-...| Nob Hill| 22|[Cafe Mozart, Gal...| +--------------------+--------------------+--------------+--------------------+
Finally we have what we need to create our visualization. Time to create our choropleth map.
Note here that I am using a custom map config to set the color scale, style, and other properties.
map = SedonaKepler.create_map(df=neighborhood_agg, name="Coffee Shop Count", config=map_config)
The resulting visualization reveals interesting patterns. Downtown San Francisco (the Financial District) has the highest concentration with 186 coffee shops, while areas like the Presidio are relatively sparse with only around 10 options. The Inner Richmond neighborhood shows about 37 coffee shops.
With these live Sedona Kepler maps, you can hover over the neighborhood you’re interested in to see the full list of coffee shops in that area.
That wraps up our tutorial on working with Foursquare Places data in Wherobots. I hope you enjoyed learning about these powerful geospatial capabilities.
Before we wrap up, let’s take a quick look at what makes this dataset from Foursquare unique.
Unlike some other POI datasets, Foursquare’s Places data is continually updated through their Placemaker Tools. These web-based tools allow community members to:
This community-driven approach helps ensure the data remains current and accurately reflects real-world changes.
The Foursquare Places dataset combined with Wherobots’ powerful geospatial capabilities offers an excellent foundation for location-based analytics. Whether you’re studying urban patterns, analyzing business distributions, or planning a coffee crawl through San Francisco, these tools make it easy to extract meaningful insights from spatial data.
Ready to explore the Foursquare Places data for yourself? Sign up for a free Wherobots account and try out this notebook!
By the way, a couple months ago I released this video where I walk through this exact use case in one of our Wherobots demo notebooks. Feel free to check it out.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Spatial Data Pipeline Architecture: PostGIS and Wherobots Together
In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]
Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade
Introduction Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it. Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: