🌶 Comparing taco chains :: a consumer retail cannibalization study with isochrones

Authors: Sean Knight and Ilya Marchenko

Using Wherobots for a Retail Cannibalization Study Comparing Two Leading Taco Chains

In this post, we explore how to implement a workflow from the commercial real estate (CRE) space using Wherobots Cloud. This workflow is commonly known as a cannibalization study, and we will be using WherobotsDB, POI data from OvertureMaps, the open source Valhalla API, and visualization capabilities offered by SedonaKepler.

NOTE: This is a guest Wherobots post from our friends at YuzuData. Reach out to them to learn more about their spatial data product services. You can also join for a demo scheduled on this use case with them on July 30th here.

What is a retail cannibalization study?

In CRE (consumer real estate), stakeholders are often interested in questions like “If we build a new fast food restaurant here, how will its performance be affected by other similar fast food locations that already exist nearby?”. The idea of the new fast food restaurant “eating into” the sales of other fast food restaurants that already exist nearby is what is known as ‘cannibalization’.

The main objective of studying this phenomenon is to determine the extent to which a new store might divert sales from existing stores owned by the same company or brand and evaluate the overall impact on the company’s market share and profitability in the area.

Cannibalization Study in Wherobots

For this case study, we will look at two taco chains which are located primarily in Texas: Torchy’s Tacos and Velvet Taco. In general, information about the performance of individual locations and customer demographics are often proprietary information. We can, however, still learn a great deal about the potential for cannibalization both between these two chains as competitors, and between individual locations of each chain. We also know, based on our own experience, these chains compete with each other. Which taco shop to go to when we are visiting Texas is always a spicy debate.

??????

We begin by importing modules that will be useful to us as we go on.

import geopandas as gpd
import pandas as pd
import requests
from sedona.spark import *
from pyspark.sql.functions import explode, array
from pyspark.sql import functions as F

Next, we can initiate a Sedona context.

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Identifying Points of Interest

Now, we need to retrieve the locations of Torchy’s Tacos and Velvet Taco locations. In general, one can do this via a variety of both free and paid means. We will look at a simple, free approach that is made possible by the integration of Overture Maps data into the Wherobots environment:

sedona.table("wherobots_open_data.overture.places_place"). \\
createOrReplaceTempView("places")

We create a view of the Overture Maps places database, which contains information on points of interest (POI’s) worldwide.

Now, we can select the POI’s which are relevant to this exercise:

stores = sedona.sql("""
SELECT id, names.common[0].value as name, ST_X(geometry) as long,
 ST_Y(geometry) as lat, geometry, 
 CASE WHEN names.common[0].value LIKE "%Torchy's Tacos%" 
 THEN "Torchy's Tacos" 
 ELSE 'Velvet Taco' END AS chain
FROM places
WHERE addresses[0].region = 'TX'
AND (names.common[0].value LIKE "%Torchy's Tacos%" 
OR names.common[0].value LIKE '%Velvet Taco%')
""")

Calling stores.show() gives us a look at the spark DataFrame we created:

+--------------------+--------------+-----------+----------+
|                  id|          name|       long|       lat|
+--------------------+--------------+-----------+----------+
|tmp_8104A79216254...|Torchy's Tacos|  -98.59689|  29.60891|
|tmp_D17CA8BD72325...|Torchy's Tacos|  -97.74175|  30.29368|
|tmp_F497329382C10...|   Velvet Taco|  -95.48866|  30.18314|
|tmp_9B40A1BF3237E...|Torchy's Tacos| -96.805853| 32.909982|
|tmp_38210E5EC047B...|Torchy's Tacos|  -96.68755|  33.10118|
|tmp_DF0C5DF6CA549...|Torchy's Tacos|  -97.75159|  30.24542|
|tmp_BE38CAC8D46CF...|Torchy's Tacos|  -97.80877|  30.52676|
|tmp_44390C4117BEA...|Torchy's Tacos|  -97.82594|   30.4547|
|tmp_8032605AA5BDC...|   Velvet Taco| -96.469695| 32.898634|
|tmp_0A2AA67757F42...|Torchy's Tacos|  -96.44858|  32.90856|
|tmp_643821EB9C104...|Torchy's Tacos|  -97.11933|  32.94021|
|tmp_0042962D27E06...|   Velvet Taco|-95.3905374|29.7444214|
|tmp_8D0E2246C3F36...|Torchy's Tacos|  -97.15952|  33.22987|
|tmp_CB939610BC175...|Torchy's Tacos|  -95.62067|  29.60098|
|tmp_54C9A79320840...|Torchy's Tacos|  -97.75604|  30.37091|
|tmp_96D7B4FBCB327...|Torchy's Tacos|  -98.49816|  29.60937|
|tmp_1BB732F35314D...|   Velvet Taco|  -95.41044|    29.804|
|tmp_55787B14975DD...|   Velvet Taco|-96.7173913|32.9758554|
|tmp_7DC02C9CC1FAA...|Torchy's Tacos|  -95.29544|  32.30361|
|tmp_1987B31B9E24D...|   Velvet Taco|  -95.41006| 29.770256|
+--------------------+--------------+-----------+----------+
only showing top 20 rows

We’ve retrieved the latitude and longitude of our locations, as well as the name of the chain each location belongs to. We used the CASE WHEN statement in our query in order to simplify the location names. This way, we can easily select all the stores from the Torchy’s Tacos chain, for example, and not have to worry about individual locations being called things like “Torchy’s Tacos – Rice Village” or “Velvet Taco Midtown”, etc.

We can also visualize these locations using SedonaKepler. First, we can create the map using the following snippet:

location_map = SedonaKepler.create_map(stores, "Locations", 
    config = location_map_cfg)

Then, we can display the results by simply calling location_map in the notebook. For convenience, we included the location_map_cfg Python dict in our notebook, which stores the settings necessary for the map to be created with the locations color-coded by chain. If we wish to make modifications to the map and save the new configuration for later use, we can do so by calling location_map.config and saving the result either as a cell in our notebook or in a separate location_map_cfg.py file.

Generating Isochrones

Now, for each of these locations, we can generate a polygon known as an isochrone or drivetime. These polygons will represent the areas that are within a certain time’s drive from the given location. We will generate these drivetimes using the Valhalla isochrone api:

def get_isochrone(lat, lng, costing, time_steps, name, location_id):
    url = "<https://valhalla1.openstreetmap.de/isochrone>"
    params = {
      "locations": [{"lon": lng, "lat": lat}],
      "contours": [{"time": i} for i in time_steps],
      "costing": costing,
      "polygons": 1,
    }
    response = requests.post(url, json=params)
    if response:
        result = response.json()
        if 'error_code' not in result.keys():
            df = gpd.GeoDataFrame.from_features(result)
            df['name'] = name
            df['id'] = location_id
            return df[['name','id','geometry']]

The function takes as its input a latitude and longitude value, a costing paratemeter, a location name, and a location id. The output is a dataframe which contains a Shapely polygon representing the isochrone, along with the a name and id of the location the isochrone corresponds to.

We have separate columns for a location id and a location name so that we can use the id column to examine isochrones for individual restaurants and we can use the name column to look at isochrones for each of the chains.

The costing parameter can take on several different values (see the API reference here), and it can be used to create “drivetimes” assuming the user is either walking, driving, or taking public transport.

We create a geoDataFrame of all of the 5-minute drivetimes for our taco restaurant locations

drivetimes_5_min = pd.concat([get_isochrone(row.lat, row.long, 'auto', [5],
 row.chain, row.id) for row in stores.select('id','chain','lat','long').collect()])

and then save it to our S3 storage for later use:

drivetimes_5_min.to_csv('s3://path/drivetimes_5_min_torchys_velvet.csv',
 index = False)

Because we are using a free API and we have to create quite a few of these isochrones, we highly recommend saving the file for later analysis. For the purposes of this blog, we have provided a ready-made isochrone file here, which we can load into Wherobots with the following snippet:

sedona.read.option('header','true').format('csv') .\\
load('s3://path/drivetimes_5_min_torchys_velvet.csv') .\\
createOrReplaceTempView('drivetimes_5_min')

We can now visualize our drivetime polygons in SedonaKepler. As before, we first create the map with the snippet below.

map_isochrones = sedona.read.option('header','true').format('csv'). \\
load('s3://path/drivetimes_5_min_torchys_velvet.csv')

isochrone_map = SedonaKepler.create_map(map_isochrones, "Isochrones",
 config = isochrone_map_cfg)

Now, we can display the result by calling isochrone_map .

The Analysis

At this point, we have a collection of the Torchy’s and Velvet Taco locations in Texas, and we know the areas which are within a 5-minute drive of each location. What we want to do now is to estimate the number of potential customers that live near each of these locations, and the extent to which these populations overlap.

A First Look

Before we look at how these two chains might compete with each other, let’s also take a look at the extent to which restaurants within each chain might be cannibalizing each others’ sales. A quick way to do this is by using the filtering feature in Kepler to look at isochrones for a single chain:



We see that locations for each chain are fairly spread out and (at least at the 5-minute drivetime level), there is not a high degree of cannibalization within each chain. Looking at the isochrones for both chains, however, we notice that Velvet Taco locations often tend to be near Torchy’s Tacos locations (or vice-versa). At this point, all we have are qualitative statements based on these maps. Next, we will show how to use H3 and existing open-source datasets to make these statements more quantitative.

Estimating Cannibalization Potential

As we can see by looking at the map of isochrones above, they are highly irregular polygons which have a considerable amount of overlap. In general, these polygons are not described in a ‘nice’ way by any administrative boundaries such as census block groups, census tracts, etc. Therefore, we will have to be a little creative in order to estimate the population inside them.

One way of doing this using the tools provided by Apache Sedona and Wherobots is to convert these polygons to H3 hexes. We can do this with the following snippet:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
createOrReplaceTempView('h3_isochrones')

This turns our table of drivetime polygons into a table where each row represents a hexagon with sides roughly 400m long, which is a part of a drivetime polygon. We also record the chain that these hexagons are associated to (the chain that the polygon they came from belongs to). We store each hexagon in its own row because this will simplify the process of estimating population later on.

Although the question of estimating population inside individual H3 hexes is also a difficult one (we will release a notebook on this soon), open-source datasets with this information are available online, and we will use one such dataset, provided by Kontur:

kontur = sedona.read.option('header','true') .\\
load('s3://path/us_h3_8_pop.geojson', format="json") .\\
drop('_corrupt_record').dropna() .\\
selectExpr('CAST(CONV(properties.h3, 16, 10) AS BIGINT) AS h3',
 'properties.population as population')

kontur.createOrReplaceTempView('kontur')

We can now enhance our h3_isochrones table with population counts for each H3 hex:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').distinct().createOrReplaceTempView('h3_isochrones')

At this stage, we can also quickly compute the cannibalization potential within each chain. Using the following code, for example, we can estimate the number of people who live within a 5 minute drive of more than one Torcy’s Tacos:

sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\\
join(kontur, 'h3', 'left').filter('name LIKE "%Torchy%"').select('h3','population') .\\
groupBy('h3').count().filter('count >= 2').join(kontur, 'h3', 'left').distinct() .\\
agg(F.sum('population')).collect()[0][0]
97903.0

We can easily change this code to compute the same information for Velvet Taco by changing filter('name LIKE "%Torchy%"') in line 4 of the above snippet to filter('name LIKE "%Velvet%"') . If we do this, we will see that 100298 people live within a 5 minute drive of more than one Velvet Taco. Thus, we see that the Torchy’s Tacos brand appears to be slightly better at avoiding canibalization among its own locations (especially given that Torchy’s Tacos has more locations than Velvet Taco).

Now, we can run the following query to show the number of people in Texas who live within a 5 minutes drive of a Torchy’s Tacos:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE "%Torchy's%"
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()

The reason we select distinct H3 hexes here is because a single hex can belong to more than one isochrone (as evidenced by the SedonaKepler visualizations above). We get the following output:

+---------------+
|sum(population)|
+---------------+
|      1546765.0|
+---------------+

So roughly 1.5 million people in Texas live within a 5-minute drive of a Torchy’s Tacos location. Looking at our previous calculations for how many people live near more than one restaurant of the same chain, we can see that Torchy’s Tacos locations near each other cannibalize about 6.3% of the potential customers who live within 5 minutes of a Torchy’s location.

Running a similar query for Velvet Taco tells us that roughly half as many people live within a 5-minute drive of a Velvet Taco:

sedona.sql("""
WITH distinct_h3 (h3, population) AS 
(
    SELECT DISTINCT h3, ANY_VALUE(population)
    FROM h3_isochrones
    WHERE name LIKE '%Velvet Taco%'
    GROUP BY h3
)
SELECT SUM(population)
FROM distinct_h3
""").show()
+---------------+
|sum(population)|
+---------------+
|       750360.0|
+---------------+

As before, we can also see that Velvet Taco locations near each other cannibalize about 13.4% of the potential customers who live within 5 minutes of a Velvet Taco location.

Now, we can estimate the potential for cannibalization between these two chains:

sedona.sql("""
WITH overlap_h3 (h3, population) AS
(
    SELECT DISTINCT a.h3, ANY_VALUE(a.population)
    FROM h3_isochrones a LEFT JOIN h3_isochrones b ON a.h3 = b.h3
    WHERE a.name != b.name
    GROUP BY a.h3
)
SELECT sum(population)
FROM overlap_h3
""").show()

which gives:

+---------------+
|sum(population)|
+---------------+
|       415033.0|
+---------------+

We can see that more than half of the people who live near a Velvet Taco location also live near a Torchy’s Tacos location and we can visualize this population overlap:

isochrones_h3_map_data = sedona.sql("""
SELECT ST_H3CellIds(ST_GeomFromWKT(geometry), 8, false) AS h3, name, id
FROM drivetimes_5_min
""").select(explode('h3'), 'name','id').withColumnRenamed('col','h3') .\
join(kontur, 'h3', 'left').select('name','population',array('h3')).withColumnRenamed('array(h3)','h3').selectExpr('name','population','ST_H3ToGeom(h3)[0] AS geometry')

isochrones_h3_map = SedonaKepler.create_map(isochrones_h3_map_data, 'Isochrones in H3', config = isochrones_h3_map_cfg)

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.

Understanding GeoParquet

Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.

In a nutshell, GeoParquet adds the following information to the file metadata:

  • Version: the GeoParquet specification version of this file
  • Primary column: the main Geometry Type column should be considered when there are multiple Geometry Type columns stored in this file.
  • Column metadata per Geometry Type column:

    • Encoding of the geometry data. Currently only WKB format is supported.
    • Geometry types: all possible geometry types occurred in this column, such as POINT, POLYGON, LINESTRING
    • Coordinate reference system information, bounding box, geometry orientation, and so on

In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.

The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.

Benefits of GeoParquet

GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.

Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.

A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.

As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.

The Role Of Apache Sedona

The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.

With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/

Generating GeoParquet Files With Apache Sedona

Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:

myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")

In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash function to generate dual GeoHash IDs for each geometry in the dataset.

  1. Coarse-Level GeoHashing (Precision Level 3): The first GeoHash, with a precision level of 3, creates cells approximately 156.5km x 156km in size. This level is ideal for grouping data based on geographic proximity. Using this GeoHash key, a large Sedona DataFrame is partitioned into numerous smaller GeoParquet files. Each file correlates to a unique 3-character GeoHash key, facilitating efficient data pruning at the Parquet file level. Importantly, this approach enables users to manually download individual GeoParquet files by referencing the GeoHash key in the file path, bypassing the need for a Parquet file reader.
  2. Fine-Level GeoHashing (Precision Level 8): Concurrently, OMF utilizes a finer GeoHash at level 8, resulting in smaller cells around 38.2m x 19m and then sorts data using this GeoHash within each GeoParquet file. This precision is used to sort data within each GeoParquet file. By doing so, OMF ensures that the data within each Parquet row-group or data page is organized based on spatial proximity. This organization is crucial for enabling efficient data pruning, particularly on the bbox column.

The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.

This dual GeoHash approach serves three core purposes:

  • It allows users to download individual GeoParquet files from OMF without requiring a GeoParquet or Parquet reader, simply by checking the GeoHash key.
  • GeoParquet readers can efficiently filter files using the BBox information in each file’s metadata, enhancing query speed.
  • The strategy also enhances the functionality of standard Parquet readers which can effectively filter files using the statistical information of the bbox column, found in each file’s column and row-group metadata, without needing to recognize the file as a GeoParquet file.

Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud

Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.

The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:

docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
apache/sedona:latest

This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888

Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:

from sedona.spark import *

config = SedonaContext. \
builder(). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate()

sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet"). \
load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")

We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:

  • ST_PolygonFromEnvelope – create a polygon geometry that will serve as the area roughly representing the boundaries of New York City from a specified bounding box
  • ST_Within – filter rows for only points of interest that fall within the boundary of that polygon geometry
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
df = df.where(spatial_filter)
df.createOrReplaceTempView("places")

When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.

To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:

  • ST_Buffer – create a new geometry that represents a buffer around each stadium that represents a short walking distance from the stadium
  • ST_Intersects – filter for points of interest that lie within the buffer geometry, identifying points of interest within walking distance of each stadium
stadium_places = sedona.sql("""
WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena")
SELECT * FROM places, stadiums
WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))
""")
Querying Overture Maps GeoParquet data using Apache Sedona
Visualizing points of interest within walking distance from each stadium.

You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.

Conclusion

The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Raster Data Analysis, Processing Petabytes of Agronomic Data, Overview of Sedona 1.5, and Unlocking The Spatial Frontier – This Month In Wherobots

Welcome to This Month In Wherobots, the monthly newsletter for data practitioners in the Apache Sedona and Wherobots community. This month we’re exploring raster data analysis with Spatial SQL, processing petabytes of agronomic data with Apache Sedona, a deep dive on new features added in the 1.5 release series, and an overview of working with files in Wherobots Cloud.

Raster Data Analysis With Spatial SQL & Apache Sedona

One of the strengths of Apache Sedona and Wherobots Cloud is the ability to work with large scale vector and raster geospatial data together using Spatial SQL. This post (and video) takes a look at how to get started working with raster data in Sedona using Spatial SQL and some of the use cases for raster data analysis including vector / raster join operations, zonal statistics, and using raster map algebra.

Read The Article: Raster Data Analysis With Spatial SQL & Apache Sedona

Featured Community Member: Luiz Santana

This month’s Wherobots & Apache Sedona featured community member is Luiz Santana. Luiz is Co-Founder and CTO of Leaf Agriculture. He has extensive experience as a former data architect and developer. Luiz has a PhD in Computer Science from Universidade Federal de Santa Catarina and spent time researching data processing and integration in highly scalable environments. Leaf Agriculture is building the unified API for food and agriculture by leveraging large-scale agronomic data. Luiz has given several presentations at conferences such as Apache Sedona: How To Process Petabytes of agrnomic data with Spark, and Perspectives on the use of data in Agriculture, which covers how Leaf uses Apache Sedona to analyze large-scale agricultural data and how Sedona fits into their stack alongside other technologies. Thank you Luiz for your work with the Apache Sedona and Wherobots community and sharing your knowledge and experience!

Apache Sedona: How To Process Petabytes of Agronomic Data With Spark

In this presentation from The Developer’s Conference Luiz Santana shares the experience of using Apache Sedona at Leaf Agriculture to process petabytes of agronomic data from satellites, agricultural machines, drones and other sensors. He discusses how Leaf uses Sedona for tasks such as geometry intersections, geographic searches, and polygon transformations with high performance and speed. Luiz also presented Perspectives On The Use of Data in Agriculture which covers some of the data challenges that Leaf handles and an overview of the technologies used to address these challenges, including Apache Sedona.

See The Slides From The Presentation

Working With Files – Getting Started With Wherobots Cloud

https://wherobots.com/wp-content/uploads/2024/02/Screenshot_2024-02-16_at_9.44.20_AM-1024x842.png

This post takes a look at loading and working with our own data in Wherobots Cloud as well as creating and saving data as the result of our analysis, such as the end result of a data pipeline. It covers importing files in various formats including CSV, GeoJSON, Shapefile, and GeoTIFF in Wherobots Cloud, working with AWS S3 cloud object storage, and creating GeoParquet files using Apache Sedona.

Read The Post: Working With Files – Getting Started With Wherobots Cloud

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both raster and vector data

https://wherobots.com/wp-content/uploads/2024/02/Screenshot-2024-02-08-at-2.13.52-PM.png

The 1.5 series of Apache Sedona represents a leap forward in geospatial processing that adds essential features and enhancements to make Sedona a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data analysis. This post covers XYZM coordinates and SRID, vector and raster joins, raster data manipulation, visualization with SedonaKepler and SedonaPyDeck, GeoParquet reading and writing, H3 hexagons, and new cluster compute engine support.

Read The Post: Introducing Sedona 1.5 – Making Sedona The Most Comprehensive & Scalable Spatial Data Processing and ETL Engine For Both Raster and Vector Data

Unlocking the Spatial Frontier: The Evolution and Potential of spatial technology in Apple Vision Pro and Augmented Reality Apps

https://wherobots.com/wp-content/uploads/2024/02/VisionPro.001.png

Apple adopted the term “spatial computing” when announcing the Apple Vision Pro to describe its new augmented reality platform. This post from Wherobots CEO Mo Sarwat examines spatial computing in the context of augmented reality experiences to explore spatial object localization and presentation and the role of spatial query processing and spatial data analytics in Apple Vision Pro.

Read The Post: Unlocking The Spatial Frontier

Upcoming Events

SedonaSnow: Apache Sedona On Snowflake, Accelerating Your GIS Pipeline, Exploring Global Fishing Watch Data With GeoParquet, and Apache Sedona 1.5.1 Release

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! In this edition we cover SedonaSnow: Apache Sedona on Snowflake, accelerating your GIS pipeline with Apache Sedona, exploring Global Fishing Watch public data with SedonaDB and GeoParquet, and a look at new features and updates in the 1.5.1 release of Apache Sedona.

Apache Sedona Now Available In Snowflake Marketplace: SedonaSnow

SedonaSnow

Apache Sedona, the scalable open-source geospatial compute engine is now available on Snowflake via the Snowflake Marketplace or via manual installation. The SedonaSnow plugin brings Apache Sedona’s Spatial SQL functionality to Snowflake via 130+ Sedona SQL "ST" SQL functions that can be used alongside Snowflake SQL.

Read More About Using SedonaSnow In Snowflake In This Tutorial

Featured Community Members: Alihan Zihna & Fernando Ayuso Palacios

This month’s Wherobots & Apache Sedona featured community members are Alihan Zihna, Lead Data Scientist at CKDelta and Fernando Palacios, Director of Data Science & Data Engineering also at CKDelta. Alihan and Fernando presented "GIS Pipeline Acceleration With Apache Sedona" at the Data + AI Summit where they share how they were able to improve the performance and innovation of their geospatial analysis pipelines, going from a pipeline that took 48 hours to complete down to 10 minutes using Apache Sedona. Thank you Fernando and Alihan for being a part of the community and sharing your work!

GIS Pipeline Acceleration With Apache Sedona

GIS pipeline accelerations with Apache Sedona

In this presentation from Data + AI Summit, Fernando and Alihan discuss some of the various usecases for working with large-scale geospatial data at conglomerate CKDelta, part of the Hutchinson Group which operates ports, utility networks, retail stores and mobile telecom networks with hundreds of millions of users across dozens of countries. They discuss how geospatial analytics at scale is important for identifying water leakage in their utility network, understanding customer satisfaction, identifying sites for electric vehicle charging station installation, and forecasting the supply and demand of energy. They provide a technical overview of Apache Sedona and share the results of improving and extending their geospatial analytics pipelines including one process that reduced running time from 48 hours to 10 minutes using Apache Sedona.

Watch the recording of "GIS Pipeline Acceleration With Apache Sedona"

The Wherobots Notebook Environment – Getting Started With Wherobots Cloud & SedonaDB Part 2

Wherobots Initial Notebook

In Part 2 of our Getting Started With Wherobots Cloud & SedonaDB series we dive into the Wherobots Notebook Environment including how to configure and start notebook runtimes, an overview of the sample notebooks included in Wherobots Cloud, and how to use version control like git with notebooks. If you missed it check out Part 1: An Overview of Wherobots Cloud or sign up for a free Wherobots Cloud account to get started directly.

Read More About The Wherobots Notebooks Environment

Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

Matched vs unmatched vessels

This post is a hands-on look at offshore ocean infrastructure and industrial vessel activity with SedonaDB using data from Global Fishing Watch. We also see how GeoParquet can be used with this data to improve the efficiency of data retrieval and enable large-scale geospatial visualization using GeoArrow and the Lonboard Python visualization library.

Read "Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet"

Apache Sedona 1.5.1 Release

Apache Sedona 1.5.1 Release Notes

The most recent release of Apache Sedona introduces some exciting new updates including support for Spark 3.5, 20+ new raster functions, 7 new vector functions, support for running Sedona in Snowflake with SedonaSnow, updates to Sedona’s GeoParquet reader and writer, and more! The updated raster functions include RS_ZonalStats for computing zonal statistics, RS_Tile and RS_TileExplode to enable tiling large rasters, and updates to RS_MapAlgebra to enable user defined raster functions that can work across multiple rasters. Updated vector functions include ST_IsValidReason which exposes the reason geometries might not be valid, and ST_LineLocatePoint which can be useful for map matching and snapping data to road networks.

Read More About Apache Sedona 1.5.1 In The Release Notes

Hands-On With Havasu and GeoParquet

GeoParquet and Iceberg Havasu

Each month you can find a new livestream tutorial on the Wherobots YouTube channel. January’s livestream was all about working with GeoParquet and Havasu tables in SedonaDB. We dig in to understanding some of the optimizations built into the Apache Parquet format to learn how Parquet delivers efficient data storage and data retrieval before exploring the GeoParquet specification for storing geospatial data in Parquet. We cover loading, analyzing, and creating GeoParquet files using SedonaDB with a focus on comparing performance of various GeoParquet partitioning strategies. Finally, we see how the Havasu extension to the Apache Iceberg table format enables working with both vector and raster geospatial data backed by GeoParquet but with the familiar developer experience of SQL tables.

Watch The Recording: Hands-On With Havasu And GeoParquet

Upcoming Events

  • Apache Sedona Community Office Hour (Online Zoom Call – February 13, 2024) – Join the Apache Sedona community for updates on the state of Apache Sedona, presentation and demo of recent features, and provide your input into the roadmap, future plans, and contribution opportunities.
  • Raster Data Analysis With Spatial SQL & SedonaDB (Online Livestream – February 29th, 2024) – This month our livestream is focused on raster data analysis. We’ll see how to load raster data in SedonaDB and perform raster operations like map algebra and zonal statistics using Spatial SQL. Be sure to subscribe to the Wherobots YouTube channel to keep up to date with more Wherobots livestreams and videos!

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter:


Introducing The Havasu Spatial Table Format, Wherobots 2023 In Review, Analyzing Overture Maps and Real Estate Data: This Month In Wherobots

In the latest edition of This Month In Wherobots the latest highlights from the Wherobots & Apache Sedona community include an overview of the Havasu spatial table format, a look back at Wherobots’ 2023 in review, analyzing Overture Maps and real estate data, finding the perfect Christmas tree, plus big data analytics for sustainable smart cities.

Havasu: A Table Format for Spatial Attributes In A Data Lake Architecture

The Havasu spatial table format is an extension of Apache Iceberg that brings native spatial support to data lakes. This post introduces Havasu and some of its key features and technical insights such as support for primitive spatial data types and storage, spatial statistics, and spatial filter push down and indexing. Examples of creating and querying Havasu tables as well as how Havasu working with GeoParquet is also covered. If you’ve used the Wherobots open data catalog you may have already leveraged Havasu!

Read More About The Havasu Spatial Table Format

Featured Community Member: Muhammed O?uzhan METE

January Featured Community Member

Our featured community member this month is Dr. Muhammed O?uzhan METE. Muhammed is an Assistant Professor at Istanbul Technical University in the Geomatics Engineering Department. His research areas of focus are land management, real estate management, GIS, machine learning, and big data analytics. He is also an AWS Community Builder. Muhammed presented "Geospatial Big Data Analytics For Sustainable Smart Cities" at the FOSS4G 2023 conference where he shared how Apache Sedona can be used for large scale spatial analytics.

Connect with Muhammed On LinkedIn

Geospatial Big Data Analytics For Sustainable Smart Cities

Sustainable smart cities methodology

In this presentation from FOSS4G 2023 Dr. Muhammed Oguzhan Mete covers the need for large-scale geospatial data analysis for sustainable smart city project development. He covers some of the use cases for large scale data analysis relevant for smart cities such as how to make smart infrastructure decisions to meet goals for sustainability. He discusses how to work with geospatial big data in cloud computing environments for the purposes of analyzing energy performance of buildings at scale and how spatial joins and spatial clustering algorithms can be implemented using Apache Sedona and Dask GeoPandas to identify clusters of lower energy efficiency scores to inform policy decision making. Finally, he shows a benchmark comparing the performance of Apache Sedona and Dask GeoPandas for spatial joins and spatial clustering. You’ll have to watch the video to see the final results!

Watch the recording of "Geospatial Big Data Analytics For Sustainable Smart Cities"

Wherobots: 2023 Year In Review

2023 has been an exciting year for Wherobots and this post points out a handful of key moments for Wherobots and the Apache Sedona community. Highlights of the year included 130% growth of Apache Sedona usage, key hires for growing the Wherobots team, launching Wherobots Cloud and SedonaDB, and raising a $5.5M seed funding round.

Read More About Wherobots: 2023 Year In Review

Analyzing The Overture Maps Places Dataset

Visualizing Overture Maps points of interest

In this article from Pranav Toggi we learn how to query and analyze the Overture Maps Places dataset using SedonaDB and Wherobots Cloud. After a review of the data schema we see how to filter for points of interest within New York City and explore businesses within walking distance of stadiums and event venues. Along the way we learn several ways to visualize the data and results of our analysis.

Read "Analyzing The Overture Maps Places Dataset Using SedonaDB, Wherobots Cloud, & GeoParquet

Finding The Perfect Christmas Tree With USFS Map Data, QGIS, & SedonaDB

Christmas Tree Hunting With SedonaDB, QGIS, and USFS Motor Vehicle User Map data

This tutorial walks through how to use US Forest Service road data and aerial imagery to find the perfect Christmas tree. It covers loading and querying USFS road data, annotating aerial imagery in QGIS, then combining them and querying using SQL to find routes in National Forests to the perfect tree stands.

Read "Finding The Perfect Christmas Tree With USFS Map Data, QGIS, & SedonaDB"

Analyzing Real Estate Data With SedonaDB

Change In Real Estate Values In The US 2018-2023

In a livestream tutorial on the Wherobots YouTube channel we worked through how to use data from Zillow to analyze real estate values in the US. We calculated the change in real estate values at the county level over the last 5 years and created choropleth maps to help visualize the results. We also wrote up a written tutorial so you can follow along in Wherobots Cloud.

Watch Analyzing Real Estate Data With SedonaDB

Upcoming Events

  • GeoBuiz Summit (Monteray California – January 9-11, 2024) – Wherobots CEO Mo Sarwat will be presenting at the GeoBuiz Summit in a plenary panel addressing the "Geospatial Value Chain for Retail and Commerce". If you’re at the conference stop by and say hi!
  • GeoParquet Community Day (San Francisco – January 30th, 2024) – Join us for GeoParquet Community Day to highlight the usage of spatial data in Parquet, open table formats, and cloud-native spatial analytics.
  • Data Day Texas (Austin, TX – January 27, 2024) – Will from Wherobots will be presenting a talk on geospatial analytics with Apache Sedona and Spatial SQL
  • Hands On With Havasu and GeoParquet (Online Livestream – January 25th, 2024) – In this livestream we’ll take a look at the Havasu spatial table format and how Havasu can be used with GeoParquet. Be sure to subscribe to the Wherobots YouTube channel to keep up to date with more Wherobots livestreams and videos!

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter:


The Biggest Apache Sedona Release Ever, Wherobots Cloud Launch, Building Maps With Felt & Self Service Geospatial Analytics – This Month In Wherobots

Welcome to This Month In Wherobots where we highlight the latest goings on from the Wherobots & Apache Sedona community. In this edition we’re taking a look at the largest release ever in the history of Apache Sedona, the latest SedonaDB and Wherobots Cloud launch, self-service geospatial analytics, building maps with Felt, and a look at the Wherobots Online Community.

Featured Community Members: Nara Khou & Cort Lunke

Nara Khou and Cort Lunke

Each month we highlight special members of the community who contribute their expertise and passion to the Wherobots and Apache Sedona community. This month’s featured community members are Nara Khou Lead Data Engineer at Land O’Lakes and Cort Lunke Data & Analytics Lead at Land O’Lakes. Earlier this year Nara and Cort presented "Self-Service Geospatial Analytics Built On Databricks, Apache Sedona and R" at Databricks’ Data+AI Summit. Thanks so much Nara and Cort for sharing your success story using Apache Sedona with the community!

Self-Service Geospatial Analytics Built On Databricks, Apache Sedona And R

data pipeline process

In this presentation Nara and Cort discuss some of the challenges of working with spatial data and how Apache Sedona can be used to address some of these difficulties in an enterprise data environment like Databricks. Cort and Nara share why they chose Apache Sedona for working with spatial data at scale to analyze watershed and cropland data. They discuss the data processing pipeline used for the project and demo the end to end data pipeline from data collection, processing and analysis using Apache Sedona, and visualization using R-Studio – all within Databricks.

Watch the recording of Nara & Cort’s Data+AI Summit Presentation

SedonaDB & Wherobots Cloud Launch

The Wherobots team was excited to reveal SedonaDB, the cloud-native spatial analytics database platform at the FOSS4G North America conference. SedonaDB builds upon the scability and stability of the Apache Sedona project bringing large-scale geospatial analytics capabilities to enterprises looking for a cloud-native solution. SedonaDB also introduces the Havasu open table format which enables efficient querying and updating of geometry and raster columns in Parquet files stored in cloud object stores such as AWS S3.

Get started with SedonaDB on Wherobots Cloud Free tier today.

Apache Sedona 1.5 Release

The most recent release of Apache Sedona v1.5.0 was the biggest release in the history of Apache Sedona. This release includes native support for Uber H3 hexagon functions, comprehensive raster ETL and analytics support, more ST functions to enable new geospatial workloads, XYZM support, and visualization with SedonaKepler and SedonaPyDeck. This version is also available in the official Apache Sedona Docker Image. You can find more about this release in the Apache Sedona GitHub repository and read more about the v1.5.0 Apache Sedona release here.

Learn more about the latest Apache Sedona release

Building Maps With Felt

An important requirement for data infrastructure tools like SedonaDB and Wherobots Cloud are that they integrate well with the technology ecosystems around them. In the world of spatial databases this includes geospatial visualization tooling like the web-based mapping tool Felt. This blog post shows how to integrate the Felt API with Wherobots Cloud so we can leverage SedonaDB’s geospatial analysis capabilities using Spatial SQL then publish the results of our analysis to Felt’s beautiful web-based mapping tooling. This example uses data from BirdBuddy, which publishes data about bird sightings at its smart birdfeeders to find the range of some of our favorite bird species.

Read the blog post "Creating Collaborative Web Maps With The Felt API And Wherobots Cloud".

The Wherobots Online Community Launch

Wherobots online community

The Wherobots Online Community is the forum for community members to come together, ask questions, and share their expertise and excitement about spatial analytics. This site was launched earlier this month and we’re excited to have a home for the community. Please feel free to join the community, introduce yourself, and share what you’re working on and why you’re excited about spatial analytics! We’ve also launched the Wherobots YouTube Channel as a way to share educational content about spatial analytics – please check it out and subscribe.

Join The Wherobots Online Community

Upcoming Events

  • GeoParquet Community Day (San Francisco – January 30th, 2024) – Join us for GeoParquet Community Day to highlight the usage of spatial data in Parquet, open table formats, and cloud-native spatial analytics.
  • Analyzing Real Estate Data Using SedonaDB & Wherobots Cloud (Online Livestream – December 12th, 2023) – In this livestream we’ll take a look at analyzing US real estate data at the county level, using data from Zillow and Natural Earth. We’ll introduce the Wherobots Cloud platform and SedonaDB using Python and Spatial SQL. Be sure to subscribe to the Wherobots YouTube channel to keep up to date with more Wherobots livestreams and videos!

Want to receive this monthly update in your inbox? Sign up for the This Month In Wherobots Newsletter:


Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet

Introduction

The Overture Maps Foundation (OMF) has recently released its first open-source Parquet dataset (https://overturemaps.org/download/), divided into four themes – places of interest, buildings, transportation networks, and administrative boundaries.

Apache Sedona is an open-source and distributed geospatial analytics engine. It enables large-scale spatial data processing and analysis through native handling of geospatial data types, Spatial SQL for querying, spatial indexing techniques, and support for standard formats like Shapefile, GeoParquet, GeoJSON, WKT, WKB, GeoTiff, and ArcGrid.

In this article, we demonstrate the capabilities of Apache Sedona for working with massive Overture Maps data. The article also shows that GeoParquet OMF data produced by Sedona can accelerate the spatial queries by 60X. This exploration underscores Sedona’s proficiency in handling scalable geospatial tasks using prevalent industry formats like Parquet and GeoParquet.

This article is structured as follows: First, we will explore and analyze the data in its original Parquet format. Then, we will convert it to the GeoParquet format with built-in spatial indexes. Finally, we will use Sedona in a Jupyter notebook to explore and analyze the dataset in GeoParquet form, leveraging capabilities like spatial SQL and spatial Python to derive insights.

Study 1: Analyze the OMF Parquet data using Sedona

Overture Maps uses the parquet format to store the datasets. You can find the schema of these datasets on OMF website (https://docs.overturemaps.org/reference). In this example, we will be using Buildings theme dataset which is the largest dataset in OMF with around 120GB size. The schema of this dataset is as follows:

id: string (nullable = true)
updatetime: string (nullable = true)
version: integer (nullable = true)
names: map (nullable = true)
 |-- key: string
 |-- value: array (valueContainsNull = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
level: integer (nullable = true)
height: double (nullable = true)
numfloors: integer (nullable = true)
class: string (nullable = true)
sources: array (nullable = true)
 |-- element: map (containsNull = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
bbox: struct (nullable = true)
 |-- minx: double (nullable = true)
 |-- maxx: double (nullable = true)
 |-- miny: double (nullable = true)
 |-- maxy: double (nullable = true)
geometry: binary (nullable = true)

To start using Sedona on OMF data, we will first have to create an SedonaContext:

import sedona.spark.SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext(config)

Apache Sedona facilitates easy loading and analysis of Parquet datasets through its APIs.

df = sedona.read.format("parquet").load("s3a://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Since Parquet does not natively support geospatial data types, the geometry columns in this dataset are stored in the WKB (Well-Known Binary) format. Sedona provides functionality to decode the WKB-encoded geometries into proper spatial types like points and polygons.

sedona.sql('SELECT ST_GeomFromWKB(geometry) as geometry FROM df').show(5)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((7.85925...|
|POLYGON ((2.69399...|
|POLYGON ((-95.775...|
|POLYGON ((103.141...|
|POLYGON ((111.900...|
+--------------------+

After processing, the dataset can be used for spatial range queries. We can perform a ST_Contains query between the dataset and Washington state’s boundary polygon to find out buildings within the Washington state.

df = df.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-123.3208 49.0023,-123.0338 49.0027,-122.0650 49.0018,-121.7491 48.9973,-121.5912 48.9991,-119.6082 49.0009,-118.0378 49.0005,-117.0319 48.9996,-117.0415 47.9614,-117.0394 46.5060,-117.0394 46.4274,-117.0621 46.3498,-117.0277 46.3384,-116.9879 46.2848,-116.9577 46.2388,-116.9659 46.2022,-116.9254 46.1722,-116.9357 46.1432,-116.9584 46.1009,-116.9762 46.0785,-116.9433 46.0537,-116.9165 45.9960,-118.0330 46.0008,-118.9867 45.9998,-119.1302 45.9320,-119.1708 45.9278,-119.2559 45.9402,-119.3047 45.9354,-119.3644 45.9220,-119.4386 45.9172,-119.4894 45.9067,-119.5724 45.9249,-119.6013 45.9196,-119.6700 45.8565,-119.8052 45.8479,-119.9096 45.8278,-119.9652 45.8245,-120.0710 45.7852,-120.1705 45.7623,-120.2110 45.7258,-120.3628 45.7057,-120.4829 45.6951,-120.5942 45.7469,-120.6340 45.7460,-120.6924 45.7143,-120.8558 45.6721,-120.9142 45.6409,-120.9471 45.6572,-120.9787 45.6419,-121.0645 45.6529,-121.1469 45.6078,-121.1847 45.6083,-121.2177 45.6721,-121.3392 45.7057,-121.4010 45.6932,-121.5328 45.7263,-121.6145 45.7091,-121.7361 45.6947,-121.8095 45.7067,-121.9338 45.6452,-122.0451 45.6088,-122.1089 45.5833,-122.1426 45.5838,-122.2009 45.5660,-122.2641 45.5439,-122.3321 45.5482,-122.3795 45.5756,-122.4392 45.5636,-122.5676 45.6006,-122.6891 45.6236,-122.7647 45.6582,-122.7750 45.6817,-122.7619 45.7613,-122.7962 45.8106,-122.7839 45.8642,-122.8114 45.9120,-122.8148 45.9612,-122.8587 46.0160,-122.8848 46.0604,-122.9034 46.0832,-122.9597 46.1028,-123.0579 46.1556,-123.1210 46.1865,-123.1664 46.1893,-123.2810 46.1446,-123.3703 46.1470,-123.4314 46.1822,-123.4287 46.2293,-123.4946 46.2691,-123.5557 46.2582,-123.6209 46.2573,-123.6875 46.2497,-123.7404 46.2691,-123.8729 46.2350,-123.9292 46.2383,-123.9711 46.2677,-124.0212 46.2924,-124.0329 46.2653,-124.2444 46.2596,-124.2691 46.4312,-124.3529 46.8386,-124.4380 47.1832,-124.5616 47.4689,-124.7566 47.8012,-124.8679 48.0423,-124.8679 48.2457,-124.8486 48.3727,-124.7539 48.4984,-124.4174 48.4096,-124.2389 48.3599,-124.0116 48.2964,-123.9141 48.2795,-123.5413 48.2247,-123.3998 48.2539,-123.2501 48.2841,-123.1169 48.4233,-123.1609 48.4533,-123.2220 48.5548,-123.2336 48.5902,-123.2721 48.6901,-123.0084 48.7675,-123.0084 48.8313,-123.3215 49.0023,-123.3208 49.0023))'), geometry) = true")
df.selectExpr("names", "height", "numfloors", "geometry").show(5)

+--------------------+------+---------+--------------------+
|               names|height|numfloors|            geometry|
+--------------------+------+---------+--------------------+
|{common -> [{valu...|   5.1|        1|POLYGON ((-122.32...|
|{common -> [{valu...|  10.9|        1|POLYGON ((-122.18...|
|{common -> [{valu...|   7.9|        1|POLYGON ((-122.31...|
|{common -> [{valu...|   9.2|        1|POLYGON ((-122.22...|
|{common -> [{valu...|   6.4|        1|POLYGON ((-122.21...|
+--------------------+------+---------+--------------------+

We can also leverage Apache Spark’s filter pushdown capabilities on non-spatial columns to reduce the data before geospatial analysis. Since the building dataset was large, we applied a highly selective filter pushdown:

df_building = df_building.filter(~(size(col("names")) <= 0))\
                         .filter(col("height") <= 200 )\
                         .filter(~(size(col("sources")) <= 0))\
                                                 .filter(col("numfloors") == 1)

The count of the dataset after intersecting with Washington State boundary:

df.count()

Count: 511
Time: 1h 31min 42s

Discussions

The spatial query and non-spatial filter pushdown on the dataset took an hour and half to complete on a single-node Sedona Docker environment. Therefore, users can barely do any interactive analytics on it. The reason is two-fold:

  1. The code was executed on a single-node Sedona docker environment with 4GB executor RAM, not a real cluster. Performance will be significantly improved if run in a distributed Sedona environment across multiple nodes.
  2. This lengthy processing time is primarily due to the limitations of using the Parquet format without geospatial optimizations. Parquet lacks native spatial data types and spatial indexing schemes suited for efficient geospatial analysis. The required data loading and preparation are time-consuming. Using a format optimized for geospatial workloads, like GeoParquet, could significantly reduce the pre-processing time for this analysis.

Study 2: Converting from Parquet to GeoParquet

Sedona enables spatial data format conversion between Parquet and GeoParquet. For more details, please refer to [Sedona GeoParquet blogpost]. After realizing the limitations of Parquet, we decide to leverage this functionality and see how much improvement the GeoParquet format brings in.

To achieve the best spatial filter pushdown performance, we need to partition the data based on their spatial proximity. In other words, nearby spatial objects should be put in the same GeoParquet partition. For this purpose, we first create a GeoHash ID for each geometry in OMF data using Sedona ST_GeoHash. This function generates geographic hashes for a given geometry at a specified precision. The precision refers to the size of the grid cells, where a precision of 2 indicates each cell has 1,250km X 625km. This precision level was chosen as an optimal balance, since too high a precision produces many small files that can slow down query processing and reduce read throughput.

df.withColumn("geohash", expr("ST_GeoHash(geometry, 2)")).repartition("geohash")

By repartitioning with GeoHashing, data points with the same GeoHash ID get assigned to contiguous partition slices based on precision. This clusters nearby points together in the same partitions.

Finally we will store the GeoParquet Overture maps into our public Wherobots’ S3 bucket. Such same operation is applied to all OMF datasets.

df.write.partitionBy("geohash").format("geoparquet").save("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Discussions

This conversion on all OMF datasets took around 15 minutes to finish on a 20-node AWS EMR cluster. Each node is a m3.2xlarge instance with 8 CPU and 30 GB RAM. In closing, Sedona provides a streamlined way to convert datasets to GeoParquet format and partition using GeoHashes for optimal performance. The entire GeoParquet dataset converted here is available in the Wherobots’ public S3 bucket for you to experiment with.

Study 3: Interactive geospatial spatial analytics on OMF GeoParquet data

We’ll ingest the public Wherobots dataset from S3 into a Spark DataFrame using Sedona, like before, along with that we will intersect Bellevue, Washington boundary with the dataset.

Fast geospatial queries via Sedona GeoParquet spatial filter pushdown

With GeoParquet, we observe improved query performance versus the Parquet format over the following simple spatial range query (i.e., spatial filter).

spatial_filter = "POLYGON ((-122.235128 47.650163, -122.233796 47.65162, -122.231581 47.653287, -122.228514 47.65482, -122.227526 47.655204, -122.226175 47.655729, -122.222039 47.656743999999996, -122.218428 47.657464, -122.217026 47.657506, -122.21437399999999 47.657588, -122.212091 47.657464, -122.212135 47.657320999999996, -122.21092999999999 47.653552, -122.209834 47.650121, -122.209559 47.648976, -122.209642 47.648886, -122.21042 47.648658999999995, -122.210897 47.64864, -122.211005 47.648373, -122.21103099999999 47.648320999999996, -122.211992 47.64644, -122.212457 47.646426, -122.212469 47.646392, -122.212469 47.646088999999996, -122.212471 47.645213, -122.213115 47.645212, -122.213123 47.644576, -122.21352999999999 47.644576, -122.213768 47.644560999999996, -122.21382 47.644560999999996, -122.21382 47.644456999999996, -122.21373299999999 47.644455, -122.213748 47.643102999999996, -122.213751 47.642790999999995, -122.213753 47.642716, -122.213702 47.642697999999996, -122.213679 47.642689999999995, -122.21364 47.642678, -122.213198 47.642541, -122.213065 47.642500000000005, -122.212918 47.642466, -122.21275 47.642441, -122.212656 47.642433, -122.21253899999999 47.642429, -122.212394 47.64243, -122.212182 47.642444999999995, -122.211957 47.642488, -122.211724 47.642551999999995, -122.21143599999999 47.642647, -122.210906 47.642834, -122.210216 47.643099, -122.209858 47.643215, -122.20973000000001 47.643248, -122.20973599999999 47.643105, -122.209267 47.643217, -122.208832 47.643302, -122.208391 47.643347999999996, -122.207797 47.643414, -122.207476 47.643418, -122.20701199999999 47.643397, -122.206795 47.643387999999995, -122.205742 47.643246, -122.20549 47.643201999999995, -122.20500200000001 47.643119, -122.204802 47.643085, -122.204641 47.643066, -122.204145 47.643012, -122.203547 47.643012, -122.203097 47.643107, -122.20275699999999 47.643283, -122.202507 47.643496999999996, -122.202399 47.643653, -122.202111 47.643771, -122.201668 47.643767, -122.201363 47.643665, -122.20133 47.643648999999996, -122.201096 47.643536, -122.200744 47.64328, -122.200568 47.64309, -122.200391 47.642849, -122.200162 47.642539, -122.199896 47.642500000000005, -122.19980799999999 47.642424, -122.199755 47.642376999999996, -122.199558 47.642227999999996, -122.199439 47.642157, -122.199293 47.642078999999995, -122.199131 47.642004, -122.198928 47.641925, -122.19883 47.641892, -122.19856300000001 47.641811999999994, -122.198203 47.641731, -122.197662 47.641619999999996, -122.196819 47.641436, -122.196294 47.641309, -122.196294 47.642314, -122.19628 47.642855, -122.196282 47.642897999999995, -122.196281 47.643111, -122.196283 47.643415, -122.196283 47.643508999999995, -122.19628399999999 47.643739, -122.196287 47.644203999999995, -122.196287 47.644262999999995, -122.19629 47.644937999999996, -122.19629 47.644954999999996, -122.196292 47.645271, -122.196291 47.645426, -122.19629499999999 47.646315, -122.19629499999999 47.646432, -122.195925 47.646432, -122.195251 47.646432, -122.190853 47.646429999999995, -122.187649 47.646428, -122.187164 47.646426, -122.18683 47.646426, -122.185547 47.646409, -122.185546 47.646316, -122.185537 47.645599, -122.185544 47.644197, -122.185537 47.643294999999995, -122.185544 47.642733, -122.185541 47.641757, -122.185555 47.640681, -122.185561 47.63972, -122.185557 47.638228999999995, -122.185591 47.635419, -122.185611 47.634750999999994, -122.18562299999999 47.634484, -122.18561700000001 47.634375999999996, -122.185592 47.634311, -122.185549 47.634232999999995, -122.185504 47.634181999999996, -122.185426 47.634119, -122.184371 47.633424999999995, -122.18400000000001 47.633198, -122.183896 47.633134, -122.1838 47.633067, -122.18375499999999 47.633019999999995, -122.183724 47.632959, -122.183695 47.632858, -122.183702 47.632675, -122.182757 47.632622999999995, -122.182365 47.63259, -122.18220600000001 47.632562, -122.181984 47.632504999999995, -122.18163799999999 47.632363, -122.18142 47.632262999999995, -122.181229 47.632165, -122.181612 47.632172999999995, -122.18271899999999 47.632151, -122.183138 47.632135, -122.18440000000001 47.632081, -122.184743 47.632065999999995, -122.185312 47.63205, -122.185624 47.632047, -122.185625 47.631873999999996, -122.184618 47.63187, -122.184291 47.631878, -122.184278 47.631817999999996, -122.183882 47.629942, -122.182689 47.623548, -122.182594 47.622789999999995, -122.182654 47.622155, -122.183135 47.622372999999996, -122.183471 47.622506, -122.18360200000001 47.622552, -122.183893 47.622637999999995, -122.184244 47.62272, -122.184618 47.622777, -122.184741 47.622727999999995, -122.184605 47.622679, -122.18424 47.622622, -122.183985 47.622569, -122.183717 47.622501, -122.183506 47.622439, -122.18327 47.622357, -122.18305699999999 47.622271999999995, -122.182669 47.622088999999995, -122.182796 47.621545, -122.18347 47.619628999999996, -122.18365 47.619098, -122.183859 47.6184, -122.183922 47.617793999999996, -122.183956 47.617292, -122.183792 47.616388, -122.183261 47.614391999999995, -122.183202 47.613802, -122.183209 47.613155, -122.183436 47.612384999999996, -122.18395100000001 47.610445999999996, -122.184338 47.60924, -122.184657 47.609116, -122.18481 47.609051, -122.18491900000001 47.608987, -122.184974 47.608942, -122.185047 47.608846, -122.185082 47.608743999999994, -122.185109 47.608526999999995, -122.185116 47.608359, -122.18513 47.608315999999995, -122.185157 47.608273999999994, -122.185183 47.608247, -122.185246 47.608214, -122.185354 47.608196, -122.185475 47.608191999999995, -122.185472 47.606697, -122.185472 47.606373999999995, -122.185521 47.606272, -122.185528 47.606210999999995, -122.185506 47.606037, -122.185451 47.605872999999995, -122.185411 47.605781, -122.185358 47.605681999999995, -122.185248 47.605509999999995, -122.185127 47.605365, -122.185058 47.605292, -122.184772 47.605038, -122.184428 47.604834, -122.184122 47.604693999999995, -122.183775 47.604574, -122.183644 47.604546, -122.183708 47.604400999999996, -122.183749 47.604223999999995, -122.18376 47.604037, -122.183707 47.603778, -122.183619 47.603556999999995, -122.183559 47.603406, -122.183488 47.603303, -122.183824 47.603167, -122.184108 47.603052, -122.184478 47.602902, -122.18543 47.602495, -122.186669 47.601957, -122.186433 47.601220999999995, -122.186341 47.601127999999996, -122.18874199999999 47.593742999999996, -122.188434 47.592338999999996, -122.188479 47.591786, -122.188217 47.591269999999994, -122.18795399999999 47.590871, -122.186822 47.589228, -122.187421 47.589228999999996, -122.18848299999999 47.589228999999996, -122.188433 47.587922999999996, -122.18990000000001 47.588547, -122.191368 47.589169999999996, -122.19158 47.589222, -122.191779 47.589254999999994, -122.192117 47.589289, -122.191569 47.587478999999995, -122.191323 47.586628999999995, -122.191295 47.586554, -122.191268 47.586479, -122.191192 47.586318, -122.191163 47.586268999999994, -122.1911 47.586164, -122.19099 47.586011, -122.19067 47.585668999999996, -122.1905 47.585515, -122.190301 47.58531, -122.190143 47.585152, -122.189573 47.584576999999996, -122.188702 47.583735999999995, -122.188646 47.583679, -122.188239 47.583258, -122.188037 47.583005, -122.187832 47.582657, -122.187726 47.582164999999996, -122.18769499999999 47.581964, -122.18768299999999 47.581781, -122.187678 47.581592, -122.18766099999999 47.581455, -122.187674 47.581311, -122.18768 47.581146, -122.187722 47.580877, -122.187817 47.580569999999994, -122.187932 47.580301999999996, -122.188047 47.580087, -122.188161 47.579933999999994, -122.188399 47.579660999999994, -122.18851699999999 47.579547, -122.188621 47.579454, -122.188042 47.579493, -122.18762 47.579527, -122.187806 47.579358, -122.188009 47.579175, -122.18814499999999 47.579051, -122.188177 47.579021, -122.18842000000001 47.5788, -122.188638 47.578461, -122.188895 47.57806, -122.189791 47.577281, -122.190008 47.577103, -122.190372 47.576805, -122.19119 47.576358, -122.191877 47.576087, -122.193025 47.57566, -122.194317 47.575185999999995, -122.196061 47.574664, -122.197239 47.574386999999994, -122.197873 47.574267, -122.198286 47.574189999999994, -122.199091 47.574044, -122.199067 47.574574999999996, -122.199007 47.575921, -122.200335 47.578222, -122.20057299999999 47.578345999999996, -122.2009 47.578517999999995, -122.201095 47.578621999999996, -122.20138399999999 47.578776999999995, -122.201465 47.57882, -122.201516 47.578846999999996, -122.205753 47.581112, -122.209515 47.583124, -122.210634 47.583721, -122.21473399999999 47.587021, -122.21538699999999 47.588254, -122.21580399999999 47.589042, -122.216534 47.590421, -122.220092 47.596261, -122.220434 47.596821, -122.22041899999999 47.597837999999996, -122.220289 47.606455, -122.220234 47.610121, -122.22048 47.615221999999996, -122.220359 47.615379, -122.220283 47.615477999999996, -122.21999 47.615854999999996, -122.219993 47.61597, -122.22023300000001 47.616634, -122.220356 47.616687999999996, -122.220409 47.616712, -122.221401 47.618538, -122.22142 47.618573, -122.221456 47.618635, -122.221791 47.619222, -122.222492 47.619682999999995, -122.222799 47.619886, -122.222083 47.620368, -122.222046 47.620407, -122.222028 47.620449, -122.222025 47.620483, -122.22203999999999 47.620523999999996, -122.222079 47.620557999999996, -122.222156 47.620594999999994, -122.222458 47.620629, -122.222454 47.620673, -122.222454 47.620711, -122.22244599999999 47.621041999999996, -122.223056 47.621041, -122.223129 47.62104, -122.223153 47.62104, -122.223574 47.621041, -122.22377900000001 47.621041, -122.223857 47.621041, -122.22467499999999 47.621041, -122.224712 47.62104, -122.224958 47.62104, -122.225167 47.621049, -122.226882 47.621037, -122.227565 47.621032, -122.228002 47.621029, -122.22797800000001 47.621300999999995, -122.227919 47.626574999999995, -122.227914 47.627085, -122.227901 47.6283, -122.227881 47.630069, -122.227869 47.631177, -122.227879 47.631952999999996, -122.22789 47.633879, -122.227886 47.63409, -122.227871 47.635534, -122.227918 47.635565, -122.228953 47.635624, -122.22895199999999 47.635571999999996, -122.231018 47.635574999999996, -122.233276 47.635588999999996, -122.233287 47.63617, -122.233273 47.63639, -122.233272 47.636469999999996, -122.23327 47.636578, -122.233266 47.636827, -122.233263 47.636851, -122.233262 47.637014, -122.23322999999999 47.638110999999995, -122.233239 47.638219, -122.233262 47.638279, -122.233313 47.638324999999995, -122.233255 47.638359, -122.233218 47.638380999999995, -122.233153 47.638450999999996, -122.233136 47.638552999999995, -122.233137 47.638692, -122.232715 47.639348999999996, -122.232659 47.640093, -122.232704 47.641375, -122.233821 47.645111, -122.234906 47.648874, -122.234924 47.648938, -122.235128 47.650163))"

df = sedona.read.format("geoparquet").load("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")
df = df.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")
df.count()

Count: 3423743
Time: 2min 47s

Switching to GeoParquet from regular Parquet reduced query time from over an hour and a half to around 3 minutes, a 60X speedup. This query was done on the same single-node Sedona Docker environment, without non-spatial filter pushdowns. Further gains could be achieved by leveraging a distributed Sedona cluster. Overall, GeoParquet enables more efficient geospatial queries through both spatial partitioning and enhanced filter pushdown.

Interactive geospatial visualization via Sedona Kepler

We also wanted to highlight SedonaKepler, our interactive visualization API built on KeplerGl. SedonaKepler provides a powerful, customizable visualization experience through a simple API. Creating visualizations with SedonaKepler is as straightforward as calling a single function. The map created by the code below indicates the buildings in Bellevue, Washington.

import sedona.spark.*

SedonaKepler.create_map(df, "Building")

Time: 2min 38s

file

Photo: Visualization of building dataset spatially filtered to only include data for the city of Bellevue, Washington.

Of course, you can also choose other region and see what the dataset looks like.

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Mount Rainier National Park, Washington in Satellite view.

SedonaKepler can also visualize multiple datasets together, revealing insights across interconnected geospatial data. To demonstrate, we are using the segment and connector datasets, as they highlight the transportation network from OMF.

The dots below represent connection points between road segments. The road segments themselves are shown as yellow lines, representing paths that can be traversed by people, cars, or other vehicles. By layering this networked transport data over a map, we can gain insights into how the transportation infrastructure links together across the geographic area.

map_connector = SedonaKepler.create_map(df_connector, "Connector")
SedonaKepler.add_df(map_connector, df_segment, name="Segment")
map_connector

Time: 3min 11s

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Bellevue, Washington.

The Take-Away

Apache Sedona enables both the reading and writing of GeoParquet files, a specialized Parquet variant tailored for spatial data interchange. When executing spatial range queries on GeoParquet files, Apache Sedona supports spatial filter pushdown, and optimizing query performance with over 60X speedup. SedonaKepler is a powerful tool for creating visualizations that are interactive and easy to maneuver, and it allows you to create visualizations from multiple datasets.

Try it yourself

Notebooks

All notebooks used in this article are available on Wherobots GitHub repository: https://github.com/wherobots/OvertureMaps

Interactive notebook using GeoParquet and Sedona

Use Wherobots to deploy Sedona to cloud vendors

The other notebooks used in Study 1 and 2 can be run on a AWS EMR or Databricks cluster. if you want to try them out, please sign up here: https://www.wherobots.ai/demo

Wherobots is a spatial data analytics and AI platform trusted in production, at scale, from the original creators of Apache Sedona.

Free and public Overture Maps GeoParquet data from Wherobots

The GeoParquet format data produced in Study 2 is provided by Wherobots for free. It has the same schema and license as the original Overture Maps Parquet data, except the geometry column is in geometry type and has additional geohash column in string type. You can access them as follows:

  • Buildings: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building
  • Places: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=place
  • AdminBoundary: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=administrativeBoundary
  • AdminLocality: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=locality
  • Transportation Connector: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=connector
  • Transportation Segment: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=segment