Generating global PMTiles from Overture Maps in 26 minutes with WherobotsDB VTiles

We previously described the performance and scalability challenges of generating tiles and how they can be overcome with WherobotsDB VTiles. Today we will demonstrate how you can use VTiles to generate vector tiles for three planetary scale Overture Maps Foundation datasets in PMTiles format: places, buildings, and division areas.

Quick recap: What are Vector Tiles and why should you use PMTiles?

Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. Vector tiles contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.

PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.

Tile Viewer

If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:

  • Viewing tiles with a dark themed basemap
  • Inspecting individual tiles, selected from a list of all the tiles in the set
  • Inspecting the metadata and header information of the PMTiles file

This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:

from wherobots.tools.utility.s3_utils import get_signed_url

get_signed_url(my_s3_path, expiration_in_seconds)

my_s3_path will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles and expiration_in_seconds will be an int representing the number of seconds the signed url will be valid for.

The tile viewer will be used to explore the tiles we generate in our examples.

Examples

The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.

Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.

To run the examples below, just make sure your sedona session is started:

from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Places

We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:

  1. The dataset contains only points. A point feature rarely spans multiple tiles as it has no dimensions. The tile generation time is strongly influenced by the sum of the number of features multiplied by the number of tiles which that feature intersects.
  2. At 50 million records, this is a relatively small dataset compared to the buildings dataset at 2.3 billion features.
  3. We will do minimal customization. VTiles’ feature filters allow us to control which features go into which tiles based on the tile id (x, y, z) and the feature itself (area, length, and user-provided columns). We will go more in depth on feature filters in the division areas example.
import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("categories.main").alias("category"),
        f.lit('places').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles",
    GenerationConfig(6, 15)
)

This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.

Tile generation time was 2 minutes and 23 seconds on a Tokyo runtime. The resulting PMTiles archive is 28.1 GB.

Buildings

This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select(
        "geometry",
        f.lit('buildings').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles",
    GenerationConfig(10, 15)
)

This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.

Tile generation time was 26 minutes on a Tokyo runtime. The resulting PMTiles archive is 438.4 GB.

Division Areas

The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.

Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5) tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.

Let’s first profile the base case with no feature filter.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles",
    GenerationConfig(3, 15)
)

This run took a bit over 3 hours on a Tokyo runtime. The resulting PMTiles archive is 158.0 GB. This small dataset takes more time than the buildings dataset that is more than 2300 times larger!

Feature Filters

We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.

Here is an example of using feature filters to improve performance of this division area generation task:

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles",
    GenerationConfig(
        min_zoom=2, 
        max_zoom=15,
        feature_filter = (
            ((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) |
            ((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) |
            ((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(12))) |
            ((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(14))) |
            ((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) |
            ((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z")))
        )
    )
)

This run took less than 10 minutes on a Tokyo runtime. The resulting PMTiles archive is 8.9 GB.

Feature filters reduced tile generation time by more than 90%, reduced the dataset size, and lowered the cost compared to the original example. Tiles will also appear less cluttered to the user without having to get one’s hands dirty playing with style sheets.

A Note on Working without Feature Filters

We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.

Conclusion

The VTiles tile generation functionality is a fast and cost effective way to generate tiles for global data. The Apache Spark-based runtime powered by Apache Sedona and Wherobots Cloud makes loading and transforming data for input into the system straightforward and performant even on large datasets. You can leverage feature filters to curate the contents of your tiles to your use cases and performance goals. We encourage you to try out VTiles with your own data on Wherobots Cloud.

Creating Collaborative Web Maps With The Felt API And Wherobots Cloud

An important requirement for data infrastructure tools like WherobotsDB and Wherobots Cloud are that they integrate well with the technology ecosystems around them. In the world of spatial databases this includes geospatial visualization tooling. Being able to create maps with data from WherobotsDB is an important usecase for Wherobots Cloud so in this blog post I wanted to explore how to create collaborative web maps with Felt, providing data using the Felt API.

For this map I wanted to integrate the Felt API with Wherobots Cloud so I could do some geospatial analysis using Spatial SQL and WherobotsDB then publish the results of my analysis to Felt’s beautiful web-based mapping tooling.

I decided to use data from BirdBuddy, which publishes data about bird sightings at its smart birdfeeders to find the range of some of my favorite bird species.

North American bird ranges

You can follow along by creating a free account on Wherobots Cloud.

Collaborative Web Maps With Felt

Felt is a web-based tool for creating collaborative maps. Felt is bringing collaborative real-time editing functionality similar to what you’ve seen in Google Docs or Notion to the world of mapmaking. You can annotate, comment, and draw on the map, then share the results with anyone on the web you want with a single link to start collaborating.

Felt has also invested in supporting a wide range of data formats for adding data to maps with their Upload Anything tool, a simple drag and drop interface that supports formats include Shapefile, GeoJSON, GeoTiff, CSV, Jpeg, etc. Felt also built a QGIS plugin so if you’re used to working with desktop GIS tooling you can easily export your project and layers to Felt’s web-based tooling via QGIS.

Felt enables programmatically creating maps and layers as well as adding data to the map via the Felt API. We’ll be using the Felt API to upload the results of our analysis using Sedona to create and publish a map.

Wherobots Cloud File Management

Our data comes from Bird Buddy which makes a smart bird feeder than can identify bird species and (optionally) report their location.

BirdBuddy screenshot

Bird Buddy publishes its data as CSV files so we’ll download the latest data and then upload the file to our Wherobots Cloud instance via the "Files" tab. The free tier of Wherobots Cloud includes free data storage in AWS S3 which we can access within the Wherobots notebook environment using the S3 URL of the file.

Wherobots Cloud file management

Once you’ve uploaded a file you can click the copy file icon to copy the file’s S3 path to access the file in the Wherobots notebook environment. Note that these files are private to your Wherobots organization, so the S3 URL below won’t be accessible to anyone outside my organization.

S3_URL = "s3://<YOUR_S3_URL_HERE>/birdbuddy/"

Now we’ll load the BirdBuddy CSV data and convert it to a Spatial DataFrame so we can use Spatial SQL to find the range of each species.

bb_df = sedona.read.format('csv').option('header','true').option('delimiter', ',').load(S3_URL)
bb_df.show(5)

Looking at the first few rows of the DataFrame we can see we have latitude and longitude stored as seperate fields, a well as information about the bird species.

+-------------------+--------------------+----------+-----------------+----------------+
|anonymized_latitude|anonymized_longitude| timestamp|      common_name| scientific_name|
+-------------------+--------------------+----------+-----------------+----------------+
|          45.441235|          -122.51253|2023-09...|  dark eyed junco|      junco h...|
|           41.75291|            -83.6242|2023-09...|northern cardinal|cardinalis ca...|
|            43.8762|            -78.9261|2023-09...|northern cardinal|cardinalis ca...|
|            33.7657|            -84.2951|2023-09...|northern cardinal|cardinalis ca...|
|            30.4805|            -84.2243|2023-09...|northern cardinal|cardinalis ca...|
+-------------------+--------------------+----------+-----------------+----------------+
only showing top 5 rows

Spatial SQL With WherobotsDB

Now we’re ready to use the power of Spatial SQL to analyze our Bird Buddy data. We want to find the range of each species, but first let’s explore the data.

First we’ll convert our latitude and longitude fields into Point geometries using the ST_Point SQL function.

bb_df = bb_df.selectExpr('ST_Point(CAST(anonymized_longitude AS Decimal(24,20)), CAST(anonymized_latitude AS Decimal(24,20))) AS location', 'timestamp', 'common_name', 'scientific_name')
bb_df.createOrReplaceTempView('bb')
bb_df.show(5)

Now the location field is a proper geometry type that our Spatial DataFrame can take advantage of.

+--------------------+--------------------+-----------------+--------------------+
|            location|           timestamp|      common_name|     scientific_name|
+--------------------+--------------------+-----------------+--------------------+
|POINT (-122.51253...|2023-09-01 00:00:...|  dark eyed junco|      junco hyemalis|
|POINT (-83.6242 4...|2023-09-01 00:00:...|northern cardinal|cardinalis cardin...|
|POINT (-78.9261 4...|2023-09-01 00:00:...|northern cardinal|cardinalis cardin...|
|POINT (-84.2951 3...|2023-09-01 00:00:...|northern cardinal|cardinalis cardin...|
|POINT (-84.2243 3...|2023-09-01 00:00:...|northern cardinal|cardinalis cardin...|
+--------------------+--------------------+-----------------+--------------------+
only showing top 5 rows

We have just under 14 million bird observations in our DataFrame.

bb_df.count()
------------
13972003

If we wanted to find all observations of Juncos in the data we can write a SQL query to filter the results and visualize the observations on a map using SedonaKepler, the integration for Kepler.gl

junco_df = sedona.sql("SELECT * FROM bb WHERE common_name LIKE '%junco' ")
junco_df.show(5)

We used the SQL LIKE string comparision operator to find all observations relating to Juncos, then stored the results in a new DataFrame junco_df.

+--------------------+--------------------+---------------+---------------+
|            location|           timestamp|    common_name|scientific_name|
+--------------------+--------------------+---------------+---------------+
|POINT (-122.51253...|2023-09-01 00:00:...|dark eyed junco| junco hyemalis|
|POINT (-94.5916 3...|2023-09-01 00:00:...|dark eyed junco| junco hyemalis|
|POINT (-85.643 31...|2023-09-01 00:00:...|dark eyed junco| junco hyemalis|
|POINT (-87.7645 3...|2023-09-01 00:00:...|dark eyed junco| junco hyemalis|
|POINT (-122.16346...|2023-09-01 00:00:...|dark eyed junco| junco hyemalis|
+--------------------+--------------------+---------------+---------------+
only showing top 5 rows

Now we’ll visualize the contents of our new junco_df DataFrame using SedonaKepler.

SedonaKepler.create_map(df=junco_df, name='Juncos')

Juncos observation map

Based on the map above it looks like Juncos have quite a large range throughout North America.

Next, we’ll filter the overall dataset to a few of my favorite bird species, then use the power of Spatial SQL with a GROUP BY operation to create convex hulls (polygon geometries) from the individual observations (point geometries) of each species.

By creating a convex hull around all point observations grouped by species we will create a new geometry that represents the observed range of each species in our dataset.

range_df = sedona.sql("""
    SELECT common_name, COUNT(*) AS num, ST_ConvexHull(ST_Union_aggr(location)) AS geometry 
    FROM bb 
    WHERE common_name IN ('california towhee', 'steller’s jay', 'mountain chickadee', 'eastern bluebird') 
    GROUP BY common_name 
    ORDER BY num DESC
""")
range_df.show()

Note our use of the following Spatial SQL functions:

  • ST_ConvexHull – given multiple point geometries, return a polygon geometry of an area that contains all points in a convex hull
  • ST_Union_aggr – an aggregating function that will collect multiple geometries, in this case used alongside a GROUP BY
+------------------+-----+--------------------+
|       common_name|  num|            geometry|
+------------------+-----+--------------------+
|  eastern bluebird|65971|POLYGON ((-80.345...|
|     steller’s jay|37864|POLYGON ((-110.26...|
| california towhee|22007|POLYGON ((-117.05...|
|mountain chickadee| 4102|POLYGON ((-110.99...|
+------------------+-----+--------------------+

Now we have a new DataFrame range_df with 4 rows, one for each of the species we indicated in the query above. But now the geometry field is a polygon that represents the observed range of that species in our dataset. Pretty neat – let’s visualize these species ranges using Felt.

The Felt API supports file uploads in a variety of formats, but we’ll use GeoJSON. We’ll convert our Spatial DataFrame into a GeoPandas GeoDataFrame and then export to a GeoJSON file so we can upload it to the Felt API.

range_gdf = geopandas.GeoDataFrame(range_df.toPandas(), geometry="geometry")
range_gdf.to_file('birdbuddy_range.geojson', driver='GeoJSON')

We’ve now created a GeoJSON file birdbuddy_range.geojson that looks a bit like this (we’ve omitted some lines):

{
    "type": "FeatureCollection",
    "features": [
        {
            "type": "Feature",
            "properties": {
                "common_name": "eastern bluebird",
                "num": 65971
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            -80.3452,
                            25.6062
                        ],
                        [
                            -98.2271,
                            26.2516
                        ],
                        ...
                        [
                            -80.3452,
                            25.6062
                        ]
                    ]
                ]
            }
        },
        ...
    ]
}

Felt Maps API

If you haven’t already, create a free Felt account and then in your account settings generate a new access token so you’ll be able to create maps and upload data via the Felt API.

Creating a Felt API token

FELT_TOKEN = '<YOUR_TOKEN_HERE>'

To create a new map and upload data we’ll actually need to make a few network requests to the Felt API:

  1. /maps to create a new map. This endpoint will return the id and url of the new map.
  2. /maps/{map_id}/layers to create a new layer in our new map. Note we need to use the map_id from the previous request. This endpoint will return a presigned upload URL that will allow us to upload our GeoJSON file.
  3. /maps/{map_id}/layers/{layer_id}/finish_upload to indicate we have finished uploading our data using the presigned upload URL.

The function below will create a new map in Felt, then create a new layer and upload our GeoJSON file to this layer. See the Felt API docs for more examples of what’s possible with the Felt API.

def create_felt_map(access_token, filename, map_title, layer_name):

    # First create a new map using the /maps endpoint
    create_map_response = requests.post(
        f"https://felt.com/api/v1/maps",
        headers={
            "authorization": f"Bearer {access_token}",
            "content-type": "application/json",
        },
        json={"title": map_title},
    )
    create_map_data = create_map_response.json()
    map_id = create_map_data['data']['id']
    map_url = create_map_data['data']['attributes']['url']
    print(create_map_data)

    # Next, we'll create a new layer and get a presigned upload url so we can upload our GeoJSON file
    layer_response = requests.post(
    f"https://felt.com/api/v1/maps/{map_id}/layers",
    headers={
        "authorization": f"Bearer {access_token}",
        "content-type": "application/json",
    },
    json={"file_names": [filename], "name": layer_name},
    )

    # This endpoint will return a pre-signed URL that we use to upload the file to Felt
    presigned_upload = layer_response.json()
    url = presigned_upload["data"]["attributes"]["url"]
    presigned_attributes = presigned_upload["data"]["attributes"]["presigned_attributes"]

    # A 204 response indicates that the upload was successful
    with open(filename, "rb") as file_obj:
        output = requests.post(
            url,
            # Order is important, file should come at the end
            files={**presigned_attributes, "file": file_obj},
        )
    layer_id = presigned_upload['data']['attributes']['layer_id']
    print(output)
    print(layer_id)
    print(presigned_upload)

    # Finally, we call the /maps/:map_id/layers/:layer_id/finish_upload endpoint to complete the process
    finish_upload = requests.post(
        f"https://felt.com/api/v1/maps/{map_id}/layers/{layer_id}/finish_upload",
        headers={
            "authorization": f"Bearer {access_token}",
            "content-type": "application/json"},
            json={"filename": filename, "name": layer_name},
    )
    print(finish_upload.json())

Now to create a new Felt map we can call this function, passing our API token, the name of our GeoJSON file as well as what we’d like to call our new map and the data layer.

create_felt_map(FELT_TOKEN, "birdbuddy_range.geojson", "North American Bird Ranges", "My Favorite Birds")

We’ve now created a new map in Felt and uploaded our GeoJSON data as a new layer. We can share the URL with anyone on the web to view or collaborate on our map!

North American bird ranges

We can also embed the map in our Jupyter notebook:

from IPython.display import HTML
HTML('<iframe width="1600" height="600" frameborder="0" title="My Favorite Bird Ranges" src="https://felt.com/embed/map/North-American-Bird-Ranges-a4c5cOCaRMiL64KK5N27TA"></iframe>"')

The 30 Day Map Challenge

Every year the geospatial data and map community joins together to organize the "30 Day Map Challenge" a fun and informal challenge to create a new map and share it on social media each day for one month.

The 30 Day Map Challenge

This BirdBuddy map was my 30 Day Map Challenge map for Day 3: Polygons. You can find the full Jupyter Notebook with all code on GitHub here as well as some of my other 30 Day Map Challenge maps in this repository. If you’d like to follow along with my attempt at the rest of the 30 Day Map Challenge feel free to connect with me on Twitter or LinkedIn.

Want to keep up to date with Wherobots? Sign up for the Wherobots Developer Newsletter below: