Generating global PMTiles from Overture Maps in 26 minutes with WherobotsDB VTiles
Learn why WherobotsDB VTiles is the most scalable and performant solution for generating vector tiles. We will demonstrate this by generating vector tiles from a massive (>2 billion record) worldwide dataset in 26 minutes.
TABLE OF CONTENTS
Contributors
-
James Willis
James is a Senior Geospatial Software Engineer simplifying the experience of gaining insights and deriving value from spatial vector data.
We previously described the performance and scalability challenges of generating tiles and how they can be overcome with WherobotsDB VTiles. Today we will demonstrate how you can use VTiles to generate vector tiles for three planetary scale Overture Maps Foundation datasets in PMTiles format: places, buildings, and division areas.
Quick recap: What are Vector Tiles and why should you use PMTiles?
Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. Vector tiles contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.
PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.
Tile Viewer
If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:
- Viewing tiles with a dark themed basemap
- Inspecting individual tiles, selected from a list of all the tiles in the set
- Inspecting the metadata and header information of the PMTiles file
This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:
from wherobots.tools.utility.s3_utils import get_signed_url
get_signed_url(my_s3_path, expiration_in_seconds)
my_s3_path
will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles
and expiration_in_seconds
will be an int representing the number of seconds the signed url will be valid for.
The tile viewer will be used to explore the tiles we generate in our examples.
Examples
The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.
Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.
To run the examples below, just make sure your sedona session is started:
from sedona.spark import SedonaContext
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Places
We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:
- The dataset contains only points. A point feature rarely spans multiple tiles as it has no dimensions. The tile generation time is strongly influenced by the sum of the number of features multiplied by the number of tiles which that feature intersects.
- At 50 million records, this is a relatively small dataset compared to the buildings dataset at 2.3 billion features.
- We will do minimal customization. VTiles’ feature filters allow us to control which features go into which tiles based on the tile id (x, y, z) and the feature itself (area, length, and user-provided columns). We will go more in depth on feature filters in the division areas example.
import pyspark.sql.functions as f
import os
from wherobots.vtiles import GenerationConfig, generate_pmtiles
generate_pmtiles(
sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select(
"geometry",
f.col("names.primary").alias("name"),
f.col("categories.main").alias("category"),
f.lit('places').alias('layer'),
),
os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles",
GenerationConfig(6, 15)
)
This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.
Tile generation time was 2 minutes and 23 seconds on a Tokyo runtime. The resulting PMTiles archive is 28.1 GB.
Buildings
This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|
. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.
import pyspark.sql.functions as f
import os
from wherobots.vtiles import GenerationConfig, generate_pmtiles
generate_pmtiles(
sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select(
"geometry",
f.lit('buildings').alias('layer'),
),
os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles",
GenerationConfig(10, 15)
)
This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.
Tile generation time was 26 minutes on a Tokyo runtime. The resulting PMTiles archive is 438.4 GB.
Division Areas
The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.
Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5)
tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.
Let’s first profile the base case with no feature filter.
import pyspark.sql.functions as f
import os
from wherobots.vtiles import GenerationConfig, generate_pmtiles
generate_pmtiles(
sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
"geometry",
f.col("names.primary").alias("name"),
f.col("subtype").alias('layer'),
),
os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles",
GenerationConfig(3, 15)
)
This run took a bit over 3 hours on a Tokyo runtime. The resulting PMTiles archive is 158.0 GB. This small dataset takes more time than the buildings dataset that is more than 2300 times larger!
Feature Filters
We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.
Here is an example of using feature filters to improve performance of this division area generation task:
import pyspark.sql.functions as f
import os
from wherobots.vtiles import GenerationConfig, generate_pmtiles
generate_pmtiles(
sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
"geometry",
f.col("names.primary").alias("name"),
f.col("subtype").alias('layer'),
),
os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles",
GenerationConfig(
min_zoom=2,
max_zoom=15,
feature_filter = (
((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) |
((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) |
((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z") < f.lit(12))) |
((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z") < f.lit(14))) |
((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) |
((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z")))
)
)
)
This run took less than 10 minutes on a Tokyo runtime. The resulting PMTiles archive is 8.9 GB.
Feature filters reduced tile generation time by more than 90%, reduced the dataset size, and lowered the cost compared to the original example. Tiles will also appear less cluttered to the user without having to get one’s hands dirty playing with style sheets.
A Note on Working without Feature Filters
We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.
Conclusion
The VTiles tile generation functionality is a fast and cost effective way to generate tiles for global data. The Apache Spark-based runtime powered by Apache Sedona and Wherobots Cloud makes loading and transforming data for input into the system straightforward and performant even on large datasets. You can leverage feature filters to curate the contents of your tiles to your use cases and performance goals. We encourage you to try out VTiles with your own data on Wherobots Cloud.
Contributors
-
James Willis
James is a Senior Geospatial Software Engineer simplifying the experience of gaining insights and deriving value from spatial vector data.