Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

Generate Global PMTiles in 26 Minutes: Processing 2.3B Overture Maps Data

Authors

How to Generate Global PMTiles from Overture Maps in 26 Minutes with WherobotsDB

Overview: PMTiles Generation at Planetary Scale

WherobotsDB VTiles generates global PMTiles from Overture Maps data in 26 minutes by processing billions of features with distributed computation. The system handles three datasets: Places (50M features, 2m 23s), Buildings (2.3B features, 26 minutes), and Division Areas (1M features, optimized to 10 minutes with feature filters).

Key capabilities:

  • Processes 2.3 billion building features in 26 minutes
  • Reduces division area processing from 3 hours to 10 minutes with feature filters
  • Generates cloud-native PMTiles format for direct S3 queries
  • Scales to planetary datasets using Apache Sedona and Spark

This article demonstrates how to use WherobotsDB VTiles to generate vector tiles for three planetary-scale Overture Maps Foundation datasets in PMTiles format: Places, Buildings, and Division Areas, with feature filtering and distributed computation across billions of features.

What are Vector Tiles and PMTiles?

Vector tiles and PMTiles together make it possible to store, render, and serve large-scale maps efficiently in the cloud. Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. They contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.

PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.

Key terms:

  • Vector tiles: Geometric and attribute data chunks that enable dynamic map styling at varying zoom levels
  • PMTiles: Cloud-native archive format for storing entire tile collections with direct S3 query support
  • Feature filtering: Method to control which features appear at specific zoom levels based on geometry size and type
  • Overture Maps: Open-source geospatial dataset including 2.3B+ buildings, 50M+ places, and administrative boundaries
Build PM Tiles with this interactive notebook

How to Visualize and Inspect PMTiles

If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:

  • Viewing tiles with a dark themed basemap
  • Inspecting individual tiles, selected from a list of all the tiles in the set
  • Inspecting the metadata and header information of the PMTiles file

This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:

from wherobots.tools.utility.s3_utils import get_signed_url

get_signed_url(my_s3_path, expiration_in_seconds)

my_s3_path will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles and expiration_in_seconds will be an int representing the number of seconds the signed url will be valid for.

The tile viewer will be used to explore the tiles we generate in our examples.

PMTiles Generation Examples: Overture Maps

The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.

Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.

To run the examples below, just make sure your sedona session is started:

from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Generating PMTiles for Overture Places Dataset (50M Features)

We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:

  1. The dataset contains only points. A point feature rarely spans multiple tiles as it has no dimensions. The tile generation time is strongly influenced by the sum of the number of features multiplied by the number of tiles which that feature intersects.
  2. At 50 million records, this is a relatively small dataset compared to the buildings dataset at 2.3 billion features.
  3. We will do minimal customization. VTiles’ feature filters allow us to control which features go into which tiles based on the tile id (x, y, z) and the feature itself (area, length, and user-provided columns). We will go more in depth on feature filters in the division areas example.
import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("categories.main").alias("category"),
        f.lit('places').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles",
    GenerationConfig(6, 15)
)

This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.

Performance Results:

  • Dataset size: 50 million features
  • Processing time: 2 minutes 23 seconds
  • Output file size: 28.1 GB
  • Zoom levels: 6-15
  • Runtime: Tokyo (large)

Generating PMTiles for Overture Buildings Dataset (2.3B Features)

This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select(
        "geometry",
        f.lit('buildings').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles",
    GenerationConfig(10, 15)
)

This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.

Performance Results:

  • Dataset size: 2.3 billion features
  • Processing time: 26 minutes
  • Output file size: 438.4 GB
  • Zoom levels: 10-15
  • Runtime: Tokyo (large)

Generating PMTiles for Overture Division Areas Dataset

The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.

Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5) tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.

Let’s first profile the base case with no feature filter.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles",
    GenerationConfig(3, 15)
)

Performance Results (Without Optimization):

  • Dataset size: 1 million features
  • Processing time: 3+ hours
  • Output file size: 158.0 GB
  • Zoom levels: 3-15
  • Runtime: Tokyo (large)
  • Issue: Large geometries like countries span millions of tiles

How to Optimize PMTiles Generation with Feature Filters

We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.

Here is an example of using feature filters to improve performance of this division area generation task:

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles",
    GenerationConfig(
        min_zoom=2, 
        max_zoom=15,
        feature_filter = (
            ((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) |
            ((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) |
            ((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(12))) |
            ((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(14))) |
            ((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) |
            ((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z")))
        )
    )
)

Performance Results:

  • Dataset size: 1 million features (filtered by zoom/subtype)
  • Processing time: 9 minutes 47 seconds
  • Output file size: 8.9 GB
  • Zoom levels: 2-15 (filtered by feature type)
  • Runtime: Tokyo (large)
  • Improvement: 94% faster processing, 94% smaller file size

Improving PMTiles Performance Without Feature Filters

We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.

PMTiles Generation Performance Summary

WherobotsDB VTiles is a fast and cost-effective solution for generating tiles from global datasets. The Apache Spark-based runtime powered by Apache Sedona enables straightforward data loading and transformation, delivering proven performance at planetary scale:

Performance metrics:

  • 2.3 billion building features processed in 26 minutes
  • 94% performance improvement with feature filtering (3+ hours → 10 minutes)
  • 94% file size reduction (158 GB → 8.9 GB)
  • Direct S3 querying without tile server infrastructure

Best practices for optimization:

  • Apply feature filters to match geometry size with zoom level resolution
  • Use repartitioning for datasets with large, irregular geometries
  • Set zoom ranges appropriate to feature type (buildings: 10-15, places: 6-15)
  • Leverage Apache Sedona for data loading and transformation

Feature filters let you curate tile contents to match your specific use cases and performance goals. Try VTiles with your own data on Wherobots Cloud.

Create your free Wherobots account