Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
You can now use Databricks for geospatial with Wherobots.
Authors
In our previous post, we described the performance and scalability challenges of generating vector tiles and how WherobotsDB VTiles helps overcome them. In this article, we put that theory into action and demonstrate how you can use VTiles to generate vector tiles for three planetary-scale Overture Maps Foundation datasets in PMTiles format: Places, Buildings, and Division Areas.
If you’ve ever wondered how to efficiently generate global PMTiles from Overture Maps, this post shows how we achieved it in just 26 minutes, processing billions of features at global scale with feature filtering and distributed computation.
Vector tiles and PMTiles together make it possible to store, render, and serve large-scale maps efficiently in the cloud. Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. They contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.
PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.
If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:
This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:
from wherobots.tools.utility.s3_utils import get_signed_url get_signed_url(my_s3_path, expiration_in_seconds)
my_s3_path will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles and expiration_in_seconds will be an int representing the number of seconds the signed url will be valid for.
my_s3_path
s3://myBucket/my/prefix/to/tiles.pmtiles
expiration_in_seconds
The tile viewer will be used to explore the tiles we generate in our examples.
The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.
Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.
To run the examples below, just make sure your sedona session is started:
from sedona.spark import SedonaContext config = SedonaContext.builder().getOrCreate() sedona = SedonaContext.create(config)
We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:
import pyspark.sql.functions as f import os from wherobots.vtiles import GenerationConfig, generate_pmtiles generate_pmtiles( sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select( "geometry", f.col("names.primary").alias("name"), f.col("categories.main").alias("category"), f.lit('places').alias('layer'), ), os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles", GenerationConfig(6, 15) )
This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.
Tile generation time was 2 minutes and 23 seconds on a Tokyo runtime. The resulting PMTiles archive is 28.1 GB.
This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.
|features| * |zooms|
import pyspark.sql.functions as f import os from wherobots.vtiles import GenerationConfig, generate_pmtiles generate_pmtiles( sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select( "geometry", f.lit('buildings').alias('layer'), ), os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles", GenerationConfig(10, 15) )
This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.
Tile generation time was 26 minutes on a Tokyo runtime. The resulting PMTiles archive is 438.4 GB.
The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.
Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5) tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.
2 * 4^(max_zoom - 5)
Let’s first profile the base case with no feature filter.
import pyspark.sql.functions as f import os from wherobots.vtiles import GenerationConfig, generate_pmtiles generate_pmtiles( sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select( "geometry", f.col("names.primary").alias("name"), f.col("subtype").alias('layer'), ), os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles", GenerationConfig(3, 15) )
This run took a bit over 3 hours on a Tokyo runtime. The resulting PMTiles archive is 158.0 GB. This small dataset takes more time than the buildings dataset that is more than 2300 times larger!
We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.
Here is an example of using feature filters to improve performance of this division area generation task:
import pyspark.sql.functions as f import os from wherobots.vtiles import GenerationConfig, generate_pmtiles generate_pmtiles( sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select( "geometry", f.col("names.primary").alias("name"), f.col("subtype").alias('layer'), ), os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles", GenerationConfig( min_zoom=2, max_zoom=15, feature_filter = ( ((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) | ((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) | ((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z") < f.lit(12))) | ((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z") < f.lit(14))) | ((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) | ((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z"))) ) ) )
This run took less than 10 minutes on a Tokyo runtime. The resulting PMTiles archive is 8.9 GB.
Feature filters reduced tile generation time by more than 90%, reduced the dataset size, and lowered the cost compared to the original example. Tiles will also appear less cluttered to the user without having to get one’s hands dirty playing with style sheets.
We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.
The VTiles tile generation functionality is a fast and cost effective way to generate tiles for global data. The Apache Spark-based runtime powered by Apache Sedona and Wherobots Cloud makes loading and transforming data for input into the system straightforward and performant even on large datasets. You can leverage feature filters to curate the contents of your tiles to your use cases and performance goals. We encourage you to try out VTiles with your own data on Wherobots Cloud.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots and Taylor Geospatial Engine Bring Fields-of-the-World Models to Production Scale
Agriculture depends on timely, reliable insight into what’s happening on the ground—what’s being planted, what’s being harvested, and how fields evolve over time. The Fields of The World (FTW) project was created to support exactly this mission, by building a fully open ecosystem of labeled data, software, standards and models to create a reliable global […]
How Aarden.ai Scaled Spatial Intelligence 300× Faster for Land Investments with Wherobots
When Aarden.ai emerged from stealth recently with $4M in funding to “empower landowners in data center and renewable energy deals,” the company joined a new wave of data and AI startups reimagining how physical-world data drives modern business. Their mission: help institutional land investors rapidly evaluate the value and potential uses of land across the country. […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: