Generating global PMTiles from Overture Maps in 26 minutes with WherobotsDB VTiles

We previously described the performance and scalability challenges of generating tiles and how they can be overcome with WherobotsDB VTiles. Today we will demonstrate how you can use VTiles to generate vector tiles for three planetary scale Overture Maps Foundation datasets in PMTiles format: places, buildings, and division areas.

Quick recap: What are Vector Tiles and why should you use PMTiles?

Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. Vector tiles contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.

PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.

Tile Viewer

If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:

  • Viewing tiles with a dark themed basemap
  • Inspecting individual tiles, selected from a list of all the tiles in the set
  • Inspecting the metadata and header information of the PMTiles file

This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:

from wherobots.tools.utility.s3_utils import get_signed_url

get_signed_url(my_s3_path, expiration_in_seconds)

my_s3_path will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles and expiration_in_seconds will be an int representing the number of seconds the signed url will be valid for.

The tile viewer will be used to explore the tiles we generate in our examples.

Examples

The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.

Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.

To run the examples below, just make sure your sedona session is started:

from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Places

We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:

  1. The dataset contains only points. A point feature rarely spans multiple tiles as it has no dimensions. The tile generation time is strongly influenced by the sum of the number of features multiplied by the number of tiles which that feature intersects.
  2. At 50 million records, this is a relatively small dataset compared to the buildings dataset at 2.3 billion features.
  3. We will do minimal customization. VTiles’ feature filters allow us to control which features go into which tiles based on the tile id (x, y, z) and the feature itself (area, length, and user-provided columns). We will go more in depth on feature filters in the division areas example.
import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("categories.main").alias("category"),
        f.lit('places').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles",
    GenerationConfig(6, 15)
)

This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.

Tile generation time was 2 minutes and 23 seconds on a Tokyo runtime. The resulting PMTiles archive is 28.1 GB.

Buildings

This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select(
        "geometry",
        f.lit('buildings').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles",
    GenerationConfig(10, 15)
)

This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.

Tile generation time was 26 minutes on a Tokyo runtime. The resulting PMTiles archive is 438.4 GB.

Division Areas

The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.

Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5) tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.

Let’s first profile the base case with no feature filter.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles",
    GenerationConfig(3, 15)
)

This run took a bit over 3 hours on a Tokyo runtime. The resulting PMTiles archive is 158.0 GB. This small dataset takes more time than the buildings dataset that is more than 2300 times larger!

Feature Filters

We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.

Here is an example of using feature filters to improve performance of this division area generation task:

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles",
    GenerationConfig(
        min_zoom=2, 
        max_zoom=15,
        feature_filter = (
            ((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) |
            ((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) |
            ((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(12))) |
            ((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(14))) |
            ((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) |
            ((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z")))
        )
    )
)

This run took less than 10 minutes on a Tokyo runtime. The resulting PMTiles archive is 8.9 GB.

Feature filters reduced tile generation time by more than 90%, reduced the dataset size, and lowered the cost compared to the original example. Tiles will also appear less cluttered to the user without having to get one’s hands dirty playing with style sheets.

A Note on Working without Feature Filters

We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.

Conclusion

The VTiles tile generation functionality is a fast and cost effective way to generate tiles for global data. The Apache Spark-based runtime powered by Apache Sedona and Wherobots Cloud makes loading and transforming data for input into the system straightforward and performant even on large datasets. You can leverage feature filters to curate the contents of your tiles to your use cases and performance goals. We encourage you to try out VTiles with your own data on Wherobots Cloud.

Unlock Satellite Imagery Insights with WherobotsAI Raster Inference

Recently we introduced WherobotsAI Raster Inference to unlock analytics on satellite and aerial imagery using SQL or Python. Raster Inference simplifies extracting insights from satellite and aerial imagery using SQL or Python, and is powered by open-source machine learning models. This feature is currently in preview, and we are expanding it’s capabilities to support more models. Below we’ll dig into the popular computer vision tasks that Raster Inference supports, describe how it works, and how you can use it to run batch inference to find and map electricity infrastructure.

Watch the live demo of these capabilities here.

The Power of Machine Learning with Satellite Imagery

Petabytes of satellite imagery are generated each day all over the world in a dizzying number of sensor types and image resolutions. The applications for satellite imagery and other remote sensing data sources are broad and diverse. For example, satellites with consistent, continuous orbits are ideal for monitoring forest carbon stocks to validate carbon credits or estimating agricultural yields.

However, this data has been inaccessible for most analysts and even seasoned ML practitioners because insight extraction required specialized skills. We’ve done the work to make insight extraction simple and accessible to more people. Raster Inference abstracts the complexity and scales to support planetary-scale imagery datasets, so you don’t need ML expertise to derive insights. In this blog, we explore the key features that make Raster Inference effective for land cover classification, solar farm mapping, and marine infrastructure detection. And, in the near future, you will be able to use Raster Inference with your own models!

Introduction to Popular and Supported Machine Learning Tasks

Raster Inference supports the three most common kinds of computer vision models that are applied to imagery: classification, object detection, and semantic segmentation. Instance segmentation (combines object localization and semantic segmentation) is another common type of model which is not currently supported, but let us know if you need by contacting us and we can add it to the roadmap.

Computer Vision Detection Types
Computer Vision Detection Categories from Lin et al. Microsoft COCO: Common Objects in Context

The figure above illustrates these tasks. Image classification is when an image is assigned one or more text labels. In image (a), the scene is assigned the labels “person”, “sheep”, and “dog”. Image (b) is an example of object localization (or object detection). Object localization creates bounding boxes around objects of interest and assigns labels. In this image, five sheep are localized separately along with one human and one dog. Finally, semantic segmentation is when each pixel is given a category label, as shown in image (c). Here we can see all the pixels belonging to sheep are labeled blue, the dog is labeled red, and the person is labeled teal.

While these examples highlight detection tasks on regular imagery, these computer vision models can be applied to raster formatted imagery. Raster data formats are the most common data formats for satellite and aerial imagery. When objects of interest in raster imagery are localized, their bounding boxes can be georeferenced, which means that each pixel is localized to spatial coordinates, such as latitude and longitude. Therefore, georeferencing is object localization suited for spatial analytics.

https://wherobots.com/wp-content/uploads/2024/06/remotesensing-11-00339-g005.png

The example above shows various applications of object detection for localizing and classifying features in high resolution satellite and aerial imagery. This example comes from DOTA, a 15-class dataset of different objects in RGB and grayscale satellite imagery. Public datasets like DOTA are used to develop and benchmark machine learning models.

Not only are there many publicly available object detection models, but also there are many semantic segmentation models.

Semantic Segmentation
Sourced from “A Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning”.

Not every machine learning model should be treated equally, and each will have their own tradeoffs. You can see the difference between the ground truth image (human annotated buildings representing the real world) and segmentation results across two models (Scale-MAE and Vanilla MAE). These results are derived from the same image at two different resolutions (referred to as GSD, or Ground Sampling Distance).

  • Scale-MAE is a model developed to handle detection tasks at various resolutions with different sensor inputs. It uses a similar MAE model architecture as the Vanilla MAE, but is trained specifically for detection tasks on overhead imagery that span different resolutions.
  • The Vanilla MAE is not trained to handle varying resolutions in overhead imagery. It’s performance suffers in the top row and especially the bottom row, where resolution is coarser, as seen by the mismatch between Vanilla MAE and the ground truth image where many pixels are incorrectly classified.

Satellite Analytics Before Raster Inference

Without Raster Inference, typically a team who is looking to extract insights from overhead imagery using ML would need to:

  1. Deploy a distributed runtime to scale out workloads such as data loading, preprocessing, and inference.
  2. Develop functionality to operate on raster metadata to easily filter it by location to run inference workloads on specific areas of interest.
  3. Optimize models to run performantly on GPUs, which can involve complex rewrites of the underlying model prediction logic.
  4. Create and manage data preprocessing pipelines to normalize, resize, and collate raster imagery into the correct data type and size required by the model.
  5. Develop the logic to run data loading, preprocessing, and model inference efficiently at scale.

Raster Inference and its SQL and Python APIs abstract this complexity so you and your team can easily perform inference on massive raster datasets.

Raster Inference APIs for SQL and Python

Raster Inference offers APIs in both SQL and Python to run inference tasks. These APIs are designed to be easy to use, even if you’re not a machine learning expert. RS_CLASSIFY can be used for scene classification, RS_BBOXES_DETECT for object detection, and RS_SEGMENT for semantic segmentation. Each function produces tabular results which can be georeferenced either for the scene, object, or segmentation depending on the function. The records can be joined or visualized with other data (geospatial or traditional) to curate enriched datasets and insights. Here are SQL and Python examples for RS_Segment.

RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
df = df_raster_input.withColumn("segment_result", rs_segment(model_id, col("outdb_raster")))

Example: Mapping Electricity Infrastructure

Imagine you want to optimize the location of new EV charging stations, but you want to target locations based on the availability of green energy sources, such as local solar farms. You can use Raster Inference to detect and locate solar farms and cross-reference these locations with internal data or other vector geometries that captures demand for EV charging. This use case will be demonstrated in our upcoming release webinar on July 10th.

Let’s walk through how to use Raster Inference for this use case.

First, we run predictions on rasters to find solar farms. The following code block that calls RS_SEGMENT shows how easy this is.

CREATE OR REPLACE TEMP VIEW segment_fields AS (
    SELECT
        outdb_raster,
        RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
    FROM
    az_high_demand_with_scene
)

The confidence_array column produced from RS_SEGMENT can be assigned the same geospatial coordinates as the raster input and converted to a vector that can be spatially joined and processed with WherobotsDB using RS_SEGMENT_TO_GEOMS. We select a confidence threshold of .65 so that we only georeference high confidence detections.

WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions_df
    )
    SELECT result.* FROM t
+----------+--------------------+--------------------+
|     class|avg_confidence_score|            geometry|
+----------+--------------------+--------------------+
|Solar Farm|  0.7205783606825462|MULTIPOLYGON (((-...|
|Solar Farm|  0.7273308333550763|MULTIPOLYGON (((-...|
|Solar Farm|  0.7301468510823231|MULTIPOLYGON (((-...|
|Solar Farm|  0.7180177244988899|MULTIPOLYGON (((-...|
|Solar Farm|   0.728077805771141|MULTIPOLYGON (((-...|
|Solar Farm|     0.7264981572898|MULTIPOLYGON (((-...|
|Solar Farm|  0.7044100126912517|MULTIPOLYGON (((-...|
|Solar Farm|  0.7137283466756343|MULTIPOLYGON (((-...|
+----------+--------------------+--------------------+

This allows us to integrate the vectorized model predictions with other spatial datasets and easily visualize the results with SedonaKepler.

https://wherobots.com/wp-content/uploads/2024/06/solar_farm_detection-1-1024x398.png

Here Raster Inference runs on a 85 GiB dataset with 2,200 raster scenes for Arizona. Using a Sedona (tiny) runtime, Raster Inference completed in 430 seconds, predicting solar farms for all low cloud cover satellite images for the state of Arizona for the month of October. If we scale up our runtime to a San Francisco (small) runtime, the inference speed nearly doubles. In general, average bytes processed per second by Wherobots increases as datasets scale in size because startup costs are amortized over time. Processing speed also increases as runtimes scale in size.

Inference time (seconds) Runtime Size
430 Sedona
246 San Francisco

We use predictions from the output of Raster Inference to derive insights about which zip codes have the most solar farms, as shown below. This statement joins predicted solar farms with zip codes by location, then ranks zip codes by the pre-computed solar farm area within each zip code. We skipped this step for brevity but you can see it and others in the notebook example.

az_solar_zip_codes = sedona.sql("""
SELECT solar_area, any_value(az_zta5.geometry) AS geometry, ZCTA5CE10
FROM predictions_polys JOIN az_zta5
WHERE ST_Intersects(az_zta5.geometry, predictions_polys.geometry)
GROUP BY ZCTA5CE10
ORDER BY solar_area DESC
""")

https://wherobots.com/wp-content/uploads/2024/06/final_analysis.png

These predictions are made possible by SATLAS, a family of machine learning models released with Apache 2.0 licensing from Allen AI. The solar model demonstrated above was derived from the SATLAS foundational model. This foundational model can be used as a building block to create models to address specific detection challenges like solar farm detection. Additionally, there are many other open source machine learning models available for deriving insights from satellite imagery, many of which are provided by the TorchGeo project. We are just beginning to explore what these models can achieve for planetary-scale monitoring.

If you have a specific model you would like to see made available, please contact us to let us know.

For detailed instructions on using Raster Inference, please refer to our example Jupyter notebooks in the documentation.

https://wherobots.com/wp-content/uploads/2024/06/Screenshot_2024-06-08_at_2.11.07_PM-1024x683.png

Here are some links to get you started:
https://docs.wherobots.com/latest/tutorials/wherobotsai/wherobots-inference/segmentation/

https://docs.wherobots.com/latest/api/wherobots-inference/pythondoc/inference/sql_functions/

Getting Started

Getting started with WherobotsAI Raster Inference is easy. We’ve provided three models in Wherobots Cloud that can be used with our GPU optimized runtimes. Sign up for an account on Wherobots Cloud, send us a note to access the professional tier, start a GPU runtime, and you can run our example Jupyter notebooks to analyze satellite imagery in SQL or Python.

Stay tuned for updates on improvements to Raster Inference that will make it possible to run more models, including your own custom models. We’re excited to hear what models you’d like us to support, or the integrations you need to make running your own models even easier with Raster Inference. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Quickly and Easily Create Vector Tiles with Wherobots and Overture Maps Foundation Data

Generating map tiles can be challenging and expensive, especially when dealing with large datasets like those from the Overture Maps Foundation (Overture). It’s challenging and expensive because typical solutions are not scalable or performant, which forces you to build workarounds that waste your time and are not economical. Wherobots addresses these challenges using WherobotsDB, a purpose-built, planetary scale compute engine with a native vector tile generator (VTiles) that produces tiles in an easy to work with, cloud-native file format (PMTiles).

In this post, we’ll show you how Wherobots makes generating vector tiles from billions of features at a large scale, a breeze! We’ll prove this to you using a demo that generates a tileset for all 11 Overture layers in New York City, and we’ll repeat this at a larger scale for all the transportation segments across the state of New York.

What Are Vector Tiles and PMTiles?

Vector tiles are small chunks of map data that allow for efficient rendering at varying zoom levels. Unlike raster tiles which are pre-rendered images, vector tiles contain attributes and geometric data that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.

PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format allows individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs and time-to-tile-generation.

The Challenge of Creating Tiles in a Distributed Environment

Generating tiles from worldwide map datasets was always a challenge. You had to process billions of geometric features using solutions that are not purpose-built for this scale, resulting in

  • High compute costs from development and production runs as well as fixed infrastructure.
  • Frequent crashes forcing developers to babysit and troubleshoot the system.
  • Tiles that are out-of-date because tile generation is time consuming and difficult.
  • Tile generation taking days or hours, vs minutes or seconds to complete.
Benefits of Using Wherobots for Tile Generation
  • Efficiency: Wherobots optimizes the tile generation process, making it faster and more cost efficient.
  • Scalability: WherobotsDB and VTiles are part of a purpose-built system designed to handle large datasets without compromising performance, providing a resilient solution for your geospatial needs.
  • Flexibility: VTiles accelerates time-to-production by allowing you to iterate faster using a more responsive, lower cost development experience.

Wherobots for Highly Scalable Vector Tile Generation

Wherobots VTiles, our new native vector tile generator, incorporates innovative algorithms for distributed tile generation on WherobotDB, a high performance spatial compute engine. VTiles is designed to generate vector tiles from small to planetary scale datasets quickly and cost-efficiently. Wherobots handles the heavy lifting and infrastructure management, ensuring the tile generation process is performant, scalable, and easy. We will prove this in the following demos.

Demo: From NYC to NY State

We’ll use Wherobots VTiles to generate PMTiles for all Overture layers in New York City. Then we will scale the demo up by generating PMTiles for all transportation segments in New York State, using feature filter optimizations.

Step 1: Preparing the Data

First, we need to load the administrative, places, transportation, base, and buildings layers (yes, all of them!) from the Overture data for New York City. The latest Overture dataset is included in the Wherobots Spatial Catalog out-of-the-box, which makes it easy to load all of these layers in a single statement.

_aoi = "POLYGON ((-73.957901 40.885486, …, -73.957901 40.885486))"
df_transportation_segment = sedona.sql(f"""
SELECT
ST_INTERSECTION(ST_GEOMFROMWKT("{_aoi}"),geometry),
"transportation_segment" as layer
FROM
wherobots_open_data.overture_2024_02_15.transportation_segment t1
WHERE ST_INTERSECTS(ST_GeomFromText("{_aoi}"),geometry)
""")

Here we apply a spatial filter (WHERE ST_INTERSECTS()) and a “clipping” function (ST_INTERSECTION()) to reduce our data to our area of interest, New York City. We also add a layer attribute containing our layer name. You can add or bring in additional attributes if needed. This “load-intersect” statement is executed for each of the 11 Overture Maps Foundation layers in Wherobots and the resulting DataFrames are added to a list for the next step in data prep.

We want to cut a single PMTile file for all the layers in one go, and to accomplish this all the layers need to be unioned together.  Here we iterate through the list of layers and perform the union with the DataFrame API.

nyc_unioned_data = df_admins_administrativeBoundary

for t in tables_4_tiles[1:]:
    nyc_unioned_data=nyc_unioned_data.union(t.where(ST_Intersects(t.geometry ,\
    ST_GeomFromText(lit(_aoi)))).select \
    ("layer",ST_Intersection(ST_GeomFromText(lit(_aoi)), t.geometry).alias("geometry")))

nyc_unioned_data = nyc_unioned_data.where(f'ST_IsEmpty(geometry) = False')

Step 2: Generate the Tiles

With our data prepared we are ready to kick off the tile generation process. This process looks like this:

# Generate the tiles
nyc_tiles_df = vtiles.generate(nyc_unioned_data)

#Define storage endpoint
nyc_full_tiles_path = os.getenv("USER_S3_PATH") + "nyc_tiles.pmtiles"

#Define a vtile builder specifying the data for automatic schema discovery
builder = vtiles.PMTilesConfigBuilder().from_features_data_frame(nyc_unioned_data)****

#Define the order in which to layer the tiles
nyc_ordered_layers= [ buildings, places ..., base]
builder.layers = nyc_ordered_layers

#BUILD ALL THE TILES
ordered_config= builder.build()

#Write the tiles to storage
vtiles.write_pmtiles(nyc_tiles_df, nyc_full_tiles_path, ordered_config)

This process took 79 seconds and generated a 74.7 MiB file. You can visualize the results in Wherobots by calling vtiles.show_pmtiles(nyc_full_tiles_path)

Wherobots VTiles breaks the dataset down into manageable chunks, processing each layer efficiently. The result is a single PMTile dataset that can be easily visualized and styled and served directly from S3.

Step 3: Expanding to NY State Transportation

Next, we scale the process up to cover all transportation segments in New York State. This involves a larger dataset, but WherobotsDB handles it seamlessly. The process is identical as above but this time we want to utilize zoom based feature filtering (do we really need to see driveways when viewing at the state scale?) in the vtiles.GenerationConfig() .

gen_config = vtiles.GenerationConfig(
    # Minimum zoom level for generation
    min_zoom=4,
    # Maximum zoom level for generation
    max_zoom=16,
    feature_filter = (
        when(
        # Only add motorway and trunk features to tiles level 8 and above
            (col("class").isin(["motorway", "trunk"])) & (col("tile.z") < 8), False)
        .when(…)
        …
        ).otherwise(True)  # Default to rendering a feature
    )
)

We pass our configuration into VTiles: df_transportation_segment_tiles = vtiles.generate(df_transportation_segment,gen_config) and use the same vtiles.write_pmtiles() function as above. PMTile generation for NY state transportation segments completed in ~2.5 minutes.

Wrapping Up

Previously, generating vector tiles from large datasets, like those from the Overture Maps Foundation, was challenging. You now have a solution to these challenges. Wherobots makes tile generation at any scale, easy, reliable, and cost effective. We simplify the process by unlocking distributed tile generation, and we make it possible to visualize and share large-scale geospatial data efficiently. Whether you’re focusing on a single city like New York City, an entire state, or the entire planet, Wherobots provides tile generation capability that’s purpose-built for any scale.

Ready to dive into vector tile generation with Wherobots? Start by signing up for a free Wherobots account, walk through this tutorial, connect Wherobots to your data in S3, and enjoy the benefits of efficient and flexible map visualization. Check out the WherobotsDB VTiles tutorial and reference documentation for more information

Want More Tiles News at Wherobots?

  1. The tiles we generate in this blog are available for free via AWS S3
    1. New York City is here
    2. New York State is here
  2. We are working on making a variety of global Overture PMTiles available for free via Amazon S3. We will document the process for and performance of preparing these tiles in a future publication.
  3. We are working on launching a tile viewer to help folks visualize and inspect the contents of PMTiles files.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: