Wherobots and PostgreSQL + PostGIS: A Synergy in Spatial Analysis

Introduction

Spatial data has never been more abundant and valued in decision making (everything happens somewhere!). The tools we use to process location based data can significantly impact the outcomes of our projects. In the geospatial ecosystem, Apache Sedona-powered Wherobots and PostgreSQL with the PostGIS extension each offer robust capabilities. They share some functionalities, but they are more powerful when used together, rather than in isolation. This post explores how combining Wherobots and PostgreSQL + PostGIS can enhance spatial data processing, offering a synergy that leverages the unique strengths of each tool to enhance data analysis and decision-making processes.

Exploring PostgreSQL + PostGIS

PostgreSQL is a powerful, open-source object-relational database system known for its robustness and reliability. When combined with the PostGIS extension, PostgreSQL transforms into a spatial database capable of executing location-based queries and spatial functions. This combination is ideally suited for projects where data integrity and detailed transaction management (ACID transactions) are crucial.

Key Features:

  • Data Integrity: Ensures high levels of data accuracy and consistency with ACID transaction compliance.
  • Complex Queries: Supports a wide array of SQL and spatial queries for detailed data management.
  • Integration: Seamlessly integrates with existing IT infrastructures and data systems.

Architecture:

PostgreSQL’s architecture, a traditional RDBMS, uses a process-based database server architecture. It is designed for detailed, transactional data management, where data integrity and complex query capabilities are paramount.

Characteristics:

  • Transactional Integrity: PostgreSQL excels in maintaining data consistency and integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance, making it ideal for applications requiring reliable data storage.
  • Structured Data Management: Data must be written into PostgreSQL tables, allowing for sophisticated querying and indexing capabilities provided by PostGIS. This ensures precise spatial data management.
  • Integration with Legacy Systems: PostgreSQL + PostGIS can easily integrate with existing IT infrastructure, providing a stable environment for long-term data storage and complex spatial queries.

Understanding Wherobots & Apache Sedona

Wherobots, founded by the original creators of Apache Sedona, represents a significant advancement in modern cloud native spatial data processing. Apache Sedona is an open-source cluster computing system designed for the processing of large-scale spatial data, and Wherobots (continuously optimizing Sedona for performance) employs this technology with serverless deployment principles to analyze data distributed across clusters. This approach allows Wherobots to perform complex spatial queries and analytics efficiently, handling massive datasets that are beyond the scope of traditional geospatial data processing tools.

Key Features:

  • Scalability: Easily scales horizontally to handle petabytes of data across through a cluster computing architecture.
  • Speed: Utilizes memory-centric architecture to speed up data processing tasks.
  • Complex Algorithms: Supports numerous spatial operations and algorithms essential for advanced geographic data analysis.
  • Spatial SQL API: Over 300 (and growing) spatial type SQL functions covering both raster and vector analysis.
  • Pythonic Dataframe API: Extends the Spark Dataframe API for spatial data processing.
  • Python DB API: Python module conforming to the Python DB API 2.0 specification with familiar access patterns.

Architecture:

The backbone of Wherobots, Apache Sedona, is designed with a modern cloud native and distributed computing framework in mind. Wherobots orchestrates and manages the cluster distributed compute nodes to enable large-scale spatial data processing by bringing compute power directly to the data. This architecture is purpose built for handling big data analytics and planetary-scale spatial data processing.

Characteristics:

  • Distributed Processing: Wherobots and Apache Sedona utilize a cluster of nodes, enabling parallel processing of massive datasets. This makes it ideal for tasks requiring high scalability and speed.
  • In-Memory Computation: By leveraging in-memory processing, Sedona significantly reduces the time needed for complex spatial operations, allowing for rapid data exploration and transformation.
  • Cloud Integration: Wherobots architecture is naturally aligned with cloud-based solutions, enabling seamless integration with cloud services like Amazon Web Services (AWS). This facilitates scalable and flexible spatial data analytics.

Complementarity in Action

Sedona in Wherobots and PostgreSQL + PostGIS can both perform spatial queries, but their real power lies in how they complement each other. For example:

  • Data Handling and Performance: Apache Sedona in Wherobots Cloud is second to none at handling and analyzing large volumes of data, making them ideal for initial data processing, exploration, transformation, and analysis. In contrast, PostgreSQL + PostGIS excels in managing more detailed transactions and operations requiring high precision and integrity.
    • Use Wherobots for: Scalable Data Operations, Wherobots efficiently scales processing across multiple nodes, ideal for rapid analysis of spatial data at local to global levels.
    • Use PostgreSQL & PostGIS for: PERSISTENCE, PostgreSQL + PostGIS provides the tools necessary for precise data management crucial for applications like urban planning and asset management.
  • Use Case Flexibility: Wherobots can quickly process vast quantities of spatial data, making them suitable for environments and applications that require immediate insights on large volumes of data. PostgreSQL + PostGIS, being more transaction oriented, is perfect for applications where long-term data storage and management are needed.
    • Use Wherobots for: Processing Speed, when your scenario requires immediate response.
    • Use PostgreSQL & PostGIS for: Data Management, in scenarios requiring meticulous data integrity, such as environmental monitoring over years or detailed urban planning projects.
  • Analytical Depth: When initial processing (extractions, transformations, enrichment, etc.) on large scale data is done with Wherobots, ETL run times can be greatly reduced. Once processed, the data can be permanently stored and hosted in PostgreSQL + PostGIS. This allows users to perform deep, granular analyses at scale and then persist and serve those insights.
    • Use Wherobots for: Transforming and Enriching Data: Extract, transform, enrich, refine, and analyze data in Wherobots Cloud at scale and with unparalleled speed.
    • Use PostgreSQL & PostGIS for: Robust Data Ecosystem. Combining these tools ensures a robust data and analysis and management ecosystem capable of supporting complex, multi-dimensional spatial queries.

Integration Illustration

Integration Illustration

Let’s illustrate how these two systems can be integrated in a complimentary manner.

We’ll assume the persona of a fictitious company named “Synergistic GeoFusion Technologies (SGT) holdings”.

SGT holdings handles a vast array of spatial data from diverse sources such as sensors, satellites, and user inputs. The volume and velocity of the data collection requires a sophisticated approach to maximize efficiency. Wherobots steps in as the initial processing powerhouse, applying its Apache Sedona-based cluster computing to perform heavy-duty ETL (Extract, Transform, Load) tasks. This process involves cleansing, integrating, and transforming the raw data into a more structured format. Wherobots’ capability to handle massive datasets efficiently complements PostGIS’s robust data storage and querying capabilities by preparing the data for detailed spatial analysis and storage.

Once Wherobots processes the data, it can be seamlessly transferred to PostGIS, which cab serve as the system of record. This PostgreSQL extension is well-suited for ongoing data management, thanks to its sophisticated spatial functions and ability to handle complex queries. PostGIS’s strong querying capabilities complement Wherobots’ data processing strength, providing a stable platform for further data manipulation and refinement.

PostGIS is not only a storage layer but also a platform for ongoing data edits and updates, ensuring data integrity and relevance. When complex, resource-intensive spatial analyses are required, Wherobots is re-engaged. This dual engagement allows SGT holdings to handle routine data management in PostGIS while delegating computationally demanding tasks back to Wherobots, thus utilizing each system’s strengths to full effect.

For visualization, SGT holdings again leverages Wherobots to generate vector tiles from the analyzed data. These tiles are crucial for dynamic, scalable visual representations in the company’s internal dashboards and tools. Wherobots’ ability to generate these tiles efficiently complements PostGIS’s role by providing a means to visualize the data stored and managed within PostGIS. This not only enhances user interactions with the data but also provides a seamless bridge from data analysis to actionable insights through visual exploration.

Using Wherobots and PostGIS in this complimentary manner, SGT has established a highly efficient workflow that leverages the distinct capabilities of each system. They now have the capability to ingest all their data, manage it effectively, and run their post hoc analysis tasks to server internal and external clients efficiently and in a cost effective manner.

Performance Benchmark: Wherobots vs. PostGIS

When comparing WherobotsDB to PostgreSQL with PostGIS for spatial queries, WherobotsDB’s performance benefits become evident as data size increases. Initially, PostGIS’ precomputed GIST indexes give it an edge on smaller datasets due to faster query execution times. However, as datasets grow larger, the dynamic, on-the-fly indexing of WherobotsDB surpasses PostGIS. The overhead associated with WherobotsDB’s distributed system is outweighed by its ability to efficiently handle complex, large-scale queries, ultimately making it significantly faster and more scalable in high-demand scenarios.

In essence, WherobotsDB may start off slower with smaller datasets, but its performance dramatically improves with larger datasets, far exceeding PostGIS’ capabilities. This makes WherobotsDB the preferred choice when working with extensive spatial data that demands efficient, scalable processing.

See the tables below for a detailed comparison of performance metrics.

Datasets Size Number of Rows     Types of Geometries
OSM Nodes 256GB 7,456,990,919 Point
Overture Buildings     103GB 706,967,095 Polygon/
Multipolygon
OSM Postal Codes 0.86GB 154,452 Polygon
WITH t AS 
( 
    SELECT 
        * 
    FROM 
        overture_buildings_test AS buildings 
    JOIN 
        osm_nodes_test AS nodes 
    ON ST_Intersects(
        nodes.geometry, 
        buildings.geometry) 
) 
SELECT 
    COUNT(*) 
FROM 
    t;
Dataset sizes PostGIS WherobotsDB (Cairo)     Relative Speed
1M Buildings x 1M Nodes 0.70s 2.24s 0.31 x
1M Buildings x 10M Nodes 32.30s 33.44s 0.97 x
10M Buildings x 1M Nodes 28.30s 21.72s 1.30 x
10M Buildings x 10M Nodes 49.30s 3m 36.29s 0.23 x
All Buildings x 1M Nodes 8m 44s 36.59s 14.32 x
All Buildings x 10M Nodes 38m 30s 49.77s 46.42 x
10K Buildings x All Nodes 2h 29m 19s 46.42s 193.00 x
100K Buildings x All Nodes 4h 22m 36s 51.75s 304.57 x
1M Buildings x All Nodes 3h 29m 56s 1m 19.63s 158.17 x
10M Buildings x All Nodes 8h 48m 13s 2m 13.65s 237.17 x
All Buildings x All Nodes +24h (Aborted) 4m 31.84s + 317.83 x

Conclusion: Better Together for Spatial Brilliance

Individually, Wherobots and PostgreSQL + PostGIS are powerful tools. But when combined, they unlock new possibilities for spatial analysis, offering a balanced approach to handling both large-scale data processing and detailed, precise database management. By understanding and implementing their complementary capabilities, organizations can achieve more refined insights and greater operational efficiency in their spatial data projects.

By utilizing both tools strategically, companies can not only enhance their data analysis capabilities but also ensure that they are prepared for a variety of spatial data challenges, now and in the future.

To learn more about Wherobots, reach out or give it a try with a trial account

Generating global PMTiles from Overture Maps in 26 minutes with WherobotsDB VTiles

We previously described the performance and scalability challenges of generating tiles and how they can be overcome with WherobotsDB VTiles. Today we will demonstrate how you can use VTiles to generate vector tiles for three planetary scale Overture Maps Foundation datasets in PMTiles format: places, buildings, and division areas.

Quick recap: What are Vector Tiles and why should you use PMTiles?

Vector tiles are small chunks of map data that allow for efficient and customizable map rendering at varying zoom levels. Vector tiles contain geometric and attribute data, for example roads and their names, that facilitate dynamic styling of map features on the fly, offering more flexibility and interactivity.

PMTiles is a cloud-native file format that is designed for holding an entire collection of tiles, in this case vector tiles. The PMTiles format enables individual tiles to be queried directly from cloud object storage like Amazon S3. By querying directly from cloud storage, you no longer need to set up and manage dedicated infrastructure, reducing your costs, effort, and time-to-tile-generation.

Tile Viewer

If you’re sharing, inspecting, or debugging tiles you’ll need to visualize them. To make these processes easier, Wherobots created a tile viewer site, available at tile-viewer.wherobots.com. This tool comes from the PMTiles github repository, and it has offers the following features:

  • Viewing tiles with a dark themed basemap
  • Inspecting individual tiles, selected from a list of all the tiles in the set
  • Inspecting the metadata and header information of the PMTiles file

This viewer takes a url for a tileset. If your tiles are stored in a private S3 bucket you will need to generate a signed URL. Wherobots Cloud has a function for converting your S3 URI to a signed url:

from wherobots.tools.utility.s3_utils import get_signed_url

get_signed_url(my_s3_path, expiration_in_seconds)

my_s3_path will be an s3 uri, like s3://myBucket/my/prefix/to/tiles.pmtiles and expiration_in_seconds will be an int representing the number of seconds the signed url will be valid for.

The tile viewer will be used to explore the tiles we generate in our examples.

Examples

The following examples show tile generation using VTiles for three Overture layers at a planetary scale. Because we are working with planetary scale datasets and want quick results, we will use the large runtimes available in the professional tier of Wherobots Cloud.

Tile generation time is provided in each example, and includes time to load the input data, transform it, generate tiles, and save the PMTiles file in an S3 bucket. It does not include the time to start the cluster.

To run the examples below, just make sure your sedona session is started:

from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Places

We start by creating PMTiles for the places dataset. With VTiles, this is a straightforward case for several reasons:

  1. The dataset contains only points. A point feature rarely spans multiple tiles as it has no dimensions. The tile generation time is strongly influenced by the sum of the number of features multiplied by the number of tiles which that feature intersects.
  2. At 50 million records, this is a relatively small dataset compared to the buildings dataset at 2.3 billion features.
  3. We will do minimal customization. VTiles’ feature filters allow us to control which features go into which tiles based on the tile id (x, y, z) and the feature itself (area, length, and user-provided columns). We will go more in depth on feature filters in the division areas example.
import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.places_place").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("categories.main").alias("category"),
        f.lit('places').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/places.pmtiles",
    GenerationConfig(6, 15)
)

This example generates a PMTiles file for zooms 6 through 15. Since the places dataset contains features that are not relevant at a global level, we selected a minimum zoom of 6, about the size of a large European country. The max zoom of 15 is selected because the precision provided should be sufficient and overzooming means that our places will still render at higher zooms. The OpenStreetMap wiki has a helpful page about how large a tile is at each zoom level. The name and category of each place will be included in the tiles.

Tile generation time was 2 minutes and 23 seconds on a Tokyo runtime. The resulting PMTiles archive is 28.1 GB.

Buildings

This example generates tiles for all buildings in the Overture building dataset. This is about 2.3 billion features. The roughly uniform size of the features and the relatively small size of buildings relative to the higher zoom tiles means that the number of (feature, tile) combinations is similar to |features| * |zooms|. Because of this homogeneity, we can expect a quick execution without the use of a feature filter. This example represents a typical use case where there is a very large number of features and where the extent of a tile at maximum zoom is larger than the size of a feature.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.buildings_building").select(
        "geometry",
        f.lit('buildings').alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/buildings.pmtiles",
    GenerationConfig(10, 15)
)

This example generates a PMTiles file for zooms 10 through 15. The minimum zoom of 10 was selected because buildings aren’t useful at lower zooms for most use cases. The max zoom of 15 was selected because the precision provided should be sufficient and overzooming means that our buildings will still render at higher zooms. The properties of a very large percentage of the Overture buildings are null so we haven’t included any here.

Tile generation time was 26 minutes on a Tokyo runtime. The resulting PMTiles archive is 438.4 GB.

Division Areas

The third example creates tiles for all polygons and multipolygons in the Overture division areas dataset. This dataset is just under one million records. Despite its small size, this dataset can be challenging to process. It contains polygons and multipolygons representing areas, from countries which are large and highly detailed, to small neighborhoods with minimal detail. The appropriate min/max zoom for countries and neighborhoods is very different.

Recall from the places example that the amount of work the system must do is strongly related to the number of (feature, tile) pairs. A country outline like Canada might cover an entire tile at zoom 5. It will be in roughly 2 * 4^(max_zoom - 5) tiles across all zooms; if max_zoom is 15, that’s over 2 million tiles. You can quickly wind up with an unexpectedly large execution time and tiles archive if you do not take this into account. Most use cases will benefit from setting different min and max zooms for different features, which you can do in VTiles via a feature filter.

Let’s first profile the base case with no feature filter.

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area.pmtiles",
    GenerationConfig(3, 15)
)

This run took a bit over 3 hours on a Tokyo runtime. The resulting PMTiles archive is 158.0 GB. This small dataset takes more time than the buildings dataset that is more than 2300 times larger!

Feature Filters

We can significantly accelerate the execution time of this example using the VTiles feature filters. These feature filters are most commonly used to determine what features should be in a tile on the basis of a category and the zoom level. In this case we will only show countries at lower zooms and neighborhoods at the highest zoom levels. The visual impact of a feature that is much larger than the tile is minimal in typical use cases. The visual impact of a neighborhood is null when it’s smaller than the tile can resolve; it is literally invisible, or perhaps a single pixel. By excluding these features that add no visual information, we save processing time and storage costs, as well as increase the performance of serving the now-smaller tiles.

Here is an example of using feature filters to improve performance of this division area generation task:

import pyspark.sql.functions as f
import os

from wherobots.vtiles import GenerationConfig, generate_pmtiles

generate_pmtiles(
    sedona.table("wherobots_open_data.overture_2024_05_16.divisions_division_area").select(
        "geometry",
        f.col("names.primary").alias("name"),
        f.col("subtype").alias('layer'),
    ),
    os.getenv("USER_S3_PATH") + "tile_blog/division_area_filtered.pmtiles",
    GenerationConfig(
        min_zoom=2, 
        max_zoom=15,
        feature_filter = (
            ((f.col("subType") == f.lit("country")) & (f.col("tile.z") < f.lit(7))) |
            ((f.col("subType") == f.lit("region")) & (f.lit(3) < f.col("tile.z")) & (f.col("tile.z") < f.lit(10))) |
            ((f.col("subType") == f.lit("county")) & (f.lit(9) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(12))) |
            ((f.col("subType") == f.lit("locality")) & (f.lit(10) < f.col("tile.z")) & (f.col("tile.z")  < f.lit(14))) |
            ((f.col("subType") == f.lit("localadmin")) & (f.lit(13) < f.col("tile.z"))) |
            ((f.col("subType") == f.lit("neighborhood")) & (f.lit(13) < f.col("tile.z")))
        )
    )
)

This run took less than 10 minutes on a Tokyo runtime. The resulting PMTiles archive is 8.9 GB.

Feature filters reduced tile generation time by more than 90%, reduced the dataset size, and lowered the cost compared to the original example. Tiles will also appear less cluttered to the user without having to get one’s hands dirty playing with style sheets.

A Note on Working without Feature Filters

We know that there are use cases with large geometries where it might be difficult to write an effective feature filter or it may be undesirable to filter. For those use cases we have launched a feature in Wherobots 1.3.1 to improve tile generation performance. This will be an option on the GenerationConfig called repartition_frequency. When features are repeatedly split as the algorithm zooms in, those child features wind up in the same partition. This can cause well partitioned input datasets to become skewed by even just a single large record. Setting a repartition frequency to 2 or 4 can help to keep utilization of the cluster high by keeping partitions of roughly uniform size.

Conclusion

The VTiles tile generation functionality is a fast and cost effective way to generate tiles for global data. The Apache Spark-based runtime powered by Apache Sedona and Wherobots Cloud makes loading and transforming data for input into the system straightforward and performant even on large datasets. You can leverage feature filters to curate the contents of your tiles to your use cases and performance goals. We encourage you to try out VTiles with your own data on Wherobots Cloud.

Havasu: A Table Format for Spatial Attributes in a Data Lake Architecture

Introduction

In the past decade, many organizations have been using BLOB storage (e.g., AWS S3) as a primary storage platform. These organizations collect tons of data and ingest it as files into S3 for its scalability, cost efficiency, and reliability. However, there has since been a need to interact with such data using SQL, which is familiar to developers, in a similar approach to querying relational databases. That led to the invention of open table formats such as Apache Iceberg, which enables users to perform wide-ranging database operations, including concurrent data updates, table metadata handling, time travel and versioning, on files stored in object stores such as S3 without the need to load such files in a relational database.

However, when dealing with spatial data existing open data formats such as Apache Iceberg, Hudi, and DeltaLake, do not provide native support for spatial attributes. This lack of native spatial support forces users to handle the spatial aspect of data at the application layer. This approach often cannot handle the intricacies and scale of spatial analytics and falls short in meeting the demands of analytical workloads leveraging spatial data. The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack, which includes managing spatial attributes in open table formats.

To remedy that, Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of pivotal features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.

Key features of Havasu

ACID on Spatial Data

Havasu stands out for ensuring the ACID (Atomicity, Consistency, Isolation, and Durability) properties in spatial data transactions, a critical feature for reliable data management. This guarantees that each transaction is processed completely or not at all, maintaining data integrity even in complex environments. Furthermore, Havasu supports schema evolution, allowing for adaptable data structures without compromising existing data. This flexibility is key for efficiently managing and evolving spatial data over time, catering to the dynamic needs of spatial databases.

Native Spatial Support

Havasu supports geometry/geography and raster as primary data types, and seamlessly integrates with the compute layer. Users can easily read and write spatial data using Havasu, and process spatial data using any computation engine (e.g., WherobotsDB) as long as that computation engine implements a reader/writer for Havasu.

Efficiency

Computation engines (like WherobotsDB), when incorporating Havasu, can benefit from Havasu’s spatial filter push down support, which significantly accelerates spatial range queries. Havasu allows storing spatial data in object storages of the customer’s choice, and it decouples storage with computation, which makes storing vast amounts of spatial data very cost effective. Havasu also comes equipped with an efficient storage mode of raster data for parquet files, namely out-db storage, which enables high throughput of reading large amount of rasters.

Openness

Havasu is a table format with open specification. Havasu is based on Apache Iceberg, which has an open specification and was widely adopted by the big data ecosystem. The extension to the Apache Iceberg specification is also clearly specified so that any implementation adopting this extended specification is able to read and write Havasu tables. Customers can store Havasu tables in the storage of their choice, without being tightly coupled to one specific vendor or implementation.

Havasu in a Nutshell: Key Technical Insights

The open-source Havasu specification can be found in the Wherobots Havasu documentation. The Havasu table format extends the Iceberg table spec to support managing large spatial datasets as tables, in the following ways:

  • Primitive spatial data types and storage: the Havasu specification extends the Iceberg specification to support spatial data types.
  • Spatial statistics: extending the Iceberg manifest files to support spatial statistics.
  • Spatial filter push down and indexing: extending the Iceberg specifcation to support spatial filter pushdown and spatial indexing which greatly reduces the data retrieval overhead.

All other aspects of Iceberg spec are unchanged. For example, the Havasu specification does not change the fundamental organization of table files.

Primitive spatial data types and storage
Geometry

Geometry values consist of an optional spatial reference ID (abbreviated as SRID) and a geometry shape. The SRID is a 32-bit integer that identifies the coordinate system that the geometry shape is using. The interpretation of SRID is implementation dependent. For example, the SRID could be an EPSG code, or a code defined by a proprietary coordinate system. The geometry shape is one of the types defined by OGC Simple Features for SQL specification. The geometry shape can be stored in one of the following formats in underlying parquet files.

Encoding Parquet physical type Logical type Description
ewkb BINARY Extended Well-known binary (EWKB)
wkb BINARY Well-known binary (WKB)
wkt BINARY UTF8 Well-known text (WKT)
geojson BINARY UTF8 https://datatracker.ietf.org/doc/html/rfc7946

When the geometry column is at the root of the schema, and the geometry encoding is one of wkb and ewkb, the application can optionally write the GeoParquet metadata to the Parquet files. The GeoParquet metadata is defined by the GeoParquet specification.

Raster

A raster is one or more grids of cells. All the grids should have width rows and height columns. The grid cells are represented by the band field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the crs field. The CRS will be serialized as a WKT string when stored in data files.

Havasu supports persisting raster band values in two different ways:

  • in-db: The band values are stored in the same data file as the geo-referencing information. The band values are stored in the bands field of the raster value.
  • out-db: The band values are stored in files external to Havasu tables. The raster value stored in Havasu data file contains the geo-referencing information and URI of external raster files. The URI of external raster files are stored in the bands field of the raster value.
Spatial statistics

Havasu collects and records the spatial statistics of data files when writing data to the table. The spatial statistics includes the minimum bounding rectangle (MBR) of the geometries in the data file.

Geometry bounds

The bounds of geometry values should be derived using their minimum bounding rectangles (MBRs). The MBR of a geometry value is defined as the smallest rectangle that contains the geometry value. The SRID of geometry values are ignored when computing the MBRs. The MBRs of all geometry values in a data file should be unioned together as a single MBR, which is the MBR of the data file.

Raster bounds

Raster bounds are MBRs of rasters in WGS84 coordinate system. They are computed by transforming the envelope of the raster in its native coordinate system to WGS84. Raster bounds have a special rule for handling MBRs crossing the anti-meridian. Implementations of the Havasu specification should be able to handle MBRs crossing the anti-meridian correctly, otherwise spatial query optimizations will derive incomplete query results.

Spatial filter push down and indexing

Database engines can take advantage of the spatial statistics of data files to optimize the query execution plan. For example, if the query predicate is a spatial range query, the engine can use the spatial statistics to prune the data files that do not contain any data that satisfies the query predicate. This process is called spatial filter pushdown. How spatial query optimization is implemented in scan planning is implementation dependent. For example, in WherobotsDB, for a spatial range query ST_Within(geom, Q), where geom is the geometry field in a Havasu table, Q is a constant geometry as the query window, WherobotsDB converts the spatial query predicate to an inclusive projection ST_Intersects(MBR[geom], Q)MBR[geom] is the minimum bounding box of all values of geom in a data file. Then Sedona evaluates the projection using field statistics maintained in manifest files

data skipping

Spatial filter push down works best when the spatial data near to each other were stored in the same file. Havasu provides a syntax CREATE SPATIAL INDEX for rewriting the table to sort the records by geometry column.

CREATE SPATIAL INDEX FOR <table_name> USING hilbert(<geometry_column>, <precision>) [ WHERE <condition> ] [ OPTIONS <options> ];

This statement will rewrite the data files of the table and cluster the data by the geometry column. This feature is very useful when the table contains a large amount of data and the spatial filter is very selective. For example, if the table contains 1TB of data and the spatial filter will only select 1% of the data, ideally Havasu will only read ~ 10GB of data to answer the query.

Navigating Spatial Data with Havasu-Powered WherobotsDB Tables

WherobotsDB implements a reader/writer for the Havasu spatial table format. Users can perform many interesting spatial database operations on Havasu tables using WherobotsDB in Wherobots Cloud. Here we explore some common operations using WherobotsDB. For details, please read Wherobots documentation. To follow along create a free Wherobots Cloud account.

Create a new table in Wherobots table catalog

First, let’s create a Havasu table in the wherobots table catalog. This catalog by default is configured to use your Wherobots Cloud S3 storage, but another storage location can also be specified. We’ll use a dataset of Taxi rides as our example.

CREATE TABLE wherobots.test_db.taxi (
  pickup GEOMETRY,
  Trip_Pickup_DateTime STRING,
  Payment_Type STRING,
  Fare_Amt DECIMAL(10,0))
USING havasu.iceberg
-- By default this table will be stored in your Wherobots Cloud S3 account
-- Optionally specify other location
-- LOCATION 's3://path/to/warehouse/test_db/taxi'

List tables in Wherobots table catalog

We can view all tables within the wherobots table catalog using SHOW TABLES:

SHOW TABLES IN wherobots.test_db

We can see the wherobots.test_db.taxi table that we just created:

+---------+-----------+-----------+
|namespace|  tableName|isTemporary|
+---------+-----------+-----------+
|  test_db|       taxi|      false|
+---------+-----------+-----------+

Describe a table in Wherobots table catalog

To view the columns and datatypes of each column we can describe the table:

DESCRIBE TABLE wherobots.test_db.taxi

Note here that our pickup column is of type geometry:

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|              pickup| geometry|   null|
|Trip_Pickup_DateTime|   string|   null|
|        Payment_Type|   string|   null|
|            Fare_Amt|   string|   null|
+--------------------+---------+-------+

Insert geometry data

We can insert data into our table using SQL. Here we specify the geometry value using the ST_GeomFromText function which takes a WKT string, in this case to describe the point geometry that represents the pickup location.

sedona.sql("""
INSERT INTO wherobots.test_db.taxi
VALUES(ST_GeomFromText('POINT (-73.96969 40.749244)'), '10/16/09 22:35', 'Credit', 42)
""")

We can also write spatial DataFrames to Havasu tables. Here we load a NYC taxi dataset into a Sedona DataFrame, then append the data to our wherobots.test_db.taxi Havasu table:

taxidf = sedona.read.format('csv').option("header","true").option("delimiter", ",").load("s3a://wherobots-examples-prod/data/nyc-taxi-data.csv")
taxidf = taxidf.selectExpr('ST_Point(CAST(Start_Lon AS Decimal(24,20)), CAST(Start_Lat AS Decimal(24,20))) AS pickup', 'Trip_Pickup_DateTime', 'Payment_Type', 'CAST(Fare_Amt AS DECIMAL)')
taxidf = taxidf.filter(col("pickup").isNotNull())

taxidf.writeTo("wherobots.test_db.taxi").append()

Create spatial index

Creating a spatial index will rewrite the table to sort records by the geometry column. Havasu supports the hilbert index strategy which will sort the data based on the Hilbert space filling curve which is very efficient at sorting geospatial data based on proximity. We can configure the precision by specifying a value for the precision parameter, which is the number of bits used to represent the Hilbert index.

sedona.sql("CREATE SPATIAL INDEX FOR wherobots.test_db.taxi USING hilbert(pickup, 10)")

Read data from Havasu table

We can query our Havasu tables using familiar SQL, however when using WherobotsDB we have the advantage of spatial queries using Spatial SQL functions. Here we search for all taxi pickups that occurred within a certain area around a given point:

sedona.sql("""
SELECT * FROM wherobots.test_db.taxi 
WHERE ST_Intersects(pickup, ST_Buffer(ST_GeomFromText('POINT (-73.96969 40.749244)'), 0.001))
""").show(truncate=False)
+----------------------------+--------------------+------------+--------+
|pickup                      |Trip_Pickup_DateTime|Payment_Type|Fare_Amt|
+----------------------------+--------------------+------------+--------+
|POINT (-73.96969 40.749244) |1/5/09 16:29        |Credit      |9       |
|POINT (-73.969387 40.749159)|1/20/09 14:38       |CASH        |7       |
|POINT (-73.969308 40.75001) |1/8/09 17:48        |CASH        |11      |
|POINT (-73.969355 40.749315)|1/7/09 16:52        |CASH        |10      |
|POINT (-73.970238 40.749497)|1/19/09 2:42        |Credit      |45      |
|POINT (-73.969492 40.749103)|1/21/09 19:34       |Credit      |15      |
|POINT (-73.970158 40.749055)|1/15/09 14:34       |CASH        |8       |
|POINT (-73.969638 40.748663)|1/27/09 17:46       |CASH        |9       |
|POINT (-73.970167 40.749033)|1/2/09 18:49        |CASH        |8       |
|POINT (-73.97059 40.749077) |1/18/09 20:39       |Credit      |10      |
|POINT (-73.970105 40.748985)|1/12/09 11:39       |CASH        |9       |
|POINT (-73.970228 40.749027)|1/8/09 16:07        |CASH        |5       |
|POINT (-73.9697 40.748737)  |1/5/09 18:04        |Credit      |6       |
|POINT (-73.970628 40.749132)|1/27/09 18:11       |CASH        |10      |
|POINT (-73.969573 40.748677)|1/29/09 19:35       |CASH        |5       |
|POINT (-73.969783 40.749163)|1/6/09 19:48        |Cash        |8       |
|POINT (-73.969522 40.748948)|1/4/09 16:24        |CASH        |5       |
|POINT (-73.969529 40.749625)|1/7/09 23:38        |CASH        |7       |
|POINT (-73.969473 40.749072)|1/29/09 18:04       |CASH        |16      |
|POINT (-73.970575 40.749395)|1/7/09 19:36        |CASH        |8       |
+----------------------------+--------------------+------------+--------+
only showing top 20 rows

Insert out-db raster data

We can also work with raster data in Havasu tables. Here we insert raster data into a Havasu table using the out-db option. You can read more about working with raster data in Havasu tables in the documentation.

sedona.sql("SELECT RS_FromPath('s3a://XXX.tif') as rast"). \\
    writeTo("wherobots.test_db.test_table").append()

Havasu-Powered Wherobots open data catalog

Wherobots collects open datasets from various data sources, then cleans and transforms them to Havasu format to enable linking enterprise data to the real physical world. All datasets are provided for free (except AWS data transfer fee). Certain datasets are only accessible by our Pro Edition users. To learn more, please read Wherobots Open Data

Dataset name Availability in Wherobots Type Count Description
Overture Maps buildings/building Community edition Polygon 785 million Any human-made structures with roofs or interior spaces
Overture Maps places/place Community edition Point 59 million Any business or point of interest within the world
Overture Maps admins/administrativeBoundary Community edition LineString 96 thousand Any officially defined border between two Administrative Localities
Overture Maps admins/locality Community edition Point 2948 Countries and hierarchical subdivisions of countries
Overture Maps transportation/connector Community edition Point 330 million Points of physical connection between two or more segments
Overture Maps transportation/segment Community edition LineString 294 million Center-line of a path which may be traveled
Google & Microsoft open buildings Professional edition Polygon 2.5 billion Google & Microsoft Open Buildings, combined by VIDA
LandSAT surface temperature Professional edition Raster (GeoTiff) 166K images, 10 TB size The temperature of the Earth’s surface in Kelvin, from Aug 2023 to Oct 2023
US Census ZCTA codes Professional edition Polygon 33144 ZIP Code Tabulation Areas defined in 2018
NYC TLC taxi trip records Professional edition Point 200 million NYC TLC taxi trip pickup and dropoff records per trip
Open Street Maps all nodes Professional edition Point 8 billion All the nodes of the OpenStreetMap Planet dataset
Open Street Maps postal codes Professional edition Polygon 154 thousand Boundaries of postal code areas as defined in OpenStreetMap
Weather events Professional edition Point 8.6 million Events such as rain, snow, storm, from 2016 – 2022
Wild fires Professional edition Point 1.8 million Wildfire that occurred in the United States from 1992 to 2015

The Wherobots open data catalog can be extremely useful when tables are combined, typically using spatial joins, to address real world business use cases like risk analysis, site selection, fleet vehicle optimization and answering other business intelligence questions.

Spatial join query to find zones prone to wild fires
Let’s see how we can make use of the Wherobots open data catalog using Havasu tables to perform a spatial join operation to find US zipcode regions that experience the most wild fires. To do this we will use the wherobots_open_data.us_census.zipcode Havasu table which contains the polygon geometries of US zipcodes and wherobots_open_data.weather.wild_fires which contains point geometries of wild fire events.

We perform a spatial join operation using the ST_Intersects spatial SQL function to define a predicate that will join fires that occur within their respective zipcodes.

fire_zone = sedona.sql(
    """
    SELECT
        z.geometry as zipgeom,
        z.ZCTA5CE10 as zipcode,
        f.FIRE_NAME
    FROM
        wherobots_open_data.us_census.zipcode z,
        wherobots_open_data.weather.wild_fires f
    WHERE
        ST_Intersects(z.geometry, f.geometry)
    """
)

We can then group this data by Zipcode to find the count of fires that occur in each zip code and visualize the results. This type of analysis can be useful for risk analysis and insurance premium pricing.

wildfire risk by county

Resources

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Wherobots Cloud Overview – Getting Started With Wherobots Cloud Part 1

This is the first post in a series that will introduce Wherobots Cloud and WherobotsDB, covering how to get started with cloud-native geospatial analytics at scale.

What Is Wherobots Cloud & WherobotsDB?

Wherobots Cloud is a fully managed cloud platform that enables developers and data scientists to efficiently manage their spatial analytics and AI pipelines in the cloud. The core of Wherobots Cloud is powered by Apache Sedona and WherobotsDB, a scalable geospatial analytics database engine. WherobotsDB is built on a distributed compute architecture which enables scalable computation of massive datasets without sacrificing speed. With an architecture that separates the compute layer from the storage layer WherobotsDB is truly cloud-native.

WherobotsDB is built upon the open-source Apache Sedona project that provides the foundation for scalable geospatial analytics. By leveraging WherobotsDB in Wherobots Cloud developers and data scientists can take advantage of optimized query processing, a data lakehouse architecture built on Havasu – an Apache Iceberg-compatible spatial table format– with a self-service fully-managed cloud environment.

Creating A Free Wherobots Cloud Account

You can get started with Wherobots Cloud for free – no credit card required – with a generous free tier. Sign up at cloud.wherobots.com.

Wherobots Cloud Interface

Wherobots Cloud Interface

Once you’ve signed in to your Wherobots Cloud account you’ll be able to view your account dashboard, notebooks, scheduled jobs, and files. Creating a notebook environment is the first step to getting started with WherobotsDB. With the Wherobots free tier you’ll have access to a default resource configuration which can be expanded by signing up for the Wherobots Professional Tier.

The Wherobots Notebook Environment

Wherobots cloud notebook environment

Once you’ve started the notebook environment click "Open" to launch JupyterLab. The Jupyter notebook environment will be your main interface for working with WherobotsDB via Python or Scala. You can create your own Jupyter notebook to begin your geospatial analysis workflow or explore the many sample notebooks via the file explorer in the left tab. Example notebooks are available to show how to work with both vector and raster data via Spatial SQL, how to access the Wherobots open data catalog, how to leverage the Havasu spatial table format, and much more.

You can also upload notebooks shared by others via platforms like GitHub, for example this repository includes a few notebooks from some of my recent projects.

Working With Files

Working with files in Wherobots Cloud

In addition to the Wherobots open data catalog, WherobotsDB is able to import data from virtually any spatial data format including flat files, ShapefilesGeoJSONGeoParquet, and PostGIS databases. Included with your Wherobots Cloud account is free private file hosting via AWS S3. You can upload and manage files via the "Files" tab in Wherobots Cloud. Once uploaded you’ll be able to access files via a private S3 URL making the files available to only accounts in your Wherobots Cloud organization.

Wherobots Online Community

Wherobots online community

The Wherobots Online Community is the place to ask questions, learn what others in the community are building with Wherobots and Apache Sedona, and share your spatial expertise with the community. Please join the community and connect with others interested in the exciting world of spatial data analysis.

Documentation And Resources

You can find documentation for all things WherobotsDB and Wherobots cloud at docs.wherobots.com. The documentation includes information and examples covering how to further configure and manage your Wherobots Cloud account as well as how to leverage WherobotsDB’s rich Spatial SQL and Python APIs for working with geospatial data in the cloud.

Another helpful resource is the Wherobots YouTube channel for technical videos and hands-on livestreams showing what’s possible with Apache Sedona & Wherobots Cloud

If you haven’t made the most of your free Wherobots Cloud account yet you can sign-in and get started at cloud.wherobots.com.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: