Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

Building a Spatial Data Lakehouse

Havasu spatial data lakehouse

Introduction

In today’s data-centric world, geospatial data is collected at an unprecedented scale. Large organizations in sectors like logistics, telecommunications, urban planning, and environmental monitoring gather vast amounts of data daily—from GPS locations to satellite imagery and IoT sensors. This data is incredibly rich, holding insights into everything from human movement patterns to natural resource management. However, it is also extremely complex and challenging to store, process, and analyze effectively.

These challenges stem from the sheer scale, complexity, and unique processing needs of geospatial data. Traditional data lakes and warehouses were designed for tabular data and typically struggle to handle massive raster imagery, multi-layered vector datasets, and the spatial queries essential for geospatial analysis. This gap has left many organizations unable to fully leverage their geospatial data, which remains scattered, under-utilized, and costly to manage.

Click here to launch this interactive notebook

Enter Havasu—Wherobots’ spatial data lakehouse solution that bridges this gap. Built on Apache Iceberg, Havasu brings the scalability of a data lake and the structured efficiency of a data warehouse to geospatial data, offering a comprehensive solution designed to meet the specific demands of spatial data management. You can learn more about the detailed specifications of Havasu here.

With Havasu, organizations can:

  • Achieve Performance at Scale: Havasu optimizes for high-speed processing of geospatial data, whether you’re performing complex spatial joins, analyzing real-time data streams, or querying historical records.
  • Store Massive Volumes Efficiently: Havasu’s flexible in-DB and out-DB raster storage options allow organizations to manage enormous datasets, keeping costs low while enabling easy access to high-resolution imagery, vector layers, and more.
  • Streamline Data Management: Built on the Iceberg table format, Havasu supports schema evolution, time travel, and versioning, empowering organizations to manage changing geospatial data with ease.

This blog will walk you through the core principles of building a spatial data lakehouse with Havasu, showing how you can create and manage spatial datasets at scale, integrate external data sources, and optimize for performance and cost-efficiency. Whether your focus is on city planning, environmental monitoring, or large-scale logistics, this guide will equip you to unlock the full potential of your geospatial data.

Creating a Table

Creating tables in Wherobots is a breeze, and the process is highly adaptable. Whether working with vector data (points, lines, polygons) or raster images, you can set up tables in just a few steps. For instance, creating a table for storing vector geometries might look like this:

sedona.sql("""
CREATE TABLE wherobots.test_db.cities (id bigint, name string, geom geometry)
""")

This command creates a simple table where the geom column is used to store spatial data in a geometry format, which can later be used in spatial queries. Similarly, you can create raster tables to store geospatial images like satellite data.

Creating Tables & CRUD Operations

In addition to creating tables, Wherobots fully supports CRUD (Create, Read, Update, Delete) operations. For example, if you need to update spatial data or remove records, it’s just as straightforward as using SQL commands. You can perform spatially-aware deletes, as shown below:

sedona.sql("""
DELETE FROM wherobots.test_db.cities 
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

This flexibility allows you to manage your spatial data with the same ease as any other data in the database, while still taking advantage of the powerful geospatial capabilities of Wherobots.

Connecting Storage

A key feature of Havasu is its ability to integrate with external storage, allowing users to bring their own S3 buckets for managing spatial data. This enables users to scale their storage independently, providing flexibility and control over where and how their spatial data is stored.

You can configure a catalog in Wherobots to connect directly to your own S3 bucket. This allows Havasu tables to be stored in your specified S3 location, ensuring that your spatial data lakehouse is highly scalable and aligned with your organization’s storage needs.

For detailed instructions on setting up cross-account access and configuring your storage, please refer to documentation on S3 storage integrations.

Sample Queries

Once your table is created, running spatial queries is as easy as using standard SQL commands. However, Wherobots enhances this experience with powerful spatial functions. For example, finding all locations within a specific area could be done using the ST_Intersects function:

sedona.sql("""
SELECT * FROM wherobots.test_db.cities
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

These spatial queries leverage Wherobots’ built-in geospatial functions, making it easy to work with complex spatial relationships directly in your database.

Vector Data In Havasu

Vector tables in Havasu are crucial for managing and querying vector data, which includes points, lines, and polygons. These tables allow you to store geospatial features and leverage Havasu’s advanced spatial indexing and optimizations, such as spatial filter pushdown, to ensure fast data retrieval and processing. Let’s break down the process of working with vector tables in Wherobots.

Creating Tables

Creating a vector table in Wherobots is a straightforward process. When starting your spatial data lakehouse, the first step is to define how you will store vector data.

Each feature (or record) in these tables is stored as a geometry column, which holds the spatial data—whether it’s a point representing a city or a polygon outlining a forest.

For instance, to create a table for storing buildings in a city, you would use the following SQL command:

sedona.sql("""
CREATE TABLE wherobots.test_db.buildings (id bigint, name string, geom geometry)
""")

This creates a table where the geom column is used for vector data like points, lines, and polygons. Once created, you can use Wherobots’ spatial SQL functions to run geospatial queries, such as finding intersections, distances, proximity relationships and more.

Importing Common Formats

WherobotsDB provides extensive support for importing spatial data in various commonly used formats, making it easy to integrate your existing geospatial datasets into Havasu. Whether you’re working with GeoJSON, Shapefiles, GeoParquet, WKT files, or even geospatial databases like PostGIS, Wherobots ensures seamless compatibility.

Havasu stores spatial data in Apache Parquet format on cloud storage, a design choice that ensures cost-effective storage and a decoupled architecture for computation. This approach allows organizations to scale both storage and compute independently, optimizing for performance and budget.

To learn more about importing specific formats and integrating your data with Havasu, check out the Wherobots Documentation on Importing Data Formats.

Migrating from Parquet

Many organizations already store their geospatial data in Parquet format due to its efficient columnar storage. However, Parquet lacks optimized support for spatial queries. Migrating your existing Parquet datasets to Havasu enables you to take full advantage of Native Spatial Data Types for geometry and raster, spatial indexing, optimized querying, and Havasu’s native geospatial capabilities.

  1. Create an external table from your Parquet file:


    sedona.sql("""
    CREATE TABLE spark_catalog.default.parquet_table (
    id BIGINT,
    name STRING,
    geometry BINARY
    ) USING parquet LOCATION 's3://path/to/parquet/files'
    """)

  2. Convert binary columns to geometry:
    Once the Parquet table is loaded, you can convert the geometry from a binary format into a format that Wherobots can use (EWKB or another supported geometry type). This can be done using an ALTER TABLE command:


    sedona.sql("""
    ALTER TABLE wherobots.test_db.parquet_table
    SET GEOMETRY FIELDS geometry AS 'ewkb'
    """)

This process ensures that you can utilize your existing Parquet datasets in Wherobots without reprocessing all the data.

Click here to learn more about migrating Parquet files to Havasu.

Converting Tables to Havasu

Havasu extends the Iceberg format with spatial capabilities, and converting existing tables into Havasu tables brings a lot of advantages like schema evolution, time travel, and partitioning that optimizes spatial queries.

To convert an existing table to a Havasu table, you simply need to perform a snapshot of the table. For example:

sedona.sql("""
CALL wherobots.system.snapshot('spark_catalog.default.parquet_table', 'wherobots.test_db.new_havasu_table')
""")

This converts the table into a Havasu table format that benefits from all of Iceberg’s features.

Clustering Geometries

Clustering geometries is an optimization technique used to spatially organize data, which improves query performance, particularly for large datasets. By clustering geometries, spatial queries such as range queries or nearest-neighbor searches can be significantly faster.

In Wherobots, you can cluster geometries during table creation or after the table has been populated. Here’s how you would cluster geometries in an existing table:

sedona.sql("""
ALTER TABLE wherobots.test_db.buildings 
CLUSTER BY geom")
""")

This ensures that the geometries are spatially co-located, which minimizes the number of partitions scanned during spatial queries.

To learn more about Spatial Filter Push-down and indexing strategies, click here.

Raster Data In Havasu

Havasu introduces raster data as a primitive type, meaning you can define tables with raster columns to store and manipulate spatial imagery, such as satellite imagery and elevation models.

Users can create tables to store either in-DB rasters (where the image data is stored directly in the database) or out-DB rasters (where only metadata and paths are stored, with the images in external storage). With out-DB raster support, Havasu allows you to store only the metadata and file paths within the database, while actual raster files remain in external storage like AWS S3. This architecture makes it feasible to manage and query vast amounts of high-resolution imagery, supporting everything from city-scale maps to continent-wide satellite data without straining database resources.

Creating and Loading Raster Data

Creating a Basic Raster Table

Here’s how you would create a raster table in Havasu:

sedona.sql("""
CREATE TABLE wherobots.test_db.imagery (
id bigint,
description string,
rast raster)
""")

With this setup, the rast column will hold your raster data, allowing you to apply Havasu’s RS_ functions for raster processing and spatial queries.

Loading Raster Data

Havasu supports writing raster data using DataFrames, especially useful when you have image files (like GeoTIFFs) stored in S3 or local storage. This method allows you to read binary files and convert them into raster-compatible formats using the RS_FromGeoTiff function.

# Create a DataFrame containing raster data
df_binary = sedona.read.format("binaryFile")\
    .option("pathGlobFilter", "*.tif")\
    .option("recursiveFileLookup", "true")\
    .load('s3://wherobots-examples/data/eurosat_small')
df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))\
    .selectExpr("rast", "path AS data")

# Create a Havasu table using the DataFrame
df_geotiff.writeTo("wherobots.test_db.test_table").create()

This approach provides the option to write data as in-DB rasters or to reference them externally as out-DB rasters, depending on your storage and access requirements.

Choosing Between In-DB and Out-DB Rasters

Choosing between in-DB and out-DB storage is key for optimal performance. Here’s a breakdown:

  • In-DB Rasters: Suitable for small to medium-sized rasters. These are fully managed within the database and provide faster access since all data is loaded directly from the database tables.
  • Out-DB Rasters: Ideal for large, high-resolution images. With this setup, Havasu stores only metadata and file paths, while the raster images reside in external storage like S3, reducing database storage costs and allowing for lazy loading when data is accessed.

Additional optimizations, such as tiling and caching frequently accessed tiles, allow you to efficiently handle extensive data operations without unnecessary overhead, making Havasu a cost-effective choice for large-scale raster data management.

Best Practices for Out-DB Rasters

  • Caching: Enable caching of frequently accessed tiles to avoid repeated reads from remote storage, improving overall performance.
  • Read-Ahead Configuration: Set spark.wherobots.raster.outdb.readahead to an appropriate value (e.g., 64k or higher) to balance between read performance and network overhead.
  • Using Cloud-Optimized GeoTIFFs (COG): These files are optimized for remote storage, reducing latency and bandwidth by organizing data in tile-friendly formats .

Click here for more details about the Out-DB Raster type and a guide on improving it’s performance.

Optimizing Performance with Spatial Filter Pushdown

One of the strengths of Havasu for a raster data lakehouse is spatial filter pushdown, which applies spatial filters directly to reduce the amount of data read from storage. When querying for specific geographic regions, spatial pushdown ensures that only relevant raster tiles are accessed, significantly improving performance.

For example:

sedona.sql("""
SELECT * FROM wherobots.test_db.imagery
WHERE RS_Intersects(rast, ST_GeomFromText('POLYGON ((...))'))
""")

This method is particularly effective when the raster tables are partitioned by location or other relevant attributes, making spatial filter pushdown even more efficient.

For large-scale data lakehouse solutions, partitioning and indexing are essential to enhance query performance. Partitioning raster tables by spatial identifiers, such as SRID or city names, groups similar data together, further improving the efficacy of spatial filtering. For example, partitioning by SRID can be beneficial when dealing with geographically diverse datasets in different UTM zones.

  • Partitioning by Region: Use partitionBy during table creation to reduce scanning times, especially for spatially filtered queries.

# Write the data into a table partitioned by SRID/UTM zone
df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))\
    .sort('srid')\
    .write.format("havasu.iceberg").partitionBy("srid")\
    .saveAsTable("wherobots.test_db.eurosat_ms_srid")
USA_UTM_Zones.png


  • Indexing: For tables where partitions aren’t as useful, consider indexing raster data by geographic location using Havasu’s spatial indexing capabilities (e.g., hilbert indexing).


    sedona.sql("""
    CREATE SPATIAL INDEX FOR wherobots.db.test_table
    USING hilbert(rast, 10)")
    """)

hilbert_curve.png

Hilbert curve with different number of iterations

Advanced Raster Operations: Map Algebra and Pixel-Level Operations

Havasu supports a suite of raster RS_ functions that enable complex analyses, including map algebra and zonal statistics. Lets dive into when each method is appropriate and their impact on performance.

Map Algebra and Pixel-Level Operations

This functionality is crucial for applications like remote sensing, where you might need to compute remote sensing indices on large geographical regions across multiple raster bands.

For instance you can use RS_MapAlgebra to calculate the Normalized Difference Vegetation Index (NDVI) on a multi-band raster:

sedona.sql("""
SELECT RS_MapAlgebra(rast, 'D', 'out = (rast[3] - rast[0]) / (rast[3] + rast[0]);') as ndvi 
FROM raster_table
""")

This flexibility allows Analysts to perform custom calculations directly within Havasu, simplifying workflows for raster analysis.

RS_ZonalStats for Efficient Calculations within Areas of Interest

For use cases where only specific areas within a raster need processing, RS_ZonalStats offers an efficient alternative. This function calculates statistics (e.g., mean, sum, etc) for pixels within a designated geometry (e.g., a polygonal area of interest), loading only the required subset of pixels into memory.

Unlike pixel-level operations that materialize the entire raster, RS_ZonalStats selectively materializes pixels only within the specified geometry. This efficiency makes it ideal for spatial analyses that focus on specific regions without the overhead of processing the entire raster:

sedona.sql("""
SELECT RS_ZonalStats(outdb_raster, geometry, 'mean')
FROM raster_table
""")

Because RS_ZonalStats is targeted to the geometry, tiling may not be necessary for these operations. This distinction allows you to avoid tiling when performing zonal statistics, keeping the process streamlined and reducing unnecessary data handling steps.

By choosing the appropriate approach based on your operation type, you can maximize efficiency when working with large out-DB raster datasets in Havasu, ensuring both performance and resource management.

Handling Pixel-Level Operations on Out-DB Rasters in Havasu

When performing pixel-level operations on raster data in Havasu, it’s essential to understand the implications of using out-DB rasters. Any operation that accesses individual pixels directly, such as map algebra calculations, will materialize these pixels in memory, which can be resource-intensive if working on a large out-DB raster dataset. This section explores best practices for managing these operations efficiently.

Preprocessing with RS_TileExplode for Pixel-Level Operations

For out-DB rasters, any pixel-level transformation will load pixel data from remote storage into memory, creating a significant memory load. To make these operations more manageable:

  • Tile the Raster Data: Using RS_TileExplode splits the raster into smaller, more manageable tiles, allowing you to work on localized sections rather than loading the entire raster at once.


df_tiled = sedona.sql("""
SELECT RS_TileExplode(raster_data, 1024, 1024, true) AS tile
FROM outdb_rasters
""")

tileExplode_image.gif
Before and after RS_TileExplode(). Image taken from nv5GeoSpatialSoftware

With the raster data pre-tiled, pixel-level functions like RS_MapAlgebra can be applied to individual tiles, preventing full materialization of large rasters in memory and optimizing processing time and memory usage.

This approach is particularly beneficial for complex calculations that require pixel-by-pixel transformations, as tiling minimizes memory consumption and enables parallelized processing of smaller raster sections.

STAC Rasters in Havasu

To incorporate STAC (SpatioTemporal Asset Catalog) datasets into Havasu, the process involves accessing the STAC dataset as a Sedona DataFrame, surfacing STAC assets (raster bands) as out-db or in-db rasters, and finally saving it in a Havasu table. This guide provides the essentials to help you set up STAC-based raster data in Havasu, with specific attention to handling the dataset efficiently in Wherobots.

Accessing STAC Data and Creating a DataFrame

STAC datasets like Sentinel-2 imagery are typically available in JSON format, which includes metadata about each raster, such as acquisition time, spatial extent, and the URLs of imagery files. The first step is to read these JSON files into a DataFrame:

  • Fetch and Store JSON Metadata: STAC items, usually hosted on platforms like Earth-search, can be stored as JSON files in an S3 bucket or any accessible location.
  • Load into Sedona DataFrame: Use Sedona’s GeoJSON reader to load the STAC metadata into a DataFrame. This DataFrame will contain essential attributes like URLs to individual raster bands (e.g., blue, red, NIR).

df_stac = sedona.read.format("geojson").load("s3://your-bucket/path-to-stac-metadata")

This DataFrame now acts as a container for your STAC items, enabling you to manipulate, transform and/or perform spatial joins on the metadata before loading actual raster data.

Saving STAC Rasters as Out-DB Columns

Havasu’s out-DB raster capability allows you to reference large raster files externally, ideal for high-resolution datasets like Sentinel-2 imagery. To set this up:

  • Transform URLs: Ensure each STAC asset URL points to a valid path accessible by Wherobots (e.g., converting https URLs to s3:// format if stored on S3).
  • Create Out-DB Raster Columns: Use RS_FromPath to reference each asset as an out-DB raster. Here’s an example with a blue band raster:

df_rasters = df_stac.withColumn("raster_blue", expr("RS_FromPath(assets.blue.href)"))

Repeat this process for each required band (e.g., red, NIR) to build a DataFrame that references all the necessary raster paths. This setup maintains efficient storage by referencing images externally, reducing data load times and storage requirements in Havasu.

Saving the STAC Raster Data to Havasu

Once you’ve set up the DataFrame with out-DB raster columns, the next step is to save it as a Havasu table:

  • Write to Havasu: Use writeTo() to save the DataFrame as a new Havasu table, enabling further spatial queries and transformations within Wherobots.

df_rasters.writeTo("wherobots.test_db.stac_raster_table").create()

With this approach, your STAC imagery is now part of a Havasu table, ready for efficient querying and spatial operations. WherobotsDB enables advanced spatial functions like RS_Intersects and RS_MapAlgebra, making it straightforward to analyze the imagery stored as out-DB rasters.

Considerations

  • URL Compatibility: Ensure paths are accessible by Wherobots (e.g., transforming https to s3:// if needed) to avoid data access errors.
  • Tile Size and Storage Optimization: For performance, adjust tile sizes and consider filtering edge-case tiles to optimize spatial queries.
  • Tuning for Performance: For extensive STAC datasets, consider caching and configuring Spark to handle out-DB rasters efficiently by setting appropriate cache sizes and connection thresholds.

This streamlined approach provides an effective way to incorporate STAC datasets in Havasu, enabling data engineers to leverage high-resolution satellite imagery in their spatial data lakehouse while optimizing for performance and storage.

Monitoring and Maintenance

Monitor your data lakehouse solution’s performance using WherobotsDB’s configurations and consider periodically converting standard GeoTIFF files to Cloud-Optimized GeoTIFF (COG) to reduce storage latency. Use gdalinfo to inspect file structures, tile sizes, and compression formats, ensuring that your files are optimized for remote access.

Conclusion

Geospatial data presents a wealth of opportunities for organizations across industries, yet its inherent complexity often makes it difficult to unlock its full potential. With Havasu, Wherobots has developed a solution tailored specifically to the demands of geospatial data management, combining the scalability of a data lake with the structured efficiency of a data warehouse.

From its seamless integration with cloud storage and support for common geospatial formats to advanced capabilities like spatial indexing, filter pushdown, and raster processing, Havasu empowers organizations to efficiently store, query, and analyze spatial data at scale. By leveraging its open architecture and powerful features, teams can transform vast geospatial datasets into actionable insights, driving innovation in logistics, urban planning, environmental monitoring, and beyond.

When paired with WherobotsDB, Havasu transforms geospatial analytics by providing high-speed querying for both raster and vector data, advanced spatial indexing, and seamless compatibility with cloud-native storage solutions like AWS S3. This differentiates Havasu from legacy solutions that struggle with scalability or require complex application-layer workarounds for spatial data.

As geospatial data continues to grow in both scale and importance, adopting a spatial data lakehouse architecture with Havasu ensures that your organization remains at the forefront of data-driven decision-making. Whether you’re just beginning to integrate spatial data into your workflows or optimizing an existing system, Havasu equips you with the tools to manage, process, and analyze geospatial data with unmatched efficiency and flexibility.

Learn more about how Havasu can help you redefine your geospatial data strategy in the official documentation. Unlock the future of geospatial intelligence today.

Create a free account