Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
In today’s data-centric world, geospatial data is collected at an unprecedented scale. Large organizations in sectors like logistics, telecommunications, urban planning, and environmental monitoring gather vast amounts of data daily—from GPS locations to satellite imagery and IoT sensors. This data is incredibly rich, holding insights into everything from human movement patterns to natural resource management. However, it is also extremely complex and challenging to store, process, and analyze effectively.
These challenges stem from the sheer scale, complexity, and unique processing needs of geospatial data. Traditional data lakes and warehouses were designed for tabular data and typically struggle to handle massive raster imagery, multi-layered vector datasets, and the spatial queries essential for geospatial analysis. This gap has left many organizations unable to fully leverage their geospatial data, which remains scattered, under-utilized, and costly to manage.
Enter Havasu—Wherobots’ spatial data lakehouse solution that bridges this gap. Built on Apache Iceberg, Havasu brings the scalability of a data lake and the structured efficiency of a data warehouse to geospatial data, offering a comprehensive solution designed to meet the specific demands of spatial data management. You can learn more about the detailed specifications of Havasu here.
With Havasu, organizations can:
This blog will walk you through the core principles of building a spatial data lakehouse with Havasu, showing how you can create and manage spatial datasets at scale, integrate external data sources, and optimize for performance and cost-efficiency. Whether your focus is on city planning, environmental monitoring, or large-scale logistics, this guide will equip you to unlock the full potential of your geospatial data.
Creating tables in Wherobots is a breeze, and the process is highly adaptable. Whether working with vector data (points, lines, polygons) or raster images, you can set up tables in just a few steps. For instance, creating a table for storing vector geometries might look like this:
sedona.sql(""" CREATE TABLE wherobots.test_db.cities (id bigint, name string, geom geometry) """)
This command creates a simple table where the geom column is used to store spatial data in a geometry format, which can later be used in spatial queries. Similarly, you can create raster tables to store geospatial images like satellite data.
geom
In addition to creating tables, Wherobots fully supports CRUD (Create, Read, Update, Delete) operations. For example, if you need to update spatial data or remove records, it’s just as straightforward as using SQL commands. You can perform spatially-aware deletes, as shown below:
sedona.sql(""" DELETE FROM wherobots.test_db.cities WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))')) """)
This flexibility allows you to manage your spatial data with the same ease as any other data in the database, while still taking advantage of the powerful geospatial capabilities of Wherobots.
A key feature of Havasu is its ability to integrate with external storage, allowing users to bring their own S3 buckets for managing spatial data. This enables users to scale their storage independently, providing flexibility and control over where and how their spatial data is stored.
You can configure a catalog in Wherobots to connect directly to your own S3 bucket. This allows Havasu tables to be stored in your specified S3 location, ensuring that your spatial data lakehouse is highly scalable and aligned with your organization’s storage needs.
For detailed instructions on setting up cross-account access and configuring your storage, please refer to documentation on S3 storage integrations.
Once your table is created, running spatial queries is as easy as using standard SQL commands. However, Wherobots enhances this experience with powerful spatial functions. For example, finding all locations within a specific area could be done using the ST_Intersects function:
ST_Intersects
sedona.sql(""" SELECT * FROM wherobots.test_db.cities WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))')) """)
These spatial queries leverage Wherobots’ built-in geospatial functions, making it easy to work with complex spatial relationships directly in your database.
Vector tables in Havasu are crucial for managing and querying vector data, which includes points, lines, and polygons. These tables allow you to store geospatial features and leverage Havasu’s advanced spatial indexing and optimizations, such as spatial filter pushdown, to ensure fast data retrieval and processing. Let’s break down the process of working with vector tables in Wherobots.
Creating a vector table in Wherobots is a straightforward process. When starting your spatial data lakehouse, the first step is to define how you will store vector data.
Each feature (or record) in these tables is stored as a geometry column, which holds the spatial data—whether it’s a point representing a city or a polygon outlining a forest.
For instance, to create a table for storing buildings in a city, you would use the following SQL command:
sedona.sql(""" CREATE TABLE wherobots.test_db.buildings (id bigint, name string, geom geometry) """)
This creates a table where the geom column is used for vector data like points, lines, and polygons. Once created, you can use Wherobots’ spatial SQL functions to run geospatial queries, such as finding intersections, distances, proximity relationships and more.
WherobotsDB provides extensive support for importing spatial data in various commonly used formats, making it easy to integrate your existing geospatial datasets into Havasu. Whether you’re working with GeoJSON, Shapefiles, GeoParquet, WKT files, or even geospatial databases like PostGIS, Wherobots ensures seamless compatibility.
Havasu stores spatial data in Apache Parquet format on cloud storage, a design choice that ensures cost-effective storage and a decoupled architecture for computation. This approach allows organizations to scale both storage and compute independently, optimizing for performance and budget.
To learn more about importing specific formats and integrating your data with Havasu, check out the Wherobots Documentation on Importing Data Formats.
Many organizations already store their geospatial data in Parquet format due to its efficient columnar storage. However, Parquet lacks optimized support for spatial queries. Migrating your existing Parquet datasets to Havasu enables you to take full advantage of Native Spatial Data Types for geometry and raster, spatial indexing, optimized querying, and Havasu’s native geospatial capabilities.
Create an external table from your Parquet file:
sedona.sql("""CREATE TABLE spark_catalog.default.parquet_table ( id BIGINT, name STRING, geometry BINARY) USING parquet LOCATION 's3://path/to/parquet/files'""")
Convert binary columns to geometry:Once the Parquet table is loaded, you can convert the geometry from a binary format into a format that Wherobots can use (EWKB or another supported geometry type). This can be done using an ALTER TABLE command:
EWKB
ALTER TABLE
sedona.sql("""ALTER TABLE wherobots.test_db.parquet_tableSET GEOMETRY FIELDS geometry AS 'ewkb'""")
This process ensures that you can utilize your existing Parquet datasets in Wherobots without reprocessing all the data.
Click here to learn more about migrating Parquet files to Havasu.
Havasu extends the Iceberg format with spatial capabilities, and converting existing tables into Havasu tables brings a lot of advantages like schema evolution, time travel, and partitioning that optimizes spatial queries.
To convert an existing table to a Havasu table, you simply need to perform a snapshot of the table. For example:
sedona.sql(""" CALL wherobots.system.snapshot('spark_catalog.default.parquet_table', 'wherobots.test_db.new_havasu_table') """)
This converts the table into a Havasu table format that benefits from all of Iceberg’s features.
Clustering geometries is an optimization technique used to spatially organize data, which improves query performance, particularly for large datasets. By clustering geometries, spatial queries such as range queries or nearest-neighbor searches can be significantly faster.
In Wherobots, you can cluster geometries during table creation or after the table has been populated. Here’s how you would cluster geometries in an existing table:
sedona.sql(""" ALTER TABLE wherobots.test_db.buildings CLUSTER BY geom") """)
This ensures that the geometries are spatially co-located, which minimizes the number of partitions scanned during spatial queries.
To learn more about Spatial Filter Push-down and indexing strategies, click here.
Havasu introduces raster data as a primitive type, meaning you can define tables with raster columns to store and manipulate spatial imagery, such as satellite imagery and elevation models.
Users can create tables to store either in-DB rasters (where the image data is stored directly in the database) or out-DB rasters (where only metadata and paths are stored, with the images in external storage). With out-DB raster support, Havasu allows you to store only the metadata and file paths within the database, while actual raster files remain in external storage like AWS S3. This architecture makes it feasible to manage and query vast amounts of high-resolution imagery, supporting everything from city-scale maps to continent-wide satellite data without straining database resources.
Here’s how you would create a raster table in Havasu:
sedona.sql(""" CREATE TABLE wherobots.test_db.imagery ( id bigint, description string, rast raster) """)
With this setup, the rast column will hold your raster data, allowing you to apply Havasu’s RS_ functions for raster processing and spatial queries.
Havasu supports writing raster data using DataFrames, especially useful when you have image files (like GeoTIFFs) stored in S3 or local storage. This method allows you to read binary files and convert them into raster-compatible formats using the RS_FromGeoTiff function.
RS_FromGeoTiff
# Create a DataFrame containing raster data df_binary = sedona.read.format("binaryFile")\ .option("pathGlobFilter", "*.tif")\ .option("recursiveFileLookup", "true")\ .load('s3://wherobots-examples/data/eurosat_small') df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))\ .selectExpr("rast", "path AS data") # Create a Havasu table using the DataFrame df_geotiff.writeTo("wherobots.test_db.test_table").create()
This approach provides the option to write data as in-DB rasters or to reference them externally as out-DB rasters, depending on your storage and access requirements.
Choosing between in-DB and out-DB storage is key for optimal performance. Here’s a breakdown:
Additional optimizations, such as tiling and caching frequently accessed tiles, allow you to efficiently handle extensive data operations without unnecessary overhead, making Havasu a cost-effective choice for large-scale raster data management.
Best Practices for Out-DB Rasters
spark.wherobots.raster.outdb.readahead
Click here for more details about the Out-DB Raster type and a guide on improving it’s performance.
One of the strengths of Havasu for a raster data lakehouse is spatial filter pushdown, which applies spatial filters directly to reduce the amount of data read from storage. When querying for specific geographic regions, spatial pushdown ensures that only relevant raster tiles are accessed, significantly improving performance.
For example:
sedona.sql(""" SELECT * FROM wherobots.test_db.imagery WHERE RS_Intersects(rast, ST_GeomFromText('POLYGON ((...))')) """)
This method is particularly effective when the raster tables are partitioned by location or other relevant attributes, making spatial filter pushdown even more efficient.
For large-scale data lakehouse solutions, partitioning and indexing are essential to enhance query performance. Partitioning raster tables by spatial identifiers, such as SRID or city names, groups similar data together, further improving the efficacy of spatial filtering. For example, partitioning by SRID can be beneficial when dealing with geographically diverse datasets in different UTM zones.
Partitioning by Region: Use partitionBy during table creation to reduce scanning times, especially for spatially filtered queries.
# Write the data into a table partitioned by SRID/UTM zone df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))\ .sort('srid')\ .write.format("havasu.iceberg").partitionBy("srid")\ .saveAsTable("wherobots.test_db.eurosat_ms_srid")
Indexing: For tables where partitions aren’t as useful, consider indexing raster data by geographic location using Havasu’s spatial indexing capabilities (e.g., hilbert indexing).
sedona.sql("""CREATE SPATIAL INDEX FOR wherobots.db.test_table USING hilbert(rast, 10)")""")
Hilbert curve with different number of iterations
Havasu supports a suite of raster RS_ functions that enable complex analyses, including map algebra and zonal statistics. Lets dive into when each method is appropriate and their impact on performance.
This functionality is crucial for applications like remote sensing, where you might need to compute remote sensing indices on large geographical regions across multiple raster bands.
For instance you can use RS_MapAlgebra to calculate the Normalized Difference Vegetation Index (NDVI) on a multi-band raster:
RS_MapAlgebra
sedona.sql(""" SELECT RS_MapAlgebra(rast, 'D', 'out = (rast[3] - rast[0]) / (rast[3] + rast[0]);') as ndvi FROM raster_table """)
This flexibility allows Analysts to perform custom calculations directly within Havasu, simplifying workflows for raster analysis.
For use cases where only specific areas within a raster need processing, RS_ZonalStats offers an efficient alternative. This function calculates statistics (e.g., mean, sum, etc) for pixels within a designated geometry (e.g., a polygonal area of interest), loading only the required subset of pixels into memory.
RS_ZonalStats
Unlike pixel-level operations that materialize the entire raster, RS_ZonalStats selectively materializes pixels only within the specified geometry. This efficiency makes it ideal for spatial analyses that focus on specific regions without the overhead of processing the entire raster:
sedona.sql(""" SELECT RS_ZonalStats(outdb_raster, geometry, 'mean') FROM raster_table """)
Because RS_ZonalStats is targeted to the geometry, tiling may not be necessary for these operations. This distinction allows you to avoid tiling when performing zonal statistics, keeping the process streamlined and reducing unnecessary data handling steps.
By choosing the appropriate approach based on your operation type, you can maximize efficiency when working with large out-DB raster datasets in Havasu, ensuring both performance and resource management.
When performing pixel-level operations on raster data in Havasu, it’s essential to understand the implications of using out-DB rasters. Any operation that accesses individual pixels directly, such as map algebra calculations, will materialize these pixels in memory, which can be resource-intensive if working on a large out-DB raster dataset. This section explores best practices for managing these operations efficiently.
For out-DB rasters, any pixel-level transformation will load pixel data from remote storage into memory, creating a significant memory load. To make these operations more manageable:
Tile the Raster Data: Using RS_TileExplode splits the raster into smaller, more manageable tiles, allowing you to work on localized sections rather than loading the entire raster at once.
RS_TileExplode
df_tiled = sedona.sql(""" SELECT RS_TileExplode(raster_data, 1024, 1024, true) AS tile FROM outdb_rasters """)
Before and after RS_TileExplode(). Image taken from nv5GeoSpatialSoftware
With the raster data pre-tiled, pixel-level functions like RS_MapAlgebra can be applied to individual tiles, preventing full materialization of large rasters in memory and optimizing processing time and memory usage.
This approach is particularly beneficial for complex calculations that require pixel-by-pixel transformations, as tiling minimizes memory consumption and enables parallelized processing of smaller raster sections.
To incorporate STAC (SpatioTemporal Asset Catalog) datasets into Havasu, the process involves accessing the STAC dataset as a Sedona DataFrame, surfacing STAC assets (raster bands) as out-db or in-db rasters, and finally saving it in a Havasu table. This guide provides the essentials to help you set up STAC-based raster data in Havasu, with specific attention to handling the dataset efficiently in Wherobots.
STAC datasets like Sentinel-2 imagery are typically available in JSON format, which includes metadata about each raster, such as acquisition time, spatial extent, and the URLs of imagery files. The first step is to read these JSON files into a DataFrame:
Load into Sedona DataFrame: Use Sedona’s GeoJSON reader to load the STAC metadata into a DataFrame. This DataFrame will contain essential attributes like URLs to individual raster bands (e.g., blue, red, NIR).
df_stac = sedona.read.format("geojson").load("s3://your-bucket/path-to-stac-metadata")
This DataFrame now acts as a container for your STAC items, enabling you to manipulate, transform and/or perform spatial joins on the metadata before loading actual raster data.
Havasu’s out-DB raster capability allows you to reference large raster files externally, ideal for high-resolution datasets like Sentinel-2 imagery. To set this up:
Create Out-DB Raster Columns: Use RS_FromPath to reference each asset as an out-DB raster. Here’s an example with a blue band raster:
RS_FromPath
df_rasters = df_stac.withColumn("raster_blue", expr("RS_FromPath(assets.blue.href)"))
Repeat this process for each required band (e.g., red, NIR) to build a DataFrame that references all the necessary raster paths. This setup maintains efficient storage by referencing images externally, reducing data load times and storage requirements in Havasu.
Once you’ve set up the DataFrame with out-DB raster columns, the next step is to save it as a Havasu table:
Write to Havasu: Use writeTo() to save the DataFrame as a new Havasu table, enabling further spatial queries and transformations within Wherobots.
writeTo()
df_rasters.writeTo("wherobots.test_db.stac_raster_table").create()
With this approach, your STAC imagery is now part of a Havasu table, ready for efficient querying and spatial operations. WherobotsDB enables advanced spatial functions like RS_Intersects and RS_MapAlgebra, making it straightforward to analyze the imagery stored as out-DB rasters.
RS_Intersects
This streamlined approach provides an effective way to incorporate STAC datasets in Havasu, enabling data engineers to leverage high-resolution satellite imagery in their spatial data lakehouse while optimizing for performance and storage.
Monitor your data lakehouse solution’s performance using WherobotsDB’s configurations and consider periodically converting standard GeoTIFF files to Cloud-Optimized GeoTIFF (COG) to reduce storage latency. Use gdalinfo to inspect file structures, tile sizes, and compression formats, ensuring that your files are optimized for remote access.
Geospatial data presents a wealth of opportunities for organizations across industries, yet its inherent complexity often makes it difficult to unlock its full potential. With Havasu, Wherobots has developed a solution tailored specifically to the demands of geospatial data management, combining the scalability of a data lake with the structured efficiency of a data warehouse.
From its seamless integration with cloud storage and support for common geospatial formats to advanced capabilities like spatial indexing, filter pushdown, and raster processing, Havasu empowers organizations to efficiently store, query, and analyze spatial data at scale. By leveraging its open architecture and powerful features, teams can transform vast geospatial datasets into actionable insights, driving innovation in logistics, urban planning, environmental monitoring, and beyond.
When paired with WherobotsDB, Havasu transforms geospatial analytics by providing high-speed querying for both raster and vector data, advanced spatial indexing, and seamless compatibility with cloud-native storage solutions like AWS S3. This differentiates Havasu from legacy solutions that struggle with scalability or require complex application-layer workarounds for spatial data.
As geospatial data continues to grow in both scale and importance, adopting a spatial data lakehouse architecture with Havasu ensures that your organization remains at the forefront of data-driven decision-making. Whether you’re just beginning to integrate spatial data into your workflows or optimizing an existing system, Havasu equips you with the tools to manage, process, and analyze geospatial data with unmatched efficiency and flexibility.
Learn more about how Havasu can help you redefine your geospatial data strategy in the official documentation. Unlock the future of geospatial intelligence today.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Introducing Scalability for GeoPandas in Apache Sedona
Learn about the new GeoPandas API for Apache Sedona, now available in Wherobots. This new API allows GeoPandas developers to seamlessly scale their analysis beyond what a single compute instance can provide, unlocking insights from large-scale datasets. This integration combines the Pythonic GeoPandas API with the distributed processing power of Apache Sedona.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: