10 Mins Read

20 Dec 2023

Havasu: A Table Format for Spatial Attributes in a Data Lake Architecture

Authors

Introduction

In the past decade, many organizations have been using BLOB storage (e.g., AWS S3) as a primary storage platform. These organizations collect tons of data and ingest it as files into S3 for its scalability, cost efficiency, and reliability. However, there has since been a need to interact with such data using SQL, which is familiar to developers, in a similar approach to querying relational databases. That led to the invention of open table formats such as Apache Iceberg, which enables users to perform wide-ranging database operations, including concurrent data updates, table metadata handling, time travel and versioning, on files stored in object stores such as S3 without the need to load such files in a relational database.

However, when dealing with spatial data existing open data formats such as Apache Iceberg, Hudi, and DeltaLake, do not provide native support for spatial attributes. This lack of native spatial support forces users to handle the spatial aspect of data at the application layer. This approach often cannot handle the intricacies and scale of spatial analytics and falls short in meeting the demands of analytical workloads leveraging spatial data. The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack, which includes managing spatial attributes in open table formats.

To remedy that, Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of pivotal features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.

Key features of Havasu

ACID on Spatial Data

Havasu stands out for ensuring the ACID (Atomicity, Consistency, Isolation, and Durability) properties in spatial data transactions, a critical feature for reliable data management. This guarantees that each transaction is processed completely or not at all, maintaining data integrity even in complex environments. Furthermore, Havasu supports schema evolution, allowing for adaptable data structures without compromising existing data. This flexibility is key for efficiently managing and evolving spatial data over time, catering to the dynamic needs of spatial databases.

Native Spatial Support

Havasu supports geometry/geography and raster as primary data types, and seamlessly integrates with the compute layer. Users can easily read and write spatial data using Havasu, and process spatial data using any computation engine (e.g., WherobotsDB) as long as that computation engine implements a reader/writer for Havasu.

Efficiency

Computation engines (like WherobotsDB), when incorporating Havasu, can benefit from Havasu’s spatial filter push down support, which significantly accelerates spatial range queries. Havasu allows storing spatial data in object storages of the customer’s choice, and it decouples storage with computation, which makes storing vast amounts of spatial data very cost effective. Havasu also comes equipped with an efficient storage mode of raster data for parquet files, namely out-db storage, which enables high throughput of reading large amount of rasters.

Openness

Havasu is a table format with open specification. Havasu is based on Apache Iceberg, which has an open specification and was widely adopted by the big data ecosystem. The extension to the Apache Iceberg specification is also clearly specified so that any implementation adopting this extended specification is able to read and write Havasu tables. Customers can store Havasu tables in the storage of their choice, without being tightly coupled to one specific vendor or implementation.

Havasu in a Nutshell: Key Technical Insights

The open-source Havasu specification can be found in the Wherobots Havasu documentation. The Havasu table format extends the Iceberg table spec to support managing large spatial datasets as tables, in the following ways:

Primitive spatial data types and storage: the Havasu specification extends the Iceberg specification to support spatial data types.
Spatial statistics: extending the Iceberg manifest files to support spatial statistics.
Spatial filter push down and indexing: extending the Iceberg specifcation to support spatial filter pushdown and spatial indexing which greatly reduces the data retrieval overhead.

All other aspects of Iceberg spec are unchanged. For example, the Havasu specification does not change the fundamental organization of table files.

Primitive spatial data types and storage

Geometry

Geometry values consist of an optional spatial reference ID (abbreviated as SRID) and a geometry shape. The SRID is a 32-bit integer that identifies the coordinate system that the geometry shape is using. The interpretation of SRID is implementation dependent. For example, the SRID could be an EPSG code, or a code defined by a proprietary coordinate system. The geometry shape is one of the types defined by OGC Simple Features for SQL specification. The geometry shape can be stored in one of the following formats in underlying parquet files.

Encoding	Parquet physical type	Logical type	Description
ewkb	BINARY		Extended Well-known binary (EWKB)
wkb	BINARY		Well-known binary (WKB)
wkt	BINARY	UTF8	Well-known text (WKT)
geojson	BINARY	UTF8	https://datatracker.ietf.org/doc/html/rfc7946

When the geometry column is at the root of the schema, and the geometry encoding is one of wkb and ewkb, the application can optionally write the GeoParquet metadata to the Parquet files. The GeoParquet metadata is defined by the GeoParquet specification.

Raster

A raster is one or more grids of cells. All the grids should have width rows and height columns. The grid cells are represented by the band field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the crs field. The CRS will be serialized as a WKT string when stored in data files.

Havasu supports persisting raster band values in two different ways:

in-db: The band values are stored in the same data file as the geo-referencing information. The band values are stored in the bands field of the raster value.
out-db: The band values are stored in files external to Havasu tables. The raster value stored in Havasu data file contains the geo-referencing information and URI of external raster files. The URI of external raster files are stored in the bands field of the raster value.

Spatial statistics

Havasu collects and records the spatial statistics of data files when writing data to the table. The spatial statistics includes the minimum bounding rectangle (MBR) of the geometries in the data file.

Geometry bounds

The bounds of geometry values should be derived using their minimum bounding rectangles (MBRs). The MBR of a geometry value is defined as the smallest rectangle that contains the geometry value. The SRID of geometry values are ignored when computing the MBRs. The MBRs of all geometry values in a data file should be unioned together as a single MBR, which is the MBR of the data file.

Raster bounds

Raster bounds are MBRs of rasters in WGS84 coordinate system. They are computed by transforming the envelope of the raster in its native coordinate system to WGS84. Raster bounds have a special rule for handling MBRs crossing the anti-meridian. Implementations of the Havasu specification should be able to handle MBRs crossing the anti-meridian correctly, otherwise spatial query optimizations will derive incomplete query results.

Spatial filter push down and indexing

Database engines can take advantage of the spatial statistics of data files to optimize the query execution plan. For example, if the query predicate is a spatial range query, the engine can use the spatial statistics to prune the data files that do not contain any data that satisfies the query predicate. This process is called spatial filter pushdown. How spatial query optimization is implemented in scan planning is implementation dependent. For example, in WherobotsDB, for a spatial range query ST_Within(geom, Q), where geom is the geometry field in a Havasu table, Q is a constant geometry as the query window, WherobotsDB converts the spatial query predicate to an inclusive projection ST_Intersects(MBR[geom], Q). MBR[geom] is the minimum bounding box of all values of geom in a data file. Then Sedona evaluates the projection using field statistics maintained in manifest files

Spatial filter push down works best when the spatial data near to each other were stored in the same file. Havasu provides a syntax CREATE SPATIAL INDEX for rewriting the table to sort the records by geometry column.

CREATE SPATIAL INDEX FOR <table_name> USING hilbert(<geometry_column>, <precision>) [ WHERE <condition> ] [ OPTIONS <options> ];

This statement will rewrite the data files of the table and cluster the data by the geometry column. This feature is very useful when the table contains a large amount of data and the spatial filter is very selective. For example, if the table contains 1TB of data and the spatial filter will only select 1% of the data, ideally Havasu will only read ~ 10GB of data to answer the query.

Navigating Spatial Data with Havasu-Powered WherobotsDB Tables

WherobotsDB implements a reader/writer for the Havasu spatial table format. Users can perform many interesting spatial database operations on Havasu tables using WherobotsDB in Wherobots Cloud. Here we explore some common operations using WherobotsDB. For details, please read Wherobots documentation. To follow along create a free Wherobots Cloud account.

Create a new table in Wherobots table catalog

First, let’s create a Havasu table in the wherobots table catalog. This catalog by default is configured to use your Wherobots Cloud S3 storage, but another storage location can also be specified. We’ll use a dataset of Taxi rides as our example.

CREATE TABLE wherobots.test_db.taxi (
  pickup GEOMETRY,
  Trip_Pickup_DateTime STRING,
  Payment_Type STRING,
  Fare_Amt DECIMAL(10,0))
USING havasu.iceberg
-- By default this table will be stored in your Wherobots Cloud S3 account
-- Optionally specify other location
-- LOCATION 's3://path/to/warehouse/test_db/taxi'

List tables in Wherobots table catalog

We can view all tables within the wherobots table catalog using SHOW TABLES:

SHOW TABLES IN wherobots.test_db

We can see the wherobots.test_db.taxi table that we just created:

+---------+-----------+-----------+
|namespace|  tableName|isTemporary|
+---------+-----------+-----------+
|  test_db|       taxi|      false|
+---------+-----------+-----------+

Describe a table in Wherobots table catalog

To view the columns and datatypes of each column we can describe the table:

DESCRIBE TABLE wherobots.test_db.taxi

Note here that our pickup column is of type geometry:

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|              pickup| geometry|   null|
|Trip_Pickup_DateTime|   string|   null|
|        Payment_Type|   string|   null|
|            Fare_Amt|   string|   null|
+--------------------+---------+-------+

Insert geometry data

We can insert data into our table using SQL. Here we specify the geometry value using the ST_GeomFromText function which takes a WKT string, in this case to describe the point geometry that represents the pickup location.

sedona.sql("""
INSERT INTO wherobots.test_db.taxi
VALUES(ST_GeomFromText('POINT (-73.96969 40.749244)'), '10/16/09 22:35', 'Credit', 42)
""")

We can also write spatial DataFrames to Havasu tables. Here we load a NYC taxi dataset into a Sedona DataFrame, then append the data to our wherobots.test_db.taxi Havasu table:

taxidf = sedona.read.format('csv').option("header","true").option("delimiter", ",").load("s3a://wherobots-examples-prod/data/nyc-taxi-data.csv")
taxidf = taxidf.selectExpr('ST_Point(CAST(Start_Lon AS Decimal(24,20)), CAST(Start_Lat AS Decimal(24,20))) AS pickup', 'Trip_Pickup_DateTime', 'Payment_Type', 'CAST(Fare_Amt AS DECIMAL)')
taxidf = taxidf.filter(col("pickup").isNotNull())

taxidf.writeTo("wherobots.test_db.taxi").append()

Create spatial index

Creating a spatial index will rewrite the table to sort records by the geometry column. Havasu supports the hilbert index strategy which will sort the data based on the Hilbert space filling curve which is very efficient at sorting geospatial data based on proximity. We can configure the precision by specifying a value for the precision parameter, which is the number of bits used to represent the Hilbert index.

sedona.sql("CREATE SPATIAL INDEX FOR wherobots.test_db.taxi USING hilbert(pickup, 10)")

Read data from Havasu table

We can query our Havasu tables using familiar SQL, however when using WherobotsDB we have the advantage of spatial queries using Spatial SQL functions. Here we search for all taxi pickups that occurred within a certain area around a given point:

sedona.sql("""
SELECT * FROM wherobots.test_db.taxi 
WHERE ST_Intersects(pickup, ST_Buffer(ST_GeomFromText('POINT (-73.96969 40.749244)'), 0.001))
""").show(truncate=False)

+----------------------------+--------------------+------------+--------+
|pickup                      |Trip_Pickup_DateTime|Payment_Type|Fare_Amt|
+----------------------------+--------------------+------------+--------+
|POINT (-73.96969 40.749244) |1/5/09 16:29        |Credit      |9       |
|POINT (-73.969387 40.749159)|1/20/09 14:38       |CASH        |7       |
|POINT (-73.969308 40.75001) |1/8/09 17:48        |CASH        |11      |
|POINT (-73.969355 40.749315)|1/7/09 16:52        |CASH        |10      |
|POINT (-73.970238 40.749497)|1/19/09 2:42        |Credit      |45      |
|POINT (-73.969492 40.749103)|1/21/09 19:34       |Credit      |15      |
|POINT (-73.970158 40.749055)|1/15/09 14:34       |CASH        |8       |
|POINT (-73.969638 40.748663)|1/27/09 17:46       |CASH        |9       |
|POINT (-73.970167 40.749033)|1/2/09 18:49        |CASH        |8       |
|POINT (-73.97059 40.749077) |1/18/09 20:39       |Credit      |10      |
|POINT (-73.970105 40.748985)|1/12/09 11:39       |CASH        |9       |
|POINT (-73.970228 40.749027)|1/8/09 16:07        |CASH        |5       |
|POINT (-73.9697 40.748737)  |1/5/09 18:04        |Credit      |6       |
|POINT (-73.970628 40.749132)|1/27/09 18:11       |CASH        |10      |
|POINT (-73.969573 40.748677)|1/29/09 19:35       |CASH        |5       |
|POINT (-73.969783 40.749163)|1/6/09 19:48        |Cash        |8       |
|POINT (-73.969522 40.748948)|1/4/09 16:24        |CASH        |5       |
|POINT (-73.969529 40.749625)|1/7/09 23:38        |CASH        |7       |
|POINT (-73.969473 40.749072)|1/29/09 18:04       |CASH        |16      |
|POINT (-73.970575 40.749395)|1/7/09 19:36        |CASH        |8       |
+----------------------------+--------------------+------------+--------+
only showing top 20 rows

Insert out-db raster data

We can also work with raster data in Havasu tables. Here we insert raster data into a Havasu table using the out-db option. You can read more about working with raster data in Havasu tables in the documentation.

sedona.sql("SELECT RS_FromPath('s3a://XXX.tif') as rast"). \\
    writeTo("wherobots.test_db.test_table").append()

Havasu-Powered Wherobots open data catalog

Wherobots collects open datasets from various data sources, then cleans and transforms them to Havasu format to enable linking enterprise data to the real physical world. All datasets are provided for free (except AWS data transfer fee). Certain datasets are only accessible by our Pro Edition users. To learn more, please read Wherobots Open Data

Dataset name	Availability in Wherobots	Type	Count	Description
Overture Maps buildings/building	Community edition	Polygon	785 million	Any human-made structures with roofs or interior spaces
Overture Maps places/place	Community edition	Point	59 million	Any business or point of interest within the world
Overture Maps admins/administrativeBoundary	Community edition	LineString	96 thousand	Any officially defined border between two Administrative Localities
Overture Maps admins/locality	Community edition	Point	2948	Countries and hierarchical subdivisions of countries
Overture Maps transportation/connector	Community edition	Point	330 million	Points of physical connection between two or more segments
Overture Maps transportation/segment	Community edition	LineString	294 million	Center-line of a path which may be traveled
Google & Microsoft open buildings	Professional edition	Polygon	2.5 billion	Google & Microsoft Open Buildings, combined by VIDA
LandSAT surface temperature	Professional edition	Raster (GeoTiff)	166K images, 10 TB size	The temperature of the Earth’s surface in Kelvin, from Aug 2023 to Oct 2023
US Census ZCTA codes	Professional edition	Polygon	33144	ZIP Code Tabulation Areas defined in 2018
NYC TLC taxi trip records	Professional edition	Point	200 million	NYC TLC taxi trip pickup and dropoff records per trip
Open Street Maps all nodes	Professional edition	Point	8 billion	All the nodes of the OpenStreetMap Planet dataset
Open Street Maps postal codes	Professional edition	Polygon	154 thousand	Boundaries of postal code areas as defined in OpenStreetMap
Weather events	Professional edition	Point	8.6 million	Events such as rain, snow, storm, from 2016 – 2022
Wild fires	Professional edition	Point	1.8 million	Wildfire that occurred in the United States from 1992 to 2015

The Wherobots open data catalog can be extremely useful when tables are combined, typically using spatial joins, to address real world business use cases like risk analysis, site selection, fleet vehicle optimization and answering other business intelligence questions.

Spatial join query to find zones prone to wild fires
Let’s see how we can make use of the Wherobots open data catalog using Havasu tables to perform a spatial join operation to find US zipcode regions that experience the most wild fires. To do this we will use the wherobots_open_data.us_census.zipcode Havasu table which contains the polygon geometries of US zipcodes and wherobots_open_data.weather.wild_fires which contains point geometries of wild fire events.

We perform a spatial join operation using the ST_Intersects spatial SQL function to define a predicate that will join fires that occur within their respective zipcodes.

fire_zone = sedona.sql(
    """
    SELECT
        z.geometry as zipgeom,
        z.ZCTA5CE10 as zipcode,
        f.FIRE_NAME
    FROM
        wherobots_open_data.us_census.zipcode z,
        wherobots_open_data.weather.wild_fires f
    WHERE
        ST_Intersects(z.geometry, f.geometry)
    """
)

We can then group this data by Zipcode to find the count of fires that occur in each zip code and visualize the results. This type of analysis can be useful for risk analysis and insurance premium pricing.

Resources

Get started with Havasu spatial tables by signing up for a free Wherobots Cloud account
Read more about the Havasu table format in the Wherobots Documentation
See Havasu tables in action in the Wherobots example notebooks

Try Wherobots with No Commitment

Get Started

7 Mins Read 8 Jul 2026

How Bad Telemetry Data Sabotages Modern Fleets

By the Teams at Action Engine & Wherobots Fleet monitoring is undergoing a generational shift. Fleet monitoring, the systems that ingest and analyze vehicle telemetry to track fleet health, performance, and safety, has become the foundation of how operators run vehicles, not just track them. Modern vehicles generate orders of magnitude more telemetry than even […]

Stories + 2

3 Mins Read 8 Jul 2026

Introducing the Wherobots Innovation Edition, designed to accelerate your physical world objectives

The Wherobots Innovation Edition helps you deliver outcomes on top of spatial data that propel your organization forward, and make this data AI-ready. Today we are announcing the Wherobots Innovation Edition. This is an annual partnership that pairs the full Wherobots Cloud platform with our forward deployed spatial engineering expertise, developed over years of delivering […]

post

13 Mins Read 24 Jun 2026

Spatial Data in Apache Iceberg: Optimizations and Management That Matter

Spatial data in Apache Iceberg needs different optimization than tabular data. A geometry column has no natural sort order, so unsorted files carry wide, overlapping bounding boxes and query planners cannot prune them… At all… This behaviour turns a selective spatial filter into a full table scan. A second problem compounds it: one oversized geometry […]

Data Management + 3