Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
In the past decade, many organizations have been using BLOB storage (e.g., AWS S3) as a primary storage platform. These organizations collect tons of data and ingest it as files into S3 for its scalability, cost efficiency, and reliability. However, there has since been a need to interact with such data using SQL, which is familiar to developers, in a similar approach to querying relational databases. That led to the invention of open table formats such as Apache Iceberg, which enables users to perform wide-ranging database operations, including concurrent data updates, table metadata handling, time travel and versioning, on files stored in object stores such as S3 without the need to load such files in a relational database.
However, when dealing with spatial data existing open data formats such as Apache Iceberg, Hudi, and DeltaLake, do not provide native support for spatial attributes. This lack of native spatial support forces users to handle the spatial aspect of data at the application layer. This approach often cannot handle the intricacies and scale of spatial analytics and falls short in meeting the demands of analytical workloads leveraging spatial data. The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack, which includes managing spatial attributes in open table formats.
To remedy that, Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of pivotal features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.
Havasu stands out for ensuring the ACID (Atomicity, Consistency, Isolation, and Durability) properties in spatial data transactions, a critical feature for reliable data management. This guarantees that each transaction is processed completely or not at all, maintaining data integrity even in complex environments. Furthermore, Havasu supports schema evolution, allowing for adaptable data structures without compromising existing data. This flexibility is key for efficiently managing and evolving spatial data over time, catering to the dynamic needs of spatial databases.
Havasu supports geometry/geography and raster as primary data types, and seamlessly integrates with the compute layer. Users can easily read and write spatial data using Havasu, and process spatial data using any computation engine (e.g., WherobotsDB) as long as that computation engine implements a reader/writer for Havasu.
Computation engines (like WherobotsDB), when incorporating Havasu, can benefit from Havasu’s spatial filter push down support, which significantly accelerates spatial range queries. Havasu allows storing spatial data in object storages of the customer’s choice, and it decouples storage with computation, which makes storing vast amounts of spatial data very cost effective. Havasu also comes equipped with an efficient storage mode of raster data for parquet files, namely out-db storage, which enables high throughput of reading large amount of rasters.
out-db
Havasu is a table format with open specification. Havasu is based on Apache Iceberg, which has an open specification and was widely adopted by the big data ecosystem. The extension to the Apache Iceberg specification is also clearly specified so that any implementation adopting this extended specification is able to read and write Havasu tables. Customers can store Havasu tables in the storage of their choice, without being tightly coupled to one specific vendor or implementation.
The open-source Havasu specification can be found in the Wherobots Havasu documentation. The Havasu table format extends the Iceberg table spec to support managing large spatial datasets as tables, in the following ways:
All other aspects of Iceberg spec are unchanged. For example, the Havasu specification does not change the fundamental organization of table files.
Geometry values consist of an optional spatial reference ID (abbreviated as SRID) and a geometry shape. The SRID is a 32-bit integer that identifies the coordinate system that the geometry shape is using. The interpretation of SRID is implementation dependent. For example, the SRID could be an EPSG code, or a code defined by a proprietary coordinate system. The geometry shape is one of the types defined by OGC Simple Features for SQL specification. The geometry shape can be stored in one of the following formats in underlying parquet files.
When the geometry column is at the root of the schema, and the geometry encoding is one of wkb and ewkb, the application can optionally write the GeoParquet metadata to the Parquet files. The GeoParquet metadata is defined by the GeoParquet specification.
wkb
ewkb
A raster is one or more grids of cells. All the grids should have width rows and height columns. The grid cells are represented by the band field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the crs field. The CRS will be serialized as a WKT string when stored in data files.
width
height
band
crs
Havasu supports persisting raster band values in two different ways:
bands
Havasu collects and records the spatial statistics of data files when writing data to the table. The spatial statistics includes the minimum bounding rectangle (MBR) of the geometries in the data file.
The bounds of geometry values should be derived using their minimum bounding rectangles (MBRs). The MBR of a geometry value is defined as the smallest rectangle that contains the geometry value. The SRID of geometry values are ignored when computing the MBRs. The MBRs of all geometry values in a data file should be unioned together as a single MBR, which is the MBR of the data file.
Raster bounds are MBRs of rasters in WGS84 coordinate system. They are computed by transforming the envelope of the raster in its native coordinate system to WGS84. Raster bounds have a special rule for handling MBRs crossing the anti-meridian. Implementations of the Havasu specification should be able to handle MBRs crossing the anti-meridian correctly, otherwise spatial query optimizations will derive incomplete query results.
Database engines can take advantage of the spatial statistics of data files to optimize the query execution plan. For example, if the query predicate is a spatial range query, the engine can use the spatial statistics to prune the data files that do not contain any data that satisfies the query predicate. This process is called spatial filter pushdown. How spatial query optimization is implemented in scan planning is implementation dependent. For example, in WherobotsDB, for a spatial range query ST_Within(geom, Q), where geom is the geometry field in a Havasu table, Q is a constant geometry as the query window, WherobotsDB converts the spatial query predicate to an inclusive projection ST_Intersects(MBR[geom], Q). MBR[geom] is the minimum bounding box of all values of geom in a data file. Then Sedona evaluates the projection using field statistics maintained in manifest files
ST_Within(geom, Q)
geom
Q
ST_Intersects(MBR[geom], Q)
MBR[geom]
Spatial filter push down works best when the spatial data near to each other were stored in the same file. Havasu provides a syntax CREATE SPATIAL INDEX for rewriting the table to sort the records by geometry column.
CREATE SPATIAL INDEX
CREATE SPATIAL INDEX FOR <table_name> USING hilbert(<geometry_column>, <precision>) [ WHERE <condition> ] [ OPTIONS <options> ];
This statement will rewrite the data files of the table and cluster the data by the geometry column. This feature is very useful when the table contains a large amount of data and the spatial filter is very selective. For example, if the table contains 1TB of data and the spatial filter will only select 1% of the data, ideally Havasu will only read ~ 10GB of data to answer the query.
WherobotsDB implements a reader/writer for the Havasu spatial table format. Users can perform many interesting spatial database operations on Havasu tables using WherobotsDB in Wherobots Cloud. Here we explore some common operations using WherobotsDB. For details, please read Wherobots documentation. To follow along create a free Wherobots Cloud account.
Create a new table in Wherobots table catalog
First, let’s create a Havasu table in the wherobots table catalog. This catalog by default is configured to use your Wherobots Cloud S3 storage, but another storage location can also be specified. We’ll use a dataset of Taxi rides as our example.
wherobots
CREATE TABLE wherobots.test_db.taxi ( pickup GEOMETRY, Trip_Pickup_DateTime STRING, Payment_Type STRING, Fare_Amt DECIMAL(10,0)) USING havasu.iceberg -- By default this table will be stored in your Wherobots Cloud S3 account -- Optionally specify other location -- LOCATION 's3://path/to/warehouse/test_db/taxi'
List tables in Wherobots table catalog
We can view all tables within the wherobots table catalog using SHOW TABLES:
SHOW TABLES
SHOW TABLES IN wherobots.test_db
We can see the wherobots.test_db.taxi table that we just created:
wherobots.test_db.taxi
+---------+-----------+-----------+ |namespace| tableName|isTemporary| +---------+-----------+-----------+ | test_db| taxi| false| +---------+-----------+-----------+
Describe a table in Wherobots table catalog
To view the columns and datatypes of each column we can describe the table:
DESCRIBE TABLE wherobots.test_db.taxi
Note here that our pickup column is of type geometry:
pickup
geometry
+--------------------+---------+-------+ | col_name|data_type|comment| +--------------------+---------+-------+ | pickup| geometry| null| |Trip_Pickup_DateTime| string| null| | Payment_Type| string| null| | Fare_Amt| string| null| +--------------------+---------+-------+
Insert geometry data
We can insert data into our table using SQL. Here we specify the geometry value using the ST_GeomFromText function which takes a WKT string, in this case to describe the point geometry that represents the pickup location.
sedona.sql(""" INSERT INTO wherobots.test_db.taxi VALUES(ST_GeomFromText('POINT (-73.96969 40.749244)'), '10/16/09 22:35', 'Credit', 42) """)
We can also write spatial DataFrames to Havasu tables. Here we load a NYC taxi dataset into a Sedona DataFrame, then append the data to our wherobots.test_db.taxi Havasu table:
taxidf = sedona.read.format('csv').option("header","true").option("delimiter", ",").load("s3a://wherobots-examples-prod/data/nyc-taxi-data.csv") taxidf = taxidf.selectExpr('ST_Point(CAST(Start_Lon AS Decimal(24,20)), CAST(Start_Lat AS Decimal(24,20))) AS pickup', 'Trip_Pickup_DateTime', 'Payment_Type', 'CAST(Fare_Amt AS DECIMAL)') taxidf = taxidf.filter(col("pickup").isNotNull()) taxidf.writeTo("wherobots.test_db.taxi").append()
Create spatial index
Creating a spatial index will rewrite the table to sort records by the geometry column. Havasu supports the hilbert index strategy which will sort the data based on the Hilbert space filling curve which is very efficient at sorting geospatial data based on proximity. We can configure the precision by specifying a value for the precision parameter, which is the number of bits used to represent the Hilbert index.
hilbert
sedona.sql("CREATE SPATIAL INDEX FOR wherobots.test_db.taxi USING hilbert(pickup, 10)")
Read data from Havasu table
We can query our Havasu tables using familiar SQL, however when using WherobotsDB we have the advantage of spatial queries using Spatial SQL functions. Here we search for all taxi pickups that occurred within a certain area around a given point:
sedona.sql(""" SELECT * FROM wherobots.test_db.taxi WHERE ST_Intersects(pickup, ST_Buffer(ST_GeomFromText('POINT (-73.96969 40.749244)'), 0.001)) """).show(truncate=False)
+----------------------------+--------------------+------------+--------+ |pickup |Trip_Pickup_DateTime|Payment_Type|Fare_Amt| +----------------------------+--------------------+------------+--------+ |POINT (-73.96969 40.749244) |1/5/09 16:29 |Credit |9 | |POINT (-73.969387 40.749159)|1/20/09 14:38 |CASH |7 | |POINT (-73.969308 40.75001) |1/8/09 17:48 |CASH |11 | |POINT (-73.969355 40.749315)|1/7/09 16:52 |CASH |10 | |POINT (-73.970238 40.749497)|1/19/09 2:42 |Credit |45 | |POINT (-73.969492 40.749103)|1/21/09 19:34 |Credit |15 | |POINT (-73.970158 40.749055)|1/15/09 14:34 |CASH |8 | |POINT (-73.969638 40.748663)|1/27/09 17:46 |CASH |9 | |POINT (-73.970167 40.749033)|1/2/09 18:49 |CASH |8 | |POINT (-73.97059 40.749077) |1/18/09 20:39 |Credit |10 | |POINT (-73.970105 40.748985)|1/12/09 11:39 |CASH |9 | |POINT (-73.970228 40.749027)|1/8/09 16:07 |CASH |5 | |POINT (-73.9697 40.748737) |1/5/09 18:04 |Credit |6 | |POINT (-73.970628 40.749132)|1/27/09 18:11 |CASH |10 | |POINT (-73.969573 40.748677)|1/29/09 19:35 |CASH |5 | |POINT (-73.969783 40.749163)|1/6/09 19:48 |Cash |8 | |POINT (-73.969522 40.748948)|1/4/09 16:24 |CASH |5 | |POINT (-73.969529 40.749625)|1/7/09 23:38 |CASH |7 | |POINT (-73.969473 40.749072)|1/29/09 18:04 |CASH |16 | |POINT (-73.970575 40.749395)|1/7/09 19:36 |CASH |8 | +----------------------------+--------------------+------------+--------+ only showing top 20 rows
Insert out-db raster data
We can also work with raster data in Havasu tables. Here we insert raster data into a Havasu table using the out-db option. You can read more about working with raster data in Havasu tables in the documentation.
sedona.sql("SELECT RS_FromPath('s3a://XXX.tif') as rast"). \\ writeTo("wherobots.test_db.test_table").append()
Wherobots collects open datasets from various data sources, then cleans and transforms them to Havasu format to enable linking enterprise data to the real physical world. All datasets are provided for free (except AWS data transfer fee). Certain datasets are only accessible by our Pro Edition users. To learn more, please read Wherobots Open Data
Pro Edition
The Wherobots open data catalog can be extremely useful when tables are combined, typically using spatial joins, to address real world business use cases like risk analysis, site selection, fleet vehicle optimization and answering other business intelligence questions.
Spatial join query to find zones prone to wild firesLet’s see how we can make use of the Wherobots open data catalog using Havasu tables to perform a spatial join operation to find US zipcode regions that experience the most wild fires. To do this we will use the wherobots_open_data.us_census.zipcode Havasu table which contains the polygon geometries of US zipcodes and wherobots_open_data.weather.wild_fires which contains point geometries of wild fire events.
wherobots_open_data.us_census.zipcode
wherobots_open_data.weather.wild_fires
We perform a spatial join operation using the ST_Intersects spatial SQL function to define a predicate that will join fires that occur within their respective zipcodes.
ST_Intersects
fire_zone = sedona.sql( """ SELECT z.geometry as zipgeom, z.ZCTA5CE10 as zipcode, f.FIRE_NAME FROM wherobots_open_data.us_census.zipcode z, wherobots_open_data.weather.wild_fires f WHERE ST_Intersects(z.geometry, f.geometry) """ )
We can then group this data by Zipcode to find the count of fires that occur in each zip code and visualize the results. This type of analysis can be useful for risk analysis and insurance premium pricing.
How We Delivered “Fields of The World” with RasterFlow: A Planetary-Scale GeoAI Pipeline
See how we used RasterFlow to run a 100TB+ global GeoAI pipeline, from feature mosaics to predictions and vectors, with reproducible workflows.
Spatial Data Pipeline Architecture: PostGIS and Wherobots Together
In the world of data architecture, there is a dangerous myth that you have to choose “one tool to rule them all.” We often see organizations paralyzed by the debate: “Should we use a Database or a Data Lake?” A spatial data pipeline architecture built for both large-scale analytics and operational queries is one of […]
Iceberg v3 Gets Native Geo Types. It’s More Than a Format Upgrade
Introduction Geospatial data touches nearly every industry, and until recently, the open lakehouse had no native way to handle it. Snowflake recently announced Iceberg v3 support with native geometry and geography types. It’s the first major engine to ship the geospatial extensions to the Iceberg spec. These types are now part of the open standard, […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: