Introduction
In the past decade, many organizations have been using BLOB storage (e.g., AWS S3) as a primary storage platform. These organizations collect tons of data and ingest it as files into S3 for its scalability, cost efficiency, and reliability. However, there has since been a need to interact with such data using SQL, which is familiar to developers, in a similar approach to querying relational databases. That led to the invention of open table formats such as Apache Iceberg, which enables users to perform wide-ranging database operations, including concurrent data updates, table metadata handling, time travel and versioning, on files stored in object stores such as S3 without the need to load such files in a relational database.
However, when dealing with spatial data existing open data formats such as Apache Iceberg, Hudi, and DeltaLake, do not provide native support for spatial attributes. This lack of native spatial support forces users to handle the spatial aspect of data at the application layer. This approach often cannot handle the intricacies and scale of spatial analytics and falls short in meeting the demands of analytical workloads leveraging spatial data. The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack, which includes managing spatial attributes in open table formats.
To remedy that, Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of pivotal features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.
Key features of Havasu
ACID on Spatial Data
Havasu stands out for ensuring the ACID (Atomicity, Consistency, Isolation, and Durability) properties in spatial data transactions, a critical feature for reliable data management. This guarantees that each transaction is processed completely or not at all, maintaining data integrity even in complex environments. Furthermore, Havasu supports schema evolution, allowing for adaptable data structures without compromising existing data. This flexibility is key for efficiently managing and evolving spatial data over time, catering to the dynamic needs of spatial databases.
Native Spatial Support
Havasu supports geometry/geography and raster as primary data types, and seamlessly integrates with the compute layer. Users can easily read and write spatial data using Havasu, and process spatial data using any computation engine (e.g., WherobotsDB) as long as that computation engine implements a reader/writer for Havasu.
Efficiency
Computation engines (like WherobotsDB), when incorporating Havasu, can benefit from Havasu’s spatial filter push down support, which significantly accelerates spatial range queries. Havasu allows storing spatial data in object storages of the customer’s choice, and it decouples storage with computation, which makes storing vast amounts of spatial data very cost effective. Havasu also comes equipped with an efficient storage mode of raster data for parquet files, namely out-db
storage, which enables high throughput of reading large amount of rasters.
Openness
Havasu is a table format with open specification. Havasu is based on Apache Iceberg, which has an open specification and was widely adopted by the big data ecosystem. The extension to the Apache Iceberg specification is also clearly specified so that any implementation adopting this extended specification is able to read and write Havasu tables. Customers can store Havasu tables in the storage of their choice, without being tightly coupled to one specific vendor or implementation.
Havasu in a Nutshell: Key Technical Insights
The open-source Havasu specification can be found in the Wherobots Havasu documentation. The Havasu table format extends the Iceberg table spec to support managing large spatial datasets as tables, in the following ways:
- Primitive spatial data types and storage: the Havasu specification extends the Iceberg specification to support spatial data types.
- Spatial statistics: extending the Iceberg manifest files to support spatial statistics.
- Spatial filter push down and indexing: extending the Iceberg specifcation to support spatial filter pushdown and spatial indexing which greatly reduces the data retrieval overhead.
All other aspects of Iceberg spec are unchanged. For example, the Havasu specification does not change the fundamental organization of table files.
Primitive spatial data types and storage
Geometry
Geometry values consist of an optional spatial reference ID (abbreviated as SRID) and a geometry shape. The SRID is a 32-bit integer that identifies the coordinate system that the geometry shape is using. The interpretation of SRID is implementation dependent. For example, the SRID could be an EPSG code, or a code defined by a proprietary coordinate system. The geometry shape is one of the types defined by OGC Simple Features for SQL specification. The geometry shape can be stored in one of the following formats in underlying parquet files.
Encoding |
Parquet physical type |
Logical type |
Description |
ewkb |
BINARY |
|
Extended Well-known binary (EWKB) |
wkb |
BINARY |
|
Well-known binary (WKB) |
wkt |
BINARY |
UTF8 |
Well-known text (WKT) |
geojson |
BINARY |
UTF8 |
https://datatracker.ietf.org/doc/html/rfc7946 |
When the geometry column is at the root of the schema, and the geometry encoding is one of wkb
and ewkb
, the application can optionally write the GeoParquet metadata to the Parquet files. The GeoParquet metadata is defined by the GeoParquet specification.
Raster
A raster is one or more grids of cells. All the grids should have width
rows and height
columns. The grid cells are represented by the band
field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the crs
field. The CRS will be serialized as a WKT string when stored in data files.
Havasu supports persisting raster band values in two different ways:
- in-db: The band values are stored in the same data file as the geo-referencing information. The band values are stored in the
bands
field of the raster value.
- out-db: The band values are stored in files external to Havasu tables. The raster value stored in Havasu data file contains the geo-referencing information and URI of external raster files. The URI of external raster files are stored in the
bands
field of the raster value.
Spatial statistics
Havasu collects and records the spatial statistics of data files when writing data to the table. The spatial statistics includes the minimum bounding rectangle (MBR) of the geometries in the data file.
Geometry bounds
The bounds of geometry values should be derived using their minimum bounding rectangles (MBRs). The MBR of a geometry value is defined as the smallest rectangle that contains the geometry value. The SRID of geometry values are ignored when computing the MBRs. The MBRs of all geometry values in a data file should be unioned together as a single MBR, which is the MBR of the data file.
Raster bounds
Raster bounds are MBRs of rasters in WGS84 coordinate system. They are computed by transforming the envelope of the raster in its native coordinate system to WGS84. Raster bounds have a special rule for handling MBRs crossing the anti-meridian. Implementations of the Havasu specification should be able to handle MBRs crossing the anti-meridian correctly, otherwise spatial query optimizations will derive incomplete query results.
Spatial filter push down and indexing
Database engines can take advantage of the spatial statistics of data files to optimize the query execution plan. For example, if the query predicate is a spatial range query, the engine can use the spatial statistics to prune the data files that do not contain any data that satisfies the query predicate. This process is called spatial filter pushdown. How spatial query optimization is implemented in scan planning is implementation dependent. For example, in WherobotsDB, for a spatial range query ST_Within(geom, Q)
, where geom
is the geometry field in a Havasu table, Q
is a constant geometry as the query window, WherobotsDB converts the spatial query predicate to an inclusive projection ST_Intersects(MBR[geom], Q)
. MBR[geom]
is the minimum bounding box of all values of geom
in a data file. Then Sedona evaluates the projection using field statistics maintained in manifest files
Spatial filter push down works best when the spatial data near to each other were stored in the same file. Havasu provides a syntax CREATE SPATIAL INDEX
for rewriting the table to sort the records by geometry column.
CREATE SPATIAL INDEX FOR <table_name> USING hilbert(<geometry_column>, <precision>) [ WHERE <condition> ] [ OPTIONS <options> ];
This statement will rewrite the data files of the table and cluster the data by the geometry column. This feature is very useful when the table contains a large amount of data and the spatial filter is very selective. For example, if the table contains 1TB of data and the spatial filter will only select 1% of the data, ideally Havasu will only read ~ 10GB of data to answer the query.
Navigating Spatial Data with Havasu-Powered WherobotsDB Tables
WherobotsDB implements a reader/writer for the Havasu spatial table format. Users can perform many interesting spatial database operations on Havasu tables using WherobotsDB in Wherobots Cloud. Here we explore some common operations using WherobotsDB. For details, please read Wherobots documentation. To follow along create a free Wherobots Cloud account.
Create a new table in Wherobots table catalog
First, let’s create a Havasu table in the wherobots
table catalog. This catalog by default is configured to use your Wherobots Cloud S3 storage, but another storage location can also be specified. We’ll use a dataset of Taxi rides as our example.
CREATE TABLE wherobots.test_db.taxi (
pickup GEOMETRY,
Trip_Pickup_DateTime STRING,
Payment_Type STRING,
Fare_Amt DECIMAL(10,0))
USING havasu.iceberg
-- By default this table will be stored in your Wherobots Cloud S3 account
-- Optionally specify other location
-- LOCATION 's3://path/to/warehouse/test_db/taxi'
List tables in Wherobots table catalog
We can view all tables within the wherobots
table catalog using SHOW TABLES
:
SHOW TABLES IN wherobots.test_db
We can see the wherobots.test_db.taxi
table that we just created:
+---------+-----------+-----------+
|namespace| tableName|isTemporary|
+---------+-----------+-----------+
| test_db| taxi| false|
+---------+-----------+-----------+
Describe a table in Wherobots table catalog
To view the columns and datatypes of each column we can describe the table:
DESCRIBE TABLE wherobots.test_db.taxi
Note here that our pickup
column is of type geometry
:
+--------------------+---------+-------+
| col_name|data_type|comment|
+--------------------+---------+-------+
| pickup| geometry| null|
|Trip_Pickup_DateTime| string| null|
| Payment_Type| string| null|
| Fare_Amt| string| null|
+--------------------+---------+-------+
Insert geometry data
We can insert data into our table using SQL. Here we specify the geometry value using the ST_GeomFromText function which takes a WKT string, in this case to describe the point geometry that represents the pickup location.
sedona.sql("""
INSERT INTO wherobots.test_db.taxi
VALUES(ST_GeomFromText('POINT (-73.96969 40.749244)'), '10/16/09 22:35', 'Credit', 42)
""")
We can also write spatial DataFrames to Havasu tables. Here we load a NYC taxi dataset into a Sedona DataFrame, then append the data to our wherobots.test_db.taxi
Havasu table:
taxidf = sedona.read.format('csv').option("header","true").option("delimiter", ",").load("s3a://wherobots-examples-prod/data/nyc-taxi-data.csv")
taxidf = taxidf.selectExpr('ST_Point(CAST(Start_Lon AS Decimal(24,20)), CAST(Start_Lat AS Decimal(24,20))) AS pickup', 'Trip_Pickup_DateTime', 'Payment_Type', 'CAST(Fare_Amt AS DECIMAL)')
taxidf = taxidf.filter(col("pickup").isNotNull())
taxidf.writeTo("wherobots.test_db.taxi").append()
Create spatial index
Creating a spatial index will rewrite the table to sort records by the geometry column. Havasu supports the hilbert
index strategy which will sort the data based on the Hilbert space filling curve which is very efficient at sorting geospatial data based on proximity. We can configure the precision by specifying a value for the precision parameter, which is the number of bits used to represent the Hilbert index.
sedona.sql("CREATE SPATIAL INDEX FOR wherobots.test_db.taxi USING hilbert(pickup, 10)")
Read data from Havasu table
We can query our Havasu tables using familiar SQL, however when using WherobotsDB we have the advantage of spatial queries using Spatial SQL functions. Here we search for all taxi pickups that occurred within a certain area around a given point:
sedona.sql("""
SELECT * FROM wherobots.test_db.taxi
WHERE ST_Intersects(pickup, ST_Buffer(ST_GeomFromText('POINT (-73.96969 40.749244)'), 0.001))
""").show(truncate=False)
+----------------------------+--------------------+------------+--------+
|pickup |Trip_Pickup_DateTime|Payment_Type|Fare_Amt|
+----------------------------+--------------------+------------+--------+
|POINT (-73.96969 40.749244) |1/5/09 16:29 |Credit |9 |
|POINT (-73.969387 40.749159)|1/20/09 14:38 |CASH |7 |
|POINT (-73.969308 40.75001) |1/8/09 17:48 |CASH |11 |
|POINT (-73.969355 40.749315)|1/7/09 16:52 |CASH |10 |
|POINT (-73.970238 40.749497)|1/19/09 2:42 |Credit |45 |
|POINT (-73.969492 40.749103)|1/21/09 19:34 |Credit |15 |
|POINT (-73.970158 40.749055)|1/15/09 14:34 |CASH |8 |
|POINT (-73.969638 40.748663)|1/27/09 17:46 |CASH |9 |
|POINT (-73.970167 40.749033)|1/2/09 18:49 |CASH |8 |
|POINT (-73.97059 40.749077) |1/18/09 20:39 |Credit |10 |
|POINT (-73.970105 40.748985)|1/12/09 11:39 |CASH |9 |
|POINT (-73.970228 40.749027)|1/8/09 16:07 |CASH |5 |
|POINT (-73.9697 40.748737) |1/5/09 18:04 |Credit |6 |
|POINT (-73.970628 40.749132)|1/27/09 18:11 |CASH |10 |
|POINT (-73.969573 40.748677)|1/29/09 19:35 |CASH |5 |
|POINT (-73.969783 40.749163)|1/6/09 19:48 |Cash |8 |
|POINT (-73.969522 40.748948)|1/4/09 16:24 |CASH |5 |
|POINT (-73.969529 40.749625)|1/7/09 23:38 |CASH |7 |
|POINT (-73.969473 40.749072)|1/29/09 18:04 |CASH |16 |
|POINT (-73.970575 40.749395)|1/7/09 19:36 |CASH |8 |
+----------------------------+--------------------+------------+--------+
only showing top 20 rows
Insert out-db raster data
We can also work with raster data in Havasu tables. Here we insert raster data into a Havasu table using the out-db
option. You can read more about working with raster data in Havasu tables in the documentation.
sedona.sql("SELECT RS_FromPath('s3a://XXX.tif') as rast"). \\
writeTo("wherobots.test_db.test_table").append()
Havasu-Powered Wherobots open data catalog
Wherobots collects open datasets from various data sources, then cleans and transforms them to Havasu format to enable linking enterprise data to the real physical world. All datasets are provided for free (except AWS data transfer fee). Certain datasets are only accessible by our Pro Edition
users. To learn more, please read Wherobots Open Data
Dataset name |
Availability in Wherobots |
Type |
Count |
Description |
Overture Maps buildings/building |
Community edition |
Polygon |
785 million |
Any human-made structures with roofs or interior spaces |
Overture Maps places/place |
Community edition |
Point |
59 million |
Any business or point of interest within the world |
Overture Maps admins/administrativeBoundary |
Community edition |
LineString |
96 thousand |
Any officially defined border between two Administrative Localities |
Overture Maps admins/locality |
Community edition |
Point |
2948 |
Countries and hierarchical subdivisions of countries |
Overture Maps transportation/connector |
Community edition |
Point |
330 million |
Points of physical connection between two or more segments |
Overture Maps transportation/segment |
Community edition |
LineString |
294 million |
Center-line of a path which may be traveled |
Google & Microsoft open buildings |
Professional edition |
Polygon |
2.5 billion |
Google & Microsoft Open Buildings, combined by VIDA |
LandSAT surface temperature |
Professional edition |
Raster (GeoTiff) |
166K images, 10 TB size |
The temperature of the Earth’s surface in Kelvin, from Aug 2023 to Oct 2023 |
US Census ZCTA codes |
Professional edition |
Polygon |
33144 |
ZIP Code Tabulation Areas defined in 2018 |
NYC TLC taxi trip records |
Professional edition |
Point |
200 million |
NYC TLC taxi trip pickup and dropoff records per trip |
Open Street Maps all nodes |
Professional edition |
Point |
8 billion |
All the nodes of the OpenStreetMap Planet dataset |
Open Street Maps postal codes |
Professional edition |
Polygon |
154 thousand |
Boundaries of postal code areas as defined in OpenStreetMap |
Weather events |
Professional edition |
Point |
8.6 million |
Events such as rain, snow, storm, from 2016 – 2022 |
Wild fires |
Professional edition |
Point |
1.8 million |
Wildfire that occurred in the United States from 1992 to 2015 |
The Wherobots open data catalog can be extremely useful when tables are combined, typically using spatial joins, to address real world business use cases like risk analysis, site selection, fleet vehicle optimization and answering other business intelligence questions.
Spatial join query to find zones prone to wild fires
Let’s see how we can make use of the Wherobots open data catalog using Havasu tables to perform a spatial join operation to find US zipcode regions that experience the most wild fires. To do this we will use the wherobots_open_data.us_census.zipcode
Havasu table which contains the polygon geometries of US zipcodes and wherobots_open_data.weather.wild_fires
which contains point geometries of wild fire events.
We perform a spatial join operation using the ST_Intersects
spatial SQL function to define a predicate that will join fires that occur within their respective zipcodes.
fire_zone = sedona.sql(
"""
SELECT
z.geometry as zipgeom,
z.ZCTA5CE10 as zipcode,
f.FIRE_NAME
FROM
wherobots_open_data.us_census.zipcode z,
wherobots_open_data.weather.wild_fires f
WHERE
ST_Intersects(z.geometry, f.geometry)
"""
)
We can then group this data by Zipcode to find the count of fires that occur in each zip code and visualize the results. This type of analysis can be useful for risk analysis and insurance premium pricing.
Resources
Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: