Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data
We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.
XYZM coordinates and SRID
The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).
- X typically refers to the longitude
- Y typically refers to the latitude
- Z adds depth or elevation to the spatial representation, which is critical for conducting 3D modeling for urban planning, architecture, and environmental studies.
- M stands for a measure, which can be time, distance, or other quantitative attribute that enhances the analysis, such as speed, weight, or cost.
SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.
As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways
- The geometry constructors do not understand EWKT and EWKB
- No geometry editors or accessors for Z, M dimensions and SRID information
- No geometry outputs for for Z, M dimensions and SRID information
- Z and M values silently go missing even if they are explicitly set by users
In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.
Spatial SQL query
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))
Output
SRID=4269;POINT ZM(1 2 3 4)
SRID=4269;POINT Z(1 2 3)
SRID=4269;POINT M(1 2 3)
As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.
Users can also perform some operations on the Z dimension. For example,
Spatial SQL query
SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))
Output
1.7320508075688772
Vector and raster join
In GIS, spatial data is primarily categorized into two types: vector data and raster data.
- Vector data: represents geographical features using points, lines, and polygons. These elements are defined by X and Y coordinates that describe their shape and position in space. Common vector formats include WKT, WKB, Shapefile (.shp), and GeoJSON (.geojson)
- Raster data: represents geographic information as a grid of cells (similar to images), with each cell holding a specific value. Raster data is well-suited for representing continuous phenomena, such as temperature gradients, elevation surfaces, or satellite imagery. Each raster data ofter comes with georeference information and a bounding box which aligns the image with a spatially correct geographic region. Common raster formats include GeoTiff (.tiff), and NetCDF (.nc).
Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.
With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.
Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.
Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:
from sedona.spark import *
imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst")
SedonaUtils.display_image(imageDf)
The output looks like this
Let’s read the the zip code zones of the earth into Sedona and print the content:
SELECT ZCTA5CE10, geometry FROM zipcode
The output is as follows:
+---------+--------------------+
|ZCTA5CE10| geometry|
+---------+--------------------+
| 36522|POLYGON ((-88.432...|
| 36521|POLYGON ((-88.342...|
| 36548|POLYGON ((-88.037...|
| 36585|POLYGON ((-88.169...|
| 36583|POLYGON ((-88.263...|
+---------+--------------------+
Now let’s join the 2 dataset together using RS_Intersects
and the join condition is that if a zip code zone intersects a raster image.
SELECT ZCTA5CE10, geometry, rast, datetime
FROM wherobots_open_data.us_census.zipcode z, landsat_lst l
WHERE RS_Intersects(z.geometry, l.rast)
The output is as follows:
+---------+--------------------+--------------------+--------------------+
|ZCTA5CE10| geometry| rast| datetime|
+---------+--------------------+--------------------+--------------------+
| 85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...|
| 85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
| 92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...|
| 85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
| 85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...|
+---------+--------------------+--------------------+--------------------+
Now let’s calculate the temperature of each zone using RS_ZonalStats
SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg
FROM zone_temp
The output is as follows:
+---------+------------------+
|ZCTA5CE10| avg|
+---------+------------------+
| 85350| 85.66846905011799|
| 85350|111.90322619887085|
| 85365| 43.69896389654749|
| 85350|117.59570757880628|
| 85365|120.40936393442645|
+---------+------------------+
Flexible raster data manipulation
The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.
RS_Clip
It returns a raster that is clipped by the given geometry
Original raster
SELECT RS_Clip(
RS_FromGeoTiff(content), 1,
ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'),
200, false
)
Output
RS_Resample
It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)
Original raster
SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster
FROM raster_table
Output
RS_AsRaster
Converts a Geometry to a Raster object
SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom
FROM raster_table r
Output
RS_MapAlgebra
Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.
Let’s still use the rasterized geometry above and convert it to a white background.
SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster
FROM rasterized_table
Output
SedonaKepler and SedonaPyDeck
Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.
One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck
from sedona.spark import *
SedonaKepler.create_map(df=groupedresult, name="AirportCount")
SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount')
SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')
GeoParquet reader
Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox)
properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.
df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet")
.filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")
A user can also inspect the schema of a GeoParquet file.
df.printSchema()
Output
root
|-- pop_est: long (nullable = true)
|-- continent: string (nullable = true)
|-- name: string (nullable = true)
|-- iso_a3: string (nullable = true)
|-- gdp_md_est: double (nullable = true)
|-- geometry: geometry (nullable = true)
A user can also inspect the metadata of a GeoParquet file.
df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet")
df.show()
df.printSchema()
Output of the GeoParquet metadata
+------------+--------------+------------------------------------------------------------------------------------------+-------+
|version |primary_column|columns |geohash|
+------------+--------------+------------------------------------------------------------------------------------------+-------+
|1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}} |g0 |
|1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}} |en |
|1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}} |5n |
|1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d |
|1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}} |57 |
+------------+--------------+------------------------------------------------------------------------------------------+-------+
Output of the GeoParquet metadata
root
|-- version: string (nullable = true)
|-- primary_column: string (nullable = true)
|-- columns: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- encoding: string (nullable = true)
| | |-- geometry_types: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- bbox: array (nullable = true)
| | | |-- element: double (containsNull = true)
| | |-- crs: string (nullable = true)
|-- geohash: string (nullable = true)
GeoParquet reader legacy mode
Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT
(SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.
For parquet files written by "parquet"
format when using Apache Sedona 1.3.1-incubating or earlier:
df.write.format("parquet").save("path/to/parquet/files")
Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet"
format with the .option("legacyMode", "true")
option.
df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")
GeoParquet writer
Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.
df.write.format("geoparquet")
.option("geoparquet.version", "1.0.0")
.option("geoparquet.crs", projjson)
.save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")
To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:
SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
FROM spatialDf
ORDER BY geohash
The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.
Uber H3 IDs
Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.
Sedona 1.5 now provides the following functions related to Uber H3:
- ST_H3CellIDs(geom: geometry, level: Int, fullCover: true)
- ST_H3CellDistance(cell1: Long, cell2: Long)
- ST_H3KRing(cell: Long, k: Int, exactRing: Boolean)
- ST_H3ToGeom(cells: Array[Long])
Let’s use the Seattle road network dataset (from OSM) as an example.
You can create H3 cell ids using ST_H3CellIds
as follows.
roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv")
roadDf = roadDf.selectExpr(
"monotonically_increasing_id() as id",
"ST_GeomFromWkt(geometry) as road", "`attr#1` as attr",
"ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom",
"ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3"
)
Output
+---+--------------------+--------------------+--------------------+--------------------+
| id| road| attr| roadH3Geom| roadH3|
+---+--------------------+--------------------+--------------------+--------------------+
| 0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...|
| 1|LINESTRING (-122....| [service|alley|MULTIPOLYGON (((-...|[6222150966281830...|
| 2|LINESTRING (-122....| [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...|
| 3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...|
| 4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...|
| 5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...|
| 6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...|
| 7|LINESTRING (-122....| [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]|
| 8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...|
| 9|LINESTRING (-122....| [service|alley|MULTIPOLYGON (((-...|[6222150959224913...|
+---+--------------------+--------------------+--------------------+--------------------+
We can visualize H3 cells using GeoPandas Plot
pandasDf = roadDf.limit(100).toPandas()
roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road")
h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom")
ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0)
ax.set_xlabel('Longitude (degrees)')
ax.set_ylabel('Latitude (degrees)')
roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)
You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet
New cluster compute engine support
Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.
Community contributions and acknowledgements
The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:
jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000
Future directions
Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.
- Adding more cloud platform integrations in addition to the current Spark, Flink and Snowflake support
- Developing advanced raster and vector data processing features.
- Improving performance and scalability for large datasets.
- Adding more comprehensive geospatial functions.
- Increasing compatibility with the broader spatial data ecosystem such as QGIS, Lonboard, Rasterio, NumPy, Ibis
- Strengthening documentation and community engagement for wider adoption.