9 Mins Read

8 Feb 2024

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

Authors

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.

XYZM coordinates and SRID

The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).

X typically refers to the longitude
Y typically refers to the latitude
Z adds depth or elevation to the spatial representation, which is critical for conducting 3D modeling for urban planning, architecture, and environmental studies.
M stands for a measure, which can be time, distance, or other quantitative attribute that enhances the analysis, such as speed, weight, or cost.

SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.

As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways

The geometry constructors do not understand EWKT and EWKB
No geometry editors or accessors for Z, M dimensions and SRID information
No geometry outputs for for Z, M dimensions and SRID information
Z and M values silently go missing even if they are explicitly set by users

In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.

Spatial SQL query

SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))

Output

SRID=4269;POINT ZM(1 2 3 4)
SRID=4269;POINT Z(1 2 3)
SRID=4269;POINT M(1 2 3)

As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.

Users can also perform some operations on the Z dimension. For example,

Spatial SQL query

SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))

Output

1.7320508075688772

Vector and raster join

In GIS, spatial data is primarily categorized into two types: vector data and raster data.

Vector data: represents geographical features using points, lines, and polygons. These elements are defined by X and Y coordinates that describe their shape and position in space. Common vector formats include WKT, WKB, Shapefile (.shp), and GeoJSON (.geojson)
Raster data: represents geographic information as a grid of cells (similar to images), with each cell holding a specific value. Raster data is well-suited for representing continuous phenomena, such as temperature gradients, elevation surfaces, or satellite imagery. Each raster data ofter comes with georeference information and a bounding box which aligns the image with a spatially correct geographic region. Common raster formats include GeoTiff (.tiff), and NetCDF (.nc).

Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.

With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.

Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.

Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:

from sedona.spark import *
imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst")
SedonaUtils.display_image(imageDf)

The output looks like this

Let’s read the the zip code zones of the earth into Sedona and print the content:

SELECT ZCTA5CE10, geometry FROM zipcode

The output is as follows:

+---------+--------------------+
|ZCTA5CE10|            geometry|
+---------+--------------------+
|    36522|POLYGON ((-88.432...|
|    36521|POLYGON ((-88.342...|
|    36548|POLYGON ((-88.037...|
|    36585|POLYGON ((-88.169...|
|    36583|POLYGON ((-88.263...|
+---------+--------------------+

Now let’s join the 2 dataset together using RS_Intersects and the join condition is that if a zip code zone intersects a raster image.

SELECT ZCTA5CE10, geometry, rast, datetime
FROM wherobots_open_data.us_census.zipcode z, landsat_lst l
WHERE RS_Intersects(z.geometry, l.rast)

The output is as follows:

+---------+--------------------+--------------------+--------------------+
|ZCTA5CE10|            geometry|                rast|            datetime|
+---------+--------------------+--------------------+--------------------+
|    85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...|
+---------+--------------------+--------------------+--------------------+

Now let’s calculate the temperature of each zone using RS_ZonalStats

SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg
FROM zone_temp

The output is as follows:

+---------+------------------+
|ZCTA5CE10|               avg|
+---------+------------------+
|    85350| 85.66846905011799|
|    85350|111.90322619887085|
|    85365| 43.69896389654749|
|    85350|117.59570757880628|
|    85365|120.40936393442645|
+---------+------------------+

Flexible raster data manipulation

The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.

RS_Clip

It returns a raster that is clipped by the given geometry

Original raster

SELECT RS_Clip(
        RS_FromGeoTiff(content), 1,
        ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'),
        200, false
    )

Output

RS_Resample

It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)

Original raster

SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster
FROM raster_table

Output

RS_AsRaster

Converts a Geometry to a Raster object

SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom
FROM raster_table r

Output

RS_MapAlgebra

Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.

Let’s still use the rasterized geometry above and convert it to a white background.

SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster
FROM rasterized_table

Output

SedonaKepler and SedonaPyDeck

Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.

One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck

from sedona.spark import *
SedonaKepler.create_map(df=groupedresult, name="AirportCount")
SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount')
SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')

GeoParquet reader

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox) properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.

df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet")
    .filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")

A user can also inspect the schema of a GeoParquet file.

df.printSchema()

Output

root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)

A user can also inspect the metadata of a GeoParquet file.

df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet")
df.show()
df.printSchema()

Output of the GeoParquet metadata

+------------+--------------+------------------------------------------------------------------------------------------+-------+
|version     |primary_column|columns                                                                                   |geohash|
+------------+--------------+------------------------------------------------------------------------------------------+-------+
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}}            |g0     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}}          |en     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}}  |5n     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}}  |57     |
+------------+--------------+------------------------------------------------------------------------------------------+-------+

Output of the GeoParquet metadata

root
|-- version: string (nullable = true)
|-- primary_column: string (nullable = true)
|-- columns: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- encoding: string (nullable = true)
| | |-- geometry_types: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- bbox: array (nullable = true)
| | | |-- element: double (containsNull = true)
| | |-- crs: string (nullable = true)
|-- geohash: string (nullable = true)

GeoParquet reader legacy mode

Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT (SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.

For parquet files written by "parquet" format when using Apache Sedona 1.3.1-incubating or earlier:

df.write.format("parquet").save("path/to/parquet/files")

Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet" format with the .option("legacyMode", "true") option.

df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")

GeoParquet writer

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.

df.write.format("geoparquet")
        .option("geoparquet.version", "1.0.0")
        .option("geoparquet.crs", projjson)
        .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")

To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:

SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
FROM spatialDf
ORDER BY geohash

The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.

Uber H3 IDs

Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.

Sedona 1.5 now provides the following functions related to Uber H3:

ST_H3CellIDs(geom: geometry, level: Int, fullCover: true)
ST_H3CellDistance(cell1: Long, cell2: Long)
ST_H3KRing(cell: Long, k: Int, exactRing: Boolean)
ST_H3ToGeom(cells: Array[Long])

Let’s use the Seattle road network dataset (from OSM) as an example.

You can create H3 cell ids using ST_H3CellIds as follows.

roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv")
roadDf = roadDf.selectExpr(
    "monotonically_increasing_id() as id",
    "ST_GeomFromWkt(geometry) as road", "`attr#1` as attr",
    "ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom",
    "ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3"
)

Output

+---+--------------------+--------------------+--------------------+--------------------+
| id|                road|                attr|          roadH3Geom|              roadH3|
+---+--------------------+--------------------+--------------------+--------------------+
|  0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...|
|  1|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150966281830...|
|  2|LINESTRING (-122....|    [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...|
|  3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...|
|  4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...|
|  5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...|
|  6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...|
|  7|LINESTRING (-122....|   [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]|
|  8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...|
|  9|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150959224913...|
+---+--------------------+--------------------+--------------------+--------------------+

We can visualize H3 cells using GeoPandas Plot

pandasDf = roadDf.limit(100).toPandas()
roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road")
h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom")
ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0)
ax.set_xlabel('Longitude (degrees)')
ax.set_ylabel('Latitude (degrees)')
roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)

You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

New cluster compute engine support

Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.

Community contributions and acknowledgements

The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:

jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000

Future directions

Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.

Adding more cloud platform integrations in addition to the current Spark, Flink and Snowflake support
Developing advanced raster and vector data processing features.
Improving performance and scalability for large datasets.
Adding more comprehensive geospatial functions.
Increasing compatibility with the broader spatial data ecosystem such as QGIS, Lonboard, Rasterio, NumPy, Ibis
Strengthening documentation and community engagement for wider adoption.

O'Reilly Apache Sedona Cloud Native Geospatial with Apache Sedona book cover

Interested in learning more about Apache Sedona? Download this hands-on guide with real-world examples for working with large-scale spatial data.

Get practical solutions for spatial data challenges

Access Now

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

XYZM coordinates and SRID

Vector and raster join

Flexible raster data manipulation

RS_Clip

RS_Resample

RS_AsRaster

RS_MapAlgebra

SedonaKepler and SedonaPyDeck

GeoParquet reader

GeoParquet reader legacy mode

GeoParquet writer

Uber H3 IDs

New cluster compute engine support

Community contributions and acknowledgements

Future directions

RELATED POSTS