Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.
The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).
SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.
As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways
In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.
Spatial SQL query
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)')) SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)')) SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))
Output
SRID=4269;POINT ZM(1 2 3 4) SRID=4269;POINT Z(1 2 3) SRID=4269;POINT M(1 2 3)
As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.
Users can also perform some operations on the Z dimension. For example,
SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))
1.7320508075688772
In GIS, spatial data is primarily categorized into two types: vector data and raster data.
Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.
With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.
Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.
Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:
from sedona.spark import * imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst") SedonaUtils.display_image(imageDf)
The output looks like this
Let’s read the the zip code zones of the earth into Sedona and print the content:
SELECT ZCTA5CE10, geometry FROM zipcode
The output is as follows:
+---------+--------------------+ |ZCTA5CE10| geometry| +---------+--------------------+ | 36522|POLYGON ((-88.432...| | 36521|POLYGON ((-88.342...| | 36548|POLYGON ((-88.037...| | 36585|POLYGON ((-88.169...| | 36583|POLYGON ((-88.263...| +---------+--------------------+
Now let’s join the 2 dataset together using RS_Intersects and the join condition is that if a zip code zone intersects a raster image.
RS_Intersects
SELECT ZCTA5CE10, geometry, rast, datetime FROM wherobots_open_data.us_census.zipcode z, landsat_lst l WHERE RS_Intersects(z.geometry, l.rast)
+---------+--------------------+--------------------+--------------------+ |ZCTA5CE10| geometry| rast| datetime| +---------+--------------------+--------------------+--------------------+ | 85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...| | 85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...| | 92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...| | 85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...| | 85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...| +---------+--------------------+--------------------+--------------------+
Now let’s calculate the temperature of each zone using RS_ZonalStats
RS_ZonalStats
SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg FROM zone_temp
+---------+------------------+ |ZCTA5CE10| avg| +---------+------------------+ | 85350| 85.66846905011799| | 85350|111.90322619887085| | 85365| 43.69896389654749| | 85350|117.59570757880628| | 85365|120.40936393442645| +---------+------------------+
The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.
It returns a raster that is clipped by the given geometry
Original raster
SELECT RS_Clip( RS_FromGeoTiff(content), 1, ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'), 200, false )
It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)
SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster FROM raster_table
Converts a Geometry to a Raster object
SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom FROM raster_table r
Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.
Let’s still use the rasterized geometry above and convert it to a white background.
SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster FROM rasterized_table
Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.
One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck
from sedona.spark import * SedonaKepler.create_map(df=groupedresult, name="AirportCount") SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount') SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')
Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox) properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.
[bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox)
df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet") .filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")
A user can also inspect the schema of a GeoParquet file.
df.printSchema()
root |-- pop_est: long (nullable = true) |-- continent: string (nullable = true) |-- name: string (nullable = true) |-- iso_a3: string (nullable = true) |-- gdp_md_est: double (nullable = true) |-- geometry: geometry (nullable = true)
A user can also inspect the metadata of a GeoParquet file.
df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet") df.show() df.printSchema()
Output of the GeoParquet metadata
+------------+--------------+------------------------------------------------------------------------------------------+-------+ |version |primary_column|columns |geohash| +------------+--------------+------------------------------------------------------------------------------------------+-------+ |1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}} |g0 | |1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}} |en | |1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}} |5n | |1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d | |1.0.0-beta.1|geometry |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}} |57 | +------------+--------------+------------------------------------------------------------------------------------------+-------+
root |-- version: string (nullable = true) |-- primary_column: string (nullable = true) |-- columns: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) | | |-- encoding: string (nullable = true) | | |-- geometry_types: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- bbox: array (nullable = true) | | | |-- element: double (containsNull = true) | | |-- crs: string (nullable = true) |-- geohash: string (nullable = true)
Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT (SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.
GeometryUDT
For parquet files written by "parquet" format when using Apache Sedona 1.3.1-incubating or earlier:
"parquet"
df.write.format("parquet").save("path/to/parquet/files")
Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet" format with the .option("legacyMode", "true") option.
"geoparquet"
.option("legacyMode", "true")
df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")
Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.
df.write.format("geoparquet") .option("geoparquet.version", "1.0.0") .option("geoparquet.crs", projjson) .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")
To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:
SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash FROM spatialDf ORDER BY geohash
The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.
Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.
Sedona 1.5 now provides the following functions related to Uber H3:
Let’s use the Seattle road network dataset (from OSM) as an example.
You can create H3 cell ids using ST_H3CellIds as follows.
ST_H3CellIds
roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv") roadDf = roadDf.selectExpr( "monotonically_increasing_id() as id", "ST_GeomFromWkt(geometry) as road", "`attr#1` as attr", "ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom", "ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3" )
+---+--------------------+--------------------+--------------------+--------------------+ | id| road| attr| roadH3Geom| roadH3| +---+--------------------+--------------------+--------------------+--------------------+ | 0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...| | 1|LINESTRING (-122....| [service|alley|MULTIPOLYGON (((-...|[6222150966281830...| | 2|LINESTRING (-122....| [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...| | 3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...| | 4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...| | 5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...| | 6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...| | 7|LINESTRING (-122....| [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]| | 8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...| | 9|LINESTRING (-122....| [service|alley|MULTIPOLYGON (((-...|[6222150959224913...| +---+--------------------+--------------------+--------------------+--------------------+
We can visualize H3 cells using GeoPandas Plot
pandasDf = roadDf.limit(100).toPandas() roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road") h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom") ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0) ax.set_xlabel('Longitude (degrees)') ax.set_ylabel('Latitude (degrees)') roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)
You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet
Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.
The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:
jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000
Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.
Interested in learning more about Apache Sedona? Download this hands-on guide with real-world examples for working with large-scale spatial data.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
Wherobots brought modern infrastructure to spatial data in 2025
We’re bridging the gap between AI and data from the physical world in 2026
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: