TABLE OF CONTENTS

    Contributors

    • Jia Yu

      Jia Yu is a co-founder and the Chief Architect of Wherobots Inc. He is the PMC Chair of Apache Sedona

    • Kristin Cowalcijk

      Kristin Cowalcijk is a staff engineer of Wherobots Inc. He is a PMC of Apache Sedona.

    • Mo Sarwat

    Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

    We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.

    XYZM coordinates and SRID

    The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).

    • X typically refers to the longitude
    • Y typically refers to the latitude
    • Z adds depth or elevation to the spatial representation, which is critical for conducting 3D modeling for urban planning, architecture, and environmental studies.
    • M stands for a measure, which can be time, distance, or other quantitative attribute that enhances the analysis, such as speed, weight, or cost.

    SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.

    As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways

    • The geometry constructors do not understand EWKT and EWKB
    • No geometry editors or accessors for Z, M dimensions and SRID information
    • No geometry outputs for for Z, M dimensions and SRID information
    • Z and M values silently go missing even if they are explicitly set by users

    In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.

    Spatial SQL query

    SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)'))
    SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)'))
    SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))

    Output

    SRID=4269;POINT ZM(1 2 3 4)
    SRID=4269;POINT Z(1 2 3)
    SRID=4269;POINT M(1 2 3)

    As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.

    Users can also perform some operations on the Z dimension. For example,

    Spatial SQL query

    SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))

    Output

    1.7320508075688772

    Vector and raster join

    In GIS, spatial data is primarily categorized into two types: vector data and raster data.

    • Vector data: represents geographical features using points, lines, and polygons. These elements are defined by X and Y coordinates that describe their shape and position in space. Common vector formats include WKT, WKB, Shapefile (.shp), and GeoJSON (.geojson)
    • Raster data: represents geographic information as a grid of cells (similar to images), with each cell holding a specific value. Raster data is well-suited for representing continuous phenomena, such as temperature gradients, elevation surfaces, or satellite imagery. Each raster data ofter comes with georeference information and a bounding box which aligns the image with a spatially correct geographic region. Common raster formats include GeoTiff (.tiff), and NetCDF (.nc).

    Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.

    With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.

    Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.

    Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:

    from sedona.spark import *
    imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst")
    SedonaUtils.display_image(imageDf)

    The output looks like this

    Untitled

    Let’s read the the zip code zones of the earth into Sedona and print the content:

    SELECT ZCTA5CE10, geometry FROM zipcode

    The output is as follows:

    +---------+--------------------+
    |ZCTA5CE10|            geometry|
    +---------+--------------------+
    |    36522|POLYGON ((-88.432...|
    |    36521|POLYGON ((-88.342...|
    |    36548|POLYGON ((-88.037...|
    |    36585|POLYGON ((-88.169...|
    |    36583|POLYGON ((-88.263...|
    +---------+--------------------+

    Now let’s join the 2 dataset together using RS_Intersects and the join condition is that if a zip code zone intersects a raster image.

    SELECT ZCTA5CE10, geometry, rast, datetime
    FROM wherobots_open_data.us_census.zipcode z, landsat_lst l
    WHERE RS_Intersects(z.geometry, l.rast)

    The output is as follows:

    +---------+--------------------+--------------------+--------------------+
    |ZCTA5CE10|            geometry|                rast|            datetime|
    +---------+--------------------+--------------------+--------------------+
    |    85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...|
    |    85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
    |    92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...|
    |    85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
    |    85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...|
    +---------+--------------------+--------------------+--------------------+

    Now let’s calculate the temperature of each zone using RS_ZonalStats

    SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg
    FROM zone_temp

    The output is as follows:

    +---------+------------------+
    |ZCTA5CE10|               avg|
    +---------+------------------+
    |    85350| 85.66846905011799|
    |    85350|111.90322619887085|
    |    85365| 43.69896389654749|
    |    85350|117.59570757880628|
    |    85365|120.40936393442645|
    +---------+------------------+

    Flexible raster data manipulation

    The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.

    RS_Clip

    It returns a raster that is clipped by the given geometry

    Original raster

    Untitled 1

    SELECT RS_Clip(
            RS_FromGeoTiff(content), 1,
            ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'),
            200, false
        )

    Output

    Untitled 2

    RS_Resample

    It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)

    Original raster

    Untitled 3

    SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster
    FROM raster_table

    Output

    Untitled 4

    RS_AsRaster

    Converts a Geometry to a Raster object

    SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom
    FROM raster_table r

    Output

    Untitled 5

    RS_MapAlgebra

    Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.

    Let’s still use the rasterized geometry above and convert it to a white background.

    SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster
    FROM rasterized_table

    Output

    Untitled 6

    SedonaKepler and SedonaPyDeck

    Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.

    One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck

    from sedona.spark import *
    SedonaKepler.create_map(df=groupedresult, name="AirportCount")
    SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount')
    SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')

    Untitled 7

    Untitled 8

    Untitled 9

    GeoParquet reader

    Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox) properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.

    df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet")
        .filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")

    A user can also inspect the schema of a GeoParquet file.

    df.printSchema()

    Output

    root
     |-- pop_est: long (nullable = true)
     |-- continent: string (nullable = true)
     |-- name: string (nullable = true)
     |-- iso_a3: string (nullable = true)
     |-- gdp_md_est: double (nullable = true)
     |-- geometry: geometry (nullable = true)

    A user can also inspect the metadata of a GeoParquet file.

    df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet")
    df.show()
    df.printSchema()
    
    

    Output of the GeoParquet metadata

    +------------+--------------+------------------------------------------------------------------------------------------+-------+
    |version     |primary_column|columns                                                                                   |geohash|
    +------------+--------------+------------------------------------------------------------------------------------------+-------+
    |1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}}            |g0     |
    |1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}}          |en     |
    |1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}}  |5n     |
    |1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d     |
    |1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}}  |57     |
    +------------+--------------+------------------------------------------------------------------------------------------+-------+
    

    Output of the GeoParquet metadata

    
    
    root
    |-- version: string (nullable = true)
    |-- primary_column: string (nullable = true)
    |-- columns: map (nullable = true)
    | |-- key: string
    | |-- value: struct (valueContainsNull = true)
    | | |-- encoding: string (nullable = true)
    | | |-- geometry_types: array (nullable = true)
    | | | |-- element: string (containsNull = true)
    | | |-- bbox: array (nullable = true)
    | | | |-- element: double (containsNull = true)
    | | |-- crs: string (nullable = true)
    |-- geohash: string (nullable = true)

    GeoParquet reader legacy mode

    Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT (SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.

    For parquet files written by "parquet" format when using Apache Sedona 1.3.1-incubating or earlier:

    df.write.format("parquet").save("path/to/parquet/files")

    Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet" format with the .option("legacyMode", "true") option.

    df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")

    GeoParquet writer

    Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.

    df.write.format("geoparquet")
            .option("geoparquet.version", "1.0.0")
            .option("geoparquet.crs", projjson)
            .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")

    To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:

    SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
    FROM spatialDf
    ORDER BY geohash

    The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.

    Untitled 10

    Uber H3 IDs

    Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.

    Sedona 1.5 now provides the following functions related to Uber H3:

    • ST_H3CellIDs(geom: geometry, level: Int, fullCover: true)
    • ST_H3CellDistance(cell1: Long, cell2: Long)
    • ST_H3KRing(cell: Long, k: Int, exactRing: Boolean)
    • ST_H3ToGeom(cells: Array[Long])

    Let’s use the Seattle road network dataset (from OSM) as an example.

    Untitled 11

    You can create H3 cell ids using ST_H3CellIds as follows.

    roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv")
    roadDf = roadDf.selectExpr(
        "monotonically_increasing_id() as id",
        "ST_GeomFromWkt(geometry) as road", "`attr#1` as attr",
        "ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom",
        "ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3"
    )

    Output

    +---+--------------------+--------------------+--------------------+--------------------+
    | id|                road|                attr|          roadH3Geom|              roadH3|
    +---+--------------------+--------------------+--------------------+--------------------+
    |  0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...|
    |  1|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150966281830...|
    |  2|LINESTRING (-122....|    [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...|
    |  3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...|
    |  4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...|
    |  5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...|
    |  6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...|
    |  7|LINESTRING (-122....|   [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]|
    |  8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...|
    |  9|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150959224913...|
    +---+--------------------+--------------------+--------------------+--------------------+

    We can visualize H3 cells using GeoPandas Plot

    pandasDf = roadDf.limit(100).toPandas()
    roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road")
    h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom")
    ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0)
    ax.set_xlabel('Longitude (degrees)')
    ax.set_ylabel('Latitude (degrees)')
    roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)

    Untitled 12

    You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

    Unmatched industrial vessels

    New cluster compute engine support

    Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.

    Community contributions and acknowledgements

    The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:

    jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000

    Future directions

    Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.

    • Adding more cloud platform integrations in addition to the current Spark, Flink and Snowflake support
    • Developing advanced raster and vector data processing features.
    • Improving performance and scalability for large datasets.
    • Adding more comprehensive geospatial functions.
    • Increasing compatibility with the broader spatial data ecosystem such as QGIS, Lonboard, Rasterio, NumPy, Ibis
    • Strengthening documentation and community engagement for wider adoption.

    Contributors

    • Jia Yu

      Jia Yu is a co-founder and the Chief Architect of Wherobots Inc. He is the PMC Chair of Apache Sedona

    • Kristin Cowalcijk

      Kristin Cowalcijk is a staff engineer of Wherobots Inc. He is a PMC of Apache Sedona.

    • Mo Sarwat