Announcing Apache Sedona 1.4 – A Giant Leap in Geospatial Processing

8 Mins Read

5 Jul 2023

Announcing Apache Sedona 1.4 – A Giant Leap in Geospatial Processing

Hello Apache Sedona community, We’re exhilarated to share the launch of Apache Sedona 1.4 (https://sedona.apache.org/1.4.1/), encompassing both 1.4.0 and 1.4.1 versions, signifying a transformative stage in geospatial analytics. These successive versions introduce an array of striking new features, major enhancements, and indispensable bug fixes. Top-Notch Highlights Here are the standout features spanning across these two […]

TABLE OF CONTENTS

Hello Apache Sedona community,

We’re exhilarated to share the launch of Apache Sedona 1.4 (https://sedona.apache.org/1.4.1/), encompassing both 1.4.0 and 1.4.1 versions, signifying a transformative stage in geospatial analytics. These successive versions introduce an array of striking new features, major enhancements, and indispensable bug fixes.

Top-Notch Highlights

Here are the standout features spanning across these two versions:

Use SedonaContext as a unified entry point (SEDONA-287): This allows you to use the same method to initialize your Sedona environment in different computation engines without tedious configuration.
Geodesic/Geography Functions Support (SEDONA-281, SEDONA-286): This functionality adds a whole new dimension to your geospatial data processing, enabling more accurate computations over geographical data.
Redesigned end-to-end raster ETL pipeline (SEDONA-251, SEDONA-269, SEDONA-292): Apache Sedona 1.4 includes new raster readers / writers, raster type, and many essential raster functions, paving the way for streamlined handling of vast volumes of geospatial data.
Google S2 based spatial join (SEDONA-235): Sedona 1.4 introduces the support of Google S2 cells which enables a speedy approximate point-in-polygon join.
Support for Spark 3.4 (SEDONA-276): Our support for Spark 3.4 allows you to merge the powerful capabilities of the latest Spark version with the enhanced geospatial features of Sedona.
Serialization and Deserialization Enhancements (SEDONA-207, SEDONA-227, SEDONA-231): Sedona 1.4 offered faster geometry serialization and deserialization, resulting in a performance boost by a factor of 3 to 7.
Native filter pushdown support in GeoParquet (SEDONA-156): this can push down spatial predicate on GeoParquet to reduce memory consumption by 10X.

Now, let’s dive into Features 1-4 and learn how to leverage them in your applications. We’ll discuss the remaining features in separate blog posts.

Use SedonaContext as a unified entry point (v1.4.1)

Initially designed to be a geospatial computation system on top of Apache Spark, Sedona has evolved over the years. It has extended its support to other engines and is now on a mission to become a cross-platform, scalable geospatial computation engine.

To use Sedona with different underlying engines, you must import several classes and register necessary Sedona data types, serializers, functions, and optimization rules. This process can be confusing for new users over time. Let’s examine the Sedona Spark Python interface as an example:

Old Sedona entry point

from sedona.register import SedonaRegistrator
from sedona.core.SpatialRDD import SpatialRDD
from sedona.utils.adapter import Adapter
from sedona.core.spatialOperator import JoinQueryRaw
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.sql.types import GeometryType
from sedona.core.enums import GridType
from sedona.core.enums import IndexType
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

spark = SparkSession.\
    builder.\
    master("local[*]").\
    appName("Sedona App").\
    config("spark.serializer", KryoSerializer.getName).\
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .\
    config("spark.jars.packages", "org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2") .\
    getOrCreate()

SedonaRegistrator.registerAll(spark)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

Users previously had to manually import many classes and add serializer configuration.

However, with the introduction of SedonaContext and other helper functions in SEDONA-287, starting applications has become much easier. Here’s how to do it:

SedonaContext with Spark Python

from sedona.spark import *

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,'
           'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
    getOrCreate()

sedona = SedonaContext.create(config)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

SedonaContext with Flink Java

import org.apache.sedona.flink.SedonaContext

StreamTableEnvironment sedona = SedonaContext.create(env, tableEnv);

Table spatialTbl = sedona.sqlQuery("SELECT ST_GeomFromWKT(XXX) FROM")

It’s worth noting that Sedona will maintain the old entry point for a long time, since many existing users are familiar with the old APIs.

Geodesic/Geography Functions Support (v1.4.1)

As we all know, the Earth is a spheroid, making it difficult to accurately calculate geographic distances between two locations. Given the complicated surface of the Earth, finding the exact distance is a challenging task. In the GIS industry, it is considered best practice to project the geospatial objects (usually in WGS84 / EPSG:4326) to an appropriate meter-based Coordinate Reference System (CRS) before applying geometric calculations such as distance calculation. Users can choose a CRS that is close to the locations of their data. For example, in Apache Sedona, one can calculate the distance between two locations as follows:

Sedona ST_Transfrom in SQL

CREATE OR REPLACE TEMP VIEW geomTable AS
SELECT ST_GeomFromWKT('POINT(-42.3 71.3)') AS geom1,
 ST_GeomFromWKT('POINT(-42.1 71.1)') AS geom2;

CREATE OR REPLACE TEMP VIEW geomTable_t AS
SELECT ST_Transform(geom1, 'epsg:4326', 'epsg:3857') AS geom1,
 ST_Transform(geom2, 'epsg:4326', 'epsg:3857') As geom2 
FROM geomTable

SELECT ST_Distance(geom1, geom2) AS distanceInMeter FROM geomTable_t

However, it is common for a single dataset to include geospatial objects across the globe, and there is no CRS suitable for everywhere. Therefore, it can be difficult to calculate distances accurately. In this case, the best way to measure geographic distance is either through great-circle distance (haversine distance) or spheroid distance (Vincenty’s formulae or C. Karney’s algorithm).

SEDONA-281 introduces support for both great-circle distance and spheroid distance, namely ST_DistanceSphere and ST_DistanceSpheroid. Let’s look at an example of calculating the distance between Hong Kong (22.308919, 113.914603) and Toronto (43.677223, -79.630556).

Sedona sphere and spheroid distance in SQL

All distance units in this example are in meters. The actual distance between Hong Kong and Toronto is about 12,544 to 12,568 kilometers. ST_DistanceSphere and ST_DistanceSpheroid both produce very accurate results. However, the ST_Transformbased approach cannot give a good answer if no proper CRS (coordinate reference system) is available.

# ST_DistanceSphere
SELECT ST_DistanceSphere(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2548548944238186E7|
+--------------------+

# ST_DistanceSpheroid
SELECT ST_DistanceSpheroid(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2568775317073349E7|
+--------------------+

# ST_Transform based distance calculation
SELECT ST_DISTANCE(ST_TRANSFORM(ST_POINT(22.308919, 113.914603), 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(ST_POINT(43.677223, -79.630556), 'epsg:4326', 'epsg:3857')) AS distance

+--------------------+
|            distance|
+--------------------+
|2.1735260976295043E7|
+--------------------+

Rule of thumb

Using ST_DistanceSphere and ST_DistanceSpheroid does not mean that the ST_Transform-based approach is no longer necessary. To select the most suitable approach, use the following rule of thumb: when calculating the distance between two geospatial objects that are far apart, such as cities in different countries or continents, use ST_DistanceSphere or ST_DistanceSpheroid. When calculating the distance between two objects in the same region such as a metropolitan area, use the ST_Transform-based approach.

Now that we are able to calculate the spherical distance between two geospatial objects, we can apply this to real-world applications that involve the distance calculation among many geospatial objects and not just two. For example, we can find the parks that are within 50KM distance from each city among the locations of cities and parks in the world (available from OpenStreetMap). This operation is called DistanceJoin in Sedona and is one of the most expensive and resource-consuming operations. Thanks to our highly efficient spatial join algorithm, Sedona can easily handle this operation on huge datasets. If you used our ST_Transform-based approach before, you may have enjoyed this algorithm for a while. SEDONA-286 extends our algorithm to support distance join over ST_DistanceSphere and ST_DistanceSpheroid. A distance join query can be written as follows:

Sedona distance join in SQL

50000 indicates 50KM

# ST_DistanceSphere join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSphere(c.geom, p.geom) < 50000

# ST_DistanceSpheroid join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSpheroid(c.geom, p.geom) < 50000

# ST_Transform + ST_Distance
SELECT *
FROM cities c, parks p
WHERE ST_DISTANCE(ST_TRANSFORM(c.geom, 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(p.geom, 'epsg:4326', 'epsg:3857')) < 50000

Redesigned end-to-end raster ETL pipeline (v1.4.1)

Sedona has been implementing geospatial vector data functions for several years, and the support has become mostly mature. In recent releases, the Sedona community has invested more in geospatial raster data support, including raster data reader, writer, and raster manipulation.

To enable comprehensive raster data support, Sedona has made the following efforts:

Raster data type

SEDONA-251 introduces a native raster data type. Similar to the geometry type, the raster type indicates that the column in a schema contains raster data. Each record stored in this type contains two parts: metadata and raster band information. With this in place, we can create a table that has both a geometry column and a raster column.

Raster reader

To begin processing raster data, the first step is to load the raster data. Additionally, we must be able to load raster data in parallel given the vast amount of available raster data. SEDONA-251 introduces a scalable raster reader and native raster data constructors. For example, we can read raster data files in Sedona Spark as follows:

sedona.read.format("binaryFile")
.load("raster/*.tiff").createOrReplaceTempView("binaryTable")

To construct a raster column, use the following steps. We have added additional constructors, such as RS_FromGeoTiff and RS_FromAscGrid, which allow you to convert various raster binary formats to a unified raster format.

CREATE OR REPLACE TEMP VIEW rasterTable AS
SELECT RS_FromGeoTiff(binaryTable.content) AS raster
FROM binaryTable

We can print the schema of this table. As we can see, the type of this column now is raster.

sedone.table("binaryTable").printSchema()

root
 |-- raster: raster (nullable = true)

Raster writer

After processing raster data, it is necessary to store the data in an external storage. With SEDONA-269, this can be achieved in Sedona Spark as follows:

First, convert the raster data to a specific binary format, such as GeoTiff. Note that the raster can be imported from one image format, but exported to another image format.

SELECT RS_AsGeoTiff(raster) AS image
FROM rasterTable

After processing the data, we can save it to an external location.

sedona.table("rasterTable").write.format("raster").mode(SaveMode.Overwrite).save("my_raster_file")

Raster functions

SEDONA-251, SEDONA-269, and SEDONA-292 together introduce several functions to transform and analyze raster data. Let’s examine a few examples.

First, let’s assume that we have a GeoTiff image as follows:

file

Extract bounding box, SRID and number of bands

We can extract the bounding box, srid, and number of bands of this raster.

SELECT raster, RS_Envelope(raster) as bbox, RS_Metadata(raster), RS_NumBands(raster)
FROM rasterTable

+--------------------+--------------------+-----+--------+
|              raster|                geom| srid|numBands|
+--------------------+--------------------+-----+--------+
|GridCoverage2D["g...|POLYGON ((590520 ...|32631|       3|
+--------------------+--------------------+-----+--------+

Now this resulting table has a schema that contains both raster and geometry type (see below)

root
 |-- raster: raster (nullable = true)
 |-- geom: geometry (nullable = true)
 |-- srid: integer (nullable = true)
 |-- numBands: integer (nullable = true)

This allows for more complex geometric operations on the table. For instance, we can save the table as a GeoParquet format and perform filter pushdown. When executing a Sedona spatial range query on this GeoParquet table, Sedona will retrieve only the data that intersects the spatial range query window.

sedona.table("rasterTable").write.format("geoparquet").save("rasterTable.parquet")

Extract individual band

We can extract any band in a raster.

SELECT RS_BandAsArray(raster, 1) as band1, RS_BandAsArray(raster, 2) as band2,
 RS_BandAsArray(raster, 3) as band3
FROM rasterTable

+--------------------+--------------------+--------------------+
|               band1|               band2|               band3|
+--------------------+--------------------+--------------------+
|[0.0, 799.0, 788....|[0.0, 555.0, 546....|[0.0, 330.0, 322....|
+--------------------+--------------------+--------------------+

Modify a band value

We can modify a band value and save it to the original raster

SELECT RS_AddBandFromArray(raster, RS_GreaterThan(band1, 0), 1) AS raster)
FROM rasterTable

The resulting GeoTiff looks like this

file

Google S2-based spatial join (v1.4.0)

If you’re experiencing suboptimal performance with Sedona’s optimized join, it may be due to complicated and overlapping geometries. Fortunately, Sedona offers a built-in alternative: the Google S2-based approximate equi-join. This equi-join utilizes Spark’s internal equi-join algorithm and may perform better than the optimized join, especially if you choose to skip the refinement step. However, be aware that skipping the refinement step will result in a loss of query accuracy.

Sedona’s approximate equi-join is a useful tool to have in your arsenal when working with complex geometries. And with the recent updates to Sedona, such as the introduction of the native raster data type and the redesigned end-to-end raster ETL pipeline, Sedona is better equipped than ever to meet the evolving needs of the geospatial data processing community.

1. Generate S2 ids for both tables

Use ST_S2CellIds to generate cell IDs. Each geometry may produce one or more IDs.

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM lefts

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM rights

2. Perform equi-join

Join the two tables by their S2 cellId

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId

3. Optional: Refine the result

Due to the nature of S2 Cellid, the equi-join results might have a few false-positives depending on the S2 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives.

To ensure the correctness, you can use one of the Spatial Predicates to filter out them. Use this query instead of the query in Step 2.

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs, rcs
WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)

As you see, compared to the query in Step 2, we added one more filter, which is ST_Contains, to remove false positives. You can also use ST_Intersects and so on.

!!!tip
You can skip this step if you don’t need 100% accuracy and want faster query speed.

4. Optional: De-duplicate

Due to the explode function used when we generate S2 Cell Ids, the resulting DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can remove them by performing a GroupBy query.

SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), first(rcs_name)
FROM joinresult
GROUP BY (lcs_id, rcs_id)

The first function is to take the first value from a number of duplicate values.

If you don’t have a unique id for each geometry, you can also group by geometry itself. See below:

SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
FROM joinresult
GROUP BY (lcs_geom, rcs_geom)

Try it now

With these improvements and additions, Apache Sedona 1.4 s better equipped to meet the evolving needs of the geospatial data processing community.

We are immensely grateful to our fantastic community for their ongoing support and contributions to the development and growth of Apache Sedona. We are excited to continue our journey together, pushing the boundaries of geospatial data processing.

Stay tuned for future updates and enjoy your data processing with Apache Sedona 1.4!

10 Mins Read 5 Jun 2024

Introducing WherobotsAI for planetary inference, and capabilities that modernize spatial intelligence at scale

We are excited to announce a preview of WherobotsAI, our new suite of AI and ML powered capabilities that unlock spatial intelligence in satellite imagery and GPS location data. Additionally, we are bringing the high-performance of WherobotsDB to your favorite data applications with a Spatial SQL API that integrates WherobotsDB with more interfaces including Apache […]

Apache Sedona + 3

6 Mins Read 13 Sep 2024

Wherobots and PostgreSQL + PostGIS: A Synergy in Spatial Analysis

Introduction Spatial data has never been more abundant and valued in decision making (everything happens somewhere!). The tools we use to process location based data can significantly impact the outcomes of our projects. In the geospatial ecosystem, Apache Sedona-powered Wherobots and PostgreSQL with the PostGIS extension each offer robust capabilities. They share some functionalities, but […]

Apache Sedona + 2

4 Mins Read 1 Aug 2024

Wherobots Joins Overture, Winning The Taco Wars, Spatial SQL API, Geospatial Index Podcast – This Month In Wherobots

Welcome to This Month In Wherobots the monthly developer newsletter for the Wherobots & Apache Sedona community! This month we have news about Wherobots and the Overture Maps Foundation, a deep dive on new Wherobots Cloud features like raster inference, generating vector tiles, and the Spatial SQL API, plus a look at retail cannibalization analysis […]

Apache Sedona + 1

3 Mins Read 23 Jul 2024

Wherobots Joins Overture Maps Foundation As Contributing Member To Enable Open Cloud-Native Geospatial Intelligence

Wherobots is excited to share that we have officially joined Overture Maps Foundation as a Contributing Member to support the next generation of planetary-scale open map data. Wherobots believes wholeheartedly in Overture’s mission to bring open global-scale map data to the world while leveraging cloud-native technologies to enable efficient and accessible usage of these datasets. […]

News