Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.

XYZM coordinates and SRID

The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).

  • X typically refers to the longitude
  • Y typically refers to the latitude
  • Z adds depth or elevation to the spatial representation, which is critical for conducting 3D modeling for urban planning, architecture, and environmental studies.
  • M stands for a measure, which can be time, distance, or other quantitative attribute that enhances the analysis, such as speed, weight, or cost.

SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.

As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways

  • The geometry constructors do not understand EWKT and EWKB
  • No geometry editors or accessors for Z, M dimensions and SRID information
  • No geometry outputs for for Z, M dimensions and SRID information
  • Z and M values silently go missing even if they are explicitly set by users

In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.

Spatial SQL query

SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))

Output

SRID=4269;POINT ZM(1 2 3 4)
SRID=4269;POINT Z(1 2 3)
SRID=4269;POINT M(1 2 3)

As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.

Users can also perform some operations on the Z dimension. For example,

Spatial SQL query

SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))

Output

1.7320508075688772

Vector and raster join

In GIS, spatial data is primarily categorized into two types: vector data and raster data.

  • Vector data: represents geographical features using points, lines, and polygons. These elements are defined by X and Y coordinates that describe their shape and position in space. Common vector formats include WKT, WKB, Shapefile (.shp), and GeoJSON (.geojson)
  • Raster data: represents geographic information as a grid of cells (similar to images), with each cell holding a specific value. Raster data is well-suited for representing continuous phenomena, such as temperature gradients, elevation surfaces, or satellite imagery. Each raster data ofter comes with georeference information and a bounding box which aligns the image with a spatially correct geographic region. Common raster formats include GeoTiff (.tiff), and NetCDF (.nc).

Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.

With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.

Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.

Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:

from sedona.spark import *
imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst")
SedonaUtils.display_image(imageDf)

The output looks like this

Untitled

Let’s read the the zip code zones of the earth into Sedona and print the content:

SELECT ZCTA5CE10, geometry FROM zipcode

The output is as follows:

+---------+--------------------+
|ZCTA5CE10|            geometry|
+---------+--------------------+
|    36522|POLYGON ((-88.432...|
|    36521|POLYGON ((-88.342...|
|    36548|POLYGON ((-88.037...|
|    36585|POLYGON ((-88.169...|
|    36583|POLYGON ((-88.263...|
+---------+--------------------+

Now let’s join the 2 dataset together using RS_Intersects and the join condition is that if a zip code zone intersects a raster image.

SELECT ZCTA5CE10, geometry, rast, datetime
FROM wherobots_open_data.us_census.zipcode z, landsat_lst l
WHERE RS_Intersects(z.geometry, l.rast)

The output is as follows:

+---------+--------------------+--------------------+--------------------+
|ZCTA5CE10|            geometry|                rast|            datetime|
+---------+--------------------+--------------------+--------------------+
|    85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...|
+---------+--------------------+--------------------+--------------------+

Now let’s calculate the temperature of each zone using RS_ZonalStats

SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg
FROM zone_temp

The output is as follows:

+---------+------------------+
|ZCTA5CE10|               avg|
+---------+------------------+
|    85350| 85.66846905011799|
|    85350|111.90322619887085|
|    85365| 43.69896389654749|
|    85350|117.59570757880628|
|    85365|120.40936393442645|
+---------+------------------+

Flexible raster data manipulation

The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.

RS_Clip

It returns a raster that is clipped by the given geometry

Original raster

Untitled 1

SELECT RS_Clip(
        RS_FromGeoTiff(content), 1,
        ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'),
        200, false
    )

Output

Untitled 2

RS_Resample

It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)

Original raster

Untitled 3

SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster
FROM raster_table

Output

Untitled 4

RS_AsRaster

Converts a Geometry to a Raster object

SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom
FROM raster_table r

Output

Untitled 5

RS_MapAlgebra

Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.

Let’s still use the rasterized geometry above and convert it to a white background.

SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster
FROM rasterized_table

Output

Untitled 6

SedonaKepler and SedonaPyDeck

Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.

One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck

from sedona.spark import *
SedonaKepler.create_map(df=groupedresult, name="AirportCount")
SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount')
SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')

Untitled 7

Untitled 8

Untitled 9

GeoParquet reader

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox) properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.

df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet")
    .filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")

A user can also inspect the schema of a GeoParquet file.

df.printSchema()

Output

root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)

A user can also inspect the metadata of a GeoParquet file.

df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet")
df.show()
df.printSchema()

Output of the GeoParquet metadata

+------------+--------------+------------------------------------------------------------------------------------------+-------+
|version     |primary_column|columns                                                                                   |geohash|
+------------+--------------+------------------------------------------------------------------------------------------+-------+
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}}            |g0     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}}          |en     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}}  |5n     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}}  |57     |
+------------+--------------+------------------------------------------------------------------------------------------+-------+

Output of the GeoParquet metadata



root
|-- version: string (nullable = true)
|-- primary_column: string (nullable = true)
|-- columns: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- encoding: string (nullable = true)
| | |-- geometry_types: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- bbox: array (nullable = true)
| | | |-- element: double (containsNull = true)
| | |-- crs: string (nullable = true)
|-- geohash: string (nullable = true)

GeoParquet reader legacy mode

Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT (SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.

For parquet files written by "parquet" format when using Apache Sedona 1.3.1-incubating or earlier:

df.write.format("parquet").save("path/to/parquet/files")

Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet" format with the .option("legacyMode", "true") option.

df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")

GeoParquet writer

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.

df.write.format("geoparquet")
        .option("geoparquet.version", "1.0.0")
        .option("geoparquet.crs", projjson)
        .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")

To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:

SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
FROM spatialDf
ORDER BY geohash

The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.

Untitled 10

Uber H3 IDs

Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.

Sedona 1.5 now provides the following functions related to Uber H3:

  • ST_H3CellIDs(geom: geometry, level: Int, fullCover: true)
  • ST_H3CellDistance(cell1: Long, cell2: Long)
  • ST_H3KRing(cell: Long, k: Int, exactRing: Boolean)
  • ST_H3ToGeom(cells: Array[Long])

Let’s use the Seattle road network dataset (from OSM) as an example.

Untitled 11

You can create H3 cell ids using ST_H3CellIds as follows.

roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv")
roadDf = roadDf.selectExpr(
    "monotonically_increasing_id() as id",
    "ST_GeomFromWkt(geometry) as road", "`attr#1` as attr",
    "ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom",
    "ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3"
)

Output

+---+--------------------+--------------------+--------------------+--------------------+
| id|                road|                attr|          roadH3Geom|              roadH3|
+---+--------------------+--------------------+--------------------+--------------------+
|  0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...|
|  1|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150966281830...|
|  2|LINESTRING (-122....|    [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...|
|  3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...|
|  4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...|
|  5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...|
|  6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...|
|  7|LINESTRING (-122....|   [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]|
|  8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...|
|  9|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150959224913...|
+---+--------------------+--------------------+--------------------+--------------------+

We can visualize H3 cells using GeoPandas Plot

pandasDf = roadDf.limit(100).toPandas()
roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road")
h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom")
ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0)
ax.set_xlabel('Longitude (degrees)')
ax.set_ylabel('Latitude (degrees)')
roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)

Untitled 12

You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

Unmatched industrial vessels

New cluster compute engine support

Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.

Community contributions and acknowledgements

The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:

jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000

Future directions

Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.

  • Adding more cloud platform integrations in addition to the current Spark, Flink and Snowflake support
  • Developing advanced raster and vector data processing features.
  • Improving performance and scalability for large datasets.
  • Adding more comprehensive geospatial functions.
  • Increasing compatibility with the broader spatial data ecosystem such as QGIS, Lonboard, Rasterio, NumPy, Ibis
  • Strengthening documentation and community engagement for wider adoption.

Wherobots Cloud: The Cloud-native Spatial Analytics Data Platform

According to Gartner, 97% of data collected at the enterprise sits on the shelves without being put into use. That is a shockingly big number, especially given that the data industry got their hopes up a few years back when the Economist published their article “The most valuable resource is no longer oil, it’s data”. That is also quite surprising given the 100s of billions of dollars invested in database and analytics platforms over the past two decades.

One main reason is that data professionals most of the time struggle to connect data to use cases. A natural way for data professionals to achieve that is to link their data/insights to data about the physical world, aka.“Spatial Data”, and hence ask physical-world related questions on such data, aka. "Spatial Analytics". This spatial approach can be an indispensable asset for businesses worldwide. Use cases range from determining optimal delivery routes to making informed decisions about property investments, to climate and agricultural technology. For instance, commercial real estate data will make more sense when connected to spatial data about nearby objects (e.g., building, POIs), man-made events (e.g, crimes, traffic), as well as natural events such as wildfires and floods. The importance of understanding the ‘where’ cannot be overstated, especially when it influences operational efficiency, customer satisfaction, and strategic planning.

The significance of spatial analytics underscores the pressing need for its efficient management within the enterprise data stack. Incumbent data platforms, often not built to handle the intricacies and scale of spatial analytics, fall short in meeting these demands. Recognizing this gap, we introduce Wherobots Cloud, a novel spatial analytics database platform. Here is a summary of features supported:

Wherobots Key Features

Linking Enterprise Data to the Spatial world

Wherobots seamlessly incorporates spatial analytics in the enterprise data stack to bring data many steps closer to use cases. Using a scalable spatial join technology, Wherobots can link customer data stored anywhere to tens of terabytes of spatial data such as maps, roads, buildings, natural events, and man-made events in a few minutes. Users can then apply spatial data processing, analytics, and AI tasks using SQL and Python on their data with unparalleled efficiency and adaptability.

To get started with Wherobots, please visit the Wherobots website.

Scalablilty

With its scalable, distributed architecture, Wherobots is redefining the way businesses handle geometry and raster spatial data processing and analytics in the cloud. Wherobots achieves that in two main ways:

  1. Separating Compute/Storage: Wherobots uniquely separates the spatial processing and analytics layer from the data storage layer. This approach allows for optimal performance and scalability.
  2. Distributed System Architecture: By employing a distributed system architecture, Wherobots ensures scalable out-of-core spatial computation, catering to massive datasets without compromising speed or accuracy.

Openness

Wherobots builds upon and amplifies the capabilities seen in the open-source Apache Sedona (OSS Sedona). While OSS Sedona provides foundational spatial analytics functions using spatial SQL and Python, Wherobots takes it to the next level with its faster query processing, lakehouse architecture, and its self-service yet fully-managed provisioning on Wherobots Cloud. This makes Wherobots a more comprehensive and streamlined solution for businesses. Based on our benchmarks, Wherobots is up to 10x faster than OSS Sedona for geometry data processing, and up to 20x faster than OSS Sedona for raster data processing.

Spatial Lakehouse Solution

One of the standout features of Wherobots is its support for an Apache Iceberg-compatible spatial table format, dubbed "Havasu." This feature facilitates efficient querying and updating of geometry and raster columns on Parquet files in cloud object stores such as AWS S3. This enables spatial analytics on the sheer volume of data dumped on cloud object stores this, until today, remains seldom put to use. Details about the Havasu spatial data lake format is avaialble here

Self-service

Wherobots is provisioned as a fully-managed service within the Wherobots Cloud, ensuring that users don’t have to delve into the intricacies of managing cloud or compute resources.

By delegating resource management to Wherobots, businesses can concentrate on their core spatial analytics tasks, achieving their objectives faster, efficiently, and cost-effectively.

To use Wherobots, you first need to create an account on Wherobots Cloud. To get started, please visit the Wherobots website.

Connectivity

Wherobots comes equipped with connectors for major data storage platfoms and databases. This include cloud object stores, data warehouses like Snowflake and Redshift, lakehouses such as Databricks, and OLTP databases including Postgres / PostGIS.

Wherobots example usage

Using Wherobots, users can perform a plethora of spatial queries and analytics operations on their data. Here are some common operations users can invoke in Wherobots. For more details on these examples, please refer to the Wherobots documentation.

Insert geometry data

sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")

Insert external raster data

sedona
    .sql("SELECT RS_FromPath('s3a://XXX.tif') as rast")
    .writeTo("wherobots.test_db.test_table")
    .append()

Create a spatial index

sedona.sql("CREATE SPATIAL INDEX FOR wherobots.db.test_table USING hilbert(geom, 10)")

Read data from PostGIS

sedona.read
    .format("jdbc")
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))

Read data from CSV on AWS S3

sedona.read
    .format("csv")
    .load("s3a://data.csv")

Read data from a Havasu table

sedona.table("wherobots.test_db.test_table").filter("ST_Contains(city, location) = true")
sedona.sql("SELCT * FROM wherobots.test_db.test_table WHERE ST_Contains(city, location) = true")

Spatial join query to find zones prone to wild fires

fire_zone = sedona.sql(
    """
    SELECT
        z.geometry as zipgeom,
        z.ZCTA5CE10 as zipcode,
        f.FIRE_NAME
    FROM
        wherobots_open_data.us_census.zipcode z,
        wherobots_open_data.weather.wild_fires f
    WHERE
        ST_Intersects(z.geometry, f.geometry)
    """
)

Visualize geometry data in a notebook

SedonaKepler.create_map(geometryDf)



Visualize raster data in a notebook

SedonaUtils.display_image(rasterDF.selectExpr("RS_AsImage(raster)"))

Performance Benchmark

To better showcase Wherobots’s performance, we conducted a comprehensive performance benchmark on some commonly seen spatial data processing tasks. Please download the full report of our Wherobots Performance Benchmark.

To use Wherobots, you first need to create an account on Wherobots Cloud. To get started, please visit the Wherobots website.

Spatial Data, GeoParquet, and Apache Sedona

Apache Parquet is a modern columnar data file format optimized for analytical workload. It is widely adopted in the big data ecosystem. Spatial scientists may also want to use parquet to store spatial data, especially when analyzing large scale datasets on Spark or cloud data warehouses. Problems arise when storing spatial data to parquet files.

  1. Parquet does not have native support for geometry objects such as point, linestring and polygon. These geometry objects needs to be serialized to be stored as a primitive type (such as BINARY) in parquet files. There’s no easy way to tell which column stores geometry objects by simply looking at the schema of the files. Users may adopt some conventions such that geometry objects were stored in geom or geometry column, or user has to read some metadata about the dataset to find which columns are geometry data.
  2. There are many serialization formats such as WKB, WKT, GeoJSON, etc. Users need to know which serialization format to use to properly read the geometry objects from geometry columns.

GeoParquet is an incubating OGC standard for addressing these problems. It defines how to serialize geometry objects as parquet values and the file-level metadata for describing geometry columns. Implementations that follow this standard could easily exchange parquet files containing spatial data, and enjoy the benefits of parquet columnar format when running analytics on GeoParquet files.

A Brief Introduction to GeoParquet Specification

The main idea of the GeoParquet specification is that it defines a standard schema for file-level metadata attached to the footer of parquet files. It defines which columns are for geometry data, and the encoding of geometry data in each column. For example, we can load a GeoParquet file using GeoPandas, and see that the column for storing geometry objects are interpreted as geometry field:

>>> import geopandas as gpd
>>> df_geom = gpd.read_parquet('./sample.parquet')
>>> df_geom
  id                                                  g
0  1  POLYGON ((-111.50299 33.17965, -111.50299 33.8...
1  2  POLYGON ((-111.50299 33.17965, -111.50299 33.8...
>>> df_geom.dtypes
id      object
g     geometry
dtype: objectf

We can inspect the metadata of this GeoParquet file using PyArrow. The metadata keyed by geo stores information on the geometry columns:

>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> import json
>>> table = pa.parquet.read_table('./sample.parquet')
>>> metadata_str = table.schema.metadata[b'geo'].decode('utf-8')
>>> metadata = json.loads(metadata_str)
>>> print(json.dumps(metadata, indent=2))
{
  "primary_column": "g",
  "columns": {
    "g": {
      "encoding": "WKB",
      "crs": null,
      "geometry_type": "Polygon",
      "bbox": [
        -112.473156,
        33.179649,
        -111.502991,
        33.868652
      ]
    }
  },
  "version": "0.4.0",
  "creator": {
    "library": "geopandas",
    "version": "0.12.2"
  }
}

The metadata designated that column g is a primary geometry column encoded as WKB. There are some additional informations about this column such as geometry types and CRS of geometries in this column. Readers can refer to the column metadata section of the specification to understand how to interpret these information.

There is an interesting metadata for geometry columns keyed by bbox. It is the minimum bounding box of all the geometries in this column. This metadata is useful for skipping unrelated GeoParquet files when user runs a spatial range query on a dataset made up of a bunch of GeoParquet files. For example, the query window shown as the red rectangle only overlaps with the bounding box of part1.parquet and part2.parquet, so the query engine can safely skip scanning part3.parquet and part4.parquet, reducing the IO cost and answer the query faster.

Working with GeoParquet in Apache Sedona

Apache Sedona 1.4.1 supports reading and writing GeoParquet files on Apache Spark 3.0 ~ 3.4. We’ll go through an example to demonstrate the usage of GeoParquet support of Apache Sedona, and how to speed up your spatial queries using GeoParquet.

Saving DataFrames to GeoParquet Files

We’ll use the TIGER census 2022 Shapefile dataset in our examples. Readers can navigate to https://www2.census.gov/geo/tiger/TIGER2022/TABBLOCK20/ and download the datasets. We’ve already prepared the data for Arizona state in our public S3 bucket so we can simply run this code to load the Shapefiles.

spatialRdd = ShapefileReader.readToGeometryRDD(sc, "s3a://wherobots-examples/data/census-2022/tl_2022_04_tabblock20")
df = Adapter.toDf(spatialRdd, spark)
df.show(5)

+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+
|            geometry|STATEFP20|COUNTYFP20|TRACTCE20|BLOCKCE20|        GEOID20|    NAME20|MTFCC20|UR20|UACE20|UATYPE20|FUNCSTAT20|  ALAND20|AWATER20| INTPTLAT20|  INTPTLON20|HOUSING20|POP20|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+
|POLYGON ((-109.80...|       04|       009|   961100|     1175|040099611001175|Block 1175|  G5040|   R|      |        |         S|        0|   57613|+32.9046017|-109.8078124|        0|    0|
|POLYGON ((-110.44...|       04|       009|   961600|     1025|040099616001025|Block 1025|  G5040|   R|      |        |         S|100235475|    8124|+32.8481268|-110.3910394|        3|    2|
|POLYGON ((-109.26...|       04|       003|   000501|     1186|040030005011186|Block 1186|  G5040|   R|      |        |         S| 28230312|   67735|+31.5716354|-109.2328547|        0|    0|
|POLYGON ((-109.18...|       04|       003|   000501|     1195|040030005011195|Block 1195|  G5040|   R|      |        |         S| 12013891|  451563|+31.5362891|-109.1563449|        0|    0|
|POLYGON ((-113.35...|       04|       005|   002301|     3702|040050023013702|Block 3702|  G5040|   R|      |        |         S|   118622| 1600804|+35.9649737|-113.3170363|        0|    0|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+

Now we can simply save the DataFrame as GeoParquet using df.write.format("geoparquet").save:

df.write.format("geoparquet").save(DATA_ROOT_PATH + "/tl_2022_04_tabblock20.parquet")

Loading DataFrames from GeoParquet Files

We can call sedona.read.format("geoparquet").load to load DataFrame from GeoParquet files. The geometry columns will be decoded as geometry objects automatically.

df_geoparquet = sedona.read.format("geoparquet").load(DATA_ROOT_PATH + "/tl_2022_04_tabblock20.parquet")
df_geoparquet.show(5)
df_geoparquet.printSchema()

+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+
|            geometry|STATEFP20|COUNTYFP20|TRACTCE20|BLOCKCE20|        GEOID20|    NAME20|MTFCC20|UR20|UACE20|UATYPE20|FUNCSTAT20|  ALAND20|AWATER20| INTPTLAT20|  INTPTLON20|HOUSING20|POP20|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+
|POLYGON ((-109.80...|       04|       009|   961100|     1175|040099611001175|Block 1175|  G5040|   R|      |        |         S|        0|   57613|+32.9046017|-109.8078124|        0|    0|
|POLYGON ((-110.44...|       04|       009|   961600|     1025|040099616001025|Block 1025|  G5040|   R|      |        |         S|100235475|    8124|+32.8481268|-110.3910394|        3|    2|
|POLYGON ((-109.26...|       04|       003|   000501|     1186|040030005011186|Block 1186|  G5040|   R|      |        |         S| 28230312|   67735|+31.5716354|-109.2328547|        0|    0|
|POLYGON ((-109.18...|       04|       003|   000501|     1195|040030005011195|Block 1195|  G5040|   R|      |        |         S| 12013891|  451563|+31.5362891|-109.1563449|        0|    0|
|POLYGON ((-113.35...|       04|       005|   002301|     3702|040050023013702|Block 3702|  G5040|   R|      |        |         S|   118622| 1600804|+35.9649737|-113.3170363|        0|    0|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+---------+--------+-----------+------------+---------+-----+
only showing top 5 rows

root
 |-- geometry: geometry (nullable = true)
 |-- STATEFP20: string (nullable = true)
 |-- COUNTYFP20: string (nullable = true)
 |-- TRACTCE20: string (nullable = true)
 |-- BLOCKCE20: string (nullable = true)
 |-- GEOID20: string (nullable = true)
 |-- NAME20: string (nullable = true)
 |-- MTFCC20: string (nullable = true)
 |-- UR20: string (nullable = true)
 |-- UACE20: string (nullable = true)
 |-- UATYPE20: string (nullable = true)
 |-- FUNCSTAT20: string (nullable = true)
 |-- ALAND20: string (nullable = true)
 |-- AWATER20: string (nullable = true)
 |-- INTPTLAT20: string (nullable = true)
 |-- INTPTLON20: string (nullable = true)
 |-- HOUSING20: string (nullable = true)
 |-- POP20: string (nullable = true)

As we can see, the geometry column was read back as a geometry column. Users can directly apply ST functions on this column without first parsing it from WKB.

Interoperability with GeoPandas

The primary goal of GeoParquet is defining a standard so that we can exchange parquet files containing geospatial data amongst various geospatial libraries and platforms. For instance, the GeoParquet files saved by Apache Sedona could be loaded by GeoPandas and vice versa. We can load the GeoParquet files we saved in the previous section using GeoPandas and see that the geometry column was correctly loaded.

gpd_df = geopandas.read_parquet(DATA_ROOT_PATH + "/tl_2022_04_tabblock20.parquet")
gpd_df.head()

geometry STATEFP20 COUNTYFP20 TRACTCE20 BLOCKCE20          GEOID20      NAME20 MTFCC20 UR20 UACE20 UATYPE20 FUNCSTAT20    ALAND20 AWATER20   INTPTLAT20    INTPTLON20 HOUSING20 POP20
0  POLYGON ((-109.80930 32.90518, -109.80928 32.9...        04        009    961100      1175  040099611001175  Block 1175   G5040    R                          S          0    57613  +32.9046017  -109.8078124         0     0
1  POLYGON ((-110.44854 32.76838, -110.44853 32.7...        04        009    961600      1025  040099616001025  Block 1025   G5040    R                          S  100235475     8124  +32.8481268  -110.3910394         3     2
2  POLYGON ((-109.26002 31.57156, -109.25791 31.5...        04        003    000501      1186  040030005011186  Block 1186   G5040    R                          S   28230312    67735  +31.5716354  -109.2328547         0     0
3  POLYGON ((-109.18276 31.55277, -109.14511 31.5...        04        003    000501      1195  040030005011195  Block 1195   G5040    R                          S   12013891   451563  +31.5362891  -109.1563449         0     0
4  POLYGON ((-113.35416 36.04097, -113.35416 36.0...        04        005    002301      3702  040050023013702  Block 3702   G5040    R                          S     118622  1600804  +35.9649737  -113.3170363         0     0
gpd_df.plot()

Load all Census 2022 Data

The entire census 2022 dataset takes 8 GB and contains 8 million polygons. We’ve already stored all the data in a single GeoParquet file. Readers can directly load data from our GeoParquet file stored on our public S3 bucket. It is recommended to run the above query on Spark clusters with good network connection to AWS S3, otherwise you can download the dataset from here and store it in somewhere you have fast access to.

df_census_2022 = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/census-2022-geoparquet.snappy.parquet")
df_census_2022.show(5)

+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+-------+--------+-----------+------------+---------+-----+
|            geometry|STATEFP20|COUNTYFP20|TRACTCE20|BLOCKCE20|        GEOID20|    NAME20|MTFCC20|UR20|UACE20|UATYPE20|FUNCSTAT20|ALAND20|AWATER20| INTPTLAT20|  INTPTLON20|HOUSING20|POP20|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+-------+--------+-----------+------------+---------+-----+
|POLYGON ((-91.968...|       22|       083|   970600|     3076|220839706003076|Block 3076|  G5040|   R|      |        |         S|1806215|       0|+32.2045100|-091.9584632|        0|    0|
|POLYGON ((-79.872...|       42|       129|   805200|     1000|421298052001000|Block 1000|  G5040|   U| 58162|       U|         S| 401352|       0|+40.1526521|-079.8673403|       68|  158|
|POLYGON ((-95.451...|       48|       339|   693103|     2013|483396931032013|Block 2013|  G5040|   U| 87300|       U|         S|  12340|       0|+30.3059421|-095.4503930|        9|   43|
|POLYGON ((-94.345...|       29|       037|   060904|     3004|290370609043004|Block 3004|  G5040|   U| 37189|       U|         S|  29419|       0|+38.6592501|-094.3441431|       18|   31|
|POLYGON ((-87.965...|       55|       089|   640102|     3027|550896401023027|Block 3027|  G5040|   U| 34122|       U|         S|   3807|       0|+43.3208487|-087.9652632|        4|    3|
+--------------------+---------+----------+---------+---------+---------------+----------+-------+----+------+--------+----------+-------+--------+-----------+------------+---------+-----+
only showing top 5 rows

Query the dataset

Now let’s run some spatial range queries on this dataset. Here we define a range query to fetch records within the Phoenix city:

spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-112.473156, 33.179649, -111.502991, 33.868652))"

We can run the query and count the result set, and use %%time to measure time elapsed running this query:

%%time
df_census_2022.where(spatial_filter).count()

CPU times: user 15.3 ms, sys: 12.6 ms, total: 27.9 ms
Wall time: 1min 10s
Out[18]: 57812

The query took more than 1 minute to finish, and the result set contains 57812 rows. Now let’s explore ways to redistribute the dataset to achieve lower query latency.

Redistribute the dataset for faster spatial queries

GeoParquet files has bbox metadata for geometry columns. If we partition the dataset to multiple GeoParquet files such that geometry objects that are close to each other were stored in the same GeoParquet file, then each GeoParquet file will cover only a small portion of the entire space. We only need to scan the GeoParquet files whose bbox overlaps with the query window and skip scanning the rest. This optimization is called “filter pushdown” and it was already implemented by Apache Sedona since 1.4.0. In this section we explore different ways to repartition the dataset to speed up spatial range queries.

Partition by State

Census 2022 dataset comes up with a STATEFP20 column, which is the state ID of the record. We can partition the dataset by this field so that records for the same state were grouped together.

df_census_2022.repartition("STATEFP20").write.partitionBy('STATEFP20').format("geoparquet").save(DATA_ROOT_PATH + '/census_2022_part_statefp20.parquet')
df_census_2022_part_statefp20 = sedona.read.format("geoparquet").load(DATA_ROOT_PATH + "/census_2022_part_statefp20.parquet")

Now we can run the same spatial range query. Note that we don’t need to manually add a predicate for the partitioning field or manually figure out which state the query window overlaps with. Apache Sedona will prune GeoParquet files according to their bbox metadata.

%%time
df_census_2022_part_statefp20.where(spatial_filter).count()

CPU times: user 18.9 ms, sys: 0 ns, total: 18.9 ms
Wall time: 7.3 s
Out[17]: 57812

Now the query took less than 8 seconds. Let’s visualize the bounding boxes of each GeoParquet files to see the distribution of the partitioned dataset.

file

Partition by S2

If your dataset does not have a field such as STATEFP20 to directly partition by, you can partition the dataset by the S2 cell ID of the geometry. This requires selecting a proper S2 resolution level suited for your dataset. Here we repartition the dataset by the level-4 S2 cell ID of the geometry values.

df_census_2022_s2 = df_census_2022.withColumn("s2", expr("array_max(ST_S2CellIds(ST_Centroid(geometry), 4))"))
df_census_2022_s2.repartition("s2").write.partitionBy("s2").format("geoparquet").save(DATA_ROOT_PATH + '/census_2022_part_s2.parquet')
df_census_2022_part_s2 = sedona.read.format("geoparquet").load(DATA_ROOT_PATH + "/census_2022_part_s2.parquet")

Let’s run the same query on this dataset partitioned by S2. Note that we don’t need to add a predicate for the s2 partitioning column.

%%time
df_census_2022_part_s2.where(spatial_filter).count()

CPU times: user 15.3 ms, sys: 856 µs, total: 16.2 ms
Wall time: 8.41 s
Out[16]: 57812

The bounding boxes of each GeoParquet files looks like this.

file

Sort by the GeoHash of geometry

An alternative approach is to sort the dataset by the GeoHash of the geometry values. The way GeoHash is encoded makes geometry objects that are close to each other has similar GeoHash values. We can utilize this property to redistribute our datasets for faster queries:

num_geoparquet_files = 100

sedona.conf.set("spark.sql.shuffle.partitions", num_geoparquet_files)
sedona.conf.set("spark.sql.adaptive.enabled", "false")
df_census_2022 = df_census_2022.withColumn("geohash", expr("ST_GeoHash(ST_Centroid(geometry), 20)"))\
    .sort(col("geohash"))\
    .drop("geohash")
df_census_2022.write.format("geoparquet").save(DATA_ROOT_PATH + "/census_2022_sorted.parquet")
sedona.conf.set("spark.sql.adaptive.enabled", "true")

Running the same query again, we can see it finishes in roughly 10 seconds.

%%time
df_census_2022_sorted = sedona.read.format("geoparquet").load(DATA_ROOT_PATH + "/census_2022_sorted.parquet")
df_census_2022_sorted.where(spatial_filter).count()

CPU times: user 12.4 ms, sys: 6.42 ms, total: 18.8 ms
Wall time: 11.7 s
Out[18]: 57812

The bounding boxes of each GeoParquet files looks like this. Note that although the bounding boxes of files significantly overlaps with each other, we can still prune most of the files when running spatial queries.

file

The Take-Away

Apache Sedona supports reading and writing GeoParquet files – A variant of Parquet for exchanging spatial data. Apache Sedona can perform spatial filter pushdown when running spatial range queries on GeoParquet files. You can group your GeoParquet dataset by spatial proximity to achieve faster spatial query performance.

Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet

Introduction

The Overture Maps Foundation (OMF) has recently released its first open-source Parquet dataset (https://overturemaps.org/download/), divided into four themes – places of interest, buildings, transportation networks, and administrative boundaries.

Apache Sedona is an open-source and distributed geospatial analytics engine. It enables large-scale spatial data processing and analysis through native handling of geospatial data types, Spatial SQL for querying, spatial indexing techniques, and support for standard formats like Shapefile, GeoParquet, GeoJSON, WKT, WKB, GeoTiff, and ArcGrid.

In this article, we demonstrate the capabilities of Apache Sedona for working with massive Overture Maps data. The article also shows that GeoParquet OMF data produced by Sedona can accelerate the spatial queries by 60X. This exploration underscores Sedona’s proficiency in handling scalable geospatial tasks using prevalent industry formats like Parquet and GeoParquet.

This article is structured as follows: First, we will explore and analyze the data in its original Parquet format. Then, we will convert it to the GeoParquet format with built-in spatial indexes. Finally, we will use Sedona in a Jupyter notebook to explore and analyze the dataset in GeoParquet form, leveraging capabilities like spatial SQL and spatial Python to derive insights.

Study 1: Analyze the OMF Parquet data using Sedona

Overture Maps uses the parquet format to store the datasets. You can find the schema of these datasets on OMF website (https://docs.overturemaps.org/reference). In this example, we will be using Buildings theme dataset which is the largest dataset in OMF with around 120GB size. The schema of this dataset is as follows:

id: string (nullable = true)
updatetime: string (nullable = true)
version: integer (nullable = true)
names: map (nullable = true)
 |-- key: string
 |-- value: array (valueContainsNull = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
level: integer (nullable = true)
height: double (nullable = true)
numfloors: integer (nullable = true)
class: string (nullable = true)
sources: array (nullable = true)
 |-- element: map (containsNull = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
bbox: struct (nullable = true)
 |-- minx: double (nullable = true)
 |-- maxx: double (nullable = true)
 |-- miny: double (nullable = true)
 |-- maxy: double (nullable = true)
geometry: binary (nullable = true)

To start using Sedona on OMF data, we will first have to create an SedonaContext:

import sedona.spark.SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext(config)

Apache Sedona facilitates easy loading and analysis of Parquet datasets through its APIs.

df = sedona.read.format("parquet").load("s3a://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Since Parquet does not natively support geospatial data types, the geometry columns in this dataset are stored in the WKB (Well-Known Binary) format. Sedona provides functionality to decode the WKB-encoded geometries into proper spatial types like points and polygons.

sedona.sql('SELECT ST_GeomFromWKB(geometry) as geometry FROM df').show(5)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((7.85925...|
|POLYGON ((2.69399...|
|POLYGON ((-95.775...|
|POLYGON ((103.141...|
|POLYGON ((111.900...|
+--------------------+

After processing, the dataset can be used for spatial range queries. We can perform a ST_Contains query between the dataset and Washington state’s boundary polygon to find out buildings within the Washington state.

df = df.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-123.3208 49.0023,-123.0338 49.0027,-122.0650 49.0018,-121.7491 48.9973,-121.5912 48.9991,-119.6082 49.0009,-118.0378 49.0005,-117.0319 48.9996,-117.0415 47.9614,-117.0394 46.5060,-117.0394 46.4274,-117.0621 46.3498,-117.0277 46.3384,-116.9879 46.2848,-116.9577 46.2388,-116.9659 46.2022,-116.9254 46.1722,-116.9357 46.1432,-116.9584 46.1009,-116.9762 46.0785,-116.9433 46.0537,-116.9165 45.9960,-118.0330 46.0008,-118.9867 45.9998,-119.1302 45.9320,-119.1708 45.9278,-119.2559 45.9402,-119.3047 45.9354,-119.3644 45.9220,-119.4386 45.9172,-119.4894 45.9067,-119.5724 45.9249,-119.6013 45.9196,-119.6700 45.8565,-119.8052 45.8479,-119.9096 45.8278,-119.9652 45.8245,-120.0710 45.7852,-120.1705 45.7623,-120.2110 45.7258,-120.3628 45.7057,-120.4829 45.6951,-120.5942 45.7469,-120.6340 45.7460,-120.6924 45.7143,-120.8558 45.6721,-120.9142 45.6409,-120.9471 45.6572,-120.9787 45.6419,-121.0645 45.6529,-121.1469 45.6078,-121.1847 45.6083,-121.2177 45.6721,-121.3392 45.7057,-121.4010 45.6932,-121.5328 45.7263,-121.6145 45.7091,-121.7361 45.6947,-121.8095 45.7067,-121.9338 45.6452,-122.0451 45.6088,-122.1089 45.5833,-122.1426 45.5838,-122.2009 45.5660,-122.2641 45.5439,-122.3321 45.5482,-122.3795 45.5756,-122.4392 45.5636,-122.5676 45.6006,-122.6891 45.6236,-122.7647 45.6582,-122.7750 45.6817,-122.7619 45.7613,-122.7962 45.8106,-122.7839 45.8642,-122.8114 45.9120,-122.8148 45.9612,-122.8587 46.0160,-122.8848 46.0604,-122.9034 46.0832,-122.9597 46.1028,-123.0579 46.1556,-123.1210 46.1865,-123.1664 46.1893,-123.2810 46.1446,-123.3703 46.1470,-123.4314 46.1822,-123.4287 46.2293,-123.4946 46.2691,-123.5557 46.2582,-123.6209 46.2573,-123.6875 46.2497,-123.7404 46.2691,-123.8729 46.2350,-123.9292 46.2383,-123.9711 46.2677,-124.0212 46.2924,-124.0329 46.2653,-124.2444 46.2596,-124.2691 46.4312,-124.3529 46.8386,-124.4380 47.1832,-124.5616 47.4689,-124.7566 47.8012,-124.8679 48.0423,-124.8679 48.2457,-124.8486 48.3727,-124.7539 48.4984,-124.4174 48.4096,-124.2389 48.3599,-124.0116 48.2964,-123.9141 48.2795,-123.5413 48.2247,-123.3998 48.2539,-123.2501 48.2841,-123.1169 48.4233,-123.1609 48.4533,-123.2220 48.5548,-123.2336 48.5902,-123.2721 48.6901,-123.0084 48.7675,-123.0084 48.8313,-123.3215 49.0023,-123.3208 49.0023))'), geometry) = true")
df.selectExpr("names", "height", "numfloors", "geometry").show(5)

+--------------------+------+---------+--------------------+
|               names|height|numfloors|            geometry|
+--------------------+------+---------+--------------------+
|{common -> [{valu...|   5.1|        1|POLYGON ((-122.32...|
|{common -> [{valu...|  10.9|        1|POLYGON ((-122.18...|
|{common -> [{valu...|   7.9|        1|POLYGON ((-122.31...|
|{common -> [{valu...|   9.2|        1|POLYGON ((-122.22...|
|{common -> [{valu...|   6.4|        1|POLYGON ((-122.21...|
+--------------------+------+---------+--------------------+

We can also leverage Apache Spark’s filter pushdown capabilities on non-spatial columns to reduce the data before geospatial analysis. Since the building dataset was large, we applied a highly selective filter pushdown:

df_building = df_building.filter(~(size(col("names")) <= 0))\
                         .filter(col("height") <= 200 )\
                         .filter(~(size(col("sources")) <= 0))\
                                                 .filter(col("numfloors") == 1)

The count of the dataset after intersecting with Washington State boundary:

df.count()

Count: 511
Time: 1h 31min 42s

Discussions

The spatial query and non-spatial filter pushdown on the dataset took an hour and half to complete on a single-node Sedona Docker environment. Therefore, users can barely do any interactive analytics on it. The reason is two-fold:

  1. The code was executed on a single-node Sedona docker environment with 4GB executor RAM, not a real cluster. Performance will be significantly improved if run in a distributed Sedona environment across multiple nodes.
  2. This lengthy processing time is primarily due to the limitations of using the Parquet format without geospatial optimizations. Parquet lacks native spatial data types and spatial indexing schemes suited for efficient geospatial analysis. The required data loading and preparation are time-consuming. Using a format optimized for geospatial workloads, like GeoParquet, could significantly reduce the pre-processing time for this analysis.

Study 2: Converting from Parquet to GeoParquet

Sedona enables spatial data format conversion between Parquet and GeoParquet. For more details, please refer to [Sedona GeoParquet blogpost]. After realizing the limitations of Parquet, we decide to leverage this functionality and see how much improvement the GeoParquet format brings in.

To achieve the best spatial filter pushdown performance, we need to partition the data based on their spatial proximity. In other words, nearby spatial objects should be put in the same GeoParquet partition. For this purpose, we first create a GeoHash ID for each geometry in OMF data using Sedona ST_GeoHash. This function generates geographic hashes for a given geometry at a specified precision. The precision refers to the size of the grid cells, where a precision of 2 indicates each cell has 1,250km X 625km. This precision level was chosen as an optimal balance, since too high a precision produces many small files that can slow down query processing and reduce read throughput.

df.withColumn("geohash", expr("ST_GeoHash(geometry, 2)")).repartition("geohash")

By repartitioning with GeoHashing, data points with the same GeoHash ID get assigned to contiguous partition slices based on precision. This clusters nearby points together in the same partitions.

Finally we will store the GeoParquet Overture maps into our public Wherobots’ S3 bucket. Such same operation is applied to all OMF datasets.

df.write.partitionBy("geohash").format("geoparquet").save("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")

Discussions

This conversion on all OMF datasets took around 15 minutes to finish on a 20-node AWS EMR cluster. Each node is a m3.2xlarge instance with 8 CPU and 30 GB RAM. In closing, Sedona provides a streamlined way to convert datasets to GeoParquet format and partition using GeoHashes for optimal performance. The entire GeoParquet dataset converted here is available in the Wherobots’ public S3 bucket for you to experiment with.

Study 3: Interactive geospatial spatial analytics on OMF GeoParquet data

We’ll ingest the public Wherobots dataset from S3 into a Spark DataFrame using Sedona, like before, along with that we will intersect Bellevue, Washington boundary with the dataset.

Fast geospatial queries via Sedona GeoParquet spatial filter pushdown

With GeoParquet, we observe improved query performance versus the Parquet format over the following simple spatial range query (i.e., spatial filter).

spatial_filter = "POLYGON ((-122.235128 47.650163, -122.233796 47.65162, -122.231581 47.653287, -122.228514 47.65482, -122.227526 47.655204, -122.226175 47.655729, -122.222039 47.656743999999996, -122.218428 47.657464, -122.217026 47.657506, -122.21437399999999 47.657588, -122.212091 47.657464, -122.212135 47.657320999999996, -122.21092999999999 47.653552, -122.209834 47.650121, -122.209559 47.648976, -122.209642 47.648886, -122.21042 47.648658999999995, -122.210897 47.64864, -122.211005 47.648373, -122.21103099999999 47.648320999999996, -122.211992 47.64644, -122.212457 47.646426, -122.212469 47.646392, -122.212469 47.646088999999996, -122.212471 47.645213, -122.213115 47.645212, -122.213123 47.644576, -122.21352999999999 47.644576, -122.213768 47.644560999999996, -122.21382 47.644560999999996, -122.21382 47.644456999999996, -122.21373299999999 47.644455, -122.213748 47.643102999999996, -122.213751 47.642790999999995, -122.213753 47.642716, -122.213702 47.642697999999996, -122.213679 47.642689999999995, -122.21364 47.642678, -122.213198 47.642541, -122.213065 47.642500000000005, -122.212918 47.642466, -122.21275 47.642441, -122.212656 47.642433, -122.21253899999999 47.642429, -122.212394 47.64243, -122.212182 47.642444999999995, -122.211957 47.642488, -122.211724 47.642551999999995, -122.21143599999999 47.642647, -122.210906 47.642834, -122.210216 47.643099, -122.209858 47.643215, -122.20973000000001 47.643248, -122.20973599999999 47.643105, -122.209267 47.643217, -122.208832 47.643302, -122.208391 47.643347999999996, -122.207797 47.643414, -122.207476 47.643418, -122.20701199999999 47.643397, -122.206795 47.643387999999995, -122.205742 47.643246, -122.20549 47.643201999999995, -122.20500200000001 47.643119, -122.204802 47.643085, -122.204641 47.643066, -122.204145 47.643012, -122.203547 47.643012, -122.203097 47.643107, -122.20275699999999 47.643283, -122.202507 47.643496999999996, -122.202399 47.643653, -122.202111 47.643771, -122.201668 47.643767, -122.201363 47.643665, -122.20133 47.643648999999996, -122.201096 47.643536, -122.200744 47.64328, -122.200568 47.64309, -122.200391 47.642849, -122.200162 47.642539, -122.199896 47.642500000000005, -122.19980799999999 47.642424, -122.199755 47.642376999999996, -122.199558 47.642227999999996, -122.199439 47.642157, -122.199293 47.642078999999995, -122.199131 47.642004, -122.198928 47.641925, -122.19883 47.641892, -122.19856300000001 47.641811999999994, -122.198203 47.641731, -122.197662 47.641619999999996, -122.196819 47.641436, -122.196294 47.641309, -122.196294 47.642314, -122.19628 47.642855, -122.196282 47.642897999999995, -122.196281 47.643111, -122.196283 47.643415, -122.196283 47.643508999999995, -122.19628399999999 47.643739, -122.196287 47.644203999999995, -122.196287 47.644262999999995, -122.19629 47.644937999999996, -122.19629 47.644954999999996, -122.196292 47.645271, -122.196291 47.645426, -122.19629499999999 47.646315, -122.19629499999999 47.646432, -122.195925 47.646432, -122.195251 47.646432, -122.190853 47.646429999999995, -122.187649 47.646428, -122.187164 47.646426, -122.18683 47.646426, -122.185547 47.646409, -122.185546 47.646316, -122.185537 47.645599, -122.185544 47.644197, -122.185537 47.643294999999995, -122.185544 47.642733, -122.185541 47.641757, -122.185555 47.640681, -122.185561 47.63972, -122.185557 47.638228999999995, -122.185591 47.635419, -122.185611 47.634750999999994, -122.18562299999999 47.634484, -122.18561700000001 47.634375999999996, -122.185592 47.634311, -122.185549 47.634232999999995, -122.185504 47.634181999999996, -122.185426 47.634119, -122.184371 47.633424999999995, -122.18400000000001 47.633198, -122.183896 47.633134, -122.1838 47.633067, -122.18375499999999 47.633019999999995, -122.183724 47.632959, -122.183695 47.632858, -122.183702 47.632675, -122.182757 47.632622999999995, -122.182365 47.63259, -122.18220600000001 47.632562, -122.181984 47.632504999999995, -122.18163799999999 47.632363, -122.18142 47.632262999999995, -122.181229 47.632165, -122.181612 47.632172999999995, -122.18271899999999 47.632151, -122.183138 47.632135, -122.18440000000001 47.632081, -122.184743 47.632065999999995, -122.185312 47.63205, -122.185624 47.632047, -122.185625 47.631873999999996, -122.184618 47.63187, -122.184291 47.631878, -122.184278 47.631817999999996, -122.183882 47.629942, -122.182689 47.623548, -122.182594 47.622789999999995, -122.182654 47.622155, -122.183135 47.622372999999996, -122.183471 47.622506, -122.18360200000001 47.622552, -122.183893 47.622637999999995, -122.184244 47.62272, -122.184618 47.622777, -122.184741 47.622727999999995, -122.184605 47.622679, -122.18424 47.622622, -122.183985 47.622569, -122.183717 47.622501, -122.183506 47.622439, -122.18327 47.622357, -122.18305699999999 47.622271999999995, -122.182669 47.622088999999995, -122.182796 47.621545, -122.18347 47.619628999999996, -122.18365 47.619098, -122.183859 47.6184, -122.183922 47.617793999999996, -122.183956 47.617292, -122.183792 47.616388, -122.183261 47.614391999999995, -122.183202 47.613802, -122.183209 47.613155, -122.183436 47.612384999999996, -122.18395100000001 47.610445999999996, -122.184338 47.60924, -122.184657 47.609116, -122.18481 47.609051, -122.18491900000001 47.608987, -122.184974 47.608942, -122.185047 47.608846, -122.185082 47.608743999999994, -122.185109 47.608526999999995, -122.185116 47.608359, -122.18513 47.608315999999995, -122.185157 47.608273999999994, -122.185183 47.608247, -122.185246 47.608214, -122.185354 47.608196, -122.185475 47.608191999999995, -122.185472 47.606697, -122.185472 47.606373999999995, -122.185521 47.606272, -122.185528 47.606210999999995, -122.185506 47.606037, -122.185451 47.605872999999995, -122.185411 47.605781, -122.185358 47.605681999999995, -122.185248 47.605509999999995, -122.185127 47.605365, -122.185058 47.605292, -122.184772 47.605038, -122.184428 47.604834, -122.184122 47.604693999999995, -122.183775 47.604574, -122.183644 47.604546, -122.183708 47.604400999999996, -122.183749 47.604223999999995, -122.18376 47.604037, -122.183707 47.603778, -122.183619 47.603556999999995, -122.183559 47.603406, -122.183488 47.603303, -122.183824 47.603167, -122.184108 47.603052, -122.184478 47.602902, -122.18543 47.602495, -122.186669 47.601957, -122.186433 47.601220999999995, -122.186341 47.601127999999996, -122.18874199999999 47.593742999999996, -122.188434 47.592338999999996, -122.188479 47.591786, -122.188217 47.591269999999994, -122.18795399999999 47.590871, -122.186822 47.589228, -122.187421 47.589228999999996, -122.18848299999999 47.589228999999996, -122.188433 47.587922999999996, -122.18990000000001 47.588547, -122.191368 47.589169999999996, -122.19158 47.589222, -122.191779 47.589254999999994, -122.192117 47.589289, -122.191569 47.587478999999995, -122.191323 47.586628999999995, -122.191295 47.586554, -122.191268 47.586479, -122.191192 47.586318, -122.191163 47.586268999999994, -122.1911 47.586164, -122.19099 47.586011, -122.19067 47.585668999999996, -122.1905 47.585515, -122.190301 47.58531, -122.190143 47.585152, -122.189573 47.584576999999996, -122.188702 47.583735999999995, -122.188646 47.583679, -122.188239 47.583258, -122.188037 47.583005, -122.187832 47.582657, -122.187726 47.582164999999996, -122.18769499999999 47.581964, -122.18768299999999 47.581781, -122.187678 47.581592, -122.18766099999999 47.581455, -122.187674 47.581311, -122.18768 47.581146, -122.187722 47.580877, -122.187817 47.580569999999994, -122.187932 47.580301999999996, -122.188047 47.580087, -122.188161 47.579933999999994, -122.188399 47.579660999999994, -122.18851699999999 47.579547, -122.188621 47.579454, -122.188042 47.579493, -122.18762 47.579527, -122.187806 47.579358, -122.188009 47.579175, -122.18814499999999 47.579051, -122.188177 47.579021, -122.18842000000001 47.5788, -122.188638 47.578461, -122.188895 47.57806, -122.189791 47.577281, -122.190008 47.577103, -122.190372 47.576805, -122.19119 47.576358, -122.191877 47.576087, -122.193025 47.57566, -122.194317 47.575185999999995, -122.196061 47.574664, -122.197239 47.574386999999994, -122.197873 47.574267, -122.198286 47.574189999999994, -122.199091 47.574044, -122.199067 47.574574999999996, -122.199007 47.575921, -122.200335 47.578222, -122.20057299999999 47.578345999999996, -122.2009 47.578517999999995, -122.201095 47.578621999999996, -122.20138399999999 47.578776999999995, -122.201465 47.57882, -122.201516 47.578846999999996, -122.205753 47.581112, -122.209515 47.583124, -122.210634 47.583721, -122.21473399999999 47.587021, -122.21538699999999 47.588254, -122.21580399999999 47.589042, -122.216534 47.590421, -122.220092 47.596261, -122.220434 47.596821, -122.22041899999999 47.597837999999996, -122.220289 47.606455, -122.220234 47.610121, -122.22048 47.615221999999996, -122.220359 47.615379, -122.220283 47.615477999999996, -122.21999 47.615854999999996, -122.219993 47.61597, -122.22023300000001 47.616634, -122.220356 47.616687999999996, -122.220409 47.616712, -122.221401 47.618538, -122.22142 47.618573, -122.221456 47.618635, -122.221791 47.619222, -122.222492 47.619682999999995, -122.222799 47.619886, -122.222083 47.620368, -122.222046 47.620407, -122.222028 47.620449, -122.222025 47.620483, -122.22203999999999 47.620523999999996, -122.222079 47.620557999999996, -122.222156 47.620594999999994, -122.222458 47.620629, -122.222454 47.620673, -122.222454 47.620711, -122.22244599999999 47.621041999999996, -122.223056 47.621041, -122.223129 47.62104, -122.223153 47.62104, -122.223574 47.621041, -122.22377900000001 47.621041, -122.223857 47.621041, -122.22467499999999 47.621041, -122.224712 47.62104, -122.224958 47.62104, -122.225167 47.621049, -122.226882 47.621037, -122.227565 47.621032, -122.228002 47.621029, -122.22797800000001 47.621300999999995, -122.227919 47.626574999999995, -122.227914 47.627085, -122.227901 47.6283, -122.227881 47.630069, -122.227869 47.631177, -122.227879 47.631952999999996, -122.22789 47.633879, -122.227886 47.63409, -122.227871 47.635534, -122.227918 47.635565, -122.228953 47.635624, -122.22895199999999 47.635571999999996, -122.231018 47.635574999999996, -122.233276 47.635588999999996, -122.233287 47.63617, -122.233273 47.63639, -122.233272 47.636469999999996, -122.23327 47.636578, -122.233266 47.636827, -122.233263 47.636851, -122.233262 47.637014, -122.23322999999999 47.638110999999995, -122.233239 47.638219, -122.233262 47.638279, -122.233313 47.638324999999995, -122.233255 47.638359, -122.233218 47.638380999999995, -122.233153 47.638450999999996, -122.233136 47.638552999999995, -122.233137 47.638692, -122.232715 47.639348999999996, -122.232659 47.640093, -122.232704 47.641375, -122.233821 47.645111, -122.234906 47.648874, -122.234924 47.648938, -122.235128 47.650163))"

df = sedona.read.format("geoparquet").load("s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building")
df = df.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")
df.count()

Count: 3423743
Time: 2min 47s

Switching to GeoParquet from regular Parquet reduced query time from over an hour and a half to around 3 minutes, a 60X speedup. This query was done on the same single-node Sedona Docker environment, without non-spatial filter pushdowns. Further gains could be achieved by leveraging a distributed Sedona cluster. Overall, GeoParquet enables more efficient geospatial queries through both spatial partitioning and enhanced filter pushdown.

Interactive geospatial visualization via Sedona Kepler

We also wanted to highlight SedonaKepler, our interactive visualization API built on KeplerGl. SedonaKepler provides a powerful, customizable visualization experience through a simple API. Creating visualizations with SedonaKepler is as straightforward as calling a single function. The map created by the code below indicates the buildings in Bellevue, Washington.

import sedona.spark.*

SedonaKepler.create_map(df, "Building")

Time: 2min 38s

file

Photo: Visualization of building dataset spatially filtered to only include data for the city of Bellevue, Washington.

Of course, you can also choose other region and see what the dataset looks like.

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Mount Rainier National Park, Washington in Satellite view.

SedonaKepler can also visualize multiple datasets together, revealing insights across interconnected geospatial data. To demonstrate, we are using the segment and connector datasets, as they highlight the transportation network from OMF.

The dots below represent connection points between road segments. The road segments themselves are shown as yellow lines, representing paths that can be traversed by people, cars, or other vehicles. By layering this networked transport data over a map, we can gain insights into how the transportation infrastructure links together across the geographic area.

map_connector = SedonaKepler.create_map(df_connector, "Connector")
SedonaKepler.add_df(map_connector, df_segment, name="Segment")
map_connector

Time: 3min 11s

file

Photo: Visualization of connector and segment datasets spatially filtered to only include data for the city of Bellevue, Washington.

The Take-Away

Apache Sedona enables both the reading and writing of GeoParquet files, a specialized Parquet variant tailored for spatial data interchange. When executing spatial range queries on GeoParquet files, Apache Sedona supports spatial filter pushdown, and optimizing query performance with over 60X speedup. SedonaKepler is a powerful tool for creating visualizations that are interactive and easy to maneuver, and it allows you to create visualizations from multiple datasets.

Try it yourself

Notebooks

All notebooks used in this article are available on Wherobots GitHub repository: https://github.com/wherobots/OvertureMaps

Interactive notebook using GeoParquet and Sedona

Use Wherobots to deploy Sedona to cloud vendors

The other notebooks used in Study 1 and 2 can be run on a AWS EMR or Databricks cluster. if you want to try them out, please sign up here: https://www.wherobots.ai/demo

Wherobots is a spatial data analytics and AI platform trusted in production, at scale, from the original creators of Apache Sedona.

Free and public Overture Maps GeoParquet data from Wherobots

The GeoParquet format data produced in Study 2 is provided by Wherobots for free. It has the same schema and license as the original Overture Maps Parquet data, except the geometry column is in geometry type and has additional geohash column in string type. You can access them as follows:

  • Buildings: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=building
  • Places: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=place
  • AdminBoundary: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=administrativeBoundary
  • AdminLocality: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=admins/type=locality
  • Transportation Connector: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=connector
  • Transportation Segment: s3://wherobots-public-data/overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=transportation/type=segment

Announcing Apache Sedona 1.4 – A Giant Leap in Geospatial Processing

Hello Apache Sedona community,

We’re exhilarated to share the launch of Apache Sedona 1.4 (https://sedona.apache.org/1.4.1/), encompassing both 1.4.0 and 1.4.1 versions, signifying a transformative stage in geospatial analytics. These successive versions introduce an array of striking new features, major enhancements, and indispensable bug fixes.

Top-Notch Highlights

Here are the standout features spanning across these two versions:

  1. Use SedonaContext as a unified entry point (SEDONA-287): This allows you to use the same method to initialize your Sedona environment in different computation engines without tedious configuration.
  2. Geodesic/Geography Functions Support (SEDONA-281, SEDONA-286): This functionality adds a whole new dimension to your geospatial data processing, enabling more accurate computations over geographical data.
  3. Redesigned end-to-end raster ETL pipeline (SEDONA-251, SEDONA-269, SEDONA-292): Apache Sedona 1.4 includes new raster readers / writers, raster type, and many essential raster functions, paving the way for streamlined handling of vast volumes of geospatial data.
  4. Google S2 based spatial join (SEDONA-235): Sedona 1.4 introduces the support of Google S2 cells which enables a speedy approximate point-in-polygon join.
  5. Support for Spark 3.4 (SEDONA-276): Our support for Spark 3.4 allows you to merge the powerful capabilities of the latest Spark version with the enhanced geospatial features of Sedona.
  6. Serialization and Deserialization Enhancements (SEDONA-207, SEDONA-227, SEDONA-231): Sedona 1.4 offered faster geometry serialization and deserialization, resulting in a performance boost by a factor of 3 to 7.
  7. Native filter pushdown support in GeoParquet (SEDONA-156): this can push down spatial predicate on GeoParquet to reduce memory consumption by 10X.

Now, let’s dive into Features 1-4 and learn how to leverage them in your applications. We’ll discuss the remaining features in separate blog posts.

Use SedonaContext as a unified entry point (v1.4.1)

Initially designed to be a geospatial computation system on top of Apache Spark, Sedona has evolved over the years. It has extended its support to other engines and is now on a mission to become a cross-platform, scalable geospatial computation engine.

To use Sedona with different underlying engines, you must import several classes and register necessary Sedona data types, serializers, functions, and optimization rules. This process can be confusing for new users over time. Let’s examine the Sedona Spark Python interface as an example:

Old Sedona entry point

from sedona.register import SedonaRegistrator
from sedona.core.SpatialRDD import SpatialRDD
from sedona.utils.adapter import Adapter
from sedona.core.spatialOperator import JoinQueryRaw
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.sql.types import GeometryType
from sedona.core.enums import GridType
from sedona.core.enums import IndexType
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

spark = SparkSession.\
    builder.\
    master("local[*]").\
    appName("Sedona App").\
    config("spark.serializer", KryoSerializer.getName).\
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .\
    config("spark.jars.packages", "org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2") .\
    getOrCreate()

SedonaRegistrator.registerAll(spark)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

Users previously had to manually import many classes and add serializer configuration.

However, with the introduction of SedonaContext and other helper functions in SEDONA-287, starting applications has become much easier. Here’s how to do it:

SedonaContext with Spark Python

from sedona.spark import *

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,'
           'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
    getOrCreate()

sedona = SedonaContext.create(config)

spatialDf = sedona.sql("SELECT ST_GeomFromWKT(XXX) FROM")

SedonaContext with Flink Java

import org.apache.sedona.flink.SedonaContext

StreamTableEnvironment sedona = SedonaContext.create(env, tableEnv);

Table spatialTbl = sedona.sqlQuery("SELECT ST_GeomFromWKT(XXX) FROM")

It’s worth noting that Sedona will maintain the old entry point for a long time, since many existing users are familiar with the old APIs.

Geodesic/Geography Functions Support (v1.4.1)

As we all know, the Earth is a spheroid, making it difficult to accurately calculate geographic distances between two locations. Given the complicated surface of the Earth, finding the exact distance is a challenging task. In the GIS industry, it is considered best practice to project the geospatial objects (usually in WGS84 / EPSG:4326) to an appropriate meter-based Coordinate Reference System (CRS) before applying geometric calculations such as distance calculation. Users can choose a CRS that is close to the locations of their data. For example, in Apache Sedona, one can calculate the distance between two locations as follows:

Sedona ST_Transfrom in SQL

CREATE OR REPLACE TEMP VIEW geomTable AS
SELECT ST_GeomFromWKT('POINT(-42.3 71.3)') AS geom1,
 ST_GeomFromWKT('POINT(-42.1 71.1)') AS geom2;

CREATE OR REPLACE TEMP VIEW geomTable_t AS
SELECT ST_Transform(geom1, 'epsg:4326', 'epsg:3857') AS geom1,
 ST_Transform(geom2, 'epsg:4326', 'epsg:3857') As geom2 
FROM geomTable

SELECT ST_Distance(geom1, geom2) AS distanceInMeter FROM geomTable_t

However, it is common for a single dataset to include geospatial objects across the globe, and there is no CRS suitable for everywhere. Therefore, it can be difficult to calculate distances accurately. In this case, the best way to measure geographic distance is either through great-circle distance (haversine distance) or spheroid distance (Vincenty’s formulae or C. Karney’s algorithm).

SEDONA-281 introduces support for both great-circle distance and spheroid distance, namely ST_DistanceSphere and ST_DistanceSpheroid. Let’s look at an example of calculating the distance between Hong Kong (22.308919, 113.914603) and Toronto (43.677223, -79.630556).

Sedona sphere and spheroid distance in SQL

All distance units in this example are in meters. The actual distance between Hong Kong and Toronto is about 12,544 to 12,568 kilometers. ST_DistanceSphere and ST_DistanceSpheroid both produce very accurate results. However, the ST_Transformbased approach cannot give a good answer if no proper CRS (coordinate reference system) is available.

# ST_DistanceSphere
SELECT ST_DistanceSphere(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2548548944238186E7|
+--------------------+

# ST_DistanceSpheroid
SELECT ST_DistanceSpheroid(ST_POINT(22.308919, 113.914603), ST_POINT(43.677223, -79.630556)) AS distance

+--------------------+
|            distance|
+--------------------+
|1.2568775317073349E7|
+--------------------+

# ST_Transform based distance calculation
SELECT ST_DISTANCE(ST_TRANSFORM(ST_POINT(22.308919, 113.914603), 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(ST_POINT(43.677223, -79.630556), 'epsg:4326', 'epsg:3857')) AS distance

+--------------------+
|            distance|
+--------------------+
|2.1735260976295043E7|
+--------------------+

Rule of thumb

Using ST_DistanceSphere and ST_DistanceSpheroid does not mean that the ST_Transform-based approach is no longer necessary. To select the most suitable approach, use the following rule of thumb: when calculating the distance between two geospatial objects that are far apart, such as cities in different countries or continents, use ST_DistanceSphere or ST_DistanceSpheroid. When calculating the distance between two objects in the same region such as a metropolitan area, use the ST_Transform-based approach.

Now that we are able to calculate the spherical distance between two geospatial objects, we can apply this to real-world applications that involve the distance calculation among many geospatial objects and not just two. For example, we can find the parks that are within 50KM distance from each city among the locations of cities and parks in the world (available from OpenStreetMap). This operation is called DistanceJoin in Sedona and is one of the most expensive and resource-consuming operations. Thanks to our highly efficient spatial join algorithm, Sedona can easily handle this operation on huge datasets. If you used our ST_Transform-based approach before, you may have enjoyed this algorithm for a while. SEDONA-286 extends our algorithm to support distance join over ST_DistanceSphere and ST_DistanceSpheroid. A distance join query can be written as follows:

Sedona distance join in SQL

50000 indicates 50KM

# ST_DistanceSphere join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSphere(c.geom, p.geom) < 50000

# ST_DistanceSpheroid join
SELECT *
FROM cities c, parks p
WHERE ST_DistanceSpheroid(c.geom, p.geom) < 50000

# ST_Transform + ST_Distance
SELECT *
FROM cities c, parks p
WHERE ST_DISTANCE(ST_TRANSFORM(c.geom, 'epsg:4326', 'epsg:3857'),
 ST_TRANSFORM(p.geom, 'epsg:4326', 'epsg:3857')) < 50000

Redesigned end-to-end raster ETL pipeline (v1.4.1)

Sedona has been implementing geospatial vector data functions for several years, and the support has become mostly mature. In recent releases, the Sedona community has invested more in geospatial raster data support, including raster data reader, writer, and raster manipulation.

To enable comprehensive raster data support, Sedona has made the following efforts:

Raster data type

SEDONA-251 introduces a native raster data type. Similar to the geometry type, the raster type indicates that the column in a schema contains raster data. Each record stored in this type contains two parts: metadata and raster band information. With this in place, we can create a table that has both a geometry column and a raster column.

Raster reader

To begin processing raster data, the first step is to load the raster data. Additionally, we must be able to load raster data in parallel given the vast amount of available raster data. SEDONA-251 introduces a scalable raster reader and native raster data constructors. For example, we can read raster data files in Sedona Spark as follows:

sedona.read.format("binaryFile")
.load("raster/*.tiff").createOrReplaceTempView("binaryTable")

To construct a raster column, use the following steps. We have added additional constructors, such as RS_FromGeoTiff and RS_FromAscGrid, which allow you to convert various raster binary formats to a unified raster format.

CREATE OR REPLACE TEMP VIEW rasterTable AS
SELECT RS_FromGeoTiff(binaryTable.content) AS raster
FROM binaryTable

We can print the schema of this table. As we can see, the type of this column now is raster.

sedone.table("binaryTable").printSchema()

root
 |-- raster: raster (nullable = true)

Raster writer

After processing raster data, it is necessary to store the data in an external storage. With SEDONA-269, this can be achieved in Sedona Spark as follows:

First, convert the raster data to a specific binary format, such as GeoTiff. Note that the raster can be imported from one image format, but exported to another image format.

SELECT RS_AsGeoTiff(raster) AS image
FROM rasterTable

After processing the data, we can save it to an external location.

sedona.table("rasterTable").write.format("raster").mode(SaveMode.Overwrite).save("my_raster_file")

Raster functions

SEDONA-251, SEDONA-269, and SEDONA-292 together introduce several functions to transform and analyze raster data. Let’s examine a few examples.

First, let’s assume that we have a GeoTiff image as follows:

file

Extract bounding box, SRID and number of bands

We can extract the bounding box, srid, and number of bands of this raster.

SELECT raster, RS_Envelope(raster) as bbox, RS_Metadata(raster), RS_NumBands(raster)
FROM rasterTable

+--------------------+--------------------+-----+--------+
|              raster|                geom| srid|numBands|
+--------------------+--------------------+-----+--------+
|GridCoverage2D["g...|POLYGON ((590520 ...|32631|       3|
+--------------------+--------------------+-----+--------+

Now this resulting table has a schema that contains both raster and geometry type (see below)

root
 |-- raster: raster (nullable = true)
 |-- geom: geometry (nullable = true)
 |-- srid: integer (nullable = true)
 |-- numBands: integer (nullable = true)

This allows for more complex geometric operations on the table. For instance, we can save the table as a GeoParquet format and perform filter pushdown. When executing a Sedona spatial range query on this GeoParquet table, Sedona will retrieve only the data that intersects the spatial range query window.

sedona.table("rasterTable").write.format("geoparquet").save("rasterTable.parquet")

Extract individual band

We can extract any band in a raster.

SELECT RS_BandAsArray(raster, 1) as band1, RS_BandAsArray(raster, 2) as band2,
 RS_BandAsArray(raster, 3) as band3
FROM rasterTable

+--------------------+--------------------+--------------------+
|               band1|               band2|               band3|
+--------------------+--------------------+--------------------+
|[0.0, 799.0, 788....|[0.0, 555.0, 546....|[0.0, 330.0, 322....|
+--------------------+--------------------+--------------------+

Modify a band value

We can modify a band value and save it to the original raster

SELECT RS_AddBandFromArray(raster, RS_GreaterThan(band1, 0), 1) AS raster)
FROM rasterTable

The resulting GeoTiff looks like this

file

Google S2-based spatial join (v1.4.0)

If you’re experiencing suboptimal performance with Sedona’s optimized join, it may be due to complicated and overlapping geometries. Fortunately, Sedona offers a built-in alternative: the Google S2-based approximate equi-join. This equi-join utilizes Spark’s internal equi-join algorithm and may perform better than the optimized join, especially if you choose to skip the refinement step. However, be aware that skipping the refinement step will result in a loss of query accuracy.

Sedona’s approximate equi-join is a useful tool to have in your arsenal when working with complex geometries. And with the recent updates to Sedona, such as the introduction of the native raster data type and the redesigned end-to-end raster ETL pipeline, Sedona is better equipped than ever to meet the evolving needs of the geospatial data processing community.

1. Generate S2 ids for both tables

Use ST_S2CellIds to generate cell IDs. Each geometry may produce one or more IDs.

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM lefts
SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM rights

2. Perform equi-join

Join the two tables by their S2 cellId

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId

3. Optional: Refine the result

Due to the nature of S2 Cellid, the equi-join results might have a few false-positives depending on the S2 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives.

To ensure the correctness, you can use one of the Spatial Predicates to filter out them. Use this query instead of the query in Step 2.

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs, rcs
WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)

As you see, compared to the query in Step 2, we added one more filter, which is ST_Contains, to remove false positives. You can also use ST_Intersects and so on.

!!!tip
You can skip this step if you don’t need 100% accuracy and want faster query speed.

4. Optional: De-duplicate

Due to the explode function used when we generate S2 Cell Ids, the resulting DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can remove them by performing a GroupBy query.

SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), first(rcs_name)
FROM joinresult
GROUP BY (lcs_id, rcs_id)

The first function is to take the first value from a number of duplicate values.

If you don’t have a unique id for each geometry, you can also group by geometry itself. See below:

SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
FROM joinresult
GROUP BY (lcs_geom, rcs_geom)

Try it now

With these improvements and additions, Apache Sedona 1.4 s better equipped to meet the evolving needs of the geospatial data processing community.

We are immensely grateful to our fantastic community for their ongoing support and contributions to the development and growth of Apache Sedona. We are excited to continue our journey together, pushing the boundaries of geospatial data processing.

Stay tuned for future updates and enjoy your data processing with Apache Sedona 1.4!

Wherobots raises $5.5 Million in seed round to build the data platform for spatial analytics & AI

We are happy to announce that we raised $5.5 Million in seed round led by Clear Ventures and Wing VC. Wherobots Inc. was founded by Mo Sarwat and Jia Yu (the original creators of Open Source Software Apache Sedona) in June 2022 to enable every organization to drive value from data via space and time. The seed funding will accelerate the hiring of top-tier talent, advance R&D in spatial data+AI, and continue to gain adoption of the Apache Sedona-powered spatial data platform.

Message to the Apache Sedona community

Whether you are an Apache Sedona contributor, user, or champion, congratulations! Apache Sedona managed to create a dent in the world and this is just the start. As of June 13th 2022, Sedona has 104 contributors, 1500 stars on Github and has been downloaded more than 15 million times. One main reason we started Wherobots a year ago is to sustain the research and development of Apache Sedona in particular and spatial Data+AI platforms in general. We pledge to empower the Sedona community, respect the community standards / code of ethics, increase adoption among developers, and always be inclusive.

Message to the spatial community

We started Wherobots to democratize the process of building / maintaining scalable geospatial data pipelines for all developers. De facto data platforms have always treated the spatial aspect of data as an afterthought or a second class citizen at best. Wherobots aims at changing the status-quo by building a data platform that treats spatial as a first class citizen. Wherobots enables geospatial analytics and AI applications in the cloud. Users can build their geospatial data stack using familiar API such as Spatial SQL / Spatial Python, whereas Wherobots handles the hassle of cloud administration, computer resource management /provisioning, workload scalability, and geospatial data processing support/optimization. So, we ask everyone working in geospatial to join the movement.

Message to the data management community

The data management community have made great strides in making data more accessible on premises and in the cloud. Yet, as per Garter, 97% of enterprise data remains unused. Our approach to solving this problem is to enable organizations to look at enterprise data via the space / time lens. That will relate data to the real physical world since everything happens somewhere, and hence that makes it easier to extract real tangible value from data. Wherobots does not aim to replace de-facto data platforms. Instead, it empowers them by adding the spatial aspect to the data & AI pipeline and bringing spatial functionality to the data developer’s disposal. To achieve that, Wherobots builds a highly scalable spatial computing & AI engine that is fully managed and provisioned in the cloud. Such engine can seamlessly connect to / spatially empower existing database, data warehousing, & popular data storage platforms.

We are Hiring

We are hiring software engineers, cloud platform engineers, devrel engineers, and product managers. If you are interested, please apply here.