Wherobots and PostgreSQL + PostGIS: A Synergy in Spatial Analysis

Introduction

Spatial data has never been more abundant and valued in decision making (everything happens somewhere!). The tools we use to process location based data can significantly impact the outcomes of our projects. In the geospatial ecosystem, Apache Sedona-powered Wherobots and PostgreSQL with the PostGIS extension each offer robust capabilities. They share some functionalities, but they are more powerful when used together, rather than in isolation. This post explores how combining Wherobots and PostgreSQL + PostGIS can enhance spatial data processing, offering a synergy that leverages the unique strengths of each tool to enhance data analysis and decision-making processes.

Exploring PostgreSQL + PostGIS

PostgreSQL is a powerful, open-source object-relational database system known for its robustness and reliability. When combined with the PostGIS extension, PostgreSQL transforms into a spatial database capable of executing location-based queries and spatial functions. This combination is ideally suited for projects where data integrity and detailed transaction management (ACID transactions) are crucial.

Key Features:

  • Data Integrity: Ensures high levels of data accuracy and consistency with ACID transaction compliance.
  • Complex Queries: Supports a wide array of SQL and spatial queries for detailed data management.
  • Integration: Seamlessly integrates with existing IT infrastructures and data systems.

Architecture:

PostgreSQL’s architecture, a traditional RDBMS, uses a process-based database server architecture. It is designed for detailed, transactional data management, where data integrity and complex query capabilities are paramount.

Characteristics:

  • Transactional Integrity: PostgreSQL excels in maintaining data consistency and integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance, making it ideal for applications requiring reliable data storage.
  • Structured Data Management: Data must be written into PostgreSQL tables, allowing for sophisticated querying and indexing capabilities provided by PostGIS. This ensures precise spatial data management.
  • Integration with Legacy Systems: PostgreSQL + PostGIS can easily integrate with existing IT infrastructure, providing a stable environment for long-term data storage and complex spatial queries.

Understanding Wherobots & Apache Sedona

Wherobots, founded by the original creators of Apache Sedona, represents a significant advancement in modern cloud native spatial data processing. Apache Sedona is an open-source cluster computing system designed for the processing of large-scale spatial data, and Wherobots (continuously optimizing Sedona for performance) employs this technology with serverless deployment principles to analyze data distributed across clusters. This approach allows Wherobots to perform complex spatial queries and analytics efficiently, handling massive datasets that are beyond the scope of traditional geospatial data processing tools.

Key Features:

  • Scalability: Easily scales horizontally to handle petabytes of data across through a cluster computing architecture.
  • Speed: Utilizes memory-centric architecture to speed up data processing tasks.
  • Complex Algorithms: Supports numerous spatial operations and algorithms essential for advanced geographic data analysis.
  • Spatial SQL API: Over 300 (and growing) spatial type SQL functions covering both raster and vector analysis.
  • Pythonic Dataframe API: Extends the Spark Dataframe API for spatial data processing.
  • Python DB API: Python module conforming to the Python DB API 2.0 specification with familiar access patterns.

Architecture:

The backbone of Wherobots, Apache Sedona, is designed with a modern cloud native and distributed computing framework in mind. Wherobots orchestrates and manages the cluster distributed compute nodes to enable large-scale spatial data processing by bringing compute power directly to the data. This architecture is purpose built for handling big data analytics and planetary-scale spatial data processing.

Characteristics:

  • Distributed Processing: Wherobots and Apache Sedona utilize a cluster of nodes, enabling parallel processing of massive datasets. This makes it ideal for tasks requiring high scalability and speed.
  • In-Memory Computation: By leveraging in-memory processing, Sedona significantly reduces the time needed for complex spatial operations, allowing for rapid data exploration and transformation.
  • Cloud Integration: Wherobots architecture is naturally aligned with cloud-based solutions, enabling seamless integration with cloud services like Amazon Web Services (AWS). This facilitates scalable and flexible spatial data analytics.

Complementarity in Action

Sedona in Wherobots and PostgreSQL + PostGIS can both perform spatial queries, but their real power lies in how they complement each other. For example:

  • Data Handling and Performance: Apache Sedona in Wherobots Cloud is second to none at handling and analyzing large volumes of data, making them ideal for initial data processing, exploration, transformation, and analysis. In contrast, PostgreSQL + PostGIS excels in managing more detailed transactions and operations requiring high precision and integrity.
    • Use Wherobots for: Scalable Data Operations, Wherobots efficiently scales processing across multiple nodes, ideal for rapid analysis of spatial data at local to global levels.
    • Use PostgreSQL & PostGIS for: PERSISTENCE, PostgreSQL + PostGIS provides the tools necessary for precise data management crucial for applications like urban planning and asset management.
  • Use Case Flexibility: Wherobots can quickly process vast quantities of spatial data, making them suitable for environments and applications that require immediate insights on large volumes of data. PostgreSQL + PostGIS, being more transaction oriented, is perfect for applications where long-term data storage and management are needed.
    • Use Wherobots for: Processing Speed, when your scenario requires immediate response.
    • Use PostgreSQL & PostGIS for: Data Management, in scenarios requiring meticulous data integrity, such as environmental monitoring over years or detailed urban planning projects.
  • Analytical Depth: When initial processing (extractions, transformations, enrichment, etc.) on large scale data is done with Wherobots, ETL run times can be greatly reduced. Once processed, the data can be permanently stored and hosted in PostgreSQL + PostGIS. This allows users to perform deep, granular analyses at scale and then persist and serve those insights.
    • Use Wherobots for: Transforming and Enriching Data: Extract, transform, enrich, refine, and analyze data in Wherobots Cloud at scale and with unparalleled speed.
    • Use PostgreSQL & PostGIS for: Robust Data Ecosystem. Combining these tools ensures a robust data and analysis and management ecosystem capable of supporting complex, multi-dimensional spatial queries.

Integration Illustration

Integration Illustration

Let’s illustrate how these two systems can be integrated in a complimentary manner.

We’ll assume the persona of a fictitious company named “Synergistic GeoFusion Technologies (SGT) holdings”.

SGT holdings handles a vast array of spatial data from diverse sources such as sensors, satellites, and user inputs. The volume and velocity of the data collection requires a sophisticated approach to maximize efficiency. Wherobots steps in as the initial processing powerhouse, applying its Apache Sedona-based cluster computing to perform heavy-duty ETL (Extract, Transform, Load) tasks. This process involves cleansing, integrating, and transforming the raw data into a more structured format. Wherobots’ capability to handle massive datasets efficiently complements PostGIS’s robust data storage and querying capabilities by preparing the data for detailed spatial analysis and storage.

Once Wherobots processes the data, it can be seamlessly transferred to PostGIS, which cab serve as the system of record. This PostgreSQL extension is well-suited for ongoing data management, thanks to its sophisticated spatial functions and ability to handle complex queries. PostGIS’s strong querying capabilities complement Wherobots’ data processing strength, providing a stable platform for further data manipulation and refinement.

PostGIS is not only a storage layer but also a platform for ongoing data edits and updates, ensuring data integrity and relevance. When complex, resource-intensive spatial analyses are required, Wherobots is re-engaged. This dual engagement allows SGT holdings to handle routine data management in PostGIS while delegating computationally demanding tasks back to Wherobots, thus utilizing each system’s strengths to full effect.

For visualization, SGT holdings again leverages Wherobots to generate vector tiles from the analyzed data. These tiles are crucial for dynamic, scalable visual representations in the company’s internal dashboards and tools. Wherobots’ ability to generate these tiles efficiently complements PostGIS’s role by providing a means to visualize the data stored and managed within PostGIS. This not only enhances user interactions with the data but also provides a seamless bridge from data analysis to actionable insights through visual exploration.

Using Wherobots and PostGIS in this complimentary manner, SGT has established a highly efficient workflow that leverages the distinct capabilities of each system. They now have the capability to ingest all their data, manage it effectively, and run their post hoc analysis tasks to server internal and external clients efficiently and in a cost effective manner.

Performance Benchmark: Wherobots vs. PostGIS

When comparing WherobotsDB to PostgreSQL with PostGIS for spatial queries, WherobotsDB’s performance benefits become evident as data size increases. Initially, PostGIS’ precomputed GIST indexes give it an edge on smaller datasets due to faster query execution times. However, as datasets grow larger, the dynamic, on-the-fly indexing of WherobotsDB surpasses PostGIS. The overhead associated with WherobotsDB’s distributed system is outweighed by its ability to efficiently handle complex, large-scale queries, ultimately making it significantly faster and more scalable in high-demand scenarios.

In essence, WherobotsDB may start off slower with smaller datasets, but its performance dramatically improves with larger datasets, far exceeding PostGIS’ capabilities. This makes WherobotsDB the preferred choice when working with extensive spatial data that demands efficient, scalable processing.

See the tables below for a detailed comparison of performance metrics.

Datasets Size Number of Rows     Types of Geometries
OSM Nodes 256GB 7,456,990,919 Point
Overture Buildings     103GB 706,967,095 Polygon/
Multipolygon
OSM Postal Codes 0.86GB 154,452 Polygon
WITH t AS 
( 
    SELECT 
        * 
    FROM 
        overture_buildings_test AS buildings 
    JOIN 
        osm_nodes_test AS nodes 
    ON ST_Intersects(
        nodes.geometry, 
        buildings.geometry) 
) 
SELECT 
    COUNT(*) 
FROM 
    t;
Dataset sizes PostGIS WherobotsDB (Cairo)     Relative Speed
1M Buildings x 1M Nodes 0.70s 2.24s 0.31 x
1M Buildings x 10M Nodes 32.30s 33.44s 0.97 x
10M Buildings x 1M Nodes 28.30s 21.72s 1.30 x
10M Buildings x 10M Nodes 49.30s 3m 36.29s 0.23 x
All Buildings x 1M Nodes 8m 44s 36.59s 14.32 x
All Buildings x 10M Nodes 38m 30s 49.77s 46.42 x
10K Buildings x All Nodes 2h 29m 19s 46.42s 193.00 x
100K Buildings x All Nodes 4h 22m 36s 51.75s 304.57 x
1M Buildings x All Nodes 3h 29m 56s 1m 19.63s 158.17 x
10M Buildings x All Nodes 8h 48m 13s 2m 13.65s 237.17 x
All Buildings x All Nodes +24h (Aborted) 4m 31.84s + 317.83 x

Conclusion: Better Together for Spatial Brilliance

Individually, Wherobots and PostgreSQL + PostGIS are powerful tools. But when combined, they unlock new possibilities for spatial analysis, offering a balanced approach to handling both large-scale data processing and detailed, precise database management. By understanding and implementing their complementary capabilities, organizations can achieve more refined insights and greater operational efficiency in their spatial data projects.

By utilizing both tools strategically, companies can not only enhance their data analysis capabilities but also ensure that they are prepared for a variety of spatial data challenges, now and in the future.

To learn more about Wherobots, reach out or give it a try with a trial account

Making Overture Maps Data More Efficient With GeoParquet And Apache Sedona

The latest release of Overture Maps data is now published in GeoParquet format, allowing for more efficient spatial operations when selecting subsets of the dataset and enabling interoperability with a growing ecosystem of data tooling that supports the cloud-native GeoParquet format. In this post we explore some of the motivations and advantages of publishing the Overture Maps data as GeoParquet as well as the process used by Overture to generate and publish a large scale geospatial dataset as GeoParquet using Apache Sedona.

Understanding GeoParquet

Built upon Apache Parquet, GeoParquet is an open-source file format specifically designed for storing geospatial data efficiently. Apache Parquet is a column-oriented data file format optimized for storing data efficiently in cloud data object stores like Amazon’s S3 service. However Parquet lacks native support for geospatial data. GeoParquet extends Parquet by defining a standard for storing geometry data and associated metadata, such as the spatial bounding box and coordinate reference system for each file.

In a nutshell, GeoParquet adds the following information to the file metadata:

  • Version: the GeoParquet specification version of this file
  • Primary column: the main Geometry Type column should be considered when there are multiple Geometry Type columns stored in this file.
  • Column metadata per Geometry Type column:

    • Encoding of the geometry data. Currently only WKB format is supported.
    • Geometry types: all possible geometry types occurred in this column, such as POINT, POLYGON, LINESTRING
    • Coordinate reference system information, bounding box, geometry orientation, and so on

In GeoParquet, all geometric data is required to be housed in Parquet’s Binary Type. This data must adhere to the encoding specifications detailed in the column metadata, which currently employs Well-Known Binary (WKB) encoding.

The storage of geometries in GeoParquet using Parquet’s intrinsic binary format ensures that standard Parquet readers can seamlessly access GeoParquet files. This feature significantly enhances GeoParquet’s compatibility with existing data systems. Nevertheless, a dedicated GeoParquet reader, designed to interpret GeoParquet metadata, can utilize this information for further optimizations and to execute more advanced geospatial operations.

Benefits of GeoParquet

GeoParquet integrates all the advantages of Parquet, such as its compact data storage and rapid data retrieval capabilities. Additionally, GeoParquet introduces a standardized approach for storing and accessing geospatial data. This standardization significantly simplifies the process of data sharing and enhances interoperability across various systems and tools within the geospatial data processing ecosystem.

Crucially, GeoParquet enables specialized GeoParquet readers to optimize geospatial queries, significantly enhancing the efficiency of data retrieval. This optimization is particularly vital for applications demanding real-time data processing and analysis.

A significant contributor to this enhanced efficiency is the bounding box information (abbreviated as BBox) embedded in the GeoParquet file metadata. The BBox effectively serves as a spatial index, streamlining the process of data filtering. This is particularly useful when executing geospatially driven queries. When a spatial query is executed — for instance, searching for all points of interest within a certain area — the GeoParquet reader first checks the BBox information. If the query’s geographic area doesn’t intersect with the BBox of the data, the reader can immediately exclude that file from further consideration. This step vastly reduces the amount of data that needs to be retrieved and processed.

As a testimony, with the help of Apache Sedona, Wherobots engineers created a GeoParquet version of the Overture 2023-07-26-alpha.0 data release which was not in GeoParquet format. This transformation led to a remarkable improvement in query performance. When conducting a spatial range query on the Building dataset using Apache Sedona, the process on the GeoParquet version took approximately 3 minutes, a significant reduction from the 1 hour and 30 minutes required for the same query on the original dataset. Further details and insights into this transformation are available in our blog post titled Harnessing Overture Maps Data: Apache Sedona’s Journey from Parquet to GeoParquet.

The Role Of Apache Sedona

The GIS software landscape offers a variety of tools capable of handling GeoParquet data, such as GeoPandas and QGIS. However, Apache Sedona stands apart in this field. As a renowned open-source cluster computing system, it is specifically tailored for processing large-scale spatial data. Apache Sedona offers a rich suite of spatial operations and analytics functions, all accessible through an intuitive Spatial SQL programming language. This unique blend of features establishes Apache Sedona as a highly effective system for geospatial data analysis.

With the adoption of GeoParquet, Apache Sedona has expanded its capabilities, now able to create GeoParquet files. This integration not only enhances Apache Sedona’s data storage efficiency but also opens up broader prospects for advanced data analysis. This positions Apache Sedona as a distinct and powerful tool in the realm of geospatial data processing, differentiating it from other GIS tools by its ability to handle complex, large-scale spatial datasets efficiently. Further details can be found on Sedona’s website: https://sedona.apache.org/

Generating GeoParquet Files With Apache Sedona

Leveraging Sedona’s programming guides, OMF crafts GeoParquet datasets by incorporating a geometry column sourced from its WKB geospatial data. The process involves utilizing Sedona functions, as illustrated in the python code snippet below:

myDataFrame.withColumn("geometry", expr("ST_*")).selectExpr("ST_*")

In its commitment to improving the spatial query experience for its customers, OMF has implemented a strategic pre-defined indexing method. This method organizes data effectively based on theme and type combinations. To further refine spatial filter pushdown performance, OMF capitalizes on Sedona’s ST_GeoHash function to generate dual GeoHash IDs for each geometry in the dataset.

  1. Coarse-Level GeoHashing (Precision Level 3): The first GeoHash, with a precision level of 3, creates cells approximately 156.5km x 156km in size. This level is ideal for grouping data based on geographic proximity. Using this GeoHash key, a large Sedona DataFrame is partitioned into numerous smaller GeoParquet files. Each file correlates to a unique 3-character GeoHash key, facilitating efficient data pruning at the Parquet file level. Importantly, this approach enables users to manually download individual GeoParquet files by referencing the GeoHash key in the file path, bypassing the need for a Parquet file reader.
  2. Fine-Level GeoHashing (Precision Level 8): Concurrently, OMF utilizes a finer GeoHash at level 8, resulting in smaller cells around 38.2m x 19m and then sorts data using this GeoHash within each GeoParquet file. This precision is used to sort data within each GeoParquet file. By doing so, OMF ensures that the data within each Parquet row-group or data page is organized based on spatial proximity. This organization is crucial for enabling efficient data pruning, particularly on the bbox column.

The bbox column, stored in a native Parquet Struct Type, consists of four Double Type sub-columns: minx, maxx, miny, and maxy. It allows even standard Parquet readers to conduct pruning operations effectively. This capability is particularly beneficial as it complements the BBox information found in the GeoParquet file metadata. While the BBox metadata provides an overall spatial reference for the file, the bbox column offers a more granular level of spatial data, enabling granular data pruning at the Parquet row-group level.

This dual GeoHash approach serves three core purposes:

  • It allows users to download individual GeoParquet files from OMF without requiring a GeoParquet or Parquet reader, simply by checking the GeoHash key.
  • GeoParquet readers can efficiently filter files using the BBox information in each file’s metadata, enhancing query speed.
  • The strategy also enhances the functionality of standard Parquet readers which can effectively filter files using the statistical information of the bbox column, found in each file’s column and row-group metadata, without needing to recognize the file as a GeoParquet file.

Analyze OMF GeoParquet Files with Sedona on Wherobots Cloud

Let’s see some of the benefits of querying a large-scale GeoParquet dataset in action by analyzing the Overture Places theme using Apache Sedona. The easiest way to get started with Apache Sedona is by using the official Apache Sedona Docker image, or by creating a free hosted notebook in Wherobots Cloud.

The following command will start a local Docker container running Apache Sedona and expose a Jupyter Lab environment on localhost port 8888:

docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
apache/sedona:latest

This Jupyter Lab environment can now be accessed via a web browser at http://localhost:8888

Next, the following Python code will load the latest Overture Maps GeoParquet release (at the time of writing) in Sedona:

from sedona.spark import *

config = SedonaContext. \
builder(). \
config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \ getOrCreate()

sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet"). \
load("s3a://overturemaps-us-west-2/release/2024-01-17-alpha.0/theme=places/type=place")

We can now apply a spatial predicate to filter for points of interest within the bounds of New York City, using the following Spatial SQL functions:

  • ST_PolygonFromEnvelope – create a polygon geometry that will serve as the area roughly representing the boundaries of New York City from a specified bounding box
  • ST_Within – filter rows for only points of interest that fall within the boundary of that polygon geometry
spatial_filter = "ST_Within(geometry, ST_PolygonFromEnvelope(-74.25909, 40.477399, -73.700181, 40.917577))"
df = df.where(spatial_filter)
df.createOrReplaceTempView("places")

When executing this spatial filter, Sedona can leverage the BBox GeoParquet metadata in each partitioned GeoParquet file to exclude any files that do not contain points of interest within the polygon defined in the spatial predicate. This results in less data being examined and scanned by Sedona and less data that needs to be retrieved over the network improving query performance. Also, note that the geometry column is interpreted as a geometry type without the need for explicit type casting thanks to the WKB serialization as part of the GeoParquet specification.

To further demonstrate the functionality of Spatial SQL with Apache Sedona, let’s query for all stadiums within the area of New York City, then find all points of interest within a short walking distance of each stadium using the following Spatial SQL functions:

  • ST_Buffer – create a new geometry that represents a buffer around each stadium that represents a short walking distance from the stadium
  • ST_Intersects – filter for points of interest that lie within the buffer geometry, identifying points of interest within walking distance of each stadium
stadium_places = sedona.sql("""
WITH stadiums AS (SELECT * FROM places WHERE categories.main = "stadium_arena")
SELECT * FROM places, stadiums
WHERE ST_Intersects(places.geometry, ST_Buffer(stadiums.geometry, 0.002))
""")
Querying Overture Maps GeoParquet data using Apache Sedona
Visualizing points of interest within walking distance from each stadium.

You can see more detailed examples of analyzing the Overture Places dataset using Apache Sedona in this blog post.

Conclusion

The Overture Maps Foundation’s adoption of GeoParquet, in collaboration with Apache Sedona, marks a significant milestone in geospatial data management. The combination of GeoParquet’s efficient storage format and Apache Sedona’s spatial analytics capabilities brings unprecedented performance and scalability to geospatial dataset publishing and analysis. This integration opens up new avenues for researchers, developers, and geospatial enthusiasts, empowering them to explore and derive insights from vast geospatial datasets more efficiently than ever before.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

Introducing Sedona 1.5: Making Sedona the most comprehensive & scalable spatial data processing and ETL engine for both Raster and Vector data

We’re excited to introduce Apache Sedona‘s 1.5 series, a leap forward in geospatial processing that brings essential features and enhancements. Since our last blog post on the Sedona 1.4 series, Apache Sedona has evolved significantly. The 1.5 series, launched with versions 1.5.0 in October 2023 and 1.5.1 in January 2024, marks a period of substantial advancement. These developments aim to transform Sedona into a comprehensive, all-in-one cluster computing engine for geospatial vector and raster data handling across both batch and streaming data. You can learn more about the release from our release notes, website and GitHub repository.

XYZM coordinates and SRID

The concepts of X, Y, Z, and M in geospatial data, along with vector data, are foundational to understanding how geographic information is represented and analyzed in Geographic Information Systems (GIS).

  • X typically refers to the longitude
  • Y typically refers to the latitude
  • Z adds depth or elevation to the spatial representation, which is critical for conducting 3D modeling for urban planning, architecture, and environmental studies.
  • M stands for a measure, which can be time, distance, or other quantitative attribute that enhances the analysis, such as speed, weight, or cost.

SRID stands for Spatial Reference System Identifier. It’s a unique identifier which typically references a registry or a catalog of spatial reference systems, such as the EPSG (European Petroleum Survey Group) Geodetic Parameter Registry.

As of today, PostGIS is the most popular open-source database engine that offers full support of XYZM and SRID. It comes up with a standard called EWKB (Extended WKB) and EWKT (Extended WKT) to encode and decode the XYZM and SRID information of geometries. In contrast, Sedona versions 1.4 and earlier, along with several cloud database platforms, have shown limitations in handling XYZM coordinates and SRID comprehensively. These limitations typically manifest in several ways

  • The geometry constructors do not understand EWKT and EWKB
  • No geometry editors or accessors for Z, M dimensions and SRID information
  • No geometry outputs for for Z, M dimensions and SRID information
  • Z and M values silently go missing even if they are explicitly set by users

In Sedona 1.5 series, the community has revised several Sedona components such as the serializer and geometry functions to uphold the full support of XYZM and SRID. Now let’s look at an example in Sedona 1.5.1.

Spatial SQL query

SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT ZM(1 2 3 4)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT Z(1 2 3)'))
SELECT ST_AsEWKT(ST_GeomFromEWKT('SRID=4269;POINT M(1 2 3)'))

Output

SRID=4269;POINT ZM(1 2 3 4)
SRID=4269;POINT Z(1 2 3)
SRID=4269;POINT M(1 2 3)

As shown in the example, Sedona can read and write EWKT geometries. The Z and M values and SRID information are always preserved.

Users can also perform some operations on the Z dimension. For example,

Spatial SQL query

SELECT ST_3DDistance(ST_GeomFromEWKT('POINT Z(1 1 1)'), ST_GeomFromEWKT('POINT Z(2 2 2)'))

Output

1.7320508075688772

Vector and raster join

In GIS, spatial data is primarily categorized into two types: vector data and raster data.

  • Vector data: represents geographical features using points, lines, and polygons. These elements are defined by X and Y coordinates that describe their shape and position in space. Common vector formats include WKT, WKB, Shapefile (.shp), and GeoJSON (.geojson)
  • Raster data: represents geographic information as a grid of cells (similar to images), with each cell holding a specific value. Raster data is well-suited for representing continuous phenomena, such as temperature gradients, elevation surfaces, or satellite imagery. Each raster data ofter comes with georeference information and a bounding box which aligns the image with a spatially correct geographic region. Common raster formats include GeoTiff (.tiff), and NetCDF (.nc).

Finding a scalable spatial database system that effectively handles both vector and raster data is challenging. However, the integration of these data types has significant real-world applications, such as overlaying zoning boundaries on satellite images for urban planning or combining vector-based river networks with raster-based elevation models for environmental analysis. This capability is essential for assessing land use compliance, planning development, and studying environmental phenomena like watershed dynamics and flood risks.

With the Sedona 1.5 series, joining massive vector and raster datasets has become feasible, opening the door to discovering more profound insights.

Let’s use the LandSAT LST (Land Surface Temperature) dataset as an example. Each raster image in this dataset describes the temperature of a specific region. We can join them with the zip code zones of the earth to calculate the average temperature per city.

Let’s first load this dataset to Sedona and visualize the raster DataFrame as follows:

from sedona.spark import *
imageDf = sedona.sql("SELECT RS_AsImage(rast, 100), datetime FROM landsat_lst")
SedonaUtils.display_image(imageDf)

The output looks like this

Untitled

Let’s read the the zip code zones of the earth into Sedona and print the content:

SELECT ZCTA5CE10, geometry FROM zipcode

The output is as follows:

+---------+--------------------+
|ZCTA5CE10|            geometry|
+---------+--------------------+
|    36522|POLYGON ((-88.432...|
|    36521|POLYGON ((-88.342...|
|    36548|POLYGON ((-88.037...|
|    36585|POLYGON ((-88.169...|
|    36583|POLYGON ((-88.263...|
+---------+--------------------+

Now let’s join the 2 dataset together using RS_Intersects and the join condition is that if a zip code zone intersects a raster image.

SELECT ZCTA5CE10, geometry, rast, datetime
FROM wherobots_open_data.us_census.zipcode z, landsat_lst l
WHERE RS_Intersects(z.geometry, l.rast)

The output is as follows:

+---------+--------------------+--------------------+--------------------+
|ZCTA5CE10|            geometry|                rast|            datetime|
+---------+--------------------+--------------------+--------------------+
|    85356|POLYGON ((-114.34...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85364|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    92250|POLYGON ((-115.45...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85365|MULTIPOLYGON (((-...|GridCoverage2D[""...|2023-09-01 18:10:...|
|    85347|POLYGON ((-114.16...|GridCoverage2D[""...|2023-09-01 18:10:...|
+---------+--------------------+--------------------+--------------------+

Now let’s calculate the temperature of each zone using RS_ZonalStats

SELECT ZCTA5CE10, RS_ZonalStats(rast, geometry, 'avg') as avg
FROM zone_temp

The output is as follows:

+---------+------------------+
|ZCTA5CE10|               avg|
+---------+------------------+
|    85350| 85.66846905011799|
|    85350|111.90322619887085|
|    85365| 43.69896389654749|
|    85350|117.59570757880628|
|    85365|120.40936393442645|
+---------+------------------+

Flexible raster data manipulation

The Sedona 1.5 series introduces additional functions to address the challenge of raster data from remote sensing devices not being in the preferred format for analysis. These new features are designed to streamline the process of transforming and analyzing such data. Let’s explore a few interesting examples.

RS_Clip

It returns a raster that is clipped by the given geometry

Original raster

Untitled 1

SELECT RS_Clip(
        RS_FromGeoTiff(content), 1,
        ST_GeomFromWKT('POLYGON ((236722 4204770, 243900 4204770, 243900 4197590, 221170 4197590, 236722 4204770))'),
        200, false
    )

Output

Untitled 2

RS_Resample

It resample a raster using a given resampling algorithm and new dimensions (width and height), a new grid corner to pivot the raster at (gridX and gridY) and a set of georeferencing attributes (scaleX and scaleY)

Original raster

Untitled 3

SELECT RS_Resample(raster, 1000, 1000, false, 'NearestNeighbor') as resampled_raster
FROM raster_table

Output

Untitled 4

RS_AsRaster

Converts a Geometry to a Raster object

SELECT RS_AsRaster(ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'), r.raster, 'b', 230) as rasterized_geom
FROM raster_table r

Output

Untitled 5

RS_MapAlgebra

Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.

Let’s still use the rasterized geometry above and convert it to a white background.

SELECT RS_MapAlgebra(rasterized_geom, NULL, 'out[0] = rast[0] == 0 ? 230 : 0;') AS raster
FROM rasterized_table

Output

Untitled 6

SedonaKepler and SedonaPyDeck

Sedona introduced SedonaKepler and SedonaPyDeck to integrate with KeplerGL and DeckGL, aiming to enhance geospatial data visualization. These integrations allow users to utilize advanced mapping and visualization features directly from Sedona’s spatial data processing results.

One can simply run the following command to visualize a Sedona DataFrame using SedonaKepler and SedonaPyDeck

from sedona.spark import *
SedonaKepler.create_map(df=groupedresult, name="AirportCount")
SedonaPyDeck.create_choropleth_map(df=groupedresult, plot_col='AirportCount')
SedonaPyDeck.create_geometry_map(df_building, elevation_col='height')

Untitled 7

Untitled 8

Untitled 9

GeoParquet reader

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. It can read any GeoParquet files written by third-party systems such as GeoPandas. Most importantly, Sedona will use the [bbox](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox) properties in the metadata to determine if all data in the file will be discarded by the spatial predicate. This optimization could reduce the number of files scanned when the queried GeoParquet dataset was partitioned by spatial proximity.

df = sedona.read.format("geoparquet").load("s3a://my/path/*.parquet")
    .filter("ST_Contains(ST_GeomFromWKT('POLYGON (XXX)'))")

A user can also inspect the schema of a GeoParquet file.

df.printSchema()

Output

root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)

A user can also inspect the metadata of a GeoParquet file.

df = sedona.read.format("geoparquet.metadata").load("s3a://my/path/*.parquet")
df.show()
df.printSchema()

Output of the GeoParquet metadata

+------------+--------------+------------------------------------------------------------------------------------------+-------+
|version     |primary_column|columns                                                                                   |geohash|
+------------+--------------+------------------------------------------------------------------------------------------+-------+
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 45.0038534, -33.75027, 50.5347176], null}}            |g0     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-45.0, 33.8141064, -33.7678528, 39.3735873], null}}          |en     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-44.8187256, -56.1774766, -34.1630816, -50.8584471], null}}  |5n     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-112.2099566, -33.7316387, -101.2729317, -28.2008956], null}}|3d     |
|1.0.0-beta.1|geometry      |{geometry -> {WKB, [Point], [-33.2666016, -72.9196355, -22.6803555, -67.5007358], null}}  |57     |
+------------+--------------+------------------------------------------------------------------------------------------+-------+

Output of the GeoParquet metadata



root
|-- version: string (nullable = true)
|-- primary_column: string (nullable = true)
|-- columns: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- encoding: string (nullable = true)
| | |-- geometry_types: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- bbox: array (nullable = true)
| | | |-- element: double (containsNull = true)
| | |-- crs: string (nullable = true)
|-- geohash: string (nullable = true)

GeoParquet reader legacy mode

Due to a breaking change in Apache Sedona 1.4.0 to the SQL type of GeometryUDT (SEDONA-205) as well as the serialization format of geometry values (SEDONA-207), Parquet files containing geometry columns written by Apache Sedona 1.3.1 or earlier cannot be read by Apache Sedona 1.4.0 or later.

For parquet files written by "parquet" format when using Apache Sedona 1.3.1-incubating or earlier:

df.write.format("parquet").save("path/to/parquet/files")

Since v1.5.1, GeoParquet supports reading legacy Parquet files. you can use "geoparquet" format with the .option("legacyMode", "true") option.

df = sedona.read.format("geoparquet").option("legacyMode", "true").load("path/to/legacy-parquet-files")

GeoParquet writer

Sedona GeoParquet reader now fully conforms to GeoParquet spec 1.0.0. Moreover, a user can also control the metadata written to the GeoParquet files.

df.write.format("geoparquet")
        .option("geoparquet.version", "1.0.0")
        .option("geoparquet.crs", projjson)
        .save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet")

To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see ST_GeoHash) and then save as a GeoParquet file. An example is as follows:

SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
FROM spatialDf
ORDER BY geohash

The following figure is the visualization of the bbox of Sedona GeoParquet files ordered by GeoHash.

Untitled 10

Uber H3 IDs

Uber H3 is a technique to model and index and geospatial data. H3 grids the earth surface by casting it on a icosahedron, and tiles it by hexagons + pentagons.

Sedona 1.5 now provides the following functions related to Uber H3:

  • ST_H3CellIDs(geom: geometry, level: Int, fullCover: true)
  • ST_H3CellDistance(cell1: Long, cell2: Long)
  • ST_H3KRing(cell: Long, k: Int, exactRing: Boolean)
  • ST_H3ToGeom(cells: Array[Long])

Let’s use the Seattle road network dataset (from OSM) as an example.

Untitled 11

You can create H3 cell ids using ST_H3CellIds as follows.

roadDf = sedona.read.format("csv").option("header", "true").load(PATH_PREFIX + "data/OSM2015_roads.csv")
roadDf = roadDf.selectExpr(
    "monotonically_increasing_id() as id",
    "ST_GeomFromWkt(geometry) as road", "`attr#1` as attr",
    "ST_H3ToGeom(ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false)) as roadH3Geom",
    "ST_H3CellIDs(ST_GeomFromWkt(geometry), 10, false) as roadH3"
)

Output

+---+--------------------+--------------------+--------------------+--------------------+
| id|                road|                attr|          roadH3Geom|              roadH3|
+---+--------------------+--------------------+--------------------+--------------------+
|  0|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966219243...|
|  1|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150966281830...|
|  2|LINESTRING (-122....|    [surface|asphalt|MULTIPOLYGON (((-...|[6222150966517759...|
|  3|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966243164...|
|  4|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966286090...|
|  5|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966439772...|
|  6|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966722559...|
|  7|LINESTRING (-122....|   [highway|service]|MULTIPOLYGON (((-...|[622215096642338815]|
|  8|LINESTRING (-122....|[tiger:county|Kin...|MULTIPOLYGON (((-...|[6222150966830039...|
|  9|LINESTRING (-122....|      [service|alley|MULTIPOLYGON (((-...|[6222150959224913...|
+---+--------------------+--------------------+--------------------+--------------------+

We can visualize H3 cells using GeoPandas Plot

pandasDf = roadDf.limit(100).toPandas()
roadGpd = gpd.GeoDataFrame(pandasDf, geometry="road")
h3Gpd = gpd.GeoDataFrame(pandasDf, geometry="roadH3Geom")
ax = h3Gpd.plot(edgecolor='gray', linewidth=1.0)
ax.set_xlabel('Longitude (degrees)')
ax.set_ylabel('Latitude (degrees)')
roadGpd.plot(ax=ax, edgecolor='black', linewidth=1.0)

Untitled 12

You can also use Sedona H3 functions together with SedonaPyDeck for better visualization. Please read our blog post: Exploring Global Fishing Watch Public Data With SedonaDB & GeoParquet

Unmatched industrial vessels

New cluster compute engine support

Sedona 1.5 introduces the addition of support for Snowflake, compatibility with Spark versions up to 3.5 to enhance geospatial analysis. These updates facilitate advanced spatial data processing and analysis in distributed environments, leveraging the latest Spark capabilities and integrating with Snowflake for cloud-based data warehousing solutions.

Community contributions and acknowledgements

The Sedona 1.5.0 and 1.5.1 releases will not be possible without the help of the Apache Sedona community. We thank the contribution from the following community members:

jiayuasu, jbampton, hongbo-miao, Kontinuation, furqaankhan, iGN5117, prantogg, duhaode520, gregleleu, MyEnthusiastic, wendrickje, yyy1000

Future directions

Apache Sedona has gained significant growth in the past few years. We are excited for an electrifying future with Sedona as it gears up to revolutionize geospatial processing. Here are future directions of Apache Sedona.

  • Adding more cloud platform integrations in addition to the current Spark, Flink and Snowflake support
  • Developing advanced raster and vector data processing features.
  • Improving performance and scalability for large datasets.
  • Adding more comprehensive geospatial functions.
  • Increasing compatibility with the broader spatial data ecosystem such as QGIS, Lonboard, Rasterio, NumPy, Ibis
  • Strengthening documentation and community engagement for wider adoption.

Analyzing Real Estate Data With Apache Sedona

In a recent tutorial on the Wherobots YouTube channel we explored how to analyze real estate data using Apache Sedona and Wherobots Cloud using data from Zillow. Specifically, we calculated the change in median home value over the last 5 years per county using Zillow’s Zillow Home Value Index (ZHVI) and visualized the results.

This tutorial highlights a few features of Apache Sedona and Wherobots Cloud, including:

  • An overview of the Wherobots Cloud notebook environment
  • How to upload data to Wherobots Cloud
  • How to import datasets using Apache Sedona (CSV and Shapefile)
  • Using SQL to query data using Apache Sedona
  • Creating interactive visualizations of the results of our geospatial analysis in the Wherobots notebook environment
  • Creating choropleth maps with GeoPandas and matplotlib

Feel free to follow along with the video tutorial below:

We used two data sources for this map:

To follow along create a free account in Wherobots Cloud and create a new blank notebook. We’ll be using Python and SQL to interact with Sedona.

Many geospatial Python packages are installed by default in Wherobots Cloud, but we can also add packages either when launching the notebook environment in the Wherobots Cloud UI or installing with pip in the notebook. For this tutorial the only package we’ll need to install is mapclassify which we’ll use to help with creating the data distribution bins when creating our choropleth map.

# All other packages are installed by default in Wherobots Cloud

pip install mapclassify

Next, we’ll import some dependencies for working with Sedona and importing our datasets.

from sedona.spark import *
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.utils.adapter import Adapter
import geopandas

One of the first steps to using Sedona in our notebook is configuring the SedonaContext and connecting to our cluster. The config object here is typically where we can specify authentication providers for accessing cloud storage buckets. We won’t need this functionality in this tutorial, but this is the default configuration I typically use.

config = SedonaContext.builder().appName('zillowdata')\
    .config("spark.hadoop.fs.s3a.bucket.wherobots-examples-prod.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")\
    .getOrCreate()
sedona = SedonaContext.create(config)

Natural Earth Shapefile Import

We’ll be analyzing real estate values at the county level in the US. Since the dataset from Zillow doesn’t include county geometries we’ll use the Natural Earth Admin Level 2 dataset for US county geometries. I uploaded the Shapefile files to Wherobots Cloud using this structure:

The Wherobots File Management Interface

Note that since a Shapefile is composed of multiple files, the S3 URL we want to use is the location of the directory, not one of the individual files:

working with shapefiles in Wherobots Cloud

Then in our notebook we can save the S3 URL as a variable.

S3_URL_NE = "s3://<YOUR_S3_URL_HERE>"

To import Shapefile datasets into Sedona we can use Sedona’s ShapefileReader to create a Spatial RDD data structure, then convert that Spatial RDD to a Spatial DataFrame, which we’ll then be able to query using Spatial SQL, after creating a view (or temporary table) from the DataFrame. Spatial RDDs are one of Sedona’s core data structures that extends the Spark RDD by enabling spatial geometries and querying functionality. Spatial DataFrames extend Spark DataFrames as well, offering spatial indexing, partitioning, and querying functionality.

spatialRDD = ShapefileReader.readToGeometryRDD(sedona, S3_URL_NE)
counties_df = Adapter.toDf(spatialRDD, sedona)
counties_df.createOrReplaceTempView("counties")
counties_df.printSchema()

We can inspect the schema of the county data we imported. The relevant columns for our use case are geometry (the polygon that represents the boundary of each county), FIPS (a unique identifier for each county), NAME_EN (the English name of the county), and REGION (the state).

root
 |-- geometry: geometry (nullable = true)
 |-- FEATURECLA: string (nullable = true)
 |-- SCALERANK: string (nullable = true)
 |-- ADM2_CODE: string (nullable = true)
 |-- ISO_3166_2: string (nullable = true)
 |-- ISO_A2: string (nullable = true)
 |-- ADM0_SR: string (nullable = true)
 |-- NAME: string (nullable = true)
 |-- NAME_ALT: string (nullable = true)
 |-- NAME_LOCAL: string (nullable = true)
 |-- TYPE: string (nullable = true)
 |-- TYPE_EN: string (nullable = true)
 |-- CODE_LOCAL: string (nullable = true)
 |-- REGION: string (nullable = true)
 |-- REGION_COD: string (nullable = true)
 |-- ABBREV: string (nullable = true)
 |-- AREA_SQKM: string (nullable = true)
 |-- SAMEASCITY: string (nullable = true)
 |-- LABELRANK: string (nullable = true)
 |-- NAME_LEN: string (nullable = true)
 |-- MAPCOLOR9: string (nullable = true)
 |-- MAPCOLOR13: string (nullable = true)
 |-- FIPS: string (nullable = true)
 |-- SOV_A3: string (nullable = true)
 |-- ADM0_A3: string (nullable = true)
 |-- ADM0_LABEL: string (nullable = true)
 |-- ADMIN: string (nullable = true)
 |-- GEONUNIT: string (nullable = true)
 |-- GU_A3: string (nullable = true)
 |-- MIN_LABEL: string (nullable = true)
 |-- MAX_LABEL: string (nullable = true)
 |-- MIN_ZOOM: string (nullable = true)
 |-- WIKIDATAID: string (nullable = true)
 |-- NE_ID: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- NAME_AR: string (nullable = true)
 |-- NAME_BN: string (nullable = true)
 |-- NAME_DE: string (nullable = true)
 |-- NAME_EL: string (nullable = true)
 |-- NAME_EN: string (nullable = true)
 |-- NAME_ES: string (nullable = true)
 |-- NAME_FA: string (nullable = true)
 |-- NAME_FR: string (nullable = true)
 |-- NAME_HE: string (nullable = true)
 |-- NAME_HI: string (nullable = true)
 |-- NAME_HU: string (nullable = true)
 |-- NAME_ID: string (nullable = true)
 |-- NAME_IT: string (nullable = true)
 |-- NAME_JA: string (nullable = true)
 |-- NAME_KO: string (nullable = true)
 |-- NAME_NL: string (nullable = true)
 |-- NAME_PL: string (nullable = true)
 |-- NAME_PT: string (nullable = true)
 |-- NAME_RU: string (nullable = true)
 |-- NAME_SV: string (nullable = true)
 |-- NAME_TR: string (nullable = true)
 |-- NAME_UK: string (nullable = true)
 |-- NAME_UR: string (nullable = true)
 |-- NAME_VI: string (nullable = true)
 |-- NAME_ZH: string (nullable = true)
 |-- NAME_ZHT: string (nullable = true)

Now we can query this data using SQL:

sedona.sql("""
SELECT geometry, FIPS AS fips, NAME_EN AS name, REGION AS state FROM counties LIMIT 10
""").show()
+--------------------+-------+------------+-----+
|            geometry|   fips|        name|state|
+--------------------+-------+------------+-----+
|MULTIPOLYGON (((-...|US53073|     Whatcom|   WA|
|POLYGON ((-120.85...|US53047|    Okanogan|   WA|
|POLYGON ((-118.83...|US53019|       Ferry|   WA|
|POLYGON ((-118.21...|US53065|     Stevens|   WA|
|POLYGON ((-117.42...|US53051|Pend Oreille|   WA|
|POLYGON ((-117.03...|US16021|    Boundary|   ID|
|POLYGON ((-116.04...|US30053|     Lincoln|   MT|
|POLYGON ((-114.72...|US30029|    Flathead|   MT|
|POLYGON ((-114.06...|US30035|     Glacier|   MT|
|POLYGON ((-112.19...|US30101|       Toole|   MT|
+--------------------+-------+------------+-----+

Load Zillow ZHVI CSV File

Now that we have the geometries of each county we want to import the CSV file with the Zillow home value index data. I uploaded this file into Wherobots Cloud as well which we can access in our notebook environment using the S3 URL copied from the file browser UI.

S3_URL_ZHVI = "s3://<YOUR S3 URL HERE>"

After importing the CSV file into a new DataFrame we create a view and print the schema to inspect the data.

zhvi_df = sedona.read.format('csv').option('header','true').option('delimiter', ',').load(S3_URL_ZHVI)
zhvi_df.createOrReplaceTempView("zhvi")
zhvi_df.printSchema()
root
 |-- RegionID: string (nullable = true)
 |-- SizeRank: string (nullable = true)
 |-- RegionName: string (nullable = true)
 |-- RegionType: string (nullable = true)
 |-- StateName: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Metro: string (nullable = true)
 |-- StateCodeFIPS: string (nullable = true)
 |-- MunicipalCodeFIPS: string (nullable = true)
 |-- 2000-01-31: string (nullable = true)
 |-- 2000-02-29: string (nullable = true)
 |-- 2000-03-31: string (nullable = true)
...
 |-- 2023-10-31: string (nullable = true)

So our Zillow data has information about the county and a column for each monthly home value index going back to the year 2000 (I’ve omitted several column names from the output above).

Let’s inspect the first few rows of our Zillow data:

sedona.sql("""
SELECT RegionName,StateName, State, Metro, StateCodeFIPS, MunicipalCodeFIPS, `2000-01-31` FROM zhvi LIMIT 10
""").show()
+------------------+---------+-----+----------+-------------+-----------------+-------------+
|        RegionName|StateName|State|     Metro|StateCodeFIPS|MunicipalCodeFIPS|   2000-01-31|
+------------------+---------+-----+----------+-------------+-----------------+-------------+
|Los Angeles County|       CA|   CA|Los Ang...|           06|              037| 205974.69363|
|       Cook County|       IL|   IL|Chicago...|           17|              031| 145499.51956|
|     Harris County|       TX|   TX|Houston...|           48|              201|108350.203359|
|   Maricopa County|       AZ|   AZ|Phoenix...|           04|              013|143352.154097|
|  San Diego County|       CA|   CA|San Die...|           06|              073| 214883.41007|
|     Orange County|       CA|   CA|Los Ang...|           06|              059|249870.608658|
|      Kings County|       NY|   NY|New Yor...|           36|              047|200000.033597|
| Miami-Dade County|       FL|   FL|Miami-F...|           12|              086| 119546.52618|
|     Dallas County|       TX|   TX|Dallas-...|           48|              113| 94256.230613|
|  Riverside County|       CA|   CA|Riversi...|           06|              065| 149474.98732|
+------------------+---------+-----+----------+-------------+-----------------+-------------+

Joining ZHVI And County Geometries

So we now have two tables zhvi and counties – let’s join them together so we can plot current home value per county. To do that we’ll use the FIPS code which uniquely identifies each county. The zhvi table breaks the FIPS code out into the state and municipal components so we’ll need to combine them so we can join against the counties table.

zhvi_county_df = sedona.sql("""
SELECT CAST(zhvi.`2023-10-31` AS float) AS value, zhvi.RegionName As name, counties.geometry
FROM zhvi
JOIN counties
ON CONCAT('US', zhvi.StateCodeFIPS, zhvi.MunicipalCodeFIPS) = counties.FIPS
WHERE NOT zhvi.State IN ('HI', 'AK')

zhvi_county_df.show(5)
""")
+---------+-------------------+--------------------+
|    value|               name|            geometry|
+---------+-------------------+--------------------+
|582171.56|     Whatcom County|MULTIPOLYGON (((-...|
|300434.03|    Okanogan County|POLYGON ((-120.85...|
| 279648.9|       Ferry County|POLYGON ((-118.83...|
|365007.16|     Stevens County|POLYGON ((-118.21...|
|364851.38|Pend Oreille County|POLYGON ((-117.42...|
+---------+-------------------+--------------------+
only showing top 5 rows

Now that we have our home value index and the county geometry in a single DataFrame we can use a choropleth map to visualize the data geographically. The SedonaPyDeck package supports creating choropleths with the create_choropleth_map method.

SedonaPyDeck.create_choropleth_map(zhvi_county_df,plot_col="value")

SedonaPyDeck choropleth map

We can also use matplotlib to create a choropleth by converting our zhvi_county_df Sedona DataFrame to a GeoDataFrame. A best practice when creating choropleth maps is to use an equal area projection to minimize distortion of the areas being visualized so we set the CRS of our GeoDataFrame to Albers Equal Area.

zhvi_gdf = geopandas.GeoDataFrame(zhvi_county_df.toPandas(), geometry="geometry")
zhvi_gdf.to_crs(epsg=3857)

Now we can plot our data as a choropleth using matplotlib. We’ll specify a yellow-green color ramp and experiment with different binning schemes that best fit the distribution of the data we’re visualizing. Here we use the Jenks-Caspall method.

ax = zhvi_gdf.plot(
    column="value",
    scheme="JenksCaspall",
    cmap="YlGn",
    legend=True,
    legend_kwds={"title": "Median Home Value($)", "fmt": "{:.0f}", "loc": "lower right"},
    missing_kwds={
        "color": "lightgrey",
        "edgecolor": "red",
        "hatch": "///",
        "label": "Missing values"
    },
    figsize=(12,9)
)

ax.set_axis_off()
ax.set_title("Median Home Value By County, Oct 2023")
ax.annotate("Data from Zillow, 2023 ZHVI",(-125,25))

plt.savefig("zhvi.png",dpi=300)

Relative property values throughout the US - by county

Visualizing Change In Home Value

So far we’ve plotted the current home value, now let’s calculate the change in home value over the last 5 years and visualize the spatial distribution.

In our SQL query let’s calculate the percent change in home value from 2018 to 2023. We’ll also compute the centroid of each county so we can annotate the map.

delta_county_df = sedona.sql("""
SELECT 
  ((CAST(zhvi.`2023-10-31` AS float) - CAST(zhvi.`2018-10-31` AS float)) / (CAST(zhvi.`2018-10-31` AS float)) * 100 )  AS delta,
  zhvi.RegionName As name, zhvi.Statename AS state,counties.geometry, ST_Centroid(counties.geometry) AS centroid
FROM zhvi
JOIN counties
ON CONCAT('US', zhvi.StateCodeFIPS, zhvi.MunicipalCodeFIPS) = counties.FIPS
WHERE NOT zhvi.State IN ('HI', 'AK')
""")
delta_county_df.show(5)
+-----------------+-------------------+-----+--------------------+--------------------+
|            delta|               name|state|            geometry|            centroid|
+-----------------+-------------------+-----+--------------------+--------------------+
|52.68140223225829|     Whatcom County|   WA|MULTIPOLYGON (((-...|POINT (-121.71238...|
|55.31723313387478|    Okanogan County|   WA|POLYGON ((-120.85...|POINT (-119.73730...|
|32.94520356645385|       Ferry County|   WA|POLYGON ((-118.83...|POINT (-118.51643...|
|66.86739887826734|     Stevens County|   WA|POLYGON ((-118.21...|POINT (-117.85278...|
|66.78068227229824|Pend Oreille County|   WA|POLYGON ((-117.42...|POINT (-117.27525...|
+-----------------+-------------------+-----+--------------------+--------------------+

We follow similar steps as above to create a GeoDataFrame.

# We'll convert to a GeoDataFrame

delta_gdf = geopandas.GeoDataFrame(delta_county_df.toPandas(), geometry="geometry")
delta_gdf.to_crs(epsg=3857)

Let’s find the counties with the most extreme changes in home value so we can annotate them on the map.

To find the county with the greatest increase in property value:

delta_gdf.iloc[delta_gdf['delta'].idxmax()]
delta                                              169.532379
name                                          Petersburg City
state                                                      VA
geometry    POLYGON ((-77.40366881899992 37.23746578695217...
centroid         POINT (-77.39189306391864 37.20426989798888)
Name: 2937, dtype: object

And to find the county with the greatest decrease in property value:

delta_gdf.iloc[delta_gdf['delta'].idxmin()]
delta                                              -44.794506
name                                         Claiborne Parish
state                                                      LA
geometry    POLYGON ((-93.23859537841794 33.01169686180804...
centroid         POINT (-92.99522452214245 32.81998668151367)
Name: 944, dtype: object

The steps to create our choropleth map using matplotlib are similar to the previous map however this time since our data diverges around zero we’ll use a red-yellow-green color ramp and manually specify the bins.

ax = delta_gdf.plot(
    column="delta",
    scheme="User_Defined",
    cmap="RdYlGn",
    legend=True,
    legend_kwds={"title": "Median Home Value % Change", "loc": "lower right"},
    classification_kwds={"bins":[0,20,40,60,80,100,170]},
    missing_kwds={
        "color": "lightgrey",
        #"edgecolor": "red",
        #"hatch": "///",
        "label": "Missing values"
    },
    figsize=(12,9)
)

ax.set_axis_off()
ax.set_title("Median Home Value Percent Change By County, 2018-2023")
ax.annotate("Data from Zillow's ZHVI",(-125,25))

ax.annotate('Petersburg City, VA +169.5%',
            xy=(-77.39189306391864, 37.20426989798888), xycoords='data',
            xytext=(100, 10), textcoords='offset points',
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='bottom')

ax.annotate('Claiborne Parish, LA -44.8%',
            xy=(-92.99522452214245, 32.81998668151367), xycoords='data',
            xytext=(50, -110), textcoords='offset points',
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='bottom')

plt.savefig("delta.png", dpi=300)

Change In Real Estate Values In The US 2018-2023

Even though we didn’t leverage much advanced Spatial SQL functionality this post demonstrated how to work with spatial data in various formats (CSV and Shapefile) in Wherobots Cloud, how to use SQL with Apache Sedona to join datasets, and how to leverage tools from the PyData ecosystem like GeoPandas and matplotlib in the Wherobots Notebook environment.

Resources

In addition to the data sources linked at the beginning of this post, I found the following resources helpful when creating this map:

Create your free Wherobots Cloud account today to get started working with geospatial data and Apache Sedona

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: