Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence LEARN MORE

Analyzing Real Estate Data With Apache Sedona

Change In Real Estate Values In The US 2018-2023

In a recent tutorial on the Wherobots YouTube channel we explored how to analyze real estate data using Apache Sedona and Wherobots Cloud using data from Zillow. Specifically, we calculated the change in median home value over the last 5 years per county using Zillow’s Zillow Home Value Index (ZHVI) and visualized the results.

This tutorial highlights a few features of Apache Sedona and Wherobots Cloud, including:

  • An overview of the Wherobots Cloud notebook environment
  • How to upload data to Wherobots Cloud
  • How to import datasets using Apache Sedona (CSV and Shapefile)
  • Using SQL to query data using Apache Sedona
  • Creating interactive visualizations of the results of our geospatial analysis in the Wherobots notebook environment
  • Creating choropleth maps with GeoPandas and matplotlib

Feel free to follow along with the video tutorial below:

We used two data sources for this map:

To follow along create a free account in Wherobots Cloud and create a new blank notebook. We’ll be using Python and SQL to interact with Sedona.

Many geospatial Python packages are installed by default in Wherobots Cloud, but we can also add packages either when launching the notebook environment in the Wherobots Cloud UI or installing with pip in the notebook. For this tutorial the only package we’ll need to install is mapclassify which we’ll use to help with creating the data distribution bins when creating our choropleth map.

# All other packages are installed by default in Wherobots Cloud

pip install mapclassify

Next, we’ll import some dependencies for working with Sedona and importing our datasets.

from sedona.spark import *
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.utils.adapter import Adapter
import geopandas

One of the first steps to using Sedona in our notebook is configuring the SedonaContext and connecting to our cluster. The config object here is typically where we can specify authentication providers for accessing cloud storage buckets. We won’t need this functionality in this tutorial, but this is the default configuration I typically use.

config = SedonaContext.builder().appName('zillowdata')\
    .config("spark.hadoop.fs.s3a.bucket.wherobots-examples-prod.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")\
    .getOrCreate()
sedona = SedonaContext.create(config)

Natural Earth Shapefile Import

We’ll be analyzing real estate values at the county level in the US. Since the dataset from Zillow doesn’t include county geometries we’ll use the Natural Earth Admin Level 2 dataset for US county geometries. I uploaded the Shapefile files to Wherobots Cloud using this structure:

The Wherobots File Management Interface

Note that since a Shapefile is composed of multiple files, the S3 URL we want to use is the location of the directory, not one of the individual files:

working with shapefiles in Wherobots Cloud

Then in our notebook we can save the S3 URL as a variable.

S3_URL_NE = "s3://<YOUR_S3_URL_HERE>"

To import Shapefile datasets into Sedona we can use Sedona’s ShapefileReader to create a Spatial RDD data structure, then convert that Spatial RDD to a Spatial DataFrame, which we’ll then be able to query using Spatial SQL, after creating a view (or temporary table) from the DataFrame. Spatial RDDs are one of Sedona’s core data structures that extends the Spark RDD by enabling spatial geometries and querying functionality. Spatial DataFrames extend Spark DataFrames as well, offering spatial indexing, partitioning, and querying functionality.

spatialRDD = ShapefileReader.readToGeometryRDD(sedona, S3_URL_NE)
counties_df = Adapter.toDf(spatialRDD, sedona)
counties_df.createOrReplaceTempView("counties")
counties_df.printSchema()

We can inspect the schema of the county data we imported. The relevant columns for our use case are geometry (the polygon that represents the boundary of each county), FIPS (a unique identifier for each county), NAME_EN (the English name of the county), and REGION (the state).

root
 |-- geometry: geometry (nullable = true)
 |-- FEATURECLA: string (nullable = true)
 |-- SCALERANK: string (nullable = true)
 |-- ADM2_CODE: string (nullable = true)
 |-- ISO_3166_2: string (nullable = true)
 |-- ISO_A2: string (nullable = true)
 |-- ADM0_SR: string (nullable = true)
 |-- NAME: string (nullable = true)
 |-- NAME_ALT: string (nullable = true)
 |-- NAME_LOCAL: string (nullable = true)
 |-- TYPE: string (nullable = true)
 |-- TYPE_EN: string (nullable = true)
 |-- CODE_LOCAL: string (nullable = true)
 |-- REGION: string (nullable = true)
 |-- REGION_COD: string (nullable = true)
 |-- ABBREV: string (nullable = true)
 |-- AREA_SQKM: string (nullable = true)
 |-- SAMEASCITY: string (nullable = true)
 |-- LABELRANK: string (nullable = true)
 |-- NAME_LEN: string (nullable = true)
 |-- MAPCOLOR9: string (nullable = true)
 |-- MAPCOLOR13: string (nullable = true)
 |-- FIPS: string (nullable = true)
 |-- SOV_A3: string (nullable = true)
 |-- ADM0_A3: string (nullable = true)
 |-- ADM0_LABEL: string (nullable = true)
 |-- ADMIN: string (nullable = true)
 |-- GEONUNIT: string (nullable = true)
 |-- GU_A3: string (nullable = true)
 |-- MIN_LABEL: string (nullable = true)
 |-- MAX_LABEL: string (nullable = true)
 |-- MIN_ZOOM: string (nullable = true)
 |-- WIKIDATAID: string (nullable = true)
 |-- NE_ID: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- NAME_AR: string (nullable = true)
 |-- NAME_BN: string (nullable = true)
 |-- NAME_DE: string (nullable = true)
 |-- NAME_EL: string (nullable = true)
 |-- NAME_EN: string (nullable = true)
 |-- NAME_ES: string (nullable = true)
 |-- NAME_FA: string (nullable = true)
 |-- NAME_FR: string (nullable = true)
 |-- NAME_HE: string (nullable = true)
 |-- NAME_HI: string (nullable = true)
 |-- NAME_HU: string (nullable = true)
 |-- NAME_ID: string (nullable = true)
 |-- NAME_IT: string (nullable = true)
 |-- NAME_JA: string (nullable = true)
 |-- NAME_KO: string (nullable = true)
 |-- NAME_NL: string (nullable = true)
 |-- NAME_PL: string (nullable = true)
 |-- NAME_PT: string (nullable = true)
 |-- NAME_RU: string (nullable = true)
 |-- NAME_SV: string (nullable = true)
 |-- NAME_TR: string (nullable = true)
 |-- NAME_UK: string (nullable = true)
 |-- NAME_UR: string (nullable = true)
 |-- NAME_VI: string (nullable = true)
 |-- NAME_ZH: string (nullable = true)
 |-- NAME_ZHT: string (nullable = true)

Now we can query this data using SQL:

sedona.sql("""
SELECT geometry, FIPS AS fips, NAME_EN AS name, REGION AS state FROM counties LIMIT 10
""").show()
+--------------------+-------+------------+-----+
|            geometry|   fips|        name|state|
+--------------------+-------+------------+-----+
|MULTIPOLYGON (((-...|US53073|     Whatcom|   WA|
|POLYGON ((-120.85...|US53047|    Okanogan|   WA|
|POLYGON ((-118.83...|US53019|       Ferry|   WA|
|POLYGON ((-118.21...|US53065|     Stevens|   WA|
|POLYGON ((-117.42...|US53051|Pend Oreille|   WA|
|POLYGON ((-117.03...|US16021|    Boundary|   ID|
|POLYGON ((-116.04...|US30053|     Lincoln|   MT|
|POLYGON ((-114.72...|US30029|    Flathead|   MT|
|POLYGON ((-114.06...|US30035|     Glacier|   MT|
|POLYGON ((-112.19...|US30101|       Toole|   MT|
+--------------------+-------+------------+-----+

Load Zillow ZHVI CSV File

Now that we have the geometries of each county we want to import the CSV file with the Zillow home value index data. I uploaded this file into Wherobots Cloud as well which we can access in our notebook environment using the S3 URL copied from the file browser UI.

S3_URL_ZHVI = "s3://<YOUR S3 URL HERE>"

After importing the CSV file into a new DataFrame we create a view and print the schema to inspect the data.

zhvi_df = sedona.read.format('csv').option('header','true').option('delimiter', ',').load(S3_URL_ZHVI)
zhvi_df.createOrReplaceTempView("zhvi")
zhvi_df.printSchema()
root
 |-- RegionID: string (nullable = true)
 |-- SizeRank: string (nullable = true)
 |-- RegionName: string (nullable = true)
 |-- RegionType: string (nullable = true)
 |-- StateName: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Metro: string (nullable = true)
 |-- StateCodeFIPS: string (nullable = true)
 |-- MunicipalCodeFIPS: string (nullable = true)
 |-- 2000-01-31: string (nullable = true)
 |-- 2000-02-29: string (nullable = true)
 |-- 2000-03-31: string (nullable = true)
...
 |-- 2023-10-31: string (nullable = true)

So our Zillow data has information about the county and a column for each monthly home value index going back to the year 2000 (I’ve omitted several column names from the output above).

Let’s inspect the first few rows of our Zillow data:

sedona.sql("""
SELECT RegionName,StateName, State, Metro, StateCodeFIPS, MunicipalCodeFIPS, `2000-01-31` FROM zhvi LIMIT 10
""").show()
+------------------+---------+-----+----------+-------------+-----------------+-------------+
|        RegionName|StateName|State|     Metro|StateCodeFIPS|MunicipalCodeFIPS|   2000-01-31|
+------------------+---------+-----+----------+-------------+-----------------+-------------+
|Los Angeles County|       CA|   CA|Los Ang...|           06|              037| 205974.69363|
|       Cook County|       IL|   IL|Chicago...|           17|              031| 145499.51956|
|     Harris County|       TX|   TX|Houston...|           48|              201|108350.203359|
|   Maricopa County|       AZ|   AZ|Phoenix...|           04|              013|143352.154097|
|  San Diego County|       CA|   CA|San Die...|           06|              073| 214883.41007|
|     Orange County|       CA|   CA|Los Ang...|           06|              059|249870.608658|
|      Kings County|       NY|   NY|New Yor...|           36|              047|200000.033597|
| Miami-Dade County|       FL|   FL|Miami-F...|           12|              086| 119546.52618|
|     Dallas County|       TX|   TX|Dallas-...|           48|              113| 94256.230613|
|  Riverside County|       CA|   CA|Riversi...|           06|              065| 149474.98732|
+------------------+---------+-----+----------+-------------+-----------------+-------------+

Joining ZHVI And County Geometries

So we now have two tables zhvi and counties – let’s join them together so we can plot current home value per county. To do that we’ll use the FIPS code which uniquely identifies each county. The zhvi table breaks the FIPS code out into the state and municipal components so we’ll need to combine them so we can join against the counties table.

zhvi_county_df = sedona.sql("""
SELECT CAST(zhvi.`2023-10-31` AS float) AS value, zhvi.RegionName As name, counties.geometry
FROM zhvi
JOIN counties
ON CONCAT('US', zhvi.StateCodeFIPS, zhvi.MunicipalCodeFIPS) = counties.FIPS
WHERE NOT zhvi.State IN ('HI', 'AK')

zhvi_county_df.show(5)
""")
+---------+-------------------+--------------------+
|    value|               name|            geometry|
+---------+-------------------+--------------------+
|582171.56|     Whatcom County|MULTIPOLYGON (((-...|
|300434.03|    Okanogan County|POLYGON ((-120.85...|
| 279648.9|       Ferry County|POLYGON ((-118.83...|
|365007.16|     Stevens County|POLYGON ((-118.21...|
|364851.38|Pend Oreille County|POLYGON ((-117.42...|
+---------+-------------------+--------------------+
only showing top 5 rows

Now that we have our home value index and the county geometry in a single DataFrame we can use a choropleth map to visualize the data geographically. The SedonaPyDeck package supports creating choropleths with the create_choropleth_map method.

SedonaPyDeck.create_choropleth_map(zhvi_county_df,plot_col="value")
SedonaPyDeck choropleth map

We can also use matplotlib to create a choropleth by converting our zhvi_county_df Sedona DataFrame to a GeoDataFrame. A best practice when creating choropleth maps is to use an equal area projection to minimize distortion of the areas being visualized so we set the CRS of our GeoDataFrame to Albers Equal Area.

zhvi_gdf = geopandas.GeoDataFrame(zhvi_county_df.toPandas(), geometry="geometry")
zhvi_gdf.to_crs(epsg=3857)

Now we can plot our data as a choropleth using matplotlib. We’ll specify a yellow-green color ramp and experiment with different binning schemes that best fit the distribution of the data we’re visualizing. Here we use the Jenks-Caspall method.

ax = zhvi_gdf.plot(
    column="value",
    scheme="JenksCaspall",
    cmap="YlGn",
    legend=True,
    legend_kwds={"title": "Median Home Value($)", "fmt": "{:.0f}", "loc": "lower right"},
    missing_kwds={
        "color": "lightgrey",
        "edgecolor": "red",
        "hatch": "///",
        "label": "Missing values"
    },
    figsize=(12,9)
)

ax.set_axis_off()
ax.set_title("Median Home Value By County, Oct 2023")
ax.annotate("Data from Zillow, 2023 ZHVI",(-125,25))

plt.savefig("zhvi.png",dpi=300)
Relative property values throughout the US - by county

Visualizing Change In Home Value

So far we’ve plotted the current home value, now let’s calculate the change in home value over the last 5 years and visualize the spatial distribution.

In our SQL query let’s calculate the percent change in home value from 2018 to 2023. We’ll also compute the centroid of each county so we can annotate the map.

delta_county_df = sedona.sql("""
SELECT 
  ((CAST(zhvi.`2023-10-31` AS float) - CAST(zhvi.`2018-10-31` AS float)) / (CAST(zhvi.`2018-10-31` AS float)) * 100 )  AS delta,
  zhvi.RegionName As name, zhvi.Statename AS state,counties.geometry, ST_Centroid(counties.geometry) AS centroid
FROM zhvi
JOIN counties
ON CONCAT('US', zhvi.StateCodeFIPS, zhvi.MunicipalCodeFIPS) = counties.FIPS
WHERE NOT zhvi.State IN ('HI', 'AK')
""")
delta_county_df.show(5)
+-----------------+-------------------+-----+--------------------+--------------------+
|            delta|               name|state|            geometry|            centroid|
+-----------------+-------------------+-----+--------------------+--------------------+
|52.68140223225829|     Whatcom County|   WA|MULTIPOLYGON (((-...|POINT (-121.71238...|
|55.31723313387478|    Okanogan County|   WA|POLYGON ((-120.85...|POINT (-119.73730...|
|32.94520356645385|       Ferry County|   WA|POLYGON ((-118.83...|POINT (-118.51643...|
|66.86739887826734|     Stevens County|   WA|POLYGON ((-118.21...|POINT (-117.85278...|
|66.78068227229824|Pend Oreille County|   WA|POLYGON ((-117.42...|POINT (-117.27525...|
+-----------------+-------------------+-----+--------------------+--------------------+

We follow similar steps as above to create a GeoDataFrame.

# We'll convert to a GeoDataFrame

delta_gdf = geopandas.GeoDataFrame(delta_county_df.toPandas(), geometry="geometry")
delta_gdf.to_crs(epsg=3857)

Let’s find the counties with the most extreme changes in home value so we can annotate them on the map.

To find the county with the greatest increase in property value:

delta_gdf.iloc[delta_gdf['delta'].idxmax()]
delta                                              169.532379
name                                          Petersburg City
state                                                      VA
geometry    POLYGON ((-77.40366881899992 37.23746578695217...
centroid         POINT (-77.39189306391864 37.20426989798888)
Name: 2937, dtype: object

And to find the county with the greatest decrease in property value:

delta_gdf.iloc[delta_gdf['delta'].idxmin()]
delta                                              -44.794506
name                                         Claiborne Parish
state                                                      LA
geometry    POLYGON ((-93.23859537841794 33.01169686180804...
centroid         POINT (-92.99522452214245 32.81998668151367)
Name: 944, dtype: object

The steps to create our choropleth map using matplotlib are similar to the previous map however this time since our data diverges around zero we’ll use a red-yellow-green color ramp and manually specify the bins.

ax = delta_gdf.plot(
    column="delta",
    scheme="User_Defined",
    cmap="RdYlGn",
    legend=True,
    legend_kwds={"title": "Median Home Value % Change", "loc": "lower right"},
    classification_kwds={"bins":[0,20,40,60,80,100,170]},
    missing_kwds={
        "color": "lightgrey",
        #"edgecolor": "red",
        #"hatch": "///",
        "label": "Missing values"
    },
    figsize=(12,9)
)

ax.set_axis_off()
ax.set_title("Median Home Value Percent Change By County, 2018-2023")
ax.annotate("Data from Zillow's ZHVI",(-125,25))

ax.annotate('Petersburg City, VA +169.5%',
            xy=(-77.39189306391864, 37.20426989798888), xycoords='data',
            xytext=(100, 10), textcoords='offset points',
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='bottom')

ax.annotate('Claiborne Parish, LA -44.8%',
            xy=(-92.99522452214245, 32.81998668151367), xycoords='data',
            xytext=(50, -110), textcoords='offset points',
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='bottom')

plt.savefig("delta.png", dpi=300)
Change In Real Estate Values In The US 2018-2023

Even though we didn’t leverage much advanced Spatial SQL functionality this post demonstrated how to work with spatial data in various formats (CSV and Shapefile) in Wherobots Cloud, how to use SQL with Apache Sedona to join datasets, and how to leverage tools from the PyData ecosystem like GeoPandas and matplotlib in the Wherobots Notebook environment.

Resources

In addition to the data sources linked at the beginning of this post, I found the following resources helpful when creating this map:

Create your free Wherobots Cloud account today to get started working with geospatial data and Apache Sedona

O'Reilly Apache Sedona Cloud Native Geospatial with Apache Sedona book cover

Interested in learning more about Apache Sedona? Get access to this free hands-on guide on practical solutions for the challenges of working with various types of geospatial data.

Get access to the guide