Introducing GeoStats for WherobotsAI and Apache Sedona

Introducing GeoStats for WherobotsAI and Apache Sedona

We are excited to introduce GeoStats, a machine learning (ML) and statistical toolbox for WherobotsAI and Apache Sedona users. With GeoStats, you can easily identify critical patterns in geospatial datasets such as hotspots and anomalies, and quickly get critical insights from large scale data. While these algorithms are supported in other packages, we’ve optimized each algorithm to be highly performant for small to planetary scale geospatial workloads. That means, you can get results from these algorithms significantly faster, at a lower cost, and do it all more productively, through a unified development experience purpose-built for geospatial data science and ETL.

The Wherobots toolbox supports DBSCAN, Local Outlier Factor (LOF), and Getis-Ord (Gi*) algorithms. Apache Sedona users can utilize DBSCAN starting with Apache Sedona version 1.7.0, and like all other features of Apache Sedona, its fully compatible with Wherobots.

Use Cases for GeoStats

DBSCAN

DBSCAN is the most popular algorithm we see in geospatial use cases. It identifies clusters, areas of your data that are closely packed together, and outliers, areas of your data that are set apart.

Typical use cases for DBSCAN are found in:

  • Retail: Decision makers use DBSCAN with location data to understand areas of high and low pedestrian activity to decide where to setup retailing establishments.
  • City Planning: City planners use DBSCAN with GPS data to optimize transit support by identifying high usage routes, areas in need of additional transit options, or areas that have too much support.
  • Air Traffic Control: Traffic controllers use DBSCAN to identify areas with increasing weather activity to optimize flight routing.
  • Risk computation: Insurers and others can use DBSCAN to make policy decisions and calculate risk where risk is correlated to the proximity of two or more features of interest.
Local Outlier Factor (LOF)

LOF is an anomaly detection algorithm that identifies outliers present in a dataset.

Typical use cases for LOF include:

  • Data analysis and cleansing: Data teams can use LOF to identify and remove anomalies within a dataset, like removing erroneous GPS data points from a trace dataset
Getis-Ord (Gi*)

Getis-Ord is also a popular algorithm for identifying local hot and cold spots.

Typical use cases for Gi* include:

  • Public health: Officials can use disease data with Gi* to identify areas of abnormal disease outbreak
  • Telecommunications: Network administrators can use Gi* to identify areas of high demand and optimize network deployment
  • Insurance: Insurers can identify areas prone to specific claims to better manage risk

Traditional challenges with using these algorithms on geospatial data

Before GeoStats, teams leveraging any of the algorithms in the toolbox in a data analysis or ML pipeline would:

  1. Struggle to get performance or scale from the underlying solutions that also don’t perform well when joining geospatial data.
  2. Determine how to host and scale open source versions of popular ML and statistical algorithms, like PostGIS or scikit-learn DBSCAN, PySal Gi*, or scikit-learn LOF, to work for geospatial data types and geospatial data formats.
  3. Replicate this overhead each time they want to deploy a new algorithm for geospatial data.

Benefits of WherobotsAI GeoStats

With GeoStats in WherobotsAI, you can now:

  1. Easily run native algorithms on a cloud-based engine, optimized for producing spatial data products and insights at scale.
  2. Use these algorithms without the operational overhead associated with setup and maintenance.
  3. Leverage optimized, hosted algorithms within a single platform to easily experiment and get critical insights faster.

We’ll walk through a brief overview of each algorithm, how to use them, and show how they perform at various scales.

Diving Deeper into the GeoStats Toolbox

DBSCAN Overview

DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups points with many nearby neighbors and marks as outlier points that lie alone in low-density regions.

How to use DBSCAN in Wherobots

The following examples assume you have already setup an organization and have an active runtime and notebook, with a dataframe of interest to run the algorithms on.

WherobotsAI GeoStats DBSCAN Python API Overview
For a full walk through see the Python API reference: dbscan(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: max distance to neighbors (epsilon), min neighbors (min_points)
  • Output: dataframe with cluster id

DBSCAN Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
dataset=sedona.createDataFrame(X).select(ST_MakePoint("_1", "_2").alias("geometry"))
  1. Choose values for your hyperparameters, max distance to neighbors (epsilon) and minimum neighbors (min_points). These values will determine how DBSCAN identifies clusters.
epsilon=0.3
min_points=10
  1. Run DBSCAN on your DataFrame with your chosen hyperparameter values.
clusters_df = dbscan(df, epsilon=0.3, min_points=10, include_outliers=True)
  1. Analyze the results. For each datapoint, DBSCAN returns the cluster it’s associated with or if it’s an outlier.
+--------------------+------+-------+
|            geometry|isCore|cluster|
+--------------------+------+-------+
|POINT (1.22185277...| false|      1|
|POINT (0.77885034...| false|      1|
|POINT (-2.2744742...| false|      2|
+--------------------+------+-------+

only showing top 3 rows

There’s a complete example of how to use DBSCAN in the Wherobots user documentation.

DBSCAN Performance Overview

To show DBSCAN performance in Wherobots, we created a European sample of the Overture buildings dataset, and ran DBSCAN to identify clusters of buildings near each other, starting from the geographic center of Europe and worked outwards. For each subsampled dataset, we run DBSCAN with an epsilon of 0.005 degrees (i.e. ~30 feet) and min_points value of 4 on a Large runtime in Wherobots Cloud. As seen below, DBSCAN effectively processes an increasing number of records, with 100M records taking 1.6 hrs to process.

Local Outlier Factor (LOF)

LOF is an anomaly detection algorithms that identifies outliers present in a dataset. It does this by measuring how close a given data point is to a set of k-nearest neighbors (with k being a user chosen hyperparameter) in comparison to how close its nearest neighbors are to their nearest neighbors. LOF provides a score that represents the degree to which a record is an inlier or outlier.

How to use LOF in Wherobots

For the full example, please see this docs page.

WherobotsAI GeoStats LOF Python API Overview
For a full walk through see the Python API reference: local_outlier_factor(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: number of nearest neighbors to use
  • Output: score representing degree of inlier or outlier

LOF Walk Through

  1. Choose your dataset and create a Sedona DataFrame.
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
  1. Choose your k value for how many nearest neighbors you want to use to measure density near a given datapoint.
k=20
  1. Run LOF on your DataFrame with your chosen k value.
outliers_df = local_outlier_factor(df, k=20)
  1. Analyze your results. LOF returns a score for each datapoint representing the degree of inlier or outlier.
+--------------------+------------------+
|            geometry|               lof|
+--------------------+------------------+
|POINT (-1.9169927...|0.9991534865548664|
|POINT (-1.7562422...|1.1370318880088373|
|POINT (-2.0107478...|1.1533763384772193|
+--------------------+------------------+
only showing top 3 rows

There’s a complete example of how to use LOF in the Wherobots user documentation.

LOF Performance Overview

We followed the same procedure with DBSCAN but ran LOF to identify clusters of buildings near each other. With each set of buildings we ran LOF with a k=20 on a large Wherobots Cloud runtime. As seen below, GeoStats LOF scales effectively with growing data size with 100M records taking 10 mins to process.

Getis-Ord (Gi*) Overview

Getis-Ord is an algorithm for identifying statistically significant local hot and cold spots.

How to use GeoStats Gi*

WherobotsAI GeoStats Gi* Python API Overview
For the full example, please see this docs g_local(...).

  • Supported Geometries: points, linestrings, polygons
  • Hyperparameters: star, neighbor weighting
  • Output: Set of statistics that indicate the degree of local hot or cold spot for a given record

Gi* Walk Through

  1. Choose your dataset and create a Sedona Dataframe.
places_df = (
    sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
        .select(f.col("geometry"), f.col("categories"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)
  1. Choose how you’d like to weight datapoints (ex: datapoints in a specific geographic area need to be weighted higher or any datapoint close to a given datapoint need to be weighted higher) and star (boolean to indicate if a record is a neighbor of itself).

star = True
neighbor_search_radius_degrees = 1.0
variable_column = "myNumericColumnName"

weighted_dataframe = add_binary_distance_band_column(
        df,
        neighbor_search_radius_degrees,
        include_self=star
)
  1. Run Gi* on your DataFrame with your chosen hyperparameters.
gi_df = g_local(
        weighted_dataframe,
    variable_column,
    star=star
)
  1. Analyze your results. For each datapoint, Gi* returns a set of statistics that indicate the degree of local hot or cold spot.
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|num_places|                  G|                  EG|                  VG|                 Z|                   P|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
|       871| 0.1397485091609774|0.013219284603421462|5.542296862370928E-5|16.995969941572465|                 0.0|
|       908|0.16097739240211956|0.013219284603421462|5.542296862370928E-5|19.847528249317246|                 0.0|
|       218|0.11812096144582315|0.013219284603421462|5.542296862370928E-5|14.090861243071908|                 0.0|
+----------+-------------------+--------------------+--------------------+------------------+--------------------+
only showing top 3 rows

There’s a complete example of how to use Gi* in the Wherobots user documentation.

Getis-Ord Performance Overview

To showcase how Gi performs in Wherobots, again we used the same example as DBSCAN, but ran Gi on the area of the buildings. With each set of buildings we ran Gi* with a binary neighbor weight and a neighborhood radius of .007 degrees (~0.5 miles) on a Large runtime in Wherobots Cloud. As seen below, the algorithm scales mostly linearly with the number of records, with 100M records taking 1.6 hours to process.

Get started with WherobotsAI GeoStats

The way we implemented these algorithms for large scale geospatial workloads, will help you make sense of your geospatial data faster. You can get started for free today.

  • If you haven’t already, create a free Wherobots Organization subscribed to the Community Edition of Wherobots.
  • Start a Wherobots Notebook
  • In the Notebook environment, explore the notebook_example/python/wherobots-ai/ folder for examples that you can use to get started.
  • Need additional help? Check out our user documentation, and send us a note if needed at support@wherobots.com.

Apache Sedona Users

Apache Sedona users will have access to GeoStats DBSCAN with the Apache Sedona 1.7.0 release. Subscribe to the Sedona newsletter and join the Sedona community to get notified of the release and get started!

What’s next

We’re excited to hear what ML and statistical algorithms you’d like us to support. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

The WherobotsAI Team

Announcing SAML Single Sign-On (SSO) Support

We’re excited to announce that Wherobots Cloud now supports SAML SSO for Professional Edition customers. This enhancement underscores our commitment to providing robust security features tailored to the needs of businesses that handle sensitive data.

What is SAML Single Sign-On

SAML Single Sign on Graph

Many companies already use SAML SSO. It is the mechanism by which you log in to third party tools with your company’s centralized login (Google, Okta, etc.)

SAML (Security Assertion Markup Language) are essential components of modern security protocols. SAML is an open standard for exchanging authentication and authorization data between parties and SSO (Single Sign-On) allows users to log in once and gain access to multiple systems without re-entering credentials.

How to enable SAML SSO

Setting up SAML SSO is straightforward. Any Professional Edition Wherobots administrator can enable this feature by following these steps:

  1. Verify Your Domain:

    Go to your organization’s settings, copy the provided TXT DNS record, and add it to your domain provider’s DNS settings. Once done, click “Verify” in Wherobots.

  2. Configure Your Identity Provider (IdP):

    Log in to your IdP (e.g., Google Workspace, Okta, OneLogin, Azure AD) and configure it using the SAML details provided in the Wherobots settings.
    SAML Configuration

  3. Enter IdP Details into Wherobots:

    Input the Identity Provider Issuer URL, SSO URL, and certificate details from your IdP into the SAML section in Wherobots.
    SAML IdP Details view

  4. Enable SAML SSO:

    Make sure there’s an admin user with an email from the verified domain. Then, toggle the “Enable” switch in Wherobots to activate SSO.

  5. Test the Integration:

    Log in using your verified domain email to ensure everything is set up correctly.

For more detailed step-by-step instructions, check out our comprehensive getting started guide.

Why Your Organization Should Use SAML Single Sign-On

Benefits to Users

For users, SAML SSO simplifies the login process. With fewer passwords to remember and less frequent logins, users save time and reduce frustration. This streamlined access means employees can focus more on their tasks and less on managing multiple credentials.

Benefits to Organizations

Organizations benefit from SAML SSO by enhancing security and reducing administrative overhead. Centralized authentication means fewer password-related support tickets and a lower risk of phishing attacks, as users are less likely to fall for credential-stealing schemes. Moreover, it ensures compliance with security policies and regulatory requirements by enforcing strong, consistent access controls. It also allows organizations to grant access to certain third party services only for certain users/groups of users.

Location data is particularly sensitive, as it can reveal personal habits, routines, and even confidential business operations. For example, in healthcare, location data of patients visiting specialized clinics could inadvertently disclose medical information. By implementing SAML SSO, organizations can better control access, ensuring that only authorized personnel can view and interact with this information within the Wherobots platform.

Elevate Your Security with SAML SSO

Take this opportunity to simplify authentication, reduce security risks, and improve productivity. If you are not already a Professional Edition customer, upgrade to gain access to SAML SSO. By upgrading, you’ll not only bolster your security measures and simplify logins, but you’ll gain access to more powerful workloads, service principals, and more. Contact us to schedule a demo or get started!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Wherobots and PostgreSQL + PostGIS: A Synergy in Spatial Analysis

Introduction

Spatial data has never been more abundant and valued in decision making (everything happens somewhere!). The tools we use to process location based data can significantly impact the outcomes of our projects. In the geospatial ecosystem, Apache Sedona-powered Wherobots and PostgreSQL with the PostGIS extension each offer robust capabilities. They share some functionalities, but they are more powerful when used together, rather than in isolation. This post explores how combining Wherobots and PostgreSQL + PostGIS can enhance spatial data processing, offering a synergy that leverages the unique strengths of each tool to enhance data analysis and decision-making processes.

Exploring PostgreSQL + PostGIS

PostgreSQL is a powerful, open-source object-relational database system known for its robustness and reliability. When combined with the PostGIS extension, PostgreSQL transforms into a spatial database capable of executing location-based queries and spatial functions. This combination is ideally suited for projects where data integrity and detailed transaction management (ACID transactions) are crucial.

Key Features:

  • Data Integrity: Ensures high levels of data accuracy and consistency with ACID transaction compliance.
  • Complex Queries: Supports a wide array of SQL and spatial queries for detailed data management.
  • Integration: Seamlessly integrates with existing IT infrastructures and data systems.

Architecture:

PostgreSQL’s architecture, a traditional RDBMS, uses a process-based database server architecture. It is designed for detailed, transactional data management, where data integrity and complex query capabilities are paramount.

Characteristics:

  • Transactional Integrity: PostgreSQL excels in maintaining data consistency and integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance, making it ideal for applications requiring reliable data storage.
  • Structured Data Management: Data must be written into PostgreSQL tables, allowing for sophisticated querying and indexing capabilities provided by PostGIS. This ensures precise spatial data management.
  • Integration with Legacy Systems: PostgreSQL + PostGIS can easily integrate with existing IT infrastructure, providing a stable environment for long-term data storage and complex spatial queries.

Understanding Wherobots & Apache Sedona

Wherobots, founded by the original creators of Apache Sedona, represents a significant advancement in modern cloud native spatial data processing. Apache Sedona is an open-source cluster computing system designed for the processing of large-scale spatial data, and Wherobots (continuously optimizing Sedona for performance) employs this technology with serverless deployment principles to analyze data distributed across clusters. This approach allows Wherobots to perform complex spatial queries and analytics efficiently, handling massive datasets that are beyond the scope of traditional geospatial data processing tools.

Key Features:

  • Scalability: Easily scales horizontally to handle petabytes of data across through a cluster computing architecture.
  • Speed: Utilizes memory-centric architecture to speed up data processing tasks.
  • Complex Algorithms: Supports numerous spatial operations and algorithms essential for advanced geographic data analysis.
  • Spatial SQL API: Over 300 (and growing) spatial type SQL functions covering both raster and vector analysis.
  • Pythonic Dataframe API: Extends the Spark Dataframe API for spatial data processing.
  • Python DB API: Python module conforming to the Python DB API 2.0 specification with familiar access patterns.

Architecture:

The backbone of Wherobots, Apache Sedona, is designed with a modern cloud native and distributed computing framework in mind. Wherobots orchestrates and manages the cluster distributed compute nodes to enable large-scale spatial data processing by bringing compute power directly to the data. This architecture is purpose built for handling big data analytics and planetary-scale spatial data processing.

Characteristics:

  • Distributed Processing: Wherobots and Apache Sedona utilize a cluster of nodes, enabling parallel processing of massive datasets. This makes it ideal for tasks requiring high scalability and speed.
  • In-Memory Computation: By leveraging in-memory processing, Sedona significantly reduces the time needed for complex spatial operations, allowing for rapid data exploration and transformation.
  • Cloud Integration: Wherobots architecture is naturally aligned with cloud-based solutions, enabling seamless integration with cloud services like Amazon Web Services (AWS). This facilitates scalable and flexible spatial data analytics.

Complementarity in Action

Sedona in Wherobots and PostgreSQL + PostGIS can both perform spatial queries, but their real power lies in how they complement each other. For example:

  • Data Handling and Performance: Apache Sedona in Wherobots Cloud is second to none at handling and analyzing large volumes of data, making them ideal for initial data processing, exploration, transformation, and analysis. In contrast, PostgreSQL + PostGIS excels in managing more detailed transactions and operations requiring high precision and integrity.
    • Use Wherobots for: Scalable Data Operations, Wherobots efficiently scales processing across multiple nodes, ideal for rapid analysis of spatial data at local to global levels.
    • Use PostgreSQL & PostGIS for: PERSISTENCE, PostgreSQL + PostGIS provides the tools necessary for precise data management crucial for applications like urban planning and asset management.
  • Use Case Flexibility: Wherobots can quickly process vast quantities of spatial data, making them suitable for environments and applications that require immediate insights on large volumes of data. PostgreSQL + PostGIS, being more transaction oriented, is perfect for applications where long-term data storage and management are needed.
    • Use Wherobots for: Processing Speed, when your scenario requires immediate response.
    • Use PostgreSQL & PostGIS for: Data Management, in scenarios requiring meticulous data integrity, such as environmental monitoring over years or detailed urban planning projects.
  • Analytical Depth: When initial processing (extractions, transformations, enrichment, etc.) on large scale data is done with Wherobots, ETL run times can be greatly reduced. Once processed, the data can be permanently stored and hosted in PostgreSQL + PostGIS. This allows users to perform deep, granular analyses at scale and then persist and serve those insights.
    • Use Wherobots for: Transforming and Enriching Data: Extract, transform, enrich, refine, and analyze data in Wherobots Cloud at scale and with unparalleled speed.
    • Use PostgreSQL & PostGIS for: Robust Data Ecosystem. Combining these tools ensures a robust data and analysis and management ecosystem capable of supporting complex, multi-dimensional spatial queries.

Integration Illustration

Integration Illustration

Let’s illustrate how these two systems can be integrated in a complimentary manner.

We’ll assume the persona of a fictitious company named “Synergistic GeoFusion Technologies (SGT) holdings”.

SGT holdings handles a vast array of spatial data from diverse sources such as sensors, satellites, and user inputs. The volume and velocity of the data collection requires a sophisticated approach to maximize efficiency. Wherobots steps in as the initial processing powerhouse, applying its Apache Sedona-based cluster computing to perform heavy-duty ETL (Extract, Transform, Load) tasks. This process involves cleansing, integrating, and transforming the raw data into a more structured format. Wherobots’ capability to handle massive datasets efficiently complements PostGIS’s robust data storage and querying capabilities by preparing the data for detailed spatial analysis and storage.

Once Wherobots processes the data, it can be seamlessly transferred to PostGIS, which cab serve as the system of record. This PostgreSQL extension is well-suited for ongoing data management, thanks to its sophisticated spatial functions and ability to handle complex queries. PostGIS’s strong querying capabilities complement Wherobots’ data processing strength, providing a stable platform for further data manipulation and refinement.

PostGIS is not only a storage layer but also a platform for ongoing data edits and updates, ensuring data integrity and relevance. When complex, resource-intensive spatial analyses are required, Wherobots is re-engaged. This dual engagement allows SGT holdings to handle routine data management in PostGIS while delegating computationally demanding tasks back to Wherobots, thus utilizing each system’s strengths to full effect.

For visualization, SGT holdings again leverages Wherobots to generate vector tiles from the analyzed data. These tiles are crucial for dynamic, scalable visual representations in the company’s internal dashboards and tools. Wherobots’ ability to generate these tiles efficiently complements PostGIS’s role by providing a means to visualize the data stored and managed within PostGIS. This not only enhances user interactions with the data but also provides a seamless bridge from data analysis to actionable insights through visual exploration.

Using Wherobots and PostGIS in this complimentary manner, SGT has established a highly efficient workflow that leverages the distinct capabilities of each system. They now have the capability to ingest all their data, manage it effectively, and run their post hoc analysis tasks to server internal and external clients efficiently and in a cost effective manner.

Performance Benchmark: Wherobots vs. PostGIS

When comparing WherobotsDB to PostgreSQL with PostGIS for spatial queries, WherobotsDB’s performance benefits become evident as data size increases. Initially, PostGIS’ precomputed GIST indexes give it an edge on smaller datasets due to faster query execution times. However, as datasets grow larger, the dynamic, on-the-fly indexing of WherobotsDB surpasses PostGIS. The overhead associated with WherobotsDB’s distributed system is outweighed by its ability to efficiently handle complex, large-scale queries, ultimately making it significantly faster and more scalable in high-demand scenarios.

In essence, WherobotsDB may start off slower with smaller datasets, but its performance dramatically improves with larger datasets, far exceeding PostGIS’ capabilities. This makes WherobotsDB the preferred choice when working with extensive spatial data that demands efficient, scalable processing.

See the tables below for a detailed comparison of performance metrics.

Datasets Size Number of Rows     Types of Geometries
OSM Nodes 256GB 7,456,990,919 Point
Overture Buildings     103GB 706,967,095 Polygon/
Multipolygon
OSM Postal Codes 0.86GB 154,452 Polygon
WITH t AS 
( 
    SELECT 
        * 
    FROM 
        overture_buildings_test AS buildings 
    JOIN 
        osm_nodes_test AS nodes 
    ON ST_Intersects(
        nodes.geometry, 
        buildings.geometry) 
) 
SELECT 
    COUNT(*) 
FROM 
    t;
Dataset sizes PostGIS WherobotsDB (Cairo)     Relative Speed
1M Buildings x 1M Nodes 0.70s 2.24s 0.31 x
1M Buildings x 10M Nodes 32.30s 33.44s 0.97 x
10M Buildings x 1M Nodes 28.30s 21.72s 1.30 x
10M Buildings x 10M Nodes 49.30s 3m 36.29s 0.23 x
All Buildings x 1M Nodes 8m 44s 36.59s 14.32 x
All Buildings x 10M Nodes 38m 30s 49.77s 46.42 x
10K Buildings x All Nodes 2h 29m 19s 46.42s 193.00 x
100K Buildings x All Nodes 4h 22m 36s 51.75s 304.57 x
1M Buildings x All Nodes 3h 29m 56s 1m 19.63s 158.17 x
10M Buildings x All Nodes 8h 48m 13s 2m 13.65s 237.17 x
All Buildings x All Nodes +24h (Aborted) 4m 31.84s + 317.83 x

Conclusion: Better Together for Spatial Brilliance

Individually, Wherobots and PostgreSQL + PostGIS are powerful tools. But when combined, they unlock new possibilities for spatial analysis, offering a balanced approach to handling both large-scale data processing and detailed, precise database management. By understanding and implementing their complementary capabilities, organizations can achieve more refined insights and greater operational efficiency in their spatial data projects.

By utilizing both tools strategically, companies can not only enhance their data analysis capabilities but also ensure that they are prepared for a variety of spatial data challenges, now and in the future.

To learn more about Wherobots, reach out or give it a try with a trial account

Easily create trip insights at scale by snapping millions of GPS points to road segments using WherobotsAI Map Matching

What is Map Matching?

GPS data is inherently noisy and often lacks precision, which can make it challenging to extract accurate insights. This imprecision means that the GPS points logged may not accurately represent the actual locations where a device was. For example, GPS data from a drive around a lake may incorrectly include points that are over the water!

To address these inaccuracies, teams commonly use two approaches:

  1. Identifying and Dropping Erroneous Points: This method involves manually or algorithmically filtering out GPS points that are clearly incorrect. However, this approach can reduce analytical accuracy, be costly, and is time-intensive.
  2. Map Matching Techniques: A smarter and more effective approach involves using map matching techniques. These techniques take the noisy GPS data points and compute the most likely path taken based on known transportation segments such as roadways or trails.

WherobotsAI Map Matching offers an advanced solution for this problem. It performs map matching with high scale on millions or even billions of trips with ease and performance, ensuring that the GPS data aligns accurately with the actual paths most likely taken.

map matching telematics

An illustration of map matching. Blue dots: GPS samples, Green line: matched trajectory.

Map matching is a common solution for preparing GPS data for use in a wide range of applications including:

  • Sattelite & GPS based navigation
  • GPS tracking of freight
  • Assessing risk of driving behavior for improved insurance pricing
  • Post hoc analysis of self driving car trips for telematics teams
  • Transportation engineering and urban planning

The objective of map matching is to accurately determine which road or path in the digital map corresponds to the observed geographic coordinates, considering factors such as the accuracy of the location data, the density and layout of the road network, and the speed and direction of travel.

Existing Solutions for Map Matching

Most map matching implementations are variants of the Hidden Markov Model (HMM)-based algorithm described by Newson and Krumm in their seminal paper, "Hidden Markov Map Matching through Noise and Sparseness." This foundational research has influenced a variety of map matching solutions available today.

However, traditional HMM-based approaches have notable downsides when working with large-scale GPS datasets:

  1. Significant Costs: Many commercially available map matching APIs charge substantial fees for large-scale usage.
  2. Performance Issues: Traditional map matching algorithms, while accurate, are often not optimized for large-scale computation. They can be prohibitively slow, especially when dealing with extensive GPS data, as the underlying computation struggles to handle the data scale efficiently.

These challenges highlight the need for more efficient and cost-effective solutions capable of handling large-scale GPS datasets without compromising on performance.

RESTful API Map Matching Options

The Mapbox Map Matching API, HERE Maps Route Matching API, and Google Roads API are powerful RESTful APIs for performing map matching. These solutions are particularly effective for small-scale applications. However, for large-scale applications, such as population-level analysis involving millions of trajectories, the costs can become prohibitively high.

For example, as of July 2024, the approximate costs for matching 1 million trips are:

  • Mapbox: $1,600
  • HERE Maps: $4,400
  • Google Maps Platform: $8,000

These prices are based on public pricing pages and do not consider any potential volume-based discounts that may be available.

While these APIs provide robust and accurate map matching capabilities, organizations seeking to perform extensive analyses often must explore more cost-effective alternatives.

Open-Source Map Matching Solutions

Open-source software such as such as Valhalla or GraphHopper can also be used for map matching. However, these solutions are designed for use on a single-machine. If your map matching workload exceeds the capacity that machine, your workload will suffer from extended processing times. Furthermore, you will end up running out of headroom if you are vertically scaling up the ladder of VM sizes.

Meet WherobotsAI Map Matching

WherobotsAI Map Matching is a high performance, low cost, and planetary scale map matching capability for your telematics pipelines.

WherobotsAI provides a scalable map matching feature designed for small to very large scale trajectory datasets. It works seamlessly with other Wherobots capabilities, which means you can implement data cleaning, data transformations, and map matching in one single (serverless) data processing pipeline. We’ll see how it works in the following sections.

How it works

WherobotsAI Map Matching takes a DataFrame containing trajectories and another DataFrame containing road segments, and produces a DataFrame containing map matched results. Here is a walk-through of using WherobotsAI Map Matching to match trajectories in the VED dataset to the OpenStreetMap (OSM) road network.

1. Preparing the Trajectory Data

First, we load the trajectory data. We’ll use the preprocessed VED dataset stored as GeoParquet files for demonstration.

dfPath = sedona.read.format("geoparquet").load("s3://wherobots-benchmark-prod/data/mm/ved/VED_traj/")

The trajectory dataset should contain the following attributes:

  • A unique ID for trips. In this example the ids attribute is the unique ID of each trip.
  • A geometry attribute containing LineStrings, in this case the geometry attribute is for trip data.

The rows in the trajectory DataFrame look like this:

+---+-----+----+--------------------+--------------------+
|ids|VehId|Trip|              coords|            geometry|
+---+-----+----+--------------------+--------------------+
|  0|    8| 706|[{0, 42.277558333...|LINESTRING (-83.6...|
|  1|    8| 707|[{0, 42.277681388...|LINESTRING (-83.6...|
|  2|    8| 708|[{0, 42.261997222...|LINESTRING (-83.7...|
|  3|   10|1558|[{0, 42.277065833...|LINESTRING (-83.7...|
|  4|   10|1561|[{0, 42.286599722...|LINESTRING (-83.7...|
+---+-----+----+--------------------+--------------------+
only showing top 5 rows
2. Preparing the Road Network Data

We’ll use the OpenStreetMap (OSM) data specific to the Ann Arbor, Michigan region to map match our trip data with. Wherobots provides an API for loading road network data from OSM XML files.

from wherobots import matcher
dfEdge = matcher.load_osm("s3://wherobots-examples/data/osm_AnnArbor_large.xml", "[car]")
dfEdge.show(5)

The loaded road network DataFrame looks like this:

+--------------------+----------+--------+----------+-----------+----------+-----------+
|            geometry|       src|     dst|   src_lat|    src_lon|   dst_lat|    dst_lon|
+--------------------+----------+--------+----------+-----------+----------+-----------+
|LINESTRING (-83.7...|  68133325|27254523| 42.238819|-83.7390142|42.2386159|-83.7390153|
|LINESTRING (-83.7...|9405840276|27254523|42.2386058|-83.7388915|42.2386159|-83.7390153|
|LINESTRING (-83.7...|  68133353|27254523|42.2385675|-83.7390856|42.2386159|-83.7390153|
|LINESTRING (-83.7...|2262917109|27254523|42.2384552|-83.7390313|42.2386159|-83.7390153|
|LINESTRING (-83.7...|9979197063|27489080|42.3200426|-83.7272283|42.3200887|-83.7273003|
+--------------------+----------+--------+----------+-----------+----------+-----------+
only showing top 5 rows

Users can also prepare the road network data from any data source using any data processing procedures, as long as the schema of the road network DataFrame conforms to the requirement of the Map Matching API.

3. Run Map Matching

Once the trajectories and road network data is ready, we can run matcher.match to match trajectories to the road network.

dfMmResult = matcher.match(dfEdge, dfPath, "geometry", "geometry")

The dfMmResult contains the trajectories snapped to the roads in matched_points attribute:

+---+--------------------+--------------------+--------------------+
|ids|     observed_points|      matched_points|       matched_nodes|
+---+--------------------+--------------------+--------------------+
|275|LINESTRING (-83.6...|LINESTRING (-83.6...|[62574078, 773611...|
|253|LINESTRING (-83.6...|LINESTRING (-83.6...|[5930199197, 6252...|
| 88|LINESTRING (-83.7...|LINESTRING (-83.7...|[4931645364, 6249...|
|561|LINESTRING (-83.6...|LINESTRING (-83.6...|[29314519, 773612...|
|154|LINESTRING (-83.7...|LINESTRING (-83.7...|[5284529433, 6252...|
+---+--------------------+--------------------+--------------------+
only showing top 5 rows

We can visualize the map matching result using SedonaKepler to see what the matched trajectories look like:

mapAll = SedonaKepler.create_map()
SedonaKepler.add_df(mapAll, dfEdge, name="Road Network")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("observed_points AS geometry"), name="Observed Points")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("matched_points AS geometry"), name="Matched Points")
mapAll

The following figure shows the map matching results. The red lines are original trajectories, and the green lines are matched trajectories. We can see that the noisy original trajectories are all snapped to the road network.

map matching results example 2

Performance

We used WherobotsAI Map Matching to match 90 million trips across the entire US in just 1.5 hours on the Wherobots Tokyo runtime, which equates to approximately 1 million trips per minute. The average cost of matching 1 million trips is an order of magnitude less costly and more efficient than the options outlined above.

The “optimization magic” behind WherobotsAI Map Matching lies in how Wherobots intelligently and automatically co-partitions trajectory and road network datasets based on the spatial proximity of their elements, ensuring a balanced distribution of work. As a result, the computational load is balanced evenly through this partitioning strategy, and makes map matching with Wherobots highly efficient, scalable, and affordable compared to alternatives.

Try It Out!

You can try out WherobotsAI Map Matching by starting a notebook environment in Wherobots Cloud and running our example notebook within Wherobots Cloud.

notebook_example/python/wherobots-ai/mapmatching_example.ipynb

You can also check out the WherobotsAI Map Matching tutorial and reference documentation for more information!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Unlock Satellite Imagery Insights with WherobotsAI Raster Inference

Recently we introduced WherobotsAI Raster Inference to unlock analytics on satellite and aerial imagery using SQL or Python. Raster Inference simplifies extracting insights from satellite and aerial imagery using SQL or Python, and is powered by open-source machine learning models. This feature is currently in preview, and we are expanding it’s capabilities to support more models. Below we’ll dig into the popular computer vision tasks that Raster Inference supports, describe how it works, and how you can use it to run batch inference to find and map electricity infrastructure.

Watch the live demo of these capabilities here.

The Power of Machine Learning with Satellite Imagery

Petabytes of satellite imagery are generated each day all over the world in a dizzying number of sensor types and image resolutions. The applications for satellite imagery and other remote sensing data sources are broad and diverse. For example, satellites with consistent, continuous orbits are ideal for monitoring forest carbon stocks to validate carbon credits or estimating agricultural yields.

However, this data has been inaccessible for most analysts and even seasoned ML practitioners because insight extraction required specialized skills. We’ve done the work to make insight extraction simple and accessible to more people. Raster Inference abstracts the complexity and scales to support planetary-scale imagery datasets, so you don’t need ML expertise to derive insights. In this blog, we explore the key features that make Raster Inference effective for land cover classification, solar farm mapping, and marine infrastructure detection. And, in the near future, you will be able to use Raster Inference with your own models!

Introduction to Popular and Supported Machine Learning Tasks

Raster Inference supports the three most common kinds of computer vision models that are applied to imagery: classification, object detection, and semantic segmentation. Instance segmentation (combines object localization and semantic segmentation) is another common type of model which is not currently supported, but let us know if you need by contacting us and we can add it to the roadmap.

Computer Vision Detection Types
Computer Vision Detection Categories from Lin et al. Microsoft COCO: Common Objects in Context

The figure above illustrates these tasks. Image classification is when an image is assigned one or more text labels. In image (a), the scene is assigned the labels “person”, “sheep”, and “dog”. Image (b) is an example of object localization (or object detection). Object localization creates bounding boxes around objects of interest and assigns labels. In this image, five sheep are localized separately along with one human and one dog. Finally, semantic segmentation is when each pixel is given a category label, as shown in image (c). Here we can see all the pixels belonging to sheep are labeled blue, the dog is labeled red, and the person is labeled teal.

While these examples highlight detection tasks on regular imagery, these computer vision models can be applied to raster formatted imagery. Raster data formats are the most common data formats for satellite and aerial imagery. When objects of interest in raster imagery are localized, their bounding boxes can be georeferenced, which means that each pixel is localized to spatial coordinates, such as latitude and longitude. Therefore, georeferencing is object localization suited for spatial analytics.

https://wherobots.com/wp-content/uploads/2024/06/remotesensing-11-00339-g005.png

The example above shows various applications of object detection for localizing and classifying features in high resolution satellite and aerial imagery. This example comes from DOTA, a 15-class dataset of different objects in RGB and grayscale satellite imagery. Public datasets like DOTA are used to develop and benchmark machine learning models.

Not only are there many publicly available object detection models, but also there are many semantic segmentation models.

Semantic Segmentation
Sourced from “A Scale-Aware Masked Autoencoder for Multi-scale Geospatial Representation Learning”.

Not every machine learning model should be treated equally, and each will have their own tradeoffs. You can see the difference between the ground truth image (human annotated buildings representing the real world) and segmentation results across two models (Scale-MAE and Vanilla MAE). These results are derived from the same image at two different resolutions (referred to as GSD, or Ground Sampling Distance).

  • Scale-MAE is a model developed to handle detection tasks at various resolutions with different sensor inputs. It uses a similar MAE model architecture as the Vanilla MAE, but is trained specifically for detection tasks on overhead imagery that span different resolutions.
  • The Vanilla MAE is not trained to handle varying resolutions in overhead imagery. It’s performance suffers in the top row and especially the bottom row, where resolution is coarser, as seen by the mismatch between Vanilla MAE and the ground truth image where many pixels are incorrectly classified.

Satellite Analytics Before Raster Inference

Without Raster Inference, typically a team who is looking to extract insights from overhead imagery using ML would need to:

  1. Deploy a distributed runtime to scale out workloads such as data loading, preprocessing, and inference.
  2. Develop functionality to operate on raster metadata to easily filter it by location to run inference workloads on specific areas of interest.
  3. Optimize models to run performantly on GPUs, which can involve complex rewrites of the underlying model prediction logic.
  4. Create and manage data preprocessing pipelines to normalize, resize, and collate raster imagery into the correct data type and size required by the model.
  5. Develop the logic to run data loading, preprocessing, and model inference efficiently at scale.

Raster Inference and its SQL and Python APIs abstract this complexity so you and your team can easily perform inference on massive raster datasets.

Raster Inference APIs for SQL and Python

Raster Inference offers APIs in both SQL and Python to run inference tasks. These APIs are designed to be easy to use, even if you’re not a machine learning expert. RS_CLASSIFY can be used for scene classification, RS_BBOXES_DETECT for object detection, and RS_SEGMENT for semantic segmentation. Each function produces tabular results which can be georeferenced either for the scene, object, or segmentation depending on the function. The records can be joined or visualized with other data (geospatial or traditional) to curate enriched datasets and insights. Here are SQL and Python examples for RS_Segment.

RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
df = df_raster_input.withColumn("segment_result", rs_segment(model_id, col("outdb_raster")))

Example: Mapping Electricity Infrastructure

Imagine you want to optimize the location of new EV charging stations, but you want to target locations based on the availability of green energy sources, such as local solar farms. You can use Raster Inference to detect and locate solar farms and cross-reference these locations with internal data or other vector geometries that captures demand for EV charging. This use case will be demonstrated in our upcoming release webinar on July 10th.

Let’s walk through how to use Raster Inference for this use case.

First, we run predictions on rasters to find solar farms. The following code block that calls RS_SEGMENT shows how easy this is.

CREATE OR REPLACE TEMP VIEW segment_fields AS (
    SELECT
        outdb_raster,
        RS_SEGMENT('{model_id}', outdb_raster) AS segment_result
    FROM
    az_high_demand_with_scene
)

The confidence_array column produced from RS_SEGMENT can be assigned the same geospatial coordinates as the raster input and converted to a vector that can be spatially joined and processed with WherobotsDB using RS_SEGMENT_TO_GEOMS. We select a confidence threshold of .65 so that we only georeference high confidence detections.

WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions_df
    )
    SELECT result.* FROM t
+----------+--------------------+--------------------+
|     class|avg_confidence_score|            geometry|
+----------+--------------------+--------------------+
|Solar Farm|  0.7205783606825462|MULTIPOLYGON (((-...|
|Solar Farm|  0.7273308333550763|MULTIPOLYGON (((-...|
|Solar Farm|  0.7301468510823231|MULTIPOLYGON (((-...|
|Solar Farm|  0.7180177244988899|MULTIPOLYGON (((-...|
|Solar Farm|   0.728077805771141|MULTIPOLYGON (((-...|
|Solar Farm|     0.7264981572898|MULTIPOLYGON (((-...|
|Solar Farm|  0.7044100126912517|MULTIPOLYGON (((-...|
|Solar Farm|  0.7137283466756343|MULTIPOLYGON (((-...|
+----------+--------------------+--------------------+

This allows us to integrate the vectorized model predictions with other spatial datasets and easily visualize the results with SedonaKepler.

https://wherobots.com/wp-content/uploads/2024/06/solar_farm_detection-1-1024x398.png

Here Raster Inference runs on a 85 GiB dataset with 2,200 raster scenes for Arizona. Using a Sedona (tiny) runtime, Raster Inference completed in 430 seconds, predicting solar farms for all low cloud cover satellite images for the state of Arizona for the month of October. If we scale up our runtime to a San Francisco (small) runtime, the inference speed nearly doubles. In general, average bytes processed per second by Wherobots increases as datasets scale in size because startup costs are amortized over time. Processing speed also increases as runtimes scale in size.

Inference time (seconds) Runtime Size
430 Sedona
246 San Francisco

We use predictions from the output of Raster Inference to derive insights about which zip codes have the most solar farms, as shown below. This statement joins predicted solar farms with zip codes by location, then ranks zip codes by the pre-computed solar farm area within each zip code. We skipped this step for brevity but you can see it and others in the notebook example.

az_solar_zip_codes = sedona.sql("""
SELECT solar_area, any_value(az_zta5.geometry) AS geometry, ZCTA5CE10
FROM predictions_polys JOIN az_zta5
WHERE ST_Intersects(az_zta5.geometry, predictions_polys.geometry)
GROUP BY ZCTA5CE10
ORDER BY solar_area DESC
""")

https://wherobots.com/wp-content/uploads/2024/06/final_analysis.png

These predictions are made possible by SATLAS, a family of machine learning models released with Apache 2.0 licensing from Allen AI. The solar model demonstrated above was derived from the SATLAS foundational model. This foundational model can be used as a building block to create models to address specific detection challenges like solar farm detection. Additionally, there are many other open source machine learning models available for deriving insights from satellite imagery, many of which are provided by the TorchGeo project. We are just beginning to explore what these models can achieve for planetary-scale monitoring.

If you have a specific model you would like to see made available, please contact us to let us know.

For detailed instructions on using Raster Inference, please refer to our example Jupyter notebooks in the documentation.

https://wherobots.com/wp-content/uploads/2024/06/Screenshot_2024-06-08_at_2.11.07_PM-1024x683.png

Here are some links to get you started:
https://docs.wherobots.com/latest/tutorials/wherobotsai/wherobots-inference/segmentation/

https://docs.wherobots.com/latest/api/wherobots-inference/pythondoc/inference/sql_functions/

Getting Started

Getting started with WherobotsAI Raster Inference is easy. We’ve provided three models in Wherobots Cloud that can be used with our GPU optimized runtimes. Sign up for an account on Wherobots Cloud, send us a note to access the professional tier, start a GPU runtime, and you can run our example Jupyter notebooks to analyze satellite imagery in SQL or Python.

Stay tuned for updates on improvements to Raster Inference that will make it possible to run more models, including your own custom models. We’re excited to hear what models you’d like us to support, or the integrations you need to make running your own models even easier with Raster Inference. We can’t wait for your feedback and to see what you’ll create!

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:

Introducing WherobotsAI for planetary inference, and capabilities that modernize spatial intelligence at scale

We are excited to announce a preview of WherobotsAI, our new suite of AI and ML powered capabilities that unlock spatial intelligence in satellite imagery and GPS location data. Additionally, we are bringing the high-performance of WherobotsDB to your favorite data applications with a Spatial SQL API that integrates WherobotsDB with more interfaces including Apache Airflow for Spatial ETL. Finally, we’re introducing the most scalable vector tile generator on earth to make it easier for teams to produce engaging and interactive map applications. All of these new features are capable of operating on planetary-scale data.

Watch the walkthrough of this release here.

Wherobots Mission and Vision

Before we dive into this release, we think it’s important to understand how these capabilities fit into our mission, our product principles, and vision for the Spatial Intelligence Cloud so you can see where we are headed.

Our Mission
These new capabilities are core to Wherobots’ mission, which is to unlock spatial intelligence of earth, society, and business, at a planetary scale. We will do this by making it extremely easy to utilize data and AI technology purpose-built for creating spatial intelligence that’s cloud-native and compatible with modern open data architectures.

Our Product Principles

  • We’re building the spatial intelligence platform for modern organizations. Every organization with a mission directly linked to the performance of tangible assets, goods and services, or data products about what’s happening in the physical world, will need a spatial intelligence platform to be competitive, sustainable, and climate adaptive.
  • It delivers intelligence for the greater good. Teams and their organizations want to analyze their worlds to create a net positive impact for business, society, and the earth.
  • It’s purpose-built yet simple. Spatial intelligence won’t scale through in-house ‘spatial experts’, or through general purpose architectures that are not optimized for spatial workloads or development experiences.
  • It’s efficient at any scale. Maximal performance, scale, and cost efficiency can only be achieved through a cloud-native, serverless solution.
  • It creates intelligence with AI. Every organization will need AI alongside modern analytics to create spatial intelligence.
  • It’s open by default. Pace of innovation depends on choice. Organizations that adopt cloud-native, open source compatible, and modern open data architectures will innovate faster because they have more choices in the solutions they can use.

Our Vision
We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Introducing WherobotsAI

There are petabytes of satellite or aerial imagery produced every day. Yet for most analysts, scientists, and developers, these datasets are analytically inaccessible outside of the naked eye. As a result most organizations still rely on humans and their eyes, to analyze satellite or other forms of aerial imagery. Wherobots can already perform analytics of overhead imagery (also known as raster data) and geospatial objects (known as vector data) simultaneously at scale. But organizations also want to use modern AI and ML technologies to streamline and scale otherwise visual, single threaded tasks like object detection, classification, and segmentation from overhead imagery.

Like satellite imagery that is generally hard to analyze, businesses also find it hard to analyze GPS data in their applications because it’s too noisy; points don’t always correspond to the actual path taken. Teams need an easy solution for snapping noisy GPS data to road or other segment types, at any scale.

Today we are announcing WherobotsAI which offers fully managed AI and machine learning capabilities that accelerate the development of spatial insights, for anyone familiar with SQL or Python. WherobotsAI capabilities include:

[new] Raster Inference (preview): A first of its kind, Raster Inference unlocks the analytical potential of satellite or aerial imagery at a planetary scale, by integrating AI models with WherobotsDB to make it extremely easy to detect, classify, and segment features of interest in satellite and aerial images. You can see how easy it is to detect and georeference solar farms here, with just a few lines of SQL:

SELECT
  outdb_raster,
  RS_SEGMENT(‘solar-satlas-sentinel2’, outdb_raster) AS solar_farm_result
FROM df_raster_input

These georeferenced predictions can be queried with WherobotsDB and can be interactively explored in a Wherobots notebook. Below is an example of detection of solar panels in SedonaKepler.

AI Inference Solar Farm

The models and AI infrastructure powering Raster Inference are fully managed, which means there’s nothing to set up or configure. Today, you can use Raster Inference to detect, segment, and classify solar farms, land cover, and marine infrastructure from terabyte-scale Sentinel-2 true color and multispectral imagery datasets in under half an hour, on our GPU runtimes available in the Wherobots Professional Edition. Soon we will be making the inference metadata for the models public, so if your own models meet this standard, they are supported by Raster Inference.

These models and datasets are just the starting point for WherobotsAI. We are looking forward to hearing from you to help us define the roadmap for what we should build support for next.

Map Matching: If you need to analyze trips at scale, but struggle to wrangle noisy GPS data, Map Matching is capable of turning billions of noisy GPS pings into signal, by snapping shared points to road or other vector segments. Teams are using Map Matching to process hundreds of millions of vehicle trips per hour. This speed surpasses any current commercial solutions, all for a cost of just a few hundred dollars.

Here’s an example of what WherobotsAI Map Matching does to improve the quality of your trip segments.

  • Red and yellow line segments were created from raw, noisy GPS data.
  • Green represents Map Matched segments.

map matching algorithm

Visit the user documentation to learn more and get started with WherobotsAI.

A Spatial SQL API for WherobotsDB

WherobotsDB, our serverless, highly efficient compute engine compatible with Apache Sedona is up to 60x more performant for spatial joins than popular general purpose big data engines and warehouses, and up to 20x faster than Apache Sedona on its own. It will remain the most performant, earth-friendly solution for your spatial workloads at any scale.

Until today, teams had two options for harnessing WherobotsDB: they could write and run queries in Wherobots managed notebooks, or run spatial ETL pipelines using the Wherobots jobs interface.

Today, we’re enabling you to bring the utility of WherobotsDB to more interfaces with the new Spatial SQL API. Using this API, teams can remotely execute Spatial SQL queries using a remote SQL editor, build first-party applications using our client SDKs in Python (WherobotsDB API driver) and Java (Wherobots JDBC driver), or orchestrate spatial ETL pipelines using a Wherobots Apache Airflow provider.

Run spatial queries with popular SQL IDEs

The following is an example of how to integrate Harlequin, a popular SQL IDE with WherobotsDB. You’ll need a Wherobots API key to get started with Harlequin (or any remote client). API keys allow you to authenticate with Wherobots Cloud for programmatic access to Wherobots APIs and services. API keys can be created following a few steps in our user documentation.

We will query WherobotsDB using Harlequin in the Airflow example later in this blog.

$ pip install harlequin-wherobots
$ harlequin -a wherobots --api-key $(< api.key)

harlequin api key connection

You can find more information on how to use Harlequin in its documentation, and on the WherobotsDB adapter on its GitHub repository.

The Wherobots Python driver enables integration with many other tools as well. Here’s an example of using the Wherobots Python driver in the QGIS Python console to fetch points of interest from the Overture Maps dataset using Spatial SQL API.

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime
import geopandas 
from shapely import wkt

with connect(
        token=os.environ.get("WBC_TOKEN"),
        runtime=Runtime.SEDONA,
        region=Region.AWS_US_WEST_2,
        host="api.cloud.wherobots.com"
) as conn:
    curr = conn.cursor()
    curr.execute("""
    SELECT names.common[0].value AS name, categories.main AS category, geometry 
    FROM wherobots_open_data.overture.places_place 
    WHERE ST_DistanceSphere(ST_GeomFromWKT("POINT (-122.46552 37.77196)"), geometry) < 10000
    AND categories.main = "hiking_trail"
    """)
    results = curr.fetchall()
    print(results)

results["geometry"] = results.geometry.apply(wkt.loads)
gdf = geopandas.GeoDataFrame(results, crs="EPSG:4326",geometry="geometry")

def add_geodataframe_to_layer(geodataframe, layer_name):
    # Create a new memory layer
    layer = QgsVectorLayer(geodataframe.to_json(), layer_name, "ogr")

    # Add the layer to the QGIS project
    QgsProject.instance().addMapLayer(layer)

add_geodataframe_to_layer(gdf, "POI Layer")

Using the Wherobots Python driver with QGIS

Visit the Wherobots user documentation to get started with the Spatial SQL API, or see our latest blog post that goes deeper into how to use our database drivers with the Spatial SQL API.

Automating Spatial ETL workflows with the Apache Airflow provider for Wherobots

ETL (extract, transform, load) workflows are oftentimes required to prepare spatial data for interactive analytics, or to refresh datasets automatically as new data arrives. Apache Airflow is a powerful and popular open source orchestrator of data workflows. With the Wherobots Apache Airflow provider, you can now use Apache Airflow to convert your spatial SQL queries into automated workflows running on Wherobots Cloud.

Here’s an example of the Wherobots Airflow provider in use. In this example we identify the top 100 buildings in the state of New York with the most places (facilities, services, business, etc.) registered within them using the Overture Maps dataset, and we’ll eventually auto-refresh the result daily. The initial view can be generated with the following SQL query:

CREATE TABLE wherobots.test_db.top_100_hot_buildings_daily AS
SELECT
  buildings.id AS building,
  first(buildings.names),
  count(places.geometry) AS places_count,
  '2023-07-24' AS ts
FROM wherobots_open_data.overture.places_place places
JOIN wherobots_open_data.overture.buildings_building buildings
  ON ST_CONTAINS(buildings.geometry, places.geometry)
WHERE places.updatetime >= '2023-07-24'
  AND places.updatetime < '2023-07-25'
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
  AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
GROUP BY building
ORDER BY places_count DESC
LIMIT 100
  • A place in Overture is defined as real-world facilities, services, businesses or amenities.
  • We used an arbitrary date of 2023-07-24.
  • New York is defined by a simple bounding box polygon (79.762152, 40.496103, -71.856214, 45.01585) (we could alternatively join with its appropriate administrative boundary polygon)
  • We use two WHERE clauses on places.updatetime to filter one day’s worth of data.
  • The query creates a new table wherobots.test_db.top_100_hot_buildings_daily to store the query result. Note that it will not directly return any records because we are loading directly into a table.

Now, lets use Harlequin as described earlier to inspect the outcome of creating this table with the above query:

SELECT * FROM wherobots.test_db.top_100_hot_buildings_daily

Harlequin query test 2

Apache Airflow and the Airflow Provider for Wherobots allow you to schedule and execute this query each day, injecting the appropriate date filters into your templatized query.

  • In your Apache Airflow instance, install the airflow-providers-wherobots library. You can either execute pip install airflow-providers-wherobots, or add the library to the dependency list of your Apache Airflow runtime.
  • Create a new “generic” connection for Wherobots called wherobots_default, using api.cloud.wherobots.com as the “Host” and your Wherobots API key as the “Password”.

The next step is to create an Airflow DAG. The Wherobots Provider exposes the WherobotsSqlOperator for executing SQL queries. Update the hardcoded “2023-07-24” in your query into the Airflow template macros {ds} and {next_ds}, which will be rendered as the DAG schedule date on the fly:

import datetime

from airflow import DAG
from airflow_providers_wherobots.operators.sql import WherobotsSqlOperator

with DAG(
    dag_id="example_wherobots_sql_dag",
    start_date=datetime.datetime.strptime("2023-07-24", "%Y-%m-%d"),
    schedule="@daily",
    catchup=True,
    max_active_runs=1,
):
    operator = WherobotsSqlOperator(
        task_id="execute_query",
        wait_for_downstream=True,
        sql="""
        INSERT INTO wherobots.test_db.top_100_hot_buildings_daily
        SELECT
          buildings.id AS building,
          first(buildings.names),
          count(places.geometry) AS places_count,
          '{{ ds }}' AS ts
        FROM wherobots_open_data.overture.places_place places
        JOIN wherobots_open_data.overture.buildings_building buildings
          ON ST_CONTAINS(buildings.geometry, places.geometry)
        WHERE places.updatetime >= '{{ ds }}'
          AND places.updatetime < '{{ next_ds }}'
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), places.geometry)
          AND ST_CONTAINS(ST_PolygonFromEnvelope(-79.762152, 40.496103, -71.856214, 45.01585), buildings.geometry)
        GROUP BY building
        ORDER BY places_count DESC
        LIMIT 100
        """,
        return_last=False,
    )

You can visualize the status of the and log of the DAG’s execution in the Apache Airflow UI. As shown below, the operator prints out the exact query rendered and executed when you run your DAG.

apache airflow spatial sql api
Please visit the Wherobots user documentation for more details on how to set up your Apache Airflow instance with the Wherobots Provider.

Generate Vector Tiles — formatted as PMTiles — at Global Scale

Vector tiles are high resolution representations of features optimized for visualization, computed offline and displayed in map applications. This decouples dataset preparation from client side rendering driven by zooming and panning. By decoupling dataset preparation from the interactive experience, map developers use vector tiles to significantly improve the utility, clarity, and responsiveness of feature rich interactive map applications.

Traditional vector tiles generators like Tippecanoe are limited to the processing capability of a single VM and require the use of limited formats. These solutions are great for small-scale tile generation workloads when data is already in the right file format. But if you’re like the teams we’ve worked with, you may start small and need to scale past the limits of a single VM, or have a variety of file formats. You just want to generate vector tiles with the data you have, at any scale without having to worry about format conversion steps, configuring infrastructure, partitioning your workload around the capability of a VM, or waiting for workloads to complete.

Vector Tile Generation, or VTiles for WherobotsDB generates vector tiles in PMTiles format across common data lake formats, incredibly quickly and at a planetary scale, so you can start small and know you have the capability to scale without having to look for another solution. VTiles is incredibly fast because serverless computation is parallelized, and the WherobotsDB engine is optimized for vector tile generation. This means your development teams can spend less time building map applications that matter to your customers.

Using a Tokyo runtime, we generated vector tiles with VTiles for all buildings in the Overture dataset, from zoom levels 4-15 across the entire planet, in 23 minutes. That’s fast and efficient for a planetary scale operation. You can run the tile-generation-example notebook in the Wherobots Pro tier to experience the speed and simplicity of Vtiles yourself. Here’s what this looks like:

Visit our user documentation to start generating vector tiles at-scale.

Try Wherobots now

We look forward to hearing how you put these new capabilities to work, along with your feedback to increase the usefulness of the Wherobots Cloud platform. You can try these new features today by creating a Wherobots Cloud account. WherobotsAI is a professional tier feature.

Please reach out on LinkedIn or connect to us on email at info@wherobots.com

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


The Spatial SQL API brings the performance of WherobotsDB to your favorite data applications

Since its launch last fall, Wherobots has raised the bar for cloud-native geospatial data analytics, offering the first and only platform for working with vector and raster geospatial data together at a planetary scale. Wherobots delivers a significant breadth of geospatial analytics capabilities, built around a cloud-native data lakehouse architecture and query engine that delivers up to 60x better performance than incumbent solutions. Accessible through the powerful notebook experience data scientists and data engineers know and love, Wherobots Cloud is the most comprehensive, approachable, and fully-managed serverless offering for enabling spatial intelligence at scale.

Today, we’re announcing the Wherobots Spatial SQL API, powered by Apache Sedona, to bring the performance of WherobotsDB to your favorite data applications. This opens the door to a world of direct-SQL integrations with Wherobots Cloud, bringing a serverless cloud engine that’s optimized for spatial workloads at any scale into your spatial ETL pipelines and applications, and taking your users and engineers closer to your data and spatial insights.

Register for our release webinar on July 10th here: https://bit.ly/3yFlFYk

Developers love Wherobots because compute is abstracted and managed by Wherobots Cloud. Because it can run at a planetary scale, Wherobots streamlines development and reduces time to insight. It runs on a data lake architecture, so data doesn’t need to be copied into a proprietary storage system, and integrates into familiar development tools and interfaces for exploratory analytics and orchestrating production spatial ETL pipelines.

Utilize Apache Airflow or SQL IDEs with WherobotsDB via the Spatial SQL API

Wherobots Cloud and the Wherobots Spatial SQL API are powered by WherobotsDB, with Apache Sedona at its core: a distributed computation engine that can horizontally scale to handle computation and analytics on any dataset. Wherobots Cloud automatically manages the infrastructure and compute resources of WherobotsDB to serve your use case based on how much computation power you need.

Behind the scenes, your Wherobots Cloud “runtime” defines the amount of compute resources allocated and the configuration of the software environment that executes your workload (in particular for AI/ML use cases, or if your ETL or analytics workflow depends on 1st or 3rd party libraries).

Our always-free Community Edition gives access to a modest “Sedona” runtime for working with small-scale datasets. Our Professional Edition unlocks access to much larger runtimes, up to our “Tokyo” runtime capable of working on planetary-scale datasets, and GPU-accelerated options for your WherobotsAI workloads.

With the release of the Wherobots Spatial SQL API and its client SDKs, you can bring WherobotsDB, the ease-of-use, and the expressiveness of SQL to your Apache Airflow spatial ETL pipelines, your applications, and soon to tools like Tableau, Superset, and other 3rd party systems and applications that support JDBC.

Our customers love applying the performance and scalability of WherobotsDB to their data preparation workflows and their compute-intensive data processing applications.

Use cases include

  • Preparation of nationwide and planetary-scale datasets for their users and customers
  • Processing hundreds of millions of mobility data records every day
  • Creating and analyzing spatial datasets in support of their real estate strategy and decision-making.

Now customers have the option to integrate new tools with Wherobots for orchestration and development of spatial insights using the Spatial SQL API.

How to get started with the Spatial SQL API

By establishing a connection to the Wherobots Spatial SQL API, a SQL session is started backed by your selected WherobotsDB runtime (or a “Sedona” by default but you can specify larger runtimes if you need more horsepower). Queries submitted through this connection are securely executed against your runtime, with compute fully managed by Wherobots.

We provide client SDKs in Java and in Python to easily connect and interact with WherobotsDB through the Spatial SQL API, as well as an Airflow Provider to build your spatial ETL DAGs; all of which are open-source and available on package registries, as well as on Wherobots’ GitHub page.

Using the Wherobots SQL Driver in Python

Wherobots provides an open-source Python library that exposes a DB-API 2.0 compatible interface for connecting to WherobotsDB. To build a Python application around the Wherobots DB-API driver, add the wherobots-python-dbapi library to your project’s dependencies:

$ poetry add wherobots-python-dbapi

Or directly install the package on your system with pip:

$ pip install wherobots-python-dbapi

From your Python application, establish a connection with wherobots.db.connect() and use cursors to execute your SQL queries and use their results:

import logging

from wherobots.db import connect
from wherobots.db.region import Region
from wherobots.db.runtime import Runtime

# Optionally, setup logging to get information about the driver's
# activity.
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(name)20s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

# Get your API key, or securely read it from a local file.
api_key = '...'

with connect(
    host="api.cloud.wherobots.com",
    api_key=get_secret(),
  runtime=Runtime.SEDONA,
  region=Region.AWS_US_WEST_2) as conn:
        cur = conn.cursor()
        sql = """
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
        cur.execute(sql)
        results = cur.fetchall()
      results.show()

For more information and future releases, see https://github.com/wherobots/wherobots-python-dbapi-driver on GitHub.

Using the Apache Airflow provider

Wherobots provides an open-source provider for Apache Airflow, defining an Airflow operator for executing SQL queries directly on WherobotsDB. With this new capability, you can integrate your spatial analytics queries, data preparation or data processing steps into new or existing Airflow workflow DAGs.

To build or extend your Airflow DAG using the WherobotsSqlOperator , add the airflow-providers-wherobots dependency to your project:

$ poetry add airflow-providers-wherobots

Define your connection to Wherobots; by default the Wherobots operators use the wherobots_default connection ID:

$ airflow connections add "wherobots_default" \
    --conn-type "wherobots" \
    --conn-host "api.cloud.wherobots.com" \
    --conn-password "$(< api.key)"

Instantiate the WherobotsSqlOperator and with your choice of runtime and your SQL query, and integrate it into your Airflow DAG definition:

from wherobots.db.runtime import Runtime
import airflow_providers_wherobots.operators.sql.WherobotsSqlOperator

...

select = WherobotsSqlOperator(
  runtime=Runtime.SEDONA,
  sql="""
          SELECT
              id,
              names['primary'] AS name,
              geometry,
              population
          FROM
              wherobots_open_data.overture_2024_02_15.admins_locality
          WHERE localityType = 'country'
          SORT BY population DESC
          LIMIT 10
      """
)
# select.execute() or integrate into your Airflow DAG definition

apache airflow spatial sql api
For more information and future releases, see https://github.com/wherobots/airflow-providers-wherobots on GitHub.

Using the Wherobots SQL Driver in Java

Wherobots provides an open-source Java library that implements a JDBC (Type 4) driver for connecting to WherobotsDB. To start building Java applications around the Wherobots JDBC driver, add the following line to your build.gradle file’s dependency section:

implementation "com.wherobots:wherobots-jdbc-driver"

In your application, you only need to work with Java’s JDBC APIs from the java.sql package:

import com.wherobots.db.Region;
import com.wherobots.db.Runtime;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

// Get your API key, or securely read it from a local file.
String apiKey = "...";

Properties props = new Properties();
props.setProperty("apiKey", apiKey);
props.setProperty("runtime", Runtime.SEDONA);
props.setProperty("region", Region.AWS_US_WEST_2);

try (Connection conn = DriverManager.getConnection("jdbc:wherobots://api.cloud.wherobots.com", props)) {
    String sql = """
        SELECT
            id,
            names['primary'] AS name,
            geometry,
            population
        FROM
            wherobots_open_data.overture_2024_02_15.admins_locality
        WHERE localityType = 'country'
        SORT BY population DESC
        LIMIT 10
    """;
  Statement stmt = conn.createStatement();
  try (ResultSet rs = stmt.executeQuery(sql)) {
    while (rs.next()) {
      System.out.printf("%s: %s %f %s\n",
        rs.getString("id"),
        rs.getString("name"),
        rs.getDouble("population"),
        rs.getString("geometry"));
    }
  }
}

For more information and future releases, see https://github.com/wherobots/wherobots-jdbc-driver on GitHub.

Conclusion

The Wherobots Spatial SQL API takes Wherobots’ vision of hassle-free, scalable geospatial data analytics & AI one step further by making it the easiest way to run your Spatial SQL queries in the cloud. Paired with Wherobots and Apache Sedona’s comprehensive support for working with all geospatial data at any scale and in any format, and with Wherobots AI’s inference features available directly from SQL, the Wherobots Spatial SQL API is also the most flexible and the most capable platform for getting the most out of your data.

Wherobots vision

We exist because creating spatial intelligence at-scale is hard. Our contributions to Apache Sedona, leadership in the open geospatial domain, and investments in Wherobots Cloud have, and will make it easier. Users of Apache Sedona, Wherobots customers, and ultimately any AI application will be enabled to support better decisions about our physical and virtual worlds. They will be able to create solutions to improve these worlds that were otherwise infeasible or too costly to build. And the solutions developed will have a positive impact on society, business, and earth — at a planetary scale.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Working With Files – Getting Started With Wherobots Cloud Part 3

This is the third post in a series that will introduce Wherobots Cloud and WherobotsDB, covering how to get started with cloud-native geospatial analytics at scale.

In the previous post in this series we saw how to access and query data using Spatial SQL in Wherobots Cloud via the Wherobots Open Data Catalog. In this post we’re going to take a look at loading and working with our own data in Wherobots Cloud as well as creating and saving data as the result of our analysis, such as the end result of a data pipeline. We will cover importing files in various formats including CSV, GeoJSON, Shapefile, and GeoTIFF in WherobotsDB, working with AWS S3 cloud object storage, and creating GeoParquet files using Apache Sedona.

If you’d like to follow along you can create a free Wherobots Cloud account at cloud.wherobots.com.

Loading A CSV File From A Public S3 Bucket

First, let’s explore loading a CSV file from a public AWS S3 bucket. In our SedonaContext object we’ll configure the anonymous S3 authentication provider for the bucket to ensure we can access this specific S3 bucket’s contents. Most access configuration will happen in the SedonaContext configuration object in the notebook, however we can also apply these settings when creating the notebook runtime by specifying additional spark configuration in the runtime configuration UI. See this page in the documentation for more examples of configuring cloud object storage access using the SedonaContext configuration object.

from sedona.spark import *

config = SedonaContext.builder(). \
    config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
           "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

sedona = SedonaContext.create(config)

Access to private S3 buckets can be configured by either specifying access keys or for a more secure option using an IAM role trust policy.

Now that we’ve configured the anonymous S3 credentials provider we can use the S3 URI to access objects within the bucket. In this case we’ll load a CSV file of bird species observations. We’ll use the ST_Point function to convert the individual longitude and latitude columns into a singe Point geometry column.

S3_CSV_URL = "s3://wherobots-examples/data/examples/birdbuddy_oct23.csv"
bb_df = sedona.read.format('csv'). \
    option('header', 'true'). \
    option('delimiter', ','). \
    option('inferSchema', 'true'). \
    load(S3_CSV_URL)

bb_df = bb_df.selectExpr(
    'ST_Point(anonymized_longitude, anonymized_latitude) AS location', 
    'timestamp', 
    'common_name', 
    'scientific_name')

bb_df.createOrReplaceTempView('bb')
bb_df.show(truncate=False)
+----------------------------+-----------------------+-------------------------+----------------------+
|location                    |timestamp              |common_name              |scientific_name       |
+----------------------------+-----------------------+-------------------------+----------------------+
|POINT (-118.59075 34.393112)|2023-10-01 00:00:02.415|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:02.415|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:04.544|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:04.544|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.474|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.474|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.487|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:05.487|california scrub jay     |aphelocoma californica|
|POINT (-120.5542 43.804134) |2023-10-01 00:00:05.931|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:05.931|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:06.522|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:06.522|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:09.113|lesser goldfinch         |spinus psaltria       |
|POINT (-120.5542 43.804134) |2023-10-01 00:00:09.113|lesser goldfinch         |spinus psaltria       |
|POINT (-118.59075 34.393112)|2023-10-01 00:00:09.434|california scrub jay     |aphelocoma californica|
|POINT (-118.59075 34.393112)|2023-10-01 00:00:09.434|california scrub jay     |aphelocoma californica|
|POINT (-122.8521 46.864)    |2023-10-01 00:00:17.488|red winged blackbird     |agelaius phoeniceus   |
|POINT (-122.8521 46.864)    |2023-10-01 00:00:17.488|red winged blackbird     |agelaius phoeniceus   |
|POINT (-122.2438 47.8534)   |2023-10-01 00:00:18.046|chestnut backed chickadee|poecile rufescens     |
|POINT (-122.2438 47.8534)   |2023-10-01 00:00:18.046|chestnut backed chickadee|poecile rufescens     |
+----------------------------+-----------------------+-------------------------+----------------------+

We can visualize a sample of this data using SedonaKepler, the Kepler GL integration for Apache Sedona.

SedonaKepler.create_map(df=bb_df.sample(0.001), name="Bird Species")

Visualizing point data with Sedona Kepler

Uploading A GeoJSON File Using The Wherobots Cloud File Browser

Next, let’s see how we can upload our own data in Wherobots Cloud using the Wherobots Cloud File Browser UI. Wherobots Cloud accounts include secure file storage in private S3 buckets specific to our user or shared with other users of our organization.

There are two options for uploading our own data into Wherobots Cloud:

  1. Via the file browser UI in the Wherobots Cloud web application
  2. Using the AWS CLI by generating temporary ingest credentials

We’ll explore both options, first using the file browser UI. We will upload a GeoJSON file of the US Watershed Boundary Dataset. In the Wherobots Cloud web application, navigate to the “Files” tab then select the “data” directory. Here you’ll see two folders, one with the name customer-XXXXXXXXX and the other shared. The “customer” folder is private to each Wherobots Cloud user while the “shared” folder is private to the users within your Wherobots Cloud organization.

We can upload files using the “Upload” button. Once the file is uploaded we can click on the clipboard icon to the right of the filename to copy the S3 URL for this file to access it in the notebook environment.

The Wherobots Cloud file browser

We’ll save the S3 URL for this GeoJSON file as a variable to refer to later, note that this is private to my Wherobot’s user and not publicly accessible so if you’re following along you’ll have a different URL

S3_URL_JSON = "s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/watershed_boundaries.geojson"

WherobotsDB supports native readers for many file types, including GeoJSON so we’ll specify the “geojson” format to import our watersheds data into a Spatial DataFrame. This is a multiline GeoJSON file, where the features are contained in one large single object. WherobotsDB can also handle GeoJSON files with each feature in a single line. Refer to the documentation here for more information about working with GeoJSON files in WherobotsDB.

watershed_df = sedona.read.format("geojson"). \
    option("multiLine", "true"). \
    load(S3_URL_JSON). \
    selectExpr("explode(features) as features"). \
    select("features.*")
watershed_df.createOrReplaceTempView("watersheds")
watershed_df.printSchema()
root
 |-- geometry: geometry (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- areaacres: double (nullable = true)
 |    |-- areasqkm: double (nullable = true)
 |    |-- globalid: string (nullable = true)
 |    |-- huc6: string (nullable = true)
 |    |-- loaddate: string (nullable = true)
 |    |-- metasourceid: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- objectid: long (nullable = true)
 |    |-- referencegnis_ids: string (nullable = true)
 |    |-- shape_Area: double (nullable = true)
 |    |-- shape_Length: double (nullable = true)
 |    |-- sourcedatadesc: string (nullable = true)
 |    |-- sourcefeatureid: string (nullable = true)
 |    |-- sourceoriginator: string (nullable = true)
 |    |-- states: string (nullable = true)
 |    |-- tnmid: string (nullable = true)
 |-- type: string (nullable = true)

The WherobotsDB GeoJSON loader will parse GeoJSON exactly as it is stored – as a single object so we’ll want to explode the features column which will give us rows in our DataFrame containing each feature’s geometry and a struct containing the feature’s associated properties.

The geometries stored in GeoJSON are loaded as geometry types so we can operate on the DataFrame without explicitly creating geometries (as we did when loading the CSV file above).

Here we filter for all watersheds that intersect with California and visualize them using SedonaKepler.

california_df = sedona.sql("""
SELECT geometry, properties.name, properties.huc6
FROM watersheds
WHERE properties.states LIKE "%CA%"
""")

SedonaKepler.create_map(df=california_df, name="California Watersheds")

California watersheds

Uploading Files To Wherobots Cloud Using The AWS CLI

If we have large files or many files to upload to Wherobots Cloud, instead of uploading files through the file browser web application we can use the AWS CLI directly by generating temporary ingest credentials. After clicking “Upload” select “Create Ingest Credentials” to create temporary credentials that can be used with the AWS CLI to upload data into your private Wherobots Cloud file storage.

Uploading files to Wherobots Cloud via the AWS CLI

Once we generate our credentials we’ll need to configure the AWS CLI to use these credentials by adding them to a profile in the ~/.aws/credentials file (on Mac/Linux systems) or by running the aws configure command. See this page for more information on working with the AWS CLI.

[default]
aws_access_key_id = ASIAUZT33PSSUFF73PQA
aws_secret_access_key = O4zTVJTLNURqcFJof6F21Rh7gIQOGuqAUJNREUBa
aws_session_token = IQoJb3JpZ2luX2VjEOj//////////wEaCXVzLXdlc3QtMiJIMEYCIQCcn17jQory/9dbWjoq47cnxU4lENEE6S1akq1dEQx+4AIhALuFkR/XZtCOiw/AwEKtbCpj0IjDTR24MzSPxbSVoFOkKrICCMH//////////wEQABoMMzI5ODk4NDkxMDQ1IgwK3vI/VJyUkHJMltAqhgL6dhz0ikL2kpB7fCIKE52sw5NHlmG1LfuQkmxlhWHEJHnvFd1PrYEneDBTyXbt3Mxx8HQ86/k23zePAbm3mdOyVrrd7r9nA+cPYu5Jv93aGf+3brgGd/3fRMJy6y2Lwydsfuj/3u2/c8Ox7pTJKtcJYN14C8f3BNzTqtpR3bDjpTWG2+JMGFjgOx4lf9GuuXhW39tH8qOONA/y2lRiM00/j8cVOu1AZ/R5gRqL2/fCTFdxp9oBKJHXO8RZJ2u7/H67dmgdDFcw4T/ZuIvhEOtZ9TG2Vo9Vqb4jk+tP5E0ZhAnPyjWfhAdXD8at9/4i6S2WGhsywl5fnwBLYjRFRas5nnX7yzEEMLuQvq4GOpwBpyB0/qrzwPeRhHwb/K/ipspU2DMGSL0BFg6DoEpAOct/flMMmRYTaEhWV/Igexr746Hwox7ZOdN1gCLED1+iy8R/xcASNJJ9Yt194ItTvVtT4I6NTF13Oi50t0KqLURP43t1A67YwuZiWm+V7npUyiyHezBYLzTwf/rRi17lgpYO6/NSUSjgFeOqLDob11HGywzW6ifik+y269rq

Let’s upload a Shapefile of US Forest Service roads to our Wherobots Cloud file storage and then import it using WherobotsDB. The command to upload our Shapefile data will look like this (your S3 URL will be different). Note that we use the --recursive flag since we we want to upload the multiple files that make up a Shapefile.

aws s3 cp --recursive roads_shapefile s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/roads_shapefile

Back in our notebook, we can use WherobotsDB’s Shapefile Reader which will first generate a Spatial RDD from the Shapefile which we can then convert to a Spatial DataFrame. See this page in the documentation for more information about working with Shapefile data in Wherobots.

S3_URL_SHAPEFILE = "s3://wbts-wbc-m97rcg45xi/qjnq6fcbf1/data/customer-hd1rff9kg390pk/getting_started/roads_shapefile"

spatialRDD = ShapefileReader.readToGeometryRDD(sedona, S3_URL_SHAPEFILE)
roads_df = Adapter.toDf(spatialRDD, sedona)
roads_df.printSchema()

We can visualize a sample of our forest service roads using SedonaKepler.

SedonaKepler.create_map(df=roads_df.sample(0.1), name="Forest Service Roads")

Visualizing forest service roads

Working With Raster Data – Loading GeoTiff Raster Files

So far we’ve been working with vector data: geometries and their associated properties. We can also work with raster data in Wherobots Cloud. Let’s load a GeoTiff as an out-db raster using the RS_FromPath Spatial SQL function. Once we’ve loaded the raster image we can use Spatial SQL to manipulate and work with the band data of our raster.

ortho_url = "s3://wherobots-examples/data/examples/NEON_ortho.tif"
ortho_df = sedona.sql(f"SELECT RS_FromPath('{ortho_url}') AS raster")
ortho_df.createOrReplaceTempView("ortho")
ortho_df.show(truncate=False)

For example we can use the RS_AsImage function to visualize the GeoTiff, in this case an aerial image of a forest and road scene.

htmlDf = sedona.sql("SELECT RS_AsImage(raster) FROM ortho")
SedonaUtils.display_image(htmlDf)

Visualizing raster with RS_AsImage function

This image has three bands of data: red, green, and blue pixel values.

sedona.sql("SELECT RS_NumBands(raster) FROM ortho").show()

+-------------------+
|rs_numbands(raster)|
+-------------------+
|                  3|
+-------------------+

We can use the RS_MapAlgebra function to calculate the normalized different greenness index (NDGI), a metric similar to the normalized difference vegetation index (NDVI) used for quantifying the health and density of vegetation and landcover. The RS_MapAlgebra function allows us to execute complex computations using values from one or more bands and also across multiple rasters if for example we had a sequence of images across time and were interested in change detection.

ndgi_df = sedona.sql("""
SELECT RS_MapAlgebra(raster, 'D', 'out = (rast[1] - rast[0]) / (rast[1] + rast[0]);')
AS ndgi 
FROM ortho
""")

Writing Files With Wherobots Cloud

A common workflow with Wherobots Cloud is to load several files, perform some geospatial analysis and save the results as part of a larger data pipeline, often using GeoParquet. GeoParquet is a cloud-native file format that enables efficient data storage and retrieval of geospatial data.

Let’s perform some geospatial analysis using the data we loaded above and save the results as GeoParquet to our Wherobots Cloud S3 bucket. We’ll perform a spatial join of our bird observations, joining with the boundaries of our watersheds and then with a GROUP BY calculating the number of bird observations in each watershed.

birdshed_df = sedona.sql("""
SELECT 
    COUNT(*) AS num, 
    any_value(watersheds.geometry) AS geometry, 
    any_value(watersheds.properties.name) AS name, 
    any_value(watersheds.properties.huc6) AS huc6
FROM bb, watersheds
WHERE ST_Contains(watersheds.geometry, bb.location) 
    AND watersheds.properties.states LIKE "%CA%"
GROUP BY watersheds.properties.huc6
ORDER BY num DESC
""")

birdshed_df.show()

Here we can see the watersheds with the highest number of bird observations in our dataset.

+------+--------------------+--------------------+------+
|   num|            geometry|                name|  huc6|
+------+--------------------+--------------------+------+
|138414|MULTIPOLYGON (((-...|   San Francisco Bay|180500|
| 75006|MULTIPOLYGON (((-...|Ventura-San Gabri...|180701|
| 74254|MULTIPOLYGON (((-...|Laguna-San Diego ...|180703|
| 48452|MULTIPOLYGON (((-...|Central Californi...|180600|
| 33842|MULTIPOLYGON (((-...|    Lower Sacramento|180201|
| 20476|MULTIPOLYGON (((-...|           Santa Ana|180702|
| 17014|MULTIPOLYGON (((-...|         San Joaquin|180400|
| 15288|MULTIPOLYGON (((-...|Northern Californ...|180101|
| 13636|MULTIPOLYGON (((-...|             Truckee|160501|
|  9964|MULTIPOLYGON (((-...|Southern Oregon C...|171003|
|  6864|MULTIPOLYGON (((-...|Tulare-Buena Vist...|180300|
|  5120|MULTIPOLYGON (((-...|     Northern Mojave|180902|
|  3660|MULTIPOLYGON (((-...|          Salton Sea|181002|
|  2362|MULTIPOLYGON (((-...|              Carson|160502|
|  1040|MULTIPOLYGON (((-...|      Lower Colorado|150301|
|   814|MULTIPOLYGON (((-...|    Mono-Owens Lakes|180901|
|   584|MULTIPOLYGON (((-...|             Klamath|180102|
|   516|MULTIPOLYGON (((-...|    Upper Sacramento|180200|
|   456|MULTIPOLYGON (((-...|      North Lahontan|180800|
|   436|MULTIPOLYGON (((-...|Central Nevada De...|160600|
+------+--------------------+--------------------+------+
only showing top 20 rows

We can also visualize this data as a choropleth using SedonaKepler.

Bird observations per watershed

The GeoParquet data we’ll save will include the geometry of each watershed boundary, the count of bird observations, and the name and id of each watershed. We’ll save this GeoParquet file to our Wherobots private S3 bucket. Previously we accessed our S3 URL via the Wherobots File Browser UI, but we can also access this URI as an environment variable in the notebook environment.

USER_S3_PATH = os.environ.get("USER_S3_PATH")

Since this file isn’t very large we’ll save as a single partition, but typically we would want to save partitioned GeoParquet files partitioned on a geospatial index or administrative boundary. The WherobotsDB GeoParquet writer will add the geospatial metadata when saving as GeoParquet. See this page in the documentation for more information about creating GeoParquet files with Sedona.

birdshed_df.repartition(1).write.mode("overwrite"). \
    format("geoparquet"). \
    save(USER_S3_PATH + "geoparquet/birdshed.parquet")

That was a look at working with files in Wherobots Cloud. You can get started with large-scale geospatial analytics by creating a free account at cloud.wherobots.com. Please join the Wherobots & Apache Sedona Community site and let us know what you’re working on with Wherobots, what type of examples you’d like to see next, or if you have any feedback!

Sign in to Wherobots Cloud to get started today.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


Exploring Global Fishing Watch Public Data With Apache Sedona, Wherobots Cloud & GeoParquet

This post is a hands-on look at offshore ocean infrastructure and industrial vessel activity with Apache Sedona in Wherobots Cloud using data from Global Fishing Watch. We also see how GeoParquet can be used with this data to improve the efficiency of data retrieval and enable large-scale geospatial visualization.

Global Fishing Watch Overview

Earlier this month researchers affiliated with Global Fishing Watch published a study in the journal Nature that used machine learning and satellite imagery to reveal that 75 percent of the world’s industrial fishing vessels are not publicly tracked while 25 percent of transport and energy vessels are not publicly tracked. The researchers analyzed 2 million gigabytes of satellite radar and optical imagery to detect vessels and offshore infrastructure and then attempted to match these identified vessels with data from public ship tracking systems. You can find their study here along with links to all code and data used in the project.

Global Fishing Watch is a global non-profit organization focused on creating and sharing knowledge about human activity in the oceans to ensure fair and sustainable usage of the world’s oceans. Their focus is on using big data solutions to analyze this activity (much of it from satellite imagery) and make it publicly available through a variety of open data and data products. You can find much of their open data freely available for download here.

Exploring Global Fishing Watch Public Data With Apache Sedona

In this post we will focus on working with the data published by Global Fishing Watch as a result of the study mentioned above which you can find in the dataset listing for "Paolo et al. (2024). Satellite mapping reveals extensive industrial activity at sea". To follow along, first create a free account in Wherobots Cloud then download the Global Fishing Watch data linked above. The dataset for this project includes two CSV files:

  • offshore_infrastructure_v20231106.csv 85MB – contains offshore infrastructure identified by satellite imagery and a machine learning algorithm, including the type of infrastructure (oil, wind, etc) and location
  • industrial_vessels_v20231013.csv 6.09GB – contains vessels detected in satellite imagery and classified as fishing or non-fishing, the location of the vessel, and if the vessel was matched to public AIS records (a system for reporting and tracking ship movements)

We can download these data files and upload to Wherobots Cloud via the file browser, as shown in this image (note that I’ve uploaded a number of other files from Global Fishing Watch projects):

Wherobots Cloud file browser

By uploading data to Wherobots Cloud this way it will be private to our user in Wherobots Cloud and hosted in AWS S3. We can click the clipboard icon to the right of each file to retrieve the S3 URL for each file.

Offshore Infrastructure

Let’s start by loading the offshore infrastructure data (offshore_infrastructure_v20231106.csv). First, we’ll load this into a DataFrame.

offshore_infra_df = sedona.read.format('csv'). \
    option('header','true').option('delimiter', ','). \
    load(S3_URL_INFRASTRUCTURE)

offshore_infra_df.show(5)

We can view the first few rows to get a sense of what data is included.

+------------+--------------+--------------+----------------+-----------------+
|structure_id|composite_date|         label|             lat|              lon|
+------------+--------------+--------------+----------------+-----------------+
|      115958|    2019-09-01|lake_maracaibo|9.71025137100954|-71.0518565233426|
|      115958|    2018-12-01|lake_maracaibo|9.71025137100954|-71.0518565233426|
|      115958|    2019-07-01|lake_maracaibo|9.71025137100954|-71.0518565233426|
|      115958|    2018-06-01|lake_maracaibo|9.71025137100954|-71.0518565233426|
|      115958|    2020-08-01|lake_maracaibo|9.71025137100954|-71.0518565233426|
+------------+--------------+--------------+----------------+-----------------+
only showing top 5 rows

This dataset (the smaller of the two) has 1.4 million observations.

offshore_infra_df.count()
-------------------------------------
1441242

Next, we convert the string values of lat and lon to a Point geometry using the ST_POINT Spatial SQL function. Note that since the CSV format does not include a schema or types we must explicitly cast the types of any non-string values.

offshore_infra_df = offshore_infra_df.selectExpr(
    'ST_POINT(CAST(lon AS float), CAST(lat AS float)) AS location',
    'structure_id',
    'label',
    'CAST(composite_date AS date) AS composite_date'
)
offshore_infra_df.createOrReplaceTempView("infrastructure")
offshore_infra_df.show(5)

We also cast composite_date to a date type.

+--------------------+------------+--------------+--------------+
|            location|structure_id|         label|composite_date|
+--------------------+------------+--------------+--------------+
|POINT (-71.051856...|      115958|lake_maracaibo|    2019-09-01|
|POINT (-71.051856...|      115958|lake_maracaibo|    2018-12-01|
|POINT (-71.051856...|      115958|lake_maracaibo|    2019-07-01|
|POINT (-71.051856...|      115958|lake_maracaibo|    2018-06-01|
|POINT (-71.051856...|      115958|lake_maracaibo|    2020-08-01|
+--------------------+------------+--------------+--------------+
only showing top 5 rows

We can count the number of observations for each classification label using a GROUP BY operation.

sedona.sql("SELECT COUNT(*) AS num, label FROM infrastructure GROUP BY label").show()

We can see that "oil" is the most common classification for observed offshore infrastructure in this dataset.

+------+--------------+
|   num|         label|
+------+--------------+
|518138|           oil|
| 42358|  possible_oil|
|  8169| probable_wind|
|  1082| possible_wind|
| 23101|  probable_oil|
|288390|lake_maracaibo|
|436659|          wind|
|123345|       unknown|
+------+--------------+

This dataset represents a time series of observations from 2017-2021 so individual pieces of infrastrucutre may be included mulitple times. Let’s see how many distinct structures are included in the data.

sedona.sql("""
    WITH distinct_ids AS (SELECT DISTINCT(structure_id) FROM infrastructure) 
    SELECT COUNT(*) FROM distinct_ids
""").show()
--------------------
39979 

With just under 40,000 structures we should be able to visualize them using SedonaKepler. Let’s first combine some of the classification labels to simplify the visualization.

distinct_infra_df = sedona.sql("""
    WITH grouped AS (
        SELECT structure_id, location, 
            CASE WHEN label IN ("possible_oil", "probable_oil", "oil") THEN 'oil' 
                WHEN label IN ("possible_wind", "probably_wind", "wind") THEN "wind" 
                WHEN label IN ("unknown") THEN "unknown" 
                ELSE 'other' END AS label FROM infrastructure)
    SELECT any_value(location) AS location, any_value(label) AS label, structure_id FROM grouped
    GROUP BY structure_id
""")

To create visualizations using SedonaKepler we can pass the DataFrame to the create_map method:

SedonaKepler.create_map(distinct_infra_df, name="Offshore Infrastructure")

Here we can see the distribution of offshore infrastructure structures observed in the study.

Offshore infrastructure

Now that we have a sense of the scale of the offshore structures identified in the study, let’s explore the industrial vessel traffic dataset.

Industrial Vessels

First, we’ll load the industrial vessel CSV that we uploaded to Wherobots Cloud and count the number of rows.

industrial_vessel_df = sedona.read.format('csv'). \
    option('header','true').option('delimiter', ','). \
    load(S3_URL_INDUSTRIAL)

industrial_vessel_df.count()
-----------------------------------
23088625

We have just over 23 million observations. We can print the schema of the DataFrame to get a sense of the data included.

industrial_vessel_df.printSchema()
root
 |-- timestamp: string (nullable = true)
 |-- detect_id: string (nullable = true)
 |-- scene_id: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- lon: string (nullable = true)
 |-- mmsi: string (nullable = true)
 |-- length_m: string (nullable = true)
 |-- matching_score: string (nullable = true)
 |-- fishing_score: string (nullable = true)
 |-- matched_category: string (nullable = true)
 |-- overpasses_2017_2021: string (nullable = true)

Similarly to the previous file we loaded, all types are strings so we’ll need to cast numeric and date fields explicitly, as well as convert the individual latitude and longitude values to a point geometry.

industrial_vessel_df = industrial_vessel_df.selectExpr(
    'ST_Point(CAST(lon AS float), CAST(lat AS float)) AS location', 
    'CAST(timestamp AS Timestamp) AS timestamp', 
    'detect_id', 
    'scene_id', 
    'mmsi', 
    'CAST(length_m AS float) AS length_m', 
    'CAST(matching_score AS float) AS matching_score', 
    'CAST(fishing_score AS float) AS fishing_score', 
    'matched_category', 
    'CAST(overpasses_2017_2021 AS integer) AS overpasses_2017_2021')
industrial_vessel_df.createOrReplaceTempView('iv')

The data includes a matched_category field which indicates if the vessel identified was matched to the public vessel tracking system (AIS) and also if the vessel was classified as fishing or non-fishing.

sedona.sql("""
SELECT COUNT(*) AS num, matched_category 
FROM iv
GROUP BY matched_category
""").show()

Over 9 million vessel observations were not matched to the public vessel tracking system, indicating that many of these vessels may not have been not properly recording their location information.

+--------+------------------+
|num|  matched_category|
+--------+------------------+
| 2864709|   matched_fishing|
|  803624|   matched_unknown|
| 9187828|         unmatched|
|10232464|matched_nonfishing|
+--------+------------------+

Let’s visualize an aggregation of the dataset to get a sense of the spatial distribution of the data. To do this we’ll use H3 hexagons overlaid over the area covered by the vessel observations. We will count the number of vessel observations in each hexagon’s area, and then visualize the results.

To generate the H3 hexagons we’ll make use of two Spatial SQL functions:

  • ST_H3CellIDs – return an array of H3 cell ids that cover the given geometry at the specified resolution level
  • ST_H3ToGeom – return the H3 hexagon geometry for a given array of H3 cell ids

You can read more about Sedona’s suppport for H3 in this article.

h3_industrial_df = sedona.sql("""
SELECT 
    COUNT(*) AS num, 
    ST_H3ToGeom(ST_H3CellIDs(location, 2, false)) as geometry, 
    ST_H3CellIDs(location, 2, false) as h3 
FROM iv
GROUP BY h3
""")

We can now visualize this aggregated data as a choropleth map using SedonaKepler.

SedonaKepler.create_map(h3_industrial_df, name="Industrial Vessels" )

Industrial vessel activity

Note the distorted size of the hexagons in more extreme latitudes. This is an artifact of the Web Mercator map projection. When visualizing choropleth visualizations like this we typically instead want to choose a map projection that preserves the area of each hexagon so we can more easily interpret the spatial distribution of the choropleth.

To do this we can convert our DataFrame to a GeoDataFrame, specifying an Equal Earth projection then using matplotlib to visualize the GeoDataFrame.

h3_gdf = geopandas.GeoDataFrame(h3_industrial_df.toPandas(), geometry='geometry', crs="4326")
h3_gdf = h3_gdf.to_crs(epsg=8857)

ax = h3_gdf.plot(
    column="num",
    scheme="JenksCaspall",
    cmap="YlOrRd",
    legend=False,
    legend_kwds={"title": "Number of Observed Industrial Activities", "fontsize": 6, "title_fontsize": 6},
    figsize=(24,18)
)

ax.set_axis_off()

Observed industrial activities

Another option for visualization is to use the contextily package to include a basemap layer, but the downside is we must revert to the Web Mercator projection.

Here we visualize not just the count of overall vessel observations, but specifically those unmatched to the public tracking system. This gives us an indication of areas where illicit fishing might be concentrated.

h3_unmatched_gdf = geopandas.GeoDataFrame(h3_unmatched_df.toPandas(), geometry='geometry', crs="4326")
h3_unmatched_gdf = h3_unmatched_gdf.to_crs(epsg=3857)

ax = h3_unmatched_gdf.plot(
    column="num",
    scheme="JenksCaspall",
    cmap="YlOrRd",
    legend=True,
    legend_kwds={"title": "Number of Unmatched Industrial Activities", "fontsize": 6, "title_fontsize": 6},
    figsize=(12,8)
)

cx.add_basemap(ax,source=cx.providers.OpenStreetMap.Mapnik)
ax.set_axis_off()

Unmatched industrial vessels

Now that we’ve explored the data, let’s see how we can improve the performance and efficiency of working with this data as GeoParquet.

GeoParquet

GeoParquet is a standard that specifies how to store geospatial data in the Apache Parquet file format. The goals of GeoParquet are largely to extend the benefits of Parquet to the geospatial ecosystem and enable efficient workflows for working with geospatial attributes in columnar data.

Parquet is a popular columnar file format that enables efficient data storage and efficient retrieval. Efficient data storage is achieved using various compression and encoding methods while efficient retrieval is achieved by chunking the data in row groups and storing metadata about each row group that allows for only selecting subsets of the data within the scope of a predicate (this is known as predicate pushdown) and by storing column values next to each other allowing for retrieval of only the columns necessary to complete the query. Additionally, Parquet files can be partitioned across many individual files, further allowing query engines to ignore Parquet files outside the range of the query. These features make Parquet a natural choice for efficiently storing data in cloud object stores like AWS S3.

By specifying how to store geospatial data in the Parquet format, GeoParquet takes advantage of all the benefits of Parquet, with the added bonus of supporting interoperability within the geospatial data ecosystem.

Let’s look at how we can take advantage of these two benefits of GeoParquet: efficient data storage and efficient retrieval.

Efficient Data Storage With GeoParquet

First, let’s compare the file sizes of the original dataset (CSV) with the same data in the GeoJSON and GeoParquet formats. We can save the data as GeoJSON using Sedona:

industrial_vessel_df.repartition(1).write. \
    mode("overwrite"). \
    format("geojson"). \
    save(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels.json")

Note that we repartition the data to save as a single file.

And now to create a single un-partitioned GeoParquet file for comparison:

industrial_vessel_df.repartition(1).write. \
    mode("overwrite"). \
    format("geoparquet"). \
    save(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels.json")

The relative file sizes are:

  • CSV: 6.1 GB
  • GeoJSON: 9 GB
  • GeoParquet: 2.4 GB

The same data stored as GeoParquet takes up 60% less space than the CSV equivalent and 73% less storage space than the same data stored as GeoJSON. This translates to faster query times when less data is transferred over the network.

Much of this efficiency is due to the GeoParquet specification that geometries are serialized using WKB, which are then encoded and compressed using the built-in compression functionality of Parquet.

Efficient Retrieval With Partitioned GeoParquet

A common technique when working with Parquet files is to partition the Parquet files based on the value of some attribute. When working with GeoParquet this can be done using an administrative boundary (such as by state or country). In cases where it doesn’t make sense to partition by administrative boundary, we can instead partition using a spatial index such as S2 or H3.

By partitioning our data using a spatial index we can attain efficient retrieval by leveraging another feature of the GeoParquet specification: each GeoParquet file stores the bounding box of the geometries in the file in the column metadata. This means that at query time when executing a spatial filter query, the query engine can check the bounding box for each partitioned GeoParquet file and quickly exclude those falling outside the spatial filter bounds. This is known as spatial predicate pushdown. Let’s see it in action.

First, we will partition our data using the S2 spatial index and save as partitioned GeoParquet across multiple files (one file for each S2 cell):

industrial_vessel_df_pq = industrial_vessel_df_pq.withColumn("s2", expr("array_max(ST_S2CellIds(location, 2))"))
industrial_vessel_df_pq.repartition("s2").write. \
    mode("overwrite"). \
    partitionBy("s2"). \
    format("geoparquet"). \
    save(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels_s2part.parquet")

industrial_vessel_df_pq_part_s2 = sedona.read.format("geoparquet").load(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels_s2part.parquet")

Now let’s compare the performance of these various file formats by executing a spatial filter query.

Spatial Filter Performance Comparison

Let’s query for the number of observations in the area around the Gulf of Mexico. I drew a simple polygon to capture this area, which can be represented in WKT format as:

POLYGON ((-84.294662 29.840644, -88.952866 30.297018, -89.831772 28.767659, -94.050522 29.61167, -97.038803 27.839076, -97.917709 21.453069, -94.489975 18.479609, -86.843491 21.616579, -80.779037 24.926295, -84.294662 29.840644))

Gulf of Mexico

Next, we’ll define a spatial predicate using the ST_Within Spatial SQL function.

gulf_filter = 'ST_Within(location, ST_GeomFromWKT("POLYGON ((-84.294662 29.840644, -88.952866 30.297018, -89.831772 28.767659, -94.050522 29.61167, -97.038803 27.839076, -97.917709 21.453069, -94.489975 18.479609, -86.843491 21.616579, -80.779037 24.926295, -84.294662 29.840644))"))'

Now we’ll test the performance using the CSV version of the data:

%%time
industrial_vessel_df.where(gulf_filter).count()
----------------------------------------------------------
217508 (16.2 s)

Note that this time includes loading the entire CSV file, converting the values from string to geometry types, then performing the spatial filter without using an index.

Next, we compare the performance of executing the same spatial filter but this time using our single Parquet file.

industrial_vessel_df_pq = sedona.read. \
    format('geoparquet'). \
    load(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels.parquet")

%%time
industrial_vessel_df_pq.where(gulf_filter).count()
---------------------------------------------------------------
217508 (4.91 s)

The significant improvement here was due to loading less data over the network and more efficient deserialization of the geometry values from WKB.

Next, let’s compare the performance of using the partitioned GeoParquet. Sedona will be able to exclude most GeoParquet files based on their bounding box metadata significantly reducing both the data loaded over the network but also the number of rows to filter using the spatial predicate.

industrial_vessel_df_pq_part_s2 = sedona.read. \
    format("geoparquet"). \
    load(S3_URL_DATA_ROOT + "globalfishingwatch_industrial_vessels_s2part.parquet")

%%time
industrial_vessel_df_pq_part_s2.where(gulf_filter).count()
------------------------------------------------------------------------
217508  (1.87s)

To summarize, the time it takes to load the data and execute the spatial filter for each file type:

  • CSV: 16.2s
  • Single GeoParquet file: 4.91s
  • Partitioned GeoParquet: 1.87s

Thanks to spatial predicate pushdown enabled by GeoParquet and Apache Sedona , the partitioned GeoParquet query takes 88% less time.

Another exciting benefit of GeoParquet is the types of integrations throughout the geospatial data ecosystem that it enables, such as with visualization tools.

Scalable Geospatial Visualization With Lonboard

Matched vs unmatched vessels

The GeoParquet specification is being developed alongside the GeoArrow specification, which enables efficient in-memory columnar representation of geometries and enables efficiently transporting in-memory geospatial data between tools.

One thing this type of zero-copy in-process geospatial data transport can enable is efficient geospatial data visualization by efficiently transporting data to the GPU for fast visualization rendering.

A current proposal to the GeoParquet specification includes GeoArrow as a serialization format option (in addition to WKB mentioned earlier).

Lonboard is a relatively new Python package from the folks at Development Seed that takes advantage of GeoParquet and GeoArrow to render large scale visualizations containing millions of geometries in seconds. Since GeoArrow is not compressed it doesn’t need to be parsed before being used and the geometries can be efficiently moved to the GPU for rendering.

Note that since for visualization data is moved to the web browser and the GPU of our local machine, I downloaded the GeoParquet file locally and ran this example in a local Python notebook. If running in a cloud notebook environment the data will first need to be fetched over the network so the performance impacts won’t be as significant.

Let’s see how we can use Lonboard to visualize the entirety of our industrial vessel dataset (20+ million points). First, we’ll create a GeoDataFrame loading our GeoParquet data.

url = "../data/industrial_vessels.parquet"
gdf = gpd.read_parquet(url)

We’ll visualize matched and unmatched vessels distinctly, so we’ll split them into two different GeoDataFrames and two layers in Lonboard. We’ll color matched vessels in green and unmatched in red.

unmatched_gdf = gdf.loc[gdf['matched_category']=='unmatched']
matched_gdf = gdf.loc[gdf['matched_category']=='matched_fishing']

unmatched_layer = ScatterplotLayer.from_geopandas(unmatched_gdf, get_fill_color = [200, 0, 0, 200])
matched_layer = ScatterplotLayer.from_geopandas(matched_gdf, get_fill_color = [0, 200, 0, 200])
map_ = Map(layers=[unmatched_layer, matched_layer])
map_

Lonboard visualization

This visualization contains millions of point and renders in seconds in our Jupyter notebook while offering smooth panning and scrolling (since all points are rendered immediately). If you’ve ever tried to render visualizations of this scale using other tools in Jupyter this is pretty impressive!

Resources

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter:


The Wherobots Notebook Environment – Getting Started With Wherobots Cloud Part 2

The Wherobots Notebook Environment is the main development interface for developers and data scientists working with WherobotsDB in Wherobots Cloud. In this post we’ll take a look at configuring and starting a notebook environment then see how to work with WherobotsDB via Python and Spatial SQL.

This is the second post in a series that will introduce Wherobots Cloud, covering how to get started with cloud-native geospatial analytics at scale.

Starting The Notebook Environment

After signing in to Wherobots Cloud you’ll be prompted to configure and start a notebook runtime. Anyone can create a Wherobots Cloud account and it’s free to get started.

Configuring the Wherobots Notebook runtime

Free tier users have access to the “Tiny” runtime with a default resource configuration suitable for development and testing. Professional tier users are able to select preconfigured runtimes with more resources or create custom runtimes.

We can also add additional configuration such as AWS S3 bucket credentials or additional Spark configuration before starting the runtime and also select additional Python libraries to be installed into our environment. Many PyData and geospatial Python packages are installed by default so we can typically get started with just the default runtime configuration by clicking the “Start” button.

Starting the runtime creates a Jupyter notebook environment specific to our Wherobots Cloud user. For free tier users this notebook environment will run for 2 hours then shutdown after which we’ll need to restart the runtime, while professional tier users don’t have this restriction.

Example Notebooks

Once our notebook runtime is available we can enter the Jupyter environment by clicking “Open Notebook”. By default we’ll land on a sample notebook which introduces some basic features of SedonaDB. If you’re not familiar with Jupyter it’s an interactive development environment organized around notebooks. Notebooks are made up of cells which can be code or also include text, images, or other interactive widgets. We can run cells individually or from the menu by selecting Run >> Run All Cells to run all cells in the notebook sequentially.

Wherobots Initial Notebook

This initial notebook is a simple example meant to introduce some concepts and familiarize new users with working with SedonaDB. It covers:

  • Configuring the SedonaContext to access the Wherobots Open Data Catalog
  • Exploring the data available in the Wherobots Open Data Catalog
  • Using spatial SQL to query for points of interest by category
  • Visualizing the results using SedonaKepler

Querying the Wherobots Open Data Catalog

This initial notebook is just a starting point and you can also find a number of additional example notebooks in the notebook_example directory in Jupyter. Specifically,

  • The sedona directory includes further examples for working with SedonaDB with Python, Spatial SQL and the Overture Maps dataset
    • sedona-example-python.ipynb – loading data from Shapefiles, performing spatial joins, and writing as GeoParquet
    • sedona-overture-maps.ipynb – explore the Overture Maps dataset including points of interest, administrative boundaries, and road networks
  • The havasu directory contains examples on working with the Havasu spatial table to perform ETL and data analysis using vector and raster data
    • havasu-iceberg-geometry-etl.ipynb– creating Havasu tables, performing spatial operations, working with spatial indexes to optimize performance
    • havasu-iceberg-raster-etl.ipynb – working with the EuroSAT raster dataset as Havasu tables, raster operations, handling CRS transforms, and benchmarking raster geometry operations
    • havasu-iceberg-outdb-raster-etl.ipynb – demonstrates the out-db method of working with large rasters in SedonaDB, loading a large GeoTiff and splitting into tiles, joining vector data with rasters
  • The notebook in the sedonamaps directory shows how to make use of SedonaMaps for map matching and visualizing routes
    • sedonamaps_example.ipynb – matching noisy GPS trajectory data to OpenStreetMap road segments and visualizing the results

Note: Only 1 notebook can be run at a time. If you want to run another notebook, please shut down the kernel of the current notebook first (See instructions here).

Creating Your Own Notebooks & Working With Version Control

Now you can of course create your own notebooks to work with spatial data in Wherobots Cloud. Once you’ve created a notebook there are a few ways to export it, for example you can download the notebook locally but a common workflow is to check notebook changes into version control and then push changes to a system like GitHub or GitLab.

Our Jupyter environment has git installed so we can check our notebooks into version control from the terminal. This is also a good way to import new notebooks. For example, let’s bring in some notebooks from this repository by opening the terminal in Jupyter and running the following command:

git clone https://github.com/johnymontana/30-day-map-challenge

Using version control with the Wherobots Notebook Environment

And now these notebooks are available in our Wherobots notebook environment, so as we make changes to them we can check them in to version control and push those back to our GitHub repository.

Online Resources

As you’re working with Wherobots Cloud be sure to join the Wherobots Community site where you can ask questions if you get stuck and also share your projects with the community.

Here are some other resources you might find useful as you explore SedonaDB and Wherobots Cloud:

  • Wherobots Online Community – Ask questions, share your projects, explore what others are working on in the community, and connect with other members of the community
  • Wherobots YouTube Channel – Find technical tutorials, example videos, and presentations from spatial data experts on the Wherebots YouTube Channel
  • Wherobots Documentation – The documentation includes information about how to manage your Wherobots Cloud account, how to work with data using SedonaDB, as well as reference documentation
  • Wherobots Blog – Keep up to date with the Wherobots and Apache Sedona community including new product announcements, technical tutorials, and highlighting spatial analytics projects

This was a quick introduction to the Wherobots Notebook environment. I hope you’ll enjoy working with Wherobots Cloud and I hope to see you around the Wherobots Community Site!

Sign in to Wherobots Cloud to get started today.

Want to keep up with the latest developer news from the Wherobots and Apache Sedona community? Sign up for the This Month In Wherobots Newsletter: