Connect your AI coding assistants to the physical world with Wherobots MCP and CLI Learn More

How to shift Apache Sedona on Spark workloads to WherobotsDB

Authors

Wherobots customers are realizing up to a 20x performance increase and significant cost savings by shifting their Apache Sedona workloads into Wherobots. This guide shows you how easy it is to migrate Apache Sedona workloads into WherobotsDB, and focuses on best practices for Apache Sedona migrations from Amazon EMR, AWS Glue, and Databricks.

What You’ll Achieve:

By following this guide, you’ll be able to:

  • Lift and shift existing Apache Sedona workloads: Whether your Sedona jobs are currently running in AWS Glue, Databricks, or EMR, we’ll show you how to move them to WherobotsDB.
  • Orchestrate job runs with Airflow: We’ve made it easy to redirect your existing code and execute it seamlessly on WherobotsDB using Airflow for job orchestration.

Getting Started:

This guide assumes you’ve already decided to migrate to Wherobots. We’ll focus on the technical steps in moving your workloads, empowering you to get up and running quickly.

Storage Integration

Wherobots makes it easy to run the models and scripts you already have in your public or private Amazon S3 buckets using Wherobots’ secure S3 storage integration.

For step-by-step configuration instructions, see S3 Storage integration and SAML Single Sign On (SSO) setup in the official Wherobots documentation.

Initializing Sedona Context within WherobotsDB

When you’re ready to dive into spatial data analysis within WherobotsDB, your first order of business is creating a Sedona context object. This object acts as your gateway to the capabilities of the Wherobots Cloud ecosystem, enabling you to leverage its extensive spatial functions and tools.

To ensure a smooth start, it’s essential to double-check that your Sedona environment is correctly configured within your Wherobots notebook. Pay close attention to the sedona and spark variables used to initialize the Sedona environment, ensuring they match your existing setup and preferences. This approach will help you avoid potential hiccups and ensure a seamless transition into the world of spatial analysis with WherobotsDB.

from sedona.spark import *

config = SedonaContext.builder()\
                # add your Sedona/Spark configurations here in this format
                .config("<sedona-spark-config-key>", "<sedona-spark-config-value>")\
                .getOrCreate()
sedona = SedonaContext.create(config)

This configuration provides the foundation for utilizing Sedona’s spatial functions within WherobotsDB, empowering you to perform advanced geospatial analysis with ease and efficiency.

Move your business logic

Workload migration can be daunting and disruptive. Fortunately, WherobotsDB is built on Apache Sedona and is 100% code-compatible, so you can migrate your workloads seamlessly. You’ll find all the familiar functions, joins, and features of Sedona, performance enhanced by WherobotsDB.

Follow the steps below to seamlessly transfer your business logic to WherobotsDB:

  1. Identify Your Logic
  2. Validate Functionality
  3. Create a Python Script
  4. Redeploy code in your S3 bucket

This diagram illustrates how your Sedona workloads will be integrated within the Wherobots ecosystem:

sedona lift and shift dark.png

Identify Your Logic

Start by identifying an obvious component of your spatial workflow to migrate to WherobotsDB. Ensure you have all the supporting elements required for its functionality. This approach will streamline your transition to the WherobotsDB ecosystem.

Validate Functionality

After identifying the business logic you intend to shift, it’s important to validate its functionality within the WherobotsDB environment to ensure it performs as expected. This validation process ensures that your spatial operations, data transformations, and analytical processes produce the same accurate results you rely on.

Test your code using WherobotsDB notebooks. Start by selecting a runtime for your notebook that aligns with the demands of your workload. Then, seamlessly transfer your business logic into the notebook environment. Execute your code and carefully validate the outputs, paying close attention to data counts and consistency with your expected results. This validation process ensures that your logic functions seamlessly within WherobotsDB.

Create a Python Script

With your validated code ready, it’s time to package it into a Python script. This involves simply creating a .py file and organizing your code. This step ensures your logic is portable and easily executed within the WherobotsDB environment.

Redeploy code in your S3 bucket

Now that your business logic is neatly packaged within a Python script, you need to make it accessible to Wherobots Airflow. To do this, you can upload it to an S3 bucket that’s integrated with your Wherobots environment.

Another alternative is to upload it to our Managed Storage. This secure and integrated storage solution ensures your code is readily available for execution within the Wherobots ecosystem. Click here on how to upload to Managed Storage.

Execute Sedona Code with Wherobots Airflow operator

Wherobots provides an Airflow operator called the WherobotsRunOperator to simplify the integration of your code with the Job Runs API. This operator, designed for Apache Airflow, allows you to seamlessly trigger your Wherobots runs within your Airflow workflows. Before running your script, you’ll need to establish a connection to Wherobots in the Airflow Server and retrieve the S3 URI of your uploaded Python file. This URI serves as a reference to your code’s location, enabling the Wherobots Airflow operator to access and execute it.

Here’s an example of how to use the WherobotsRunOperator to execute your Sedona code on WherobotsDB:

import datetime
import pendulum

from airflow import DAG
from airflow_providers_wherobots.operators.run import WherobotsRunOperator

from wherobots.db.runtime import Runtime

with DAG(
    dag_id="test_run_operator",
    schedule="@once",
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as test_run_dag:

    operator = WherobotsRunOperator(
        task_id="analysis_task",
        name="airflow_run_operator",
        runtime=Runtime.TINY,
        run_python={
            "uri": "S3-URI-PATH-TO-YOUR-FILE",
            "args": "test_run=True"
        },
        dag=test_run_dag,
        poll_logs=True,
    )

In this example, the WherobotsRunOperator takes the S3 URI of your Python file and executes it on a specified runtime environment Runtime.TINY. You can configure the Airflow to run your code on a schedule, pass arguments to your code, and monitor the execution logs.

By utilizing the WherobotsRunOperator and the Job Runs API, you can seamlessly integrate your existing Sedona code into WherobotsDB and take advantage of its powerful geospatial capabilities. This approach ensures a smooth transition and allows you to focus on your spatial data analysis without worrying about infrastructure management or complex configurations.

To learn more about the WherobotsRunOperator and its capabilities, refer to the Wherobots documentation.

Alternative to Airflow operator

If you don’t use Airflow, the Wherobots Jobs Runs API provides a convenient way to execute your code directly on WherobotsDB.

Conclusion

Migrating your spatial data workflows doesn’t have to be a complex endeavor. With this guide, you can easily transition from Apache Sedona on Spark, EMR, or Databricks and leverage WherobotsDB on Wherobots Cloud, a cloud-native data processing solution designed to make you more productive and your spatial workloads accelerate.

Ready to simplify your spatial data analysis?

O'Reilly Book: Cloud Native Geospatial Analytics with Apache Sedona