Planetary-scale answers, unlocked.
A Hands-On Guide for Working with Large-Scale Spatial Data. Learn more.
Authors
Our role at Wherobots and as leaders in the Apache Sedona community is to help more developers, organizations, and AI systems positively transform the physical world using spatial data. In order to make the scale of transformation we envision possible, we’ve had to address significant bottlenecks in how data is stored and queried. We’re excited to celebrate the availability of SedonaDB and SpatialBench for Apache Sedona. Together they represent the next phase in our plan to accelerate innovation with spatial data and bridge the intelligence gap between AI and the physical world.
SedonaDB is the first open-source, single-node analytical database engine that treats spatial data as a first-class citizen.
Most analytical query engines already support general-purpose operations: filtering, joins, aggregations, and APIs for SQL or Python. But when it comes to operating on spatial data those same engines fall short: support for geometry and geography types, coordinate reference systems (CRS), spatial joins, and raster or vector operations is missing. The workaround is to bolt on an extension like PostGIS (PostgreSQL), DuckDB Spatial (DuckDB), or SedonaSpark (Spark). While powerful, extensions inherit the limits, costs, and complexities of their host systems, require extra setup and tuning, and can force builders to develop around performance and usability gaps instead of developing their ideas.
SedonaDB is different. It’s for builders solving problems with physical world data.
Written in Rust, it’s lightweight, blazing fast, and spatial-native. Out of the box, it provides:
SedonaDB uses Apache Arrow and Apache DataFusion, and provides everything you need from a modern vectorized query engine. But it delivers the unique ability to also run high performance spatial workloads easily, without requiring extensions. Read the announcement on the Apache Sedona blog to dive in and roll up your sleeves.
In 2020, Apache Sedona was incubated to address a significant support gap in distributed geospatial data processing. Since then, Sedona has enabled companies like Uber, Amazon Last Mile Delivery, JB Hunt, and thousands of others with geographically distributed operations or interests to build and run more efficient and effective physical operations at scale. It is widely used today to bring geospatial processing support to Apache Spark, Apache Flink, and also Snowflake. But distributed systems aren’t for everyone or the right fit for every use case, and we could do more to drive innovation in lower-scale scenarios.
Many ideas are bootstrapped in no-to-low cost environments where iteration cycles are fast and low risk. There’s a lot that a developer can do today using a laptop or a single virtual machine, with modern software and LLMs—without adding a dependency that adds unwanted cost and complexity to the innovation cycle. Once their ideas are viable, they may not even require a distributed compute environment in production like Spark, or one that is “fully managed” by a vendor.
So the next step was pretty clear. We had to make it easier for builders to use spatial data in no-to-low cost environments so they can iterate and positively transform the physical world, faster. We also decided to address these challenges through open-source software to maximize accessibility.
If you look around the ecosystem, you’ll notice a pattern: to get the analytical support you need for geospatial data, you deploy an analytics engine without the spatial analytics support you want, and then you bolt on what you need via an extension.
Extensions are great and they serve a purpose very well. After all, SedonaSpark is an extension! But that doesn’t mean the combination of engine + extension is ideal. It requires additional setup and management, can require tuning to achieve a reasonable performance, and the underlying engine may end up becoming a bottleneck. Additionally, the development experience around the engine may be overly complex or lack support for the language you prefer, and the engine itself might introduce compute, cost, and other overhead.
Working from the root causes of these challenges, along with the desire to drive more innovation, our next step became obvious. We needed to create a query engine that aids spatial data solutions development out of the box with popular pythonic and SQL interfaces and is optimized for single-machine environments.
But was there enough value created by a spatial-first query engine compared to general purpose query engines with spatial extensions?
Spatial data is no longer a minor class of data. It’s everywhere, the rate at which it’s being generated is growing every day, and its use cases span numerous industries. It streams from devices, vehicles, satellites, and drones, and derivatives from this data inform automation and decision-making across business, government, and research. The solutions being developed with it are transforming how organizations operate in the physical world.
Innovation is happening today with this data despite the friction above, but the pace of this innovation can be accelerated by a query engine with internals intentionally designed to help developers realize the full potential of this data.
This engine is SedonaDB, and it’s backed by an open-source community (Apache Sedona) that is committed to solving physical-world challenges through data and technology.
“Without standards, there can be no improvement” – Taiichi Ohno. This statement from the founder of the Toyota Production System is an analogy for why we built SpatialBench. There was no standard way of measuring spatial query performance, so progress couldn’t be easily quantified or query engines objectively compared on this dimension.
We built SpatialBench to establish first standards. The initial release supports 12 representative queries, ranging from simple to complex workloads, and includes a data generator for scale factors 1, 10, 100, and 1000. We hope this framework and its future versions will guide innovation that leads to a greater understanding of the physical world.
We also used SpatialBench to benchmark SedonaDB, DuckDB (with its spatial extension), and GeoPandas at scale factors 1 and 10. Those results are published here.
Introducing RasterFlow: a planetary scale inference engine for Earth Intelligence
RasterFlow takes insights and embeddings from satellite and overhead imagery datasets into Apache Iceberg tables, with ease and efficiency at any scale.
The Medallion Architecture for Geospatial Data: Why Spatial Intelligence Demands a Different Approach
When most data engineers hear “medallion architecture,” they think of the traditional multi-hop layering pattern that powers countless analytics pipelines. The concept is sound: progressively refine raw data into analytical data and products. But geospatial data breaks conventional data engineering in ways that demand we rethink the entire pipeline. This isn’t about just storing location […]
Wherobots and Taylor Geospatial Engine Bring Fields-of-the-World Models to Production Scale
Agriculture depends on timely, reliable insight into what’s happening on the ground—what’s being planted, what’s being harvested, and how fields evolve over time. The Fields of The World (FTW) project was created to support exactly this mission, by building a fully open ecosystem of labeled data, software, standards and models to create a reliable global […]
share this article
Awesome that you’d like to share our articles. Where would you like to share it to: