WherobotsDB is now 3x faster with up to 45% better price performance Learn why

Mobility Data Processing at Scale: Why Traditional Spatial Systems Break Down

Mobility data is the continuous stream of GPS location records capturing how people, vehicles, and assets move through the world. Processing it at scale is fundamentally different from standard spatial data work because it carries temporal dependencies: the order and timing of observations define movement, not just position. Most organizations have discovered, often painfully, that the tools they already use were not designed with that in mind. In Part 2, we walk through the technical implementation: a three-notebook medallion architecture built on Wherobots and Apache Sedona that takes raw GPS pings and transforms them into analysis-ready, GeoParquet-backed analytical views.

Why Mobility Data Is Harder to Process Than It Looks

Every second, billions of GPS-equipped devices generate spatiotemporal data capturing how people, goods, and vehicles move through the physical world. The market reflects how seriously organizations are treating this: the global fleet management market is valued at approximately $27 billion in 2025 and projected to exceed $122 billion by 2035. Mobility data analytics platforms are on a similar trajectory, from $2.5 billion to over $11 billion by 2034. But collecting this data and actually extracting reliable intelligence from it are two very different things.

Most organizations have discovered, often painfully, that the tools they already use were not designed with GPS mobility data in mind. The result is brittle pipelines, inconsistent methodologies, and analyses that quietly produce misleading conclusions. Researchers at ACM have characterized the field as requiring its own dedicated science because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires looking at the specific properties that make mobility data uniquely difficult to process correctly.

What Makes Mobility Data Different From Standard Spatial Data

Mobility data is not just spatial data with timestamps attached. It is a distinct category of data that violates assumptions built into most data processing systems. Researchers at ACM have characterized the field as requiring its own dedicated science—Mobility Data Science—because general-purpose data science pipelines consistently produce suboptimal results when applied to movement data. Understanding why requires examining the specific properties that make trajectory data uniquely challenging.

Why GPS Data Volume Overwhelms Traditional Databases

The volume and velocity problem is the recognition that GPS data generation at fleet scale is not a batch analytics challenge, it is a continuous, high-throughput data engineering problem that demands distributed processing from the start. A single connected vehicle generating GPS pings at one-second intervals produces roughly 86,400 records per day. A fleet of 10,000 vehicles generates over 860 million data points daily. Multiply this across the millions of connected vehicles, delivery drones, rideshare fleets, and maritime vessels operating globally, and the scale becomes staggering. 

Traditional spatial databases like PostGIS, which excel at transactional workloads and moderate-scale analytics, were not designed for this volume. Loading hundreds of millions of GPS points into PostgreSQL, constructing geometries, and running spatial joins or trajectory reconstruction queries can take hours or days on a single node. Adding more hardware does not solve the fundamental problem: PostGIS was not built for distributed, parallel spatial computation.

How GPS Signal Noise Corrupts Downstream Analysis

GPS noise is error in raw location readings caused by signal reflection, satellite loss in tunnels and urban canyons, and atmospheric interference and it cascades through every downstream analysis built on top of it. Signals bounce off buildings, lose satellites in tunnels and urban canyons, and suffer from atmospheric interference. Studies have documented median GPS errors of 7 meters with standard deviations exceeding 23 meters in urban environments. Points can appear on the wrong side of a street, inside buildings, or kilometers from the actual position when signal quality degrades.

Speed calculations between consecutive points can spike to physically impossible values. Distance measurements accumulate systematic errors. Clustering algorithms identify phantom stop locations. Without rigorous cleaning and validation at the earliest stages of the pipeline, every subsequent insight is built on a compromised foundation.

Researchers at the University of Pennsylvania’s Computational Social Science Lab studied exactly this problem in the context of COVID-19 epidemic modeling. Using the same GPS mobility dataset, they found that different but individually reasonable preprocessing choices led to substantially different conclusions. a methodological “garden of forking paths” where reproducibility became nearly impossible. The root causes: data sparsity, sampling bias, and inconsistent algorithmic choices at the preprocessing stage.

Why Temporal Ordering Is Critical in Mobility Data Processing

Trip segmentation is the process of splitting a continuous GPS stream into discrete trips by detecting temporal gaps, periods where no data was recorded or the device was stationary. Mobility data is not simply geospatial—it is spatiotemporal. Every GPS point has a position and a timestamp, and the relationship between consecutive observations is what defines movement. This trip segmentation step alone introduces significant methodological complexity, because the threshold you choose (5 minutes? 20 minutes? 60 minutes?) fundamentally changes the structure of your resulting trajectories and all metrics derived from them.

Beyond segmentation, ordering matters. Trajectories are sequences, not sets. Every analytical operation—speed calculation, direction changes, stop detection, map matching—depends on correct chronological ordering within each trip. Systems that do not preserve or guarantee ordering (a common challenge in distributed frameworks) can produce geometries that appear valid but contain scrambled temporal information.

Why 2D Spatial Systems Fail for Mobility Analytics

The dimensionality problem is the gap between how most spatial systems model location, latitude and longitude only and what mobility analysis increasingly requires: 3D and 4D geometry that encodes elevation and time directly into the geometry itself. Most spatial systems treat location as a 2D construct: latitude and longitude. But mobility analysis increasingly demands 3D and 4D processing. Elevation matters for fuel consumption modeling, route optimization in mountainous terrain, aviation and drone trajectories, and any analysis where the difference between 2D and 3D distance is materially significant. Adding a temporal measure dimension (the “M” in XYZM geometries) enables encoding timestamps directly into the geometry itself, supporting trajectory validation and interpolation operations that are impossible with 2D points.

Yet 4D geometry support—constructing XYZM points, building trajectories from them, and performing analytical operations that respect all four dimensions—is rare. Most spatial SQL implementations either lack these functions entirely or implement them inconsistently. This forces practitioners to maintain separate columns for elevation and time, losing the computational advantages of integrated 4D geometry processing.

Where Traditional Spatial Systems Fail for Mobility Data

SystemPrimary Limitation for Mobility Data
Desktop GIS (QGIS, ArcGIS Pro)Single-machine ceiling, no distributed processing
PostGISSingle-node, not built for hundreds of millions of GPS points
Cloud data warehouses (Snowflake, BigQuery)Shallow spatial support, cannot handle XYZM or map matching
Vanilla Apache SparkNo native spatial types, no spatial indexing
External map matching APIsRate limits and per-request pricing make batch processing prohibitive

Desktop GIS: The Single-Machine Ceiling

Tools like QGIS and ArcGIS Pro are extraordinarily capable for visualization, manual analysis, and working with datasets that fit in memory. But they hit a hard wall with mobility data at scale. Loading millions of GPS trajectories, performing trip segmentation with window functions, running DBSCAN clustering on stop points, and executing map matching against a road network are not operations that desktop GIS was designed to handle. Analysts working with fleet-scale data routinely encounter out-of-memory errors, multi-hour processing times, and the inability to iterate quickly on analytical parameters.

Spatial Databases: Scale Without Spatial Intelligence

PostGIS remains the gold standard for spatial SQL and is an excellent choice for many use cases. However, it is fundamentally a single-node system. Scaling PostGIS to handle hundreds of millions of GPS points requires expensive vertical scaling, and even then, operations like trajectory construction across thousands of users, spatial indexing with H3 or GeoHash, and DBSCAN clustering at urban scale can exhaust available resources.

Cloud data warehouses like Snowflake, BigQuery, and Redshift have added spatial capabilities, but these tend to be shallow implementations optimized for simple point-in-polygon or distance queries. Constructing XYZM trajectories from ordered GPS points, performing spatial clustering, computing 3D distances, or running map matching against a road network are either unsupported or require convoluted workarounds that sacrifice performance and maintainability.

General-Purpose Distributed Frameworks: Power Without Spatial Awareness

Apache Spark provides the distributed computing muscle needed for mobility-scale data, but vanilla Spark has no concept of spatial data types, spatial indexing, or geometric operations. Running a spatial join in pure Spark requires broadcasting datasets or implementing custom partitioning strategies—both of which are error-prone and perform poorly at scale compared to purpose-built spatial engines.

This is precisely the gap that Apache Sedona and Wherobots were designed to fill. Sedona extends Spark (and other distributed frameworks) with native spatial data types, over 290 spatial SQL functions, spatial indexing, and optimized query planning that understands geometric predicates. Wherobots builds on Sedona to provide a fully managed, cloud-native spatial intelligence platform where teams can process GPS-scale data without managing infrastructure, configuring clusters, or bolting together fragmented toolchains.

Map Matching: The Unsolved Infrastructure Problem

Map matching—the process of snapping noisy GPS traces to the actual road network—is one of the most computationally demanding and methodologically complex steps in any mobility pipeline. It requires loading a complete road network graph, computing probabilistic alignments between GPS observations and candidate road segments, and resolving ambiguities at intersections, parallel roads, and complex interchanges.

Most map matching solutions are either commercial APIs with strict rate limits and per-request pricing that make batch processing of historical data prohibitively expensive, or open-source tools that require significant infrastructure setup and do not scale to city-wide or fleet-wide datasets. Researchers have consistently identified scalability as the primary bottleneck: algorithms that produce accurate matches on small datasets fail to perform when confronted with millions of trajectories.

Having map matching available as an integrated capability within the same distributed environment where you are already processing and analyzing your trajectory data—rather than as an external API call or a separate system—eliminates an entire category of infrastructure complexity and data movement overhead.

The Real Cost of a Fragmented Mobility Data Pipeline

In practice, most organizations processing mobility data have assembled a patchwork of tools: Python scripts for data cleaning, PostGIS for spatial operations, custom code for trip segmentation, an external API for map matching, a separate clustering library, and a visualization tool at the end. Each transition between tools introduces data serialization overhead, potential for schema drift, and opportunities for subtle bugs.

This fragmentation carries real costs beyond engineering time. When a researcher needs to change the trip segmentation threshold from 20 minutes to 30 minutes, the entire pipeline must be re-executed across multiple systems. When a new data source arrives with a slightly different schema, adapters must be updated at each integration point. When results need to be reproduced for regulatory or academic review, reconstructing the exact sequence of operations across disparate tools is often impractical.

The ideal mobility data pipeline processes GPS pings through ingestion, cleaning, enrichment, trajectory construction, map matching, spatial indexing, clustering, and analytical aggregation—all within a single, distributed, spatially-aware environment where every step is expressed in SQL or Python, every intermediate result is inspectable, and the entire pipeline can be reproduced with a single execution.

What a Modern Mobility Data Architecture Looks Like

The medallion architecture—Bronze, Silver, Gold—has become the standard pattern for progressive data refinement in the data lakehouse world. But applying it to mobility data requires rethinking what each layer does, because spatial data introduces transformations and enrichment steps that have no analog in conventional data engineering.

Bronze is not just ingestion—it is spatial profiling. You are not only loading CSV or Parquet files; you are constructing point geometries, validating coordinate bounds, assessing data quality metrics like altitude validity and temporal coverage, and establishing the spatial extent of your dataset.

Silver is where the heavy lifting happens. This is trip segmentation, 4D geometry construction, trajectory building, movement metric derivation, spatial indexing, and map matching. Each of these operations is computationally intensive, order-dependent, and requires spatial functions that most data platforms simply do not have.

Gold produces the analytical views that power downstream consumption: H3 hexbin density heatmaps, temporal activity patterns, stop detection via spatial clustering, trajectory anomaly flagging, and road segment speed analysis. These views are written as GeoParquet files—compatible with Kepler.gl, QGIS, Felt, Foursquare Studio, and DuckDB Spatial—ensuring that the output of the pipeline is immediately consumable by any modern geospatial visualization or analytics tool.

How Wherobots Handles Mobility Data Processing: What We Built

To demonstrate this architecture in practice, we built a three-notebook Wherobots Mobility Solution Accelerator using the Microsoft Research GeoLife GPS Trajectories dataset—one of the few open mobility datasets that includes elevation data, enabling full 4D geometry processing. The dataset contains 17,621 trajectories from 182 users in Beijing, with latitude, longitude, altitude, and timestamps spanning 2007–2012.

In Part 2, we walk through every notebook in detail: the Bronze layer’s ingestion and profiling pipeline, the Silver layer’s 4D trajectory construction and map matching workflow, and the Gold layer’s analytical and exploratory views. We cover the specific Apache Sedona spatial SQL functions used at each step, the PySpark window function patterns for trip segmentation and movement metrics, and the real-world challenges we encountered and solved—from Spark’s schema inference corrupting timestamp values, to COLLECT_LIST not preserving order in trajectory construction, to DBSCAN requiring physical column references.

If you are building mobility analytics pipelines and hitting the limitations of your current toolchain, Part 2 will give you a concrete, reproducible blueprint for how to do it on Wherobots.

Get Started with Wherobots

Frequently Asked Questions

What is mobility data?

Mobility data is the continuous stream of GPS location records capturing how people, vehicles, and assets move through the world. Unlike standard spatial data, it carries temporal dependencies: the order and timing of observations define movement, making position and time equally important for analysis. Fleet management platforms, rideshare services, logistics companies, and urban planners all work with mobility data to understand movement patterns, optimize routes, and model demand.

Why does PostGIS fail for large-scale GPS data processing?

PostGIS runs on a single node. At fleet scale, a single vehicle generating pings every second produces roughly 86,400 records per day. A fleet of 10,000 vehicles generates over 860 million points daily. Loading, indexing, and running trajectory reconstruction or spatial join queries against that volume on a single PostgreSQL node takes hours or fails entirely. The core issue is architectural: PostGIS was built for transactional workloads and moderate-scale analytics, not distributed parallel computation at GPS scale.

What is the medallion architecture for mobility data?

The medallion architecture organizes data processing into three layers: Bronze, Silver, and Gold. For mobility data, Bronze handles ingestion and spatial profiling, including constructing point geometries, validating coordinate bounds, and assessing data quality. Silver handles the computationally intensive work: trip segmentation, 4D geometry construction, trajectory building, spatial indexing, and map matching. Gold produces analysis-ready views used downstream, including H3 hexbin density maps, stop detection clusters, and road segment speed analysis. Each layer builds on the previous one, and the full pipeline is reproducible from a single execution.

What is map matching and why is it hard to scale?

Map matching snaps noisy GPS traces to the actual road network so that scattered coordinates become paths following real streets. It requires loading a complete road network graph, computing probabilistic alignments between GPS observations and candidate road segments, and resolving ambiguities at intersections and parallel roads. Commercial API solutions impose rate limits and per-request pricing that make batch processing of historical data expensive. Open-source tools require significant infrastructure setup and consistently underperform when processing millions of trajectories.

RELATED POSTS
VIEW MORE