Real-time planning under multi-modal uncertainty is one of the hardest problems in autonomous driving. Classical modular stacks (perception → prediction → planning) are brittle outside the situations they were hand-engineered for, while modern diffusion-based planners produce rich, multi-modal trajectories at the cost of many denoising steps — too slow for safety-critical closed-loop replanning.
We take a different route: a conditional flow-matching planner that directly outputs actionable control sequences (acceleration and curvature) from a bird's-eye-view (BEV) raster of the scene. By parameterising a smooth ODE vector field, the model reaches diffusion-level quality in a handful of integration steps, comfortably within a real-time budget on a single GPU.
The central question we ask is about generalisation: can a generative planner trained on a narrow distribution still drive well on situations it has never seen? To find out, we train the model exclusively on urban streets, intersections and roundabouts of the city of Parma, Italy, and then evaluate it in closed-loop on both held-out urban scenarios and multi-lane highways — geometries, speeds and dynamics that are entirely absent from the training data. The model generalises reliably to these conditions without any fine-tuning, and we credit this to two design choices: the BEV representation, which abstracts the scene into geometry, and the flow-matching formulation, whose smooth vector field degrades gracefully under distribution shift.
We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide extensive qualitative analysis and video demonstrations of the closed-loop behavior.
All training and evaluation is carried out in an in-house 2D traffic simulator that operates at object level on an HD map of the city of Parma, derived from OpenStreetMap and manually refined to encode lane geometry and connectivity, drivable area boundaries, speed limits, stop and yield lines, and traffic-light positions. From this map we crop 50 distinct scenarios: 40 in-distribution urban configurations (single- and dual-lane streets, T- and four-way intersections, roundabouts) and 10 reserved for out-of-distribution evaluation (highway segments and held-out urban configurations in geographically disjoint map regions, to avoid any leakage between train and test).
The simulator supports heterogeneous reactive traffic: vehicles driven by an Intelligent Driver Model for longitudinal control and a Pure Pursuit controller for lateral tracking — each parameterised with a configurable aggressiveness profile — alongside pedestrians, cyclists, slow-moving vehicle queues, and static obstacles. Crucially, in closed-loop evaluation all non-ego agents re-plan at every timestep in response to the ego's actual trajectory: collisions are emergent, not scripted.
Only successful ego episodes are kept, and frames are actively curated for diversity by binning the acceleration–curvature space and oversampling under-represented regions. The end result is a dataset that is dense in informative driving situations rather than dominated by trivial straight-line cruising.
Each frame is rasterised into a four-channel 192 × 192 m BEV at 0.25 m / pixel, centred on the ego vehicle. Pixel intensities carry semantic information: non-ego agents are rendered with brightness proportional to their speed, drivable area encodes the local speed limit, the route channel carries the ego's current speed, and a regulations channel marks stop lines, yield lines and traffic lights with distinct values.
During training we randomly perturb the poses and speeds of non-ego agents in the BEV raster, mimicking the kind of imperfect perception a real-world stack would deliver and encouraging the planner to be robust to small inaccuracies.
The supervision signal is a sequence of 64 future controls at 20 Hz (a 3.2 s horizon), each a pair (acceleration, curvature) with a ∈ [−3, 2] m/s2 and κ ∈ [−0.2, 0.2] m−1. Both are normalised before being fed to the model.
Our architecture is built around one hard constraint: real-time closed-loop replanning at 20 Hz. We split the workload into a heavy one-time scene encoder and a lightweight iterative generator, so the expensive part of the computation runs once per planning cycle and the cheap part runs once per ODE integration step.
Flow matching learns a vector field ut that transports a simple prior pinit into the data distribution pdata via an ODE. Following the rectified-flow recipe, we use the linear interpolation xt = t z + (1 − t) ε with z ∼ pdata and ε ∼ 𝒩(0, I), so the regression target z − ε is independent of t and interpolation trajectories are straight lines between noise and data. This is what makes a handful of Euler steps already sufficient for high-quality sampling. At inference we start deterministically from X0 = 0 so that, given a BEV, the model produces a reproducible control sequence.
A CNN encodes the four-channel BEV raster into a compact spatial embedding c. This embedding is computed once per planning cycle and reused across every ODE integration step. A lightweight U-Net then parameterises the vector field ut(· | c): it ingests the current noisy control sequence and the interpolation time t, and is conditioned on c via cross-attention at the U-Net's skip connections and mid-block. The full model has about 15M parameters, of which ~85% live inside the BEV CNN — the part we only run once.
We train both the BEV encoder and the U-Net jointly, end-to-end, from scratch, using the standard conditional flow-matching loss with AdamW (β1 = 0.9, β2 = 0.999, weight decay 0.1), peak learning rate 5 × 10−4, linear warmup over the first 5% of steps and cosine annealing afterwards. Training runs for 300k steps at batch size 32 and takes ~3 days on a single NVIDIA RTX A6000.
Every 50 ms the evaluation loop renders a fresh BEV from the current ego state and the surrounding scene, integrates the learned ODE to produce 64 future controls, and applies only the first (a1, κ1) through the dynamic model. All non-ego agents respond reactively to the trajectory the ego actually drives. We evaluate in two regimes — in-distribution (urban / intersections / roundabouts) and out-of-distribution (highways and held-out urban scenarios) — over 400 ID and 100 OOD episodes with fixed initial conditions.
| Metric | NFE = 1 | NFE = 10 | ||
|---|---|---|---|---|
| ID | OOD | ID | OOD | |
| CR (%) ↓ | 9.25 | 7.00 | 3.75 | 2.75 |
| DAC (%) ↑ | 82.00 | 96.00 | 93.75 | 98.00 |
| RP (%) ↑ | 92.37 | 93.51 | 98.12 | 98.32 |
| Jseq (m/s3) ↓ | 0.12 | 0.09 | 0.07 | 0.06 |
| Jexec (m/s3) ↓ | 2.13 | 1.65 | 1.32 | 1.19 |
With 10 ODE integration steps the planner is essentially safe and accurate everywhere: collisions are rare, drivable-area compliance is high, and the ego completes nearly the full route on average. Strikingly, OOD performance is on par with — and on several metrics slightly better than — in-distribution performance, even though the model has never seen highways at training time. Highways simply offer fewer hard decision points (no merges, no tight turns, no intersection negotiation) and a lower density of close-range interactions.
The one place the numbers reveal a real weakness is the gap between Jseq and Jexec: planned control profiles are extremely smooth, but the executed trajectory is jerkier, which means the model is somewhat sensitive to small frame-to-frame changes in the BEV. Adding temporal context (a stack of past BEV rasters) is a natural next step.
We benchmarked Euler, midpoint and RK4 solvers across a wide range of NFE values and observed no meaningful difference in closed-loop behaviour between Euler and higher-order alternatives, so we adopt Euler for its simplicity. At NFE = 10 the full forward pass takes ~25 ms on an NVIDIA RTX A6000 — comfortably inside the 50 ms budget required to re-plan at 20 Hz.
A selection of in-distribution and out-of-distribution episodes driven by the model in closed-loop.
Two design choices, we believe, do most of the work. The BEV representation strips away scenario-specific visual details and presents the model with a structured, geometry-centric view of the scene; spatial relationships such as "slow down near obstacles" or "stay inside the drivable area" are encoded in a way that transfers naturally from urban streets to highways. The flow-matching objective, in turn, shapes a smooth vector field over control sequences, and smooth fields tend to degrade gracefully under distribution shift rather than fail catastrophically.
The dominant failure mode in both ID and OOD episodes is aggressive non-ego behaviour: agents that spawn right behind the ego with a large speed differential and don't brake hard enough, or that cut in during merges and turns with an insufficient gap. Because our planner is purely frame-based — it sees a single BEV snapshot with no temporal history — it cannot anticipate these manoeuvres far enough in advance to react in time. A second, smaller weakness is sharp-turn handling: the curvature distribution in our training data has a thin tail, so the model tends to cut corners slightly and then overshoot in tight maneuvers.
The most natural extension is adding temporal context: feeding a short stack of past BEV rasters should help the model infer agent intent and close the Jseq – Jexec gap. Other promising directions include a more defensive deterministic planner for data collection (so the model sees demonstrations of how to handle near-misses), sim-to-real validation on the Lexus testing platform, and integration into established public benchmarks for direct head-to-head comparison with other planners.
A conditional flow-matching planner conditioned on a BEV raster can produce smooth, real-time control trajectories for autonomous driving — and, perhaps surprisingly, it generalises well beyond the narrow distribution it was trained on. We hope the recipe and the accompanying release of training data, scenarios and the HD map will make it easier for the community to investigate generalisation in learned planners.
For questions or feedback about this work, feel free to get in touch via email.