Learning Direct Control Policies with Flow Matching for Autonomous Driving

Overview

Real-time planning under multi-modal uncertainty is one of the hardest problems in autonomous driving. Classical modular stacks (perception → prediction → planning) are brittle outside the situations they were hand-engineered for, while modern diffusion-based planners produce rich, multi-modal trajectories at the cost of many denoising steps — too slow for safety-critical closed-loop replanning.

We take a different route: a conditional flow-matching planner that directly outputs actionable control sequences (acceleration and curvature) from a bird's-eye-view (BEV) raster of the scene. By parameterising a smooth ODE vector field, the model reaches diffusion-level quality in a handful of integration steps, comfortably within a real-time budget on a single GPU.

The central question we ask is about generalisation: can a generative planner trained on a narrow distribution still drive well on situations it has never seen? To find out, we train the model exclusively on urban streets, intersections and roundabouts of the city of Parma, Italy, and then evaluate it in closed-loop on both held-out urban scenarios and multi-lane highways — geometries, speeds and dynamics that are entirely absent from the training data. The model generalises reliably to these conditions without any fine-tuning, and we credit this to two design choices: the BEV representation, which abstracts the scene into geometry, and the flow-matching formulation, whose smooth vector field degrades gracefully under distribution shift.

Read the full abstract

We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide extensive qualitative analysis and video demonstrations of the closed-loop behavior.

Simulator & Dataset

All training and evaluation is carried out in an in-house 2D traffic simulator that operates at object level on an HD map of the city of Parma, derived from OpenStreetMap and manually refined to encode lane geometry and connectivity, drivable area boundaries, speed limits, stop and yield lines, and traffic-light positions. From this map we crop 50 distinct scenarios: 40 in-distribution urban configurations (single- and dual-lane streets, T- and four-way intersections, roundabouts) and 10 reserved for out-of-distribution evaluation (highway segments and held-out urban configurations in geographically disjoint map regions, to avoid any leakage between train and test).

Satellite view of Parma with scenario crops — **Figure 1:** Satellite view of Parma with scenario crop regions. Red areas are in-distribution training scenarios (urban streets, roundabouts, intersections); blue areas are out-of-distribution evaluation scenarios (highway segments and held-out urban configurations).

The simulator supports heterogeneous reactive traffic: vehicles driven by an Intelligent Driver Model for longitudinal control and a Pure Pursuit controller for lateral tracking — each parameterised with a configurable aggressiveness profile — alongside pedestrians, cyclists, slow-moving vehicle queues, and static obstacles. Crucially, in closed-loop evaluation all non-ego agents re-plan at every timestep in response to the ego's actual trajectory: collisions are emergent, not scripted.

Dataset at a glance

2.5M

training frames

~35 h

simulated driving @ 20 Hz

62.3% / 37.7%

urban vs roundabouts

6.36 ± 3.71 m/s

mean ego speed

Only successful ego episodes are kept, and frames are actively curated for diversity by binning the acceleration–curvature space and oversampling under-represented regions. The end result is a dataset that is dense in informative driving situations rather than dominated by trivial straight-line cruising.

BEV representation

Each frame is rasterised into a four-channel 192 × 192 m BEV at 0.25 m / pixel, centred on the ego vehicle. Pixel intensities carry semantic information: non-ego agents are rendered with brightness proportional to their speed, drivable area encodes the local speed limit, the route channel carries the ego's current speed, and a regulations channel marks stop lines, yield lines and traffic lights with distinct values.

BEV obstacles channel — **Figure 2:** The four BEV channels fed to the model: (a) obstacles, (b) drivable area, (c) ego route, (d) regulations.

BEV drivable area channel — **Figure 2:** The four BEV channels fed to the model: (a) obstacles, (b) drivable area, (c) ego route, (d) regulations.

During training we randomly perturb the poses and speeds of non-ego agents in the BEV raster, mimicking the kind of imperfect perception a real-world stack would deliver and encouraging the planner to be robust to small inaccuracies.

Control supervision

The supervision signal is a sequence of 64 future controls at 20 Hz (a 3.2 s horizon), each a pair (acceleration, curvature) with a ∈ [−3, 2] m/s² and κ ∈ [−0.2, 0.2] m⁻¹. Both are normalised before being fed to the model.

**Figure 3:** Distribution of ego-agent acceleration, curvature and speed in the training set.

**Figure 4:** Joint distribution of speed and curvature in the training data.

Lexus RX 350h test vehicle — **Figure 5:** Lexus RX 350h, the autonomous vehicle used for real-world testing. The ego dynamics in the simulator are modelled after this car to ease subsequent hardware deployment.

Method

Our architecture is built around one hard constraint: real-time closed-loop replanning at 20 Hz. We split the workload into a heavy one-time scene encoder and a lightweight iterative generator, so the expensive part of the computation runs once per planning cycle and the cheap part runs once per ODE integration step.

Conditional flow matching

Flow matching learns a vector field u_t that transports a simple prior p_init into the data distribution p_data via an ODE. Following the rectified-flow recipe, we use the linear interpolation x_t = t z + (1 − t) ε with z ∼ p_data and ε ∼ 𝒩(0, I), so the regression target z − ε is independent of t and interpolation trajectories are straight lines between noise and data. This is what makes a handful of Euler steps already sufficient for high-quality sampling. At inference we start deterministically from X₀ = 0 so that, given a BEV, the model produces a reproducible control sequence.

Architecture

A CNN encodes the four-channel BEV raster into a compact spatial embedding c. This embedding is computed once per planning cycle and reused across every ODE integration step. A lightweight U-Net then parameterises the vector field u_t(· | c): it ingests the current noisy control sequence and the interpolation time t, and is conditioned on c via cross-attention at the U-Net's skip connections and mid-block. The full model has about 15M parameters, of which ~85% live inside the BEV CNN — the part we only run once.

Training

We train both the BEV encoder and the U-Net jointly, end-to-end, from scratch, using the standard conditional flow-matching loss with AdamW (β₁ = 0.9, β₂ = 0.999, weight decay 0.1), peak learning rate 5 × 10⁻⁴, linear warmup over the first 5% of steps and cosine annealing afterwards. Training runs for 300k steps at batch size 32 and takes ~3 days on a single NVIDIA RTX A6000.

Closed-Loop Results

Every 50 ms the evaluation loop renders a fresh BEV from the current ego state and the surrounding scene, integrates the learned ODE to produce 64 future controls, and applies only the first (a₁, κ₁) through the dynamic model. All non-ego agents respond reactively to the trajectory the ego actually drives. We evaluate in two regimes — in-distribution (urban / intersections / roundabouts) and out-of-distribution (highways and held-out urban scenarios) — over 400 ID and 100 OOD episodes with fixed initial conditions.

Metrics

Collision Rate (CR) ↓ — fraction of episodes with any collision.
Drivable Area Compliance (DAC) ↑ — fraction of episodes where the ego never leaves the drivable surface.
Route Progress (RP) ↑ — average fraction of the assigned route completed.
Predicted-sequence jerk (J_seq) ↓ — smoothness of the planned control profile.
Executed jerk (J_exec) ↓ — smoothness of the actually driven trajectory, capturing re-planning-to-re-planning inconsistencies.

Headline numbers

Metric	NFE = 1		NFE = 10
Metric	ID	OOD	ID	OOD
CR (%) ↓	9.25	7.00	3.75	2.75
DAC (%) ↑	82.00	96.00	93.75	98.00
RP (%) ↑	92.37	93.51	98.12	98.32
J_seq (m/s³) ↓	0.12	0.09	0.07	0.06
J_exec (m/s³) ↓	2.13	1.65	1.32	1.19

With 10 ODE integration steps the planner is essentially safe and accurate everywhere: collisions are rare, drivable-area compliance is high, and the ego completes nearly the full route on average. Strikingly, OOD performance is on par with — and on several metrics slightly better than — in-distribution performance, even though the model has never seen highways at training time. Highways simply offer fewer hard decision points (no merges, no tight turns, no intersection negotiation) and a lower density of close-range interactions.

The one place the numbers reveal a real weakness is the gap between J_seq and J_exec: planned control profiles are extremely smooth, but the executed trajectory is jerkier, which means the model is somewhat sensitive to small frame-to-frame changes in the BEV. Adding temporal context (a stack of past BEV rasters) is a natural next step.

Solver and latency

We benchmarked Euler, midpoint and RK4 solvers across a wide range of NFE values and observed no meaningful difference in closed-loop behaviour between Euler and higher-order alternatives, so we adopt Euler for its simplicity. At NFE = 10 the full forward pass takes ~25 ms on an NVIDIA RTX A6000 — comfortably inside the 50 ms budget required to re-plan at 20 Hz.

Inference time vs NFE — **Figure 7:** ODE integration time vs. NFE by solver, and fraction of total inference time spent on ODE integration as NFE increases.

Closed-loop video demonstrations

A selection of in-distribution and out-of-distribution episodes driven by the model in closed-loop.

Roundabout merging and navigation

Left turn and slow agent following

Roundabout merging with emergency braking to avoid a crash

Out-of-distribution scenario: highway

Roundabout and narrow street

Yielding to an illegal maneuver

Discussion

Why does it generalise?

Two design choices, we believe, do most of the work. The BEV representation strips away scenario-specific visual details and presents the model with a structured, geometry-centric view of the scene; spatial relationships such as "slow down near obstacles" or "stay inside the drivable area" are encoded in a way that transfers naturally from urban streets to highways. The flow-matching objective, in turn, shapes a smooth vector field over control sequences, and smooth fields tend to degrade gracefully under distribution shift rather than fail catastrophically.

Where it still breaks

The dominant failure mode in both ID and OOD episodes is aggressive non-ego behaviour: agents that spawn right behind the ego with a large speed differential and don't brake hard enough, or that cut in during merges and turns with an insufficient gap. Because our planner is purely frame-based — it sees a single BEV snapshot with no temporal history — it cannot anticipate these manoeuvres far enough in advance to react in time. A second, smaller weakness is sharp-turn handling: the curvature distribution in our training data has a thin tail, so the model tends to cut corners slightly and then overshoot in tight maneuvers.

What's next

The most natural extension is adding temporal context: feeding a short stack of past BEV rasters should help the model infer agent intent and close the J_seq – J_exec gap. Other promising directions include a more defensive deterministic planner for data collection (so the model sees demonstrations of how to handle near-misses), sim-to-real validation on the Lexus testing platform, and integration into established public benchmarks for direct head-to-head comparison with other planners.

Conclusion

A conditional flow-matching planner conditioned on a BEV raster can produce smooth, real-time control trajectories for autonomous driving — and, perhaps surprisingly, it generalises well beyond the narrow distribution it was trained on. We hope the recipe and the accompanying release of training data, scenarios and the HD map will make it easier for the community to investigate generalisation in learned planners.