In robotic manipulation, capturing multi-modal distributions in action sequences is essential for learning robust and reliable imitation policies. Offline expert trajectories often admit multiple valid actions for the same or similar observations, which complicates learning from offline data. Existing methods face challenges such as collapsing multiple valid actions into a single mean, introducing temporal discontinuities, or producing random switches among modes. To address these issues, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage imitation framework that explicitly separates primary mode selection from continuous action generation. PF-DAG first learns a discrete vocabulary of primary modes and a lightweight policy that greedily selects a mode coherently. Then, it introduces a mode conditioned MeanFlow policy, which is a one-step continuous decoder to generate high-fidelity actions conditioned on the selected mode and the current observation. This explicit two-stage decomposition preserves intra-mode variations while reducing mode bouncing by enforcing stable primary choices. We validate PF-DAG with theoretical and empirical evidence, showing consistent improvements in accuracy, stability, and sample efficiency compared to diffusion and flow-based baselines across 56 simulation manipulation tasks and real-world tactile dexterous manipulation.
This figure illustrates the comparison between different imitation policies. Behavioral cloning predictions collapse into a single mean. Discrete Policy succeeds but introduces temporal discontinuities. Generative Policy bounces between mode 1 and 2. Our PF-DAG predicts consistent and fine-grained trajectory.
This figure shows the architecture of our PF-DAG framework. The input observation features are extracted via Observation Feature Extraction and then fed to the Primary Mode Policy. The GT action chunks are compressed into discrete primary modes using VQ-VAE and supervise the Primary Mode Policy, which are only used in training stage. The Mode Conditioned MeanFlow Policy takes the selected primary mode and observation features as input, generating high-fidelity continuous actions.
This figure visualizes the primary modes learned by our PF-DAG framework, showing how different modes capture distinct coarse action prototypes.
This figure shows the real-world experimental setup for tactile dexterous manipulation, including the robotic hand and the objects used in the experiments.
This figure compares the performance of different methods in real-world tactile dexterous manipulation tasks, demonstrating the superiority of our PF-DAG framework.
@inproceedings{
lei2026primaryfine,
title={Primary-Fine Decoupling for Action Generation in Robotic Imitation},
author={Xiaohan Lei and Min Wang and Wengang Zhou and Xingyu Lu and Houqiang Li},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=wySMuWHmt4}
}