PF-DAG: Primary-Fine Decoupling for Action Generation in Robotic Imitation

Abstract

In robotic manipulation, capturing multi-modal distributions in action sequences is essential for learning robust and reliable imitation policies. Offline expert trajectories often admit multiple valid actions for the same or similar observations, which complicates learning from offline data. Existing methods face challenges such as collapsing multiple valid actions into a single mean, introducing temporal discontinuities, or producing random switches among modes. To address these issues, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage imitation framework that explicitly separates primary mode selection from continuous action generation. PF-DAG first learns a discrete vocabulary of primary modes and a lightweight policy that greedily selects a mode coherently. Then, it introduces a mode conditioned MeanFlow policy, which is a one-step continuous decoder to generate high-fidelity actions conditioned on the selected mode and the current observation. This explicit two-stage decomposition preserves intra-mode variations while reducing mode bouncing by enforcing stable primary choices. We validate PF-DAG with theoretical and empirical evidence, showing consistent improvements in accuracy, stability, and sample efficiency compared to diffusion and flow-based baselines across 56 simulation manipulation tasks and real-world tactile dexterous manipulation.

Multi-modal Expert Demonstrations and Trajectories

This figure illustrates the comparison between different imitation policies. Behavioral cloning predictions collapse into a single mean. Discrete Policy succeeds but introduces temporal discontinuities. Generative Policy bounces between mode 1 and 2. Our PF-DAG predicts consistent and fine-grained trajectory.

PF-DAG Framework Overview

This figure shows the architecture of our PF-DAG framework. The input observation features are extracted via Observation Feature Extraction and then fed to the Primary Mode Policy. The GT action chunks are compressed into discrete primary modes using VQ-VAE and supervise the Primary Mode Policy, which are only used in training stage. The Mode Conditioned MeanFlow Policy takes the selected primary mode and observation features as input, generating high-fidelity continuous actions.

Mode Visualization

This figure visualizes the primary modes learned by our PF-DAG framework, showing how different modes capture distinct coarse action prototypes.

Real-world Setup

This figure shows the real-world experimental setup for tactile dexterous manipulation, including the robotic hand and the objects used in the experiments.

Real-world Comparison

This figure compares the performance of different methods in real-world tactile dexterous manipulation tasks, demonstrating the superiority of our PF-DAG framework.