In robotic manipulation, capturing structural patterns in action sequences is essential for learning robust and reliable imitation policies. Offline expert trajectories often contain complex structural information that is difficult to model with existing methods. Current approaches face challenges such as losing structural consistency, introducing temporal discontinuities, or failing to capture long-range dependencies in action sequences. To address these issues, we propose Structural Action Transformer (SAT), a novel framework that explicitly models structural patterns in action sequences using a transformer-based architecture. SAT first learns an embodied joint codebook that captures structural patterns from expert demonstrations. Then, it uses a structural policy to predict appropriate structural patterns over time, ensuring temporal consistency. Finally, a pattern-conditioned action policy generates continuous, fine-grained actions based on the predicted structural pattern and current observation. This explicit structural modeling enables SAT to capture long-range dependencies and maintain temporal consistency in action sequences. We validate SAT with theoretical and empirical evidence, showing consistent improvements in accuracy, stability, and sample efficiency compared to state-of-the-art baselines across 56 simulation manipulation tasks and real-world 3D dexterous manipulation.
This figure illustrates the conceptual comparison of action chunk tokenization. (a) The conventional temporal-centric perspective structures actions as a sequence of T timesteps (chunk length), with each token having dimension Da (action dim). (b) Our proposed structural-centric perspective reframes the action chunk as a sequence of Da joints, where each token's feature is its temporal trajectory over T. This (Da, T) view naturally handles heterogeneous embodiments as a variable-length, unordered sequence, which is a key feature of our approach.
This figure shows the architecture of our SAT framework. The policy takes a history of 3D point clouds and a language instruction as input. Observation Tokenizer: Each point cloud in the history is processed via Farthest Point Sampling (FPS) and PointNets to extract local geometric tokens and a global scene context. Language is encoded by a T5 tokenizer. Structural Action Tokenizer: Guided by the manipulator's morphology, the Embodied Joint Codebook produces structural-centric embeddings aligned with the action dimension, which are added to the time-stepped noisy tokens. Structural Action Transformer: A DiT with causal masking predicts the action velocity field, which is then integrated via an ODE solver to produce the final action chunk.
This figure visualizes the Embodied Joint Codebook learned by our SAT framework. The codebook is derived from the manipulator's morphology, defining each joint as a three-part triplet: Embodiment ID (unique identifier for the manipulator), Functional Category (classified by functional role like CMC, MCP, PIP, DIP), and Rotation Axis (describing the joint's primary motion). Each element indexes a separate learnable embedding table, and the final codebook embedding for a joint is the sum of its three component embeddings.
This figure shows the real-world experimental setup for 3D dexterous manipulation using the ShadowHand robotic hand. The setup includes the robotic hand, a camera system for capturing 3D point clouds, and various objects used in the manipulation tasks.
This figure compares the trajectories generated by different methods in real-world 3D dexterous manipulation tasks. Our SAT framework generates smooth, consistent trajectories that closely follow the expert demonstrations, while other methods produce trajectories with more errors and inconsistencies.
@inproceedings{
lei2026structural,
title={Structural Action Transformer for 3D Dexterous Manipulation},
author={Xiaohan Lei and Min Wang and Bohong Weng and Wengang Zhou and Houqiang Li},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026},
}