SAT: Structural Action Transformer for 3D Dexterous Manipulation

Abstract

In robotic manipulation, capturing structural patterns in action sequences is essential for learning robust and reliable imitation policies. Offline expert trajectories often contain complex structural information that is difficult to model with existing methods. Current approaches face challenges such as losing structural consistency, introducing temporal discontinuities, or failing to capture long-range dependencies in action sequences. To address these issues, we propose Structural Action Transformer (SAT), a novel framework that explicitly models structural patterns in action sequences using a transformer-based architecture. SAT first learns an embodied joint codebook that captures structural patterns from expert demonstrations. Then, it uses a structural policy to predict appropriate structural patterns over time, ensuring temporal consistency. Finally, a pattern-conditioned action policy generates continuous, fine-grained actions based on the predicted structural pattern and current observation. This explicit structural modeling enables SAT to capture long-range dependencies and maintain temporal consistency in action sequences. We validate SAT with theoretical and empirical evidence, showing consistent improvements in accuracy, stability, and sample efficiency compared to state-of-the-art baselines across 56 simulation manipulation tasks and real-world 3D dexterous manipulation.

Teaser

This figure illustrates the conceptual comparison of action chunk tokenization. (a) The conventional temporal-centric perspective structures actions as a sequence of T timesteps (chunk length), with each token having dimension Da (action dim). (b) Our proposed structural-centric perspective reframes the action chunk as a sequence of Da joints, where each token's feature is its temporal trajectory over T. This (Da, T) view naturally handles heterogeneous embodiments as a variable-length, unordered sequence, which is a key feature of our approach.

SAT Framework Overview

This figure shows the architecture of our SAT framework. The policy takes a history of 3D point clouds and a language instruction as input. Observation Tokenizer: Each point cloud in the history is processed via Farthest Point Sampling (FPS) and PointNets to extract local geometric tokens and a global scene context. Language is encoded by a T5 tokenizer. Structural Action Tokenizer: Guided by the manipulator's morphology, the Embodied Joint Codebook produces structural-centric embeddings aligned with the action dimension, which are added to the time-stepped noisy tokens. Structural Action Transformer: A DiT with causal masking predicts the action velocity field, which is then integrated via an ODE solver to produce the final action chunk.

Embodied Joint Codebook

This figure visualizes the Embodied Joint Codebook learned by our SAT framework. The codebook is derived from the manipulator's morphology, defining each joint as a three-part triplet: Embodiment ID (unique identifier for the manipulator), Functional Category (classified by functional role like CMC, MCP, PIP, DIP), and Rotation Axis (describing the joint's primary motion). Each element indexes a separate learnable embedding table, and the final codebook embedding for a joint is the sum of its three component embeddings.

Real-world Setup

This figure shows the real-world experimental setup for 3D dexterous manipulation using the ShadowHand robotic hand. The setup includes the robotic hand, a camera system for capturing 3D point clouds, and various objects used in the manipulation tasks.

Real-world Trajectories

This figure compares the trajectories generated by different methods in real-world 3D dexterous manipulation tasks. Our SAT framework generates smooth, consistent trajectories that closely follow the expert demonstrations, while other methods produce trajectories with more errors and inconsistencies.