Action Chunking Transformer with VAE latent encoding trained on expert demonstrations — outperforming behavior cloning baselines by 30% on constrained peg placement tasks.
This project implements Action Chunking Transformers (ACT) for learning peg placement manipulation from expert demonstrations. ACT represents actions as chunks predicted in latent space using a Variational Autoencoder, enabling smoother, more consistent policies than step-wise behavior cloning.
Expert Demonstration Collection: Demonstrations were collected via scripted oracle policies in simulation, capturing diverse approach trajectories and insertion attempts for the peg-in-hole task.
Action Chunking Transformer: The ACT architecture encodes sequences of actions into a latent space using a VAE. At inference time, the policy predicts chunks of future actions jointly, reducing compounding errors that plague step-by-step cloning.
Training Improvements: Latent-space smoothing was applied to handle multi-modal demonstration distributions. A teacher-forcing curriculum gradually reduced ground-truth conditioning during training to improve closed-loop performance.
ACT outperformed the behavior cloning baseline by approximately 30% on constrained peg placement success rate. Transfer to unseen initial configurations was demonstrated through latent-space smoothing, showing that the learned representations generalized beyond the training distribution.