In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories.
Smooth, stable trajectories from LipVQ-VAE in high-fidelity simulators.
We examine the impact of the action tokenizer on in-context imitation learning. Our findings indicate that a smoother representation of action correlates with higher robotic manipulation success.
As shown in the figure below, a lower smoothness score reflects a smoother action representation — which in turn translates to better task performance across benchmarks.
Figure: Relationship between smoothness of the action representation and robotic success rate.
LipVQ-VAE is our proposed action tokenizer, designed to address temporal smoothness in in-context imitation learning.
We adopt an autoencoder framework that maps continuous robot actions to a latent space using a quantized codebook lookup.
To ensure a smooth latent representation, we apply Lipschitz regularization by row-wise normalizing the weight matrix after each encoder layer.
This enforces stability and continuity in the latent space, which translates to smoother decoded actions during execution.
Figure: LipVQ-VAE architecture with Lipschitz-regularized encoder and codebook lookup.
This website template is adapted from HyperNeRF.