Action Tokenizer Matters in In-Context Imitation Learning

IROS 2025

An Dinh Vuong1     Minh Nhat Vu2     Dong An1     Ian Reid1

1MBZUAI   2TU Wien

Abstract

In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories.

Smooth, stable trajectories from LipVQ-VAE in high-fidelity simulators.

Smoothness vs. Success

We examine the impact of the action tokenizer on in-context imitation learning. Our findings indicate that a smoother representation of action correlates with higher robotic manipulation success.

As shown in the figure below, a lower smoothness score reflects a smoother action representation — which in turn translates to better task performance across benchmarks.

Smoothness vs. Success

Figure: Relationship between smoothness of the action representation and robotic success rate.

LipVQ-VAE Action Tokenizer

LipVQ-VAE is our proposed action tokenizer, designed to address temporal smoothness in in-context imitation learning. We adopt an autoencoder framework that maps continuous robot actions to a latent space using a quantized codebook lookup.

To ensure a smooth latent representation, we apply Lipschitz regularization by row-wise normalizing the weight matrix after each encoder layer. This enforces stability and continuity in the latent space, which translates to smoother decoded actions during execution.

LipVQ-VAE architecture

Figure: LipVQ-VAE architecture with Lipschitz-regularized encoder and codebook lookup.

Acknowledgements

This website template is adapted from HyperNeRF.