AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Abstract

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions.

This task is essential for service robots operating in human environments, requiring safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have shown strong performance on this task, their deployment in resource-constrained environments remains challenging due to the computational cost of standard transformer backbones.

To tackle this limitation, we propose AnoleVLA, a lightweight VLA that employs a deep state space model to efficiently handle multimodal sequences. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, allowing the robot to efficiently generate trajectories.

We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA model by 21 points in task success rate while achieving an inference speed approximately three times faster.

Physical Experiments (2×)

Overview

AnoleVLA is closely related to Mamba-based multimodal large language models, including Cobra, VL-Mamba, and EMMA. Their architectural frameworks can be adapted to many existing Vision-Language-Action models. Because both Mamba and transformers operate as causal sequence models over a shared token space, Mamba can seamlessly replace the transformer backbone while preserving the upstream multimodal tokenization and downstream action decoding pipelines.

Fig. 2: Model architecture of AnoleVLA. Multimodal tokens (proprioception, state delta, vision, language) are concatenated and processed by a Mamba backbone, and the final token predicts an $H$-step action chunk. The two-stage training supervises both velocities and their temporal differences to improve execution smoothness. In this figure, $\bm{s}^{(t)}$, $\Delta \bm{s}^{(t)}$, $\bm{x}^{(t)}_v$, and $\bm{x}_l$ represent the state, state delta, visual observation, and natural language instruction at time step $t$, respectively. On the right side, `pred.' and `GT' denote the predicted outputs and the corresponding ground truth. Specifically, $\hat{\bm{y}}$ and $\bm{y}$ represent the predicted future actions and their ground truth, respectively. Furthermore, $\Delta \hat{\bm{y}}$ and $\Delta \bm{y}$ denote the temporal differences of $\hat{\bm{y}}$ and $\bm{y}$, respectively.

CORE NOVELTIES:
1. We propose a lightweight and high-speed VLA model designed to operate efficiently even in resource-constrained environments.
2. We employ Mamba as the backbone architecture, leveraging its computationally efficient sequence processing capabilities for VLA modeling.
3. We introduce a two-stage training strategy that uses the acceleration loss in the second phase, complementing the velocity loss used in the initial phase.

Results

Qualitative Results

Quantitative Results

Table 6: Quantitative comparison between AnoleVLA and baseline methods on the Meta-World benchmark and physical experiments. In the table, "Med." and "V.Hard" refer to the Medium and Very Hard task suites in the Meta-World benchmark, respectively. The "Inf. speed" column specifies the inference speed of each method in the physical experiments. Best results are highlighted in bold, and second-best results are underlined.

BibTeX


    To be appeared.