Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. This work introduces CycleVLA, a system that equips Vision-Language-Action models (VLAs) with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution. CycleVLA achieves this by integrating a progress-aware VLA that flags critical subtask transition points where failures most frequently occur, a VLM-based failure predictor and planner that triggers subtask backtracking upon predicted failure, and a test-time scaling strategy based on Minimum Bayes Risk (MBR) decoding to improve retry success after backtracking. Extensive experiments show that CycleVLA improves performance for both well-trained and under-trained VLAs, and that MBR serves as an effective zero-shot test-time scaling strategy for VLAs.
Motivation: Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. Our goal is to equip VLAs with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution.
Key Insight: Many robot task failures occur at subtask transitions, and that progress near subtask completion provides strong cues for anticipating such failures (e.g., one can tell a peg is misaligned before it jams during insertion).
Approach: CycleVLA first introduces (a) a finetuning pipeline that equips a VLA with subtask-level stop and progress prediction via extended action expert dimension and augmented subtask-decomposed training data. (b) At inference, predicted progress triggers a VLM-based failure predictor and planner, which decides whether to transit to the next subtask or backtrack, and selects the subtask to backtrack to. (c) After backtracking, the VLA retries execution using test-time scaling via MBR decoding to improve success.
To finetune a progress-aware VLA, we decompose demonstrations into aligned subtasks by computing gripper state segments and movement primitives from trajectories, while leveraging LLMs to propose subtask timestamps.
@article{ma2026cyclevla,
title={CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding},
author={Ma, Chenyang and Yang, Guangyu and Lu, Kai and Xu, Shitong and Byrne, Bill and Trigoni, Niki and Markham, Andrew},
journal={arXiv preprint arXiv:2601.02295},
year={2026}
}