CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding

Preprint

Chenyang Ma¹, Guangyu Yang², Kai Lu¹, Shitong Xu¹,
Bill Byrne², Niki Trigoni¹, Andrew Markham¹

¹ University of Oxford ² University of Cambridge

arXiv Code

TL;DR: We introduce CycleVLA, a system that enables VLAs to anticipate incipient failures and recover before execution collapses. (a) CycleVLA first augments a VLA to estimate subtask-level progress and flag critical subtask transition points, where failures most frequently occur. (b) At these points during inference, a VLM is queried to predict whether the current execution will fail and to decide whether to backtrack. (c) Upon backtracking, the VLA retries using test-time scaling via Minimum Bayes Risk decoding to improve success. This cycle repeats until the task succeeds or execution terminates.

Video

Method Overview

Motivation: Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. Our goal is to equip VLAs with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution.

Key Insight: Many robot task failures occur at subtask transitions, and that progress near subtask completion provides strong cues for anticipating such failures (e.g., one can tell a peg is misaligned before it jams during insertion).

Approach: CycleVLA first introduces (a) a finetuning pipeline that equips a VLA with subtask-level stop and progress prediction via extended action expert dimension and augmented subtask-decomposed training data. (b) At inference, predicted progress triggers a VLM-based failure predictor and planner, which decides whether to transit to the next subtask or backtrack, and selects the subtask to backtrack to. (c) After backtracking, the VLA retries execution using test-time scaling via MBR decoding to improve success.

Constructing a Subtask-Decomposed Dataset

To finetune a progress-aware VLA, we decompose demonstrations into aligned subtasks by computing gripper state segments and movement primitives from trajectories, while leveraging LLMs to propose subtask timestamps.

Qualitative Results

LIBERO Simulation Benchmark

Subtask-Decomposed Dataset

BibTeX

@article{ma2026cyclevla,
  title={CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding},
  author={Ma, Chenyang and Yang, Guangyu and Lu, Kai and Xu, Shitong and Byrne, Bill and Trigoni, Niki and Markham, Andrew},
  journal={arXiv preprint arXiv:2601.02295},
  year={2026}
}