CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding

Preprint


1 University of Oxford    2 University of Cambridge

arXiv Code
CycleVLA

TL;DR: We introduce CycleVLA, a system that enables VLAs to anticipate incipient failures and recover before execution collapses. (a) CycleVLA first augments a VLA to estimate subtask-level progress and flag critical subtask transition points, where failures most frequently occur. (b) At these points during inference, a VLM is queried to predict whether the current execution will fail and to decide whether to backtrack. (c) Upon backtracking, the VLA retries using test-time scaling via Minimum Bayes Risk decoding to improve success. This cycle repeats until the task succeeds or execution terminates.

Video

Abstract

Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. This work introduces CycleVLA, a system that equips Vision-Language-Action models (VLAs) with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution. CycleVLA achieves this by integrating a progress-aware VLA that flags critical subtask transition points where failures most frequently occur, a VLM-based failure predictor and planner that triggers subtask backtracking upon predicted failure, and a test-time scaling strategy based on Minimum Bayes Risk (MBR) decoding to improve retry success after backtracking. Extensive experiments show that CycleVLA improves performance for both well-trained and under-trained VLAs, and that MBR serves as an effective zero-shot test-time scaling strategy for VLAs.

Method Overview

Method Overview

Motivation: Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. Our goal is to equip VLAs with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution.

Key Insight: Many robot task failures occur at subtask transitions, and that progress near subtask completion provides strong cues for anticipating such failures (e.g., one can tell a peg is misaligned before it jams during insertion).

Approach: CycleVLA first introduces (a) a finetuning pipeline that equips a VLA with subtask-level stop and progress prediction via extended action expert dimension and augmented subtask-decomposed training data. (b) At inference, predicted progress triggers a VLM-based failure predictor and planner, which decides whether to transit to the next subtask or backtrack, and selects the subtask to backtrack to. (c) After backtracking, the VLA retries execution using test-time scaling via MBR decoding to improve success.

Constructing a Subtask-Decomposed Dataset

Dataset Construction

To finetune a progress-aware VLA, we decompose demonstrations into aligned subtasks by computing gripper state segments and movement primitives from trajectories, while leveraging LLMs to propose subtask timestamps.

Qualitative Results

LIBERO Simulation Benchmark

Demo
Demo_More

Subtask-Decomposed Dataset

Subtask-Decomposed Dataset

BibTeX

@article{ma2026cyclevla,
  title={CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding},
  author={Ma, Chenyang and Yang, Guangyu and Lu, Kai and Xu, Shitong and Byrne, Bill and Trigoni, Niki and Markham, Andrew},
  journal={arXiv preprint arXiv:2601.02295},
  year={2026}
}