FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Preprint


1 Mitsubishi Electric Research Laboratories  2 University of Oxford  3 UNC Chapel Hill
Corresponding author

arXiv
COOPERA

TL;DR: We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly with Vision-Language-Action models. It comprises (a) a scalable simulation pipeline for data generation and evaluation, (b) a tailored VR teleoperation system for high-quality real-world demonstration collection, and (c) a progress-enhanced VLA to tackle long-horizon assembly.

Video

Abstract

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.

Simulation Playground

Simulation Playground

We build a a scalable simulation pipeline for data generation and evaluation. These bimanual furniture assembly tasks tasks require executing diverse manipulation skills over extremely long horizons. A simple item (IKEA LACK side table) requires 12 skill executions (650 steps), while a complex assembly (IKEA IVAR chair) requires 25 (1550 steps).

Real-World System

Real-World System
VR Teleoperation

To support real-world deployment, we develop a VR teleoperation system with design principles tailored for real-scale bimanual assembly, enabling a single operator to coordinate dual-arm Kinova Gen3 control for high-quality demonstration collection. Speed 10×

FurnitureVLA: Progress-Enhanced VLA

FurnitureVLA: Progress-Enhanced VLA

We propose a progress-enhanced VLA finetuned on semantically grounded subtasks to mitigate distribution drift in long-horizon tasks, jointly predicting actions and a progress signal to trigger subtask transitions.

Qualitative Results

VLA Policy Rollouts

We showcase VLA inference for long-horizon bimanual furniture assembly in simulation.

Assemble the LACK side table (Speed 10×)

Assemble the KALLAX shelf (Speed 10×)

Assemble the IVAR chair (Speed 10×)

We showcase real-world VLA policy inference.

Assemble the IVAR chair (Speed 20×)

VLA Emergent Corrective Behaviors

We also observe emergent corrective behaviors. In several rollouts, the robot self-corrects when parts are initially misaligned. For example, when grasping the seat panel with insufficient contact, the robot reopens the gripper, adjusts its pose, and regrasp for a more stable hold. During the attachment of the left chair frame, the robot performs small corrective motions to align the parts before insertion.

Seat panel regrasp (Speed 1.5×)
⚠️

Left chair frame alignment (Speed 8×)
⚠️

BibTeX

@article{ma2026furniturevla,
  title={FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model},
  author={Ma, Chenyang and Yang, Yue and Corcodel, Radu and Jain, Siddarth and Wu, Andrew and Hori, Chiori and Romeres, Diego},
  journal={arXiv preprint arXiv:2607.01212},
  year={2026}
}