SpatialPIN

We provide an extensive empirical study combining multiple off-the-shelf and handcrafted datasets, ranging from fundamental spatial questions regarding relative positions and orientations to providing fine-grained 3D information on objects' locations, sizes, inclinations, and dynamic changes, and plan for robotics tasks with full 3D trajectories.

Various Forms of Spatial VQA

We experiment on the basic form of spatial VQA introduced by SpatialVLM, Intra-Image Object Relations VQA (IaOR-VQA), as well as two new forms introduced by us: Intra-Image Angular Discrepancies VQA (IaAD-VQA) and Inter-Image Spatial Dynamics VQA (IrSD-VQA).

In the figure below, we list some sample question and answer pairs generated by our pipeline.

Robotics Pick-and-Stack

By partially reconstructing the 3D scene with visual alignments, our framework enables VLMs to use tools like rapidly-exploring random tree star (RRT*) to generate accurate, collision-free paths based on task specifications. Given a robot's egocentric observation of a scene with multiple objects, our pipeline uses traditional planning to solve robotics pick-and-stack.

Discovering and Planning Robotics Task Trajectories from a Single Image

We present a novel task that requires advanced spatial reasoning capacities of VLMs. Given a single RGB image of any scene comprising unknown environments and objects, the VLM discovers potential tasks and plans their execution with full 3D trajectories, with the motivation that it can be used for robot learning in future research. To solve this complex task and visualize the execution using our framework, we introduce: 1) a task proposal approach using VLM, 2) a novel axes-constrained 3D planning approach that enables spatial reasoning-imbued VLM to plan the object motion based on the proposed tasks by specifying waypoints.