See, Imagine, Plan: Discovering and Hallucinating Tasks from a Single Image

Chenyang Ma, Kai Lu, Ta-Ying Cheng,
Niki Trigoni, Andrew Markham

University of Oxford

Paper arXiv Code

TL;DR: We present a model for zero-shot task hallucination. With a single RGB image of any scene comprising unknown environments and objects, we identify potential tasks (task discovery) and imagine their execution (manipulation) in a vivid narrative, realized as a video. We also provide the trajectories of the executed tasks.

Input

Output

"Rotate and translate the chair to position it appropriately in front of the coffee table."

Input

Output

"Lift the container and deposit it into the trash bin."

Input

Output

"Position the kettle onto its base to align the electrical contacts for power connectivity."

Input

Output

"Insert the toilet brush into the toilet bowl and scrub for cleaning."

Input

Output

"Place the pen beside the blue eraser on the whiteboard shelf and write on the board."

Input

Output

"Pick up the apple and place it inside the pot."

Input

Output

"Move the pot lid horizontally and align it with the top of the pan to cover it completely."

Input

Output

"The knife is used to slice the orange into halves or wedges on the cutting board."

Input

Output

"Reposition the cup under the espresso machine's coffee dispenser for pouring."

Input

Output

"Slide the coffee lid horizontally to match the coffee canister, then lower it to cap it securely."

Input

Output

"Place the cup beneath the coffee machine's dispensing area to fill it with coffee."

Input

Output

"Rotate the valve around its own axis to open or close the radiator."

Input

Output

"Pick up the pen and place it into the mug, ensuring the pen is stored securely inside."

Input

Output

"Move the cellphone to the base of the telephone, assuming it's a charging dock."

Input

Output

"Pick up the mug and pour its contents into the can."

Input

Output

"Rotate the security camera to adjust its viewing angle to cover a desired area within the room."

Input

Output

"Position the plate beneath the faucet and rinse it."

Input

Output

"Insert the key card into the door lock's card reader to unlock the door."

Input

Output

"Pick up the cup and place it securely inside the dishwasher rack."

Input

Output

"Move the wet floor sign near the radiator to signal the area around it is slippery."

Abstract

Humans can not only recognize and understand the world in its current state but also envision future scenarios that extend beyond immediate perception. To resemble this profound human capacity, we introduce zero-shot task hallucination—given a single RGB image of any scene comprising unknown environments and objects, our model can identify potential tasks and imagine their execution in a vivid narrative, realized as a video. We develop a modular pipeline that progressively enhances scene decomposition, comprehension, and reconstruction, incorporating VLM for dynamic interaction and 3D motion planning for object trajectories. Our model can discover diverse tasks, with the generated task videos demonstrating realistic and compelling visual outcomes that are understandable by both machines and humans.

Method Overview

We develop a modular pipeline that progressively enhances scene decomposition, comprehension, and reconstruction, incorporating Vision-Language Model (VLM) for dynamic interaction and 3D motion planning for object trajectories, producing geometric-aware task videos. To understand the image scene, we use VLM to identify interactive objects and propose context-dependent tasks in a role-play manner, complemented by language-guided segmentation and repainting models to obtain occlusion-free object masks. Elevating 2D understanding to 3D, we use depth estimation and single-view 3D reconstruction models to generate a semi-reconstructed 3D scene, with the full 3D representation of foreground objects and the background as a plane. With the reconstructed 3D scene, we introduce a novel axes-constrained 3D planning approach that enables VLM to plan the motion of objects for given tasks by specifying waypoints. Through the combination of traditional path planning algorithms, our model generates complete, feasible, and natural trajectories from merely a single image observation. With the entire framework being fully modularized, each component can be easily replaced with the latest improvements within its specific domain.

More Results

Input

Output

"Lift the cleaning agent bottle above the toilet bowl and dispense its liquid for cleaning."

Input

Output

"Pick up the paper and place it onto the bulletin board, ensuring it is aligned and visible."

Input

Output

"Rotate the trash can to align it parallel to the adjacent wall for better space utilization."

Output

"Pick up the plastic bottle and place it inside the trash can."

Output

"Rotate the plastic bottle from its current horizontal position to an upright vertical position."

Input

Output

"Move the cup to the edge of the sink to prepare for the next task or operation."

Output

"Position the cup beneath the faucet and rinse it."

Output

"Pick up the detergent container and place it on the sink's edge to facilitate easy access"

Input

Output

"Position the hand soap over the dish and shake its contents in preparation for dishwashing."

Output

"Place the sponge on the white plate and scrub it to clean."

Output

"Move the plate from its current location to the right side of the sink to clear space."

Input

Output

"Rotate the coffee machine to face the cup, facilitating easier operation."

Output

"Move the cup to the metal plate to prepare it for serving."

Output

"Move the tea box closer to the cup to facilitate the process of making tea."

Input

Output

"Pick up the mug and place it securely inside the dishwasher rack."

Output

"Pick up the container and place it securely inside the dishwasher rack."

Output

"Rotate the lid of the container to simulate opening or closing."

Input

Output

"Rotate the door handle to close the door."

Input

Output

"Bring the fire extinguisher near the radiator and simulate the act of extinguishing the fire."

Input

Output

"Align the access card with the electronic lock's card reader and swipe to grant access."

Input

Output

"Place the battery into the charger to mimic an inspection process."

Input

Output

"Rotate the chair to align it parallel to the table's edge."

Output

"Move the table from its current position to a new location within the room."

Output

"Relocate the bottle from its current position to the seat of the chair."

Input

Output

"Rotate the valve to detach it from the radiator."

Output

"Grasp the candy can and remove it from the radiator to avert the risk of ignition."

Output

"Rotate the valve around its own axis to open or close the radiator."

Input

Output

"Move and rotate the office chair to face the desk to prepare for a user to sit and work."

Output

"Rotate the office chair for better access."

Output

"Pull and turn the filing cabinet drawer to ready it for opening."

Input

Output

"Pick up the apple and place it inside the bowl without changing the orientation of the apple."

Input

Output

"Place the delivery bag closer to the the blue shoe without it falling off."

Input

Output

"Move the trash bin closer to the chair to make it more accessible for someone sitting on the chair."

Input

Output

"Move and rotate the microwave on the countertop to change its orientation for user preference."

Input

Output

"Rotate the monitor to change the viewing angle for better ergonomics."

Output

"Slide the keyboard closer to the monitor to create space for the user's wrists and forearms."

Output

"Rotate the mouse to ensure that it is correctly oriented for user operation."

Input

Output

"Invert the bowl and place it over the can."

Input

Output

"Pick up the electric kettle and pour its content into the container."

Input

Output

"Lift the lid of the coffee container upwards to allow access to the interior for refilling."

Input

Output

"Reinsert the plug into the electrical socket for power connectivity."

Input

Output

"Place the notebook on the magazine for an aesthetic arrangement."

Input

Output

"Detach the sign from the board and pass it into a person's hand."

Input

Output

"Rotate the camera to aim at the mug for a still life photo."

Output

"Shake the can to mix its contents."

Output

"Transfer the contents of the bowl into the mug."

Input

Output

"Press the lever of the hand sanitizer dispenser to extrude sanitizer."

Output

"Rotate the sign to a different angle while maintaining its position on the wall."

Output

"Translate the sign to a new location on the same wall or a different wall without changing its orientation."

Input

Output

"Translate the pen to align it parallel with the marker tray edge, without leaving the tray."

Output

"Slide the blue eraser left towards the letters and erase them."

Output

"Reposition the red eraser to be parallel with the marker, aligned on the same axis without touching."

Input

Output

"Slide the marker across the marker tray."

Output

"Place the blue eraser on top of the red eraser carefully without knocking it over."

Output

"Place the red eraser to the left of the blue eraser and proceed to erase letters."

Input

Output

"Move the delivery bag from its current position to a designated storage area."

Output

"Move the green and white shoe towards the blue shoe on the table to align them side by side."

Output

"Move the blue shoe to be adjacent to the green and white shoe, creating a pair."

Input

Output

"Pick up the carton and place it inside the microwave."

Output

"Pick up the container and move it next to the carton, creating an organized space on the countertop"

Output

"Grasp the container and put it inside the microwave."

Input

Output

"Relocate the pot to the center of the cooktop, ensuring it is stable and ready for cooking."

Output

"Pick up the orange and place it inside the pot."

Output

"Move the apple next to the orange on the cutting board, maintaining their upright positions."

Input

Output

"Use the knife to slice the apple into halves or wedges, making sure it stays on the cutting board."

Input

Output

"Pick up the container and place it on the floor, ensuring stability."

Input

Output

"Translate the pan horizontally to center it on the stovetop burner for even heating."

Output

"Turn the pot lid 180 degrees around its central axis to switch its orientation from up to down."

Output

"Adjust the position of the pot lid by translating it to a specific point on the cutting board."

Input

Output

"Relocate the cutting board from its position on the counter to another location."

Input

Output

"Lift the spoon and place it to rest on the mug."

Dataset

As we are the first to propose zero-shot task hallucination, there lacks a pre-existing dataset for evaluation. Therefore, we craft a diverse evaluation dataset by combining self-captured photos and scenes from the NOCS dataset. We capture 38 photos using an iPhone 12 Pro Max. From NOCS, we use its real-world part of both the training and test sets, which encompasses 13 distinct scenes. Our dataset covers diverse scenes (e.g., office, kitchen, bathroom), and features a rich diversity of object categories (116) and quantities (185), with each image containing 1 - 7 objects and 1 - 3 tasks proposed for each object (278 tasks/task videos/planned trajectories in total). The dataset's diversity is further enhanced by the variety of perspectives from which the images are captured or selected (e.g., frontal, top-down, side views).

Qualitative Comparison

We compare with state-of-the-art 3D-aware image editing and video generation models, 3DIT and Runway Gen-2. We limit comparisons to rotation and translation tasks to avoid bias against 3DIT and Runway in more nuanced context-dependent tasks (e.g., slicing an apple), where their performances fall short. For 3DIT, we generate and connect frames with our planned trajectories. For Runway, we prompt it with task descriptions.