Prepare Before You Act: Learning From Humans to Rearrange Initial States

Abstract

Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out-of-distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states -- which are outside the policy's distribution -- and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action-agnostic human videos with task-agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data.

Method

Robot arm learning to grasp a cup. When encountering an out-of-distribution initial observation $S_0$ (e.g., the cup is obstructed by a box), conventional approaches train on large-scale datasets and attempt to directly rollout the robot policy $\pi$. This brute-force approach falls short when the robot encounters unexpected initial environment states. By contrast, ReSET first learns a reduction policy $\pi^\prime$ based on how humans intuitively restructure the scene. This reduction policy rearranges objects (e.g., moving the box out of the way) so that the task is easier to perform and has lower state variance ($S_a$). Our approach then executes the default task policy $\pi$ from this simplified state distribution to reach the goal state $\hat S_t$.

Algorithm and Implementation

$\textit{Left}$: The model consists of three key components: (a) a scoring network $f$, trained on human videos, which estimates the likelihood that a base policy will succeed under a given initial configuration; (b) a flow generation network $g$, which predicts flows encoding human intuition about how the scene should be restructured into anchor states; and (c) a reduction policy $\pi^{\prime}$ that translates the predicted flows $\mathcal{T}$ into executable robot action primitives $\mathcal{A}$.
$\textit{Right}$: At rollout, the scoring network evaluates the current observation to determine whether it is ready to execute the base policy. If not, the flow generation network produces a flow plan, and the reduction policy will execute that plan. The scoring network then re-evaluates the updated scene before deciding whether to proceed with the base policy.

Prepare Before You Act:
Learning From Humans to Rearrange Initial States

We should design policies that restructure the environment, bringing diverse start states into a familiar and manageable distribution before executing the actual task.

Abstract

Video

Method

Algorithm and Implementation

Real-World Experiments

Pick-and-Place

Pick-and-Rotate

Reveal-and-Pick

Multi-Task