Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning holds the potential to improve sample efficiency over model-free methods by learning inside imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, forecasting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3, a state-of-the-art model-based RL algorithm, across a range of benchmark robotic environments that evaluate for both relational reasoning and low-level manipulation capabilities.
Relational Reasoning on Multi-object Tasks
Our policies demonstrate robust relational reasoning alongside precise low-level manipulation, enabling SOLD to surpass state-of-the-art methods in a diverse range of multi-object reaching and manipulation tasks.
Learned Object-Centric Dynamics
The learned object-centric latent dynamics model is showcased through open-loop predictions from a single context frame in the videos below. These predictions remain accurate over long horizons, consistently maintaining the object-centric decomposition of the environment.
True Model Slots Reconstructions of individual Slots
Generalization to Non-Object-Centric Environments
SOLD generalizes to tasks not designed for visual control or object-centric methods.
Learned Object-Centric Dynamics
SOLD is able to learn accurate object-centric dynamics models in tasks that are not designed for visual control or object-centric methods.
True Model Slots
Discovering Task-relevant Objects
SOLD autonomously identifies task-relevant objects over long horizons while ignoring irrelevant ones, resulting in interpretable attention patterns in the behavior models. In the Push-Specific task shown below, the actor model consistently focuses on the robot and green target cube, disregarding distractors. Additionally, it attends to the occluded red target in a distant past frames to infer that the green cube is placed correctly.
Finetuning the Object-centric Encoder-Decoder
Pretraining an object-centric encoder-decoder on random sequences and freezing it during RL training is a common limitation of prior methods. This approach limits applicability to tasks where the state distributions induced by learned policies are similar to those encountered under random behavior. In PickAndPlace tasks, this assumption is violated, as random behavior rarely results in grasping and lifting a block. We demonstrate that finetuning the object-centric encoder-decoder is essential in these cases and remains stable during RL training. The figure below shows full reconstructions and target cube reconstructions for both frozen and finetuned SAVi models.