Abstract
Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 — state-of-the-art model-based RL algorithms — across a range of benchmark robotic environments that require relational reasoning and manipulation capabilities.
Reach-Specific
Reach-Specific-Relative
Push-Specific
Pick-Specific
Reach-Distinct
Reach-Distinct-Groups
Push-Distinct
Pick-Distinct
SOLD compares favorably to existing state-of-the-art model-based RL methods on the introduced benchmark, which extends visual robotic control tasks by requiring relational reasoning to solve problems.

The learned object-centric latent dynamics model is showcased through open-loop predictions in the videos below. These predictions remain accurate over long horizons, consistently maintaining the object-centric decomposition of the environment. We can see how SOLD predicts individual scene elements to evolve over time under a given action sequence.indicates ground-truth frames, whilerepresents open-loop predictions.
Button-Press
Hammer
Finger-Spin
Cartpole-Balance
SOLD autonomously identifies task-relevant objects over long horizons, producing interpretable attention patterns in the behavior models. In the Push-Specific task shown below, the actor model consistently focuses on the robot and the green target cube while disregarding distractors. Notably, it attends to the occluded red sphere in a distant past frame to retrieve the goal position and infer that the green cube is correctly placed.

Pretraining an object-centric encoder-decoder on random sequences and freezing it during RL training is a common limitation of prior methods. This approach limits applicability to tasks where the state distributions induced by learned policies are similar to those encountered under random behavior. In Pick tasks, this assumption is violated, as random behavior rarely results in grasping and lifting a block. We demonstrate that finetuning the object-centric encoder-decoder is essential in these cases and remains stable during RL training. The figure below shows full reconstructions and target cube reconstructions for both frozen and finetuned SAVi models.

Loading...