SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Malte Mosbach^*, Jan Ewertz^*, Angel Villar-Corrales, Sven Behnke

* Indicates equal contribution

International Conference on Machine Learning (ICML), 2025

Abstract

Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 — state-of-the-art model-based RL algorithms — across a range of benchmark robotic environments that require relational reasoning and manipulation capabilities.

Relational Reasoning on Multi-object Tasks

Our policies demonstrate robust relational reasoning alongside precise low-level manipulation, enabling proficient behaviors on multi-object reaching and manipulation tasks.

Reach-Specific

Reach-Specific-Relative

Push-Specific

Pick-Specific

Reach-Distinct

Reach-Distinct-Groups

Push-Distinct

Pick-Distinct

Benchmarking

SOLD compares favorably to existing state-of-the-art model-based RL methods on the introduced benchmark, which extends visual robotic control tasks by requiring relational reasoning to solve problems.

Learned Object-Centric Dynamics

The learned object-centric latent dynamics model is showcased through open-loop predictions in the videos below. These predictions remain accurate over long horizons, consistently maintaining the object-centric decomposition of the environment. We can see how SOLD predicts individual scene elements to evolve over time under a given action sequence.indicates ground-truth frames, whilerepresents open-loop predictions.

True Model Slots Reconstructions of individual Slots

Generalization to Non-Object-Centric Tasks

SOLD can generalize to problems that are not object-centric by design.

Button-Press

Hammer

Finger-Spin

Cartpole-Balance

Learned Object-Centric Dynamics

SOLD is able to learn accurate object-centric dynamics models in tasks that are not designed for visual control or object-centric methods.

True Model Slots

Discovering Task-relevant Objects

SOLD autonomously identifies task-relevant objects over long horizons, producing interpretable attention patterns in the behavior models. In the Push-Specific task shown below, the actor model consistently focuses on the robot and the green target cube while disregarding distractors. Notably, it attends to the occluded red sphere in a distant past frame to retrieve the goal position and infer that the green cube is correctly placed.

Finetuning the Object-centric Encoder-Decoder

Pretraining an object-centric encoder-decoder on random sequences and freezing it during RL training is a common limitation of prior methods. This approach limits applicability to tasks where the state distributions induced by learned policies are similar to those encountered under random behavior. In Pick tasks, this assumption is violated, as random behavior rarely results in grasping and lifting a block. We demonstrate that finetuning the object-centric encoder-decoder is essential in these cases and remains stable during RL training. The figure below shows full reconstructions and target cube reconstructions for both frozen and finetuned SAVi models.

Additional Results

In addition to the studied RL environments, we further demonstrate the ability of our object-centric dynamics model to handle realistic observations and complex action spaces. To this end, we train on the Sketchy dataset, which provides real-world visual observations and 7-dimensional robotic arm actions. Our results show that SOLD successfully learns meaningful object decompositions and accurately models the underlying dynamics.

The Moving Shapes dataset below, requires the model to associate actions with the object to which they apply. By learning these associations, our model is able to accurately predict the scene dynamics, enabling it to generalize to settings where actions may simultaneously affect multiple objects within a scene.

Citation

Loading...