RePo learns a minimally task-relevant representation by optimizing an information bottleneck objective. Specifically, we learn a latent representation that is maximally informative of the current and future rewards, while sharing minimal information with the observations.
This objective is intractable, so we derive a variational lowerbound. The final form consists of two terms. The first term encourages the latent to be reward-predictive, and the second term enforces dynamic consistency. Taking a closer look at the second term reveals that it also incentivizes the representation to discard anything that cannot be predicted from the current latent and action. This includes spurious variations that are often present in the real world.
We parametrize the models using a recurrent state space architure and optimize the objective using dual gradient descent.
While the representation is resilient to spurious varitions, it is not invariant to significant distribution shift. This is because the visual encoder does not generalize. We thus appeal to test-time adaptation to align the encoder while keeping the policy and model frozen. Standard unsupervised alignment methods employ a distribution matching objective, which are not suitable for online adaptation due to exploration inefficiency. We propose to instead match the support of the test-time distribution of visual features with the training time distribution.
We evaluate RePo's ability to learn in dynamic environments on the Distracted DeepMind Control, where the static background is replaced with with natural videos. RePo outperforms the baselines across all six environments.
To see how RePo can deal with spurious variations in the real world, we construct an egocentric navigation task where a TurtleBot has to reach some goal location in a furnished room from egocentric vision. To introduce distration, we place two TVs playing random videos along the criticial paths towards the goal. RePo achieves 62.5% success rate within 15K environment steps, whereas as Dreamer fails to reach the goal.
We evaluate RePo's ability to learn across diverse environments on three Maniskill tasks with realistic backgrounds from the Matterport3D dataset. The background is randmized at each reset. RePo learns to ignore the background and only focus on the task-relevant components of the scene.
We evaluate our test-time adaptation method by training RePo agent on original DMC tasks and adapting the encoders to the distracted DMC domains. Because RePo's representations are minimally task-relevant, we are able to recover near-optimal performance by simply aligning the visual encoder.
To further understand the representations learned by RePo, we collect the same Maniskill trajectory across different backgrounds and visualizes the top two principal components of the final recurrent states. We observe that RePo learns a more compact latent space than Dreamer, mapping similar trajectories closer to each other. This allows for data sharing across backgrounds and explains RePo's sampling efficiency in these tasks.
@inproceedings{
zhu2023repo,
title={RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability},
author={Chuning Zhu and Max Simchowitz and Siri Gadipudi and Abhishek Gupta},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=OIJ3VXDy6s}
}