RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability

1University of Washington, 2MIT

Abstract

Visual model-based RL methods typically encode image observations into low-dimensional representations in a manner that does not eliminate redundant information. This leaves them susceptible to spurious variations -- changes in task-irrelevant components such as background distractors or lighting conditions. In this paper, we propose a visual model-based RL method that learns a latent representation resilient to such spurious variations. Our training objective encourages the representation to be maximally predictive of dynamics and reward, while constraining the information flow from the observation to the latent representation. We demonstrate that this objective significantly bolsters the resilience of visual model-based RL methods to visual distractors, allowing them to operate in dynamic environments. We then show that while the learned encoder is resilient to spirious variations, it is not invariant under significant distribution shift. To address this, we propose a simple reward-free alignment procedure that enables test time adaptation of the encoder. This allows for quick adaptation to widely differing environments without having to relearn the dynamics and policy. Our effort is a step towards making model-based RL a practical and useful tool for dynamic, diverse domains. We show its effectiveness in simulation benchmarks with significant spurious variations as well as a real-world egocentric navigation task with noisy TVs in the background.

Method Overview

Representation Learning

RePo learns a minimally task-relevant representation by optimizing an information bottleneck objective. Specifically, we learn a latent representation that is maximally informative of the current and future rewards, while sharing minimal information with the observations.

Objective.

This objective is intractable, so we derive a variational lowerbound. The final form consists of two terms. The first term encourages the latent to be reward-predictive, and the second term enforces dynamic consistency. Taking a closer look at the second term reveals that it also incentivizes the representation to discard anything that cannot be predicted from the current latent and action. This includes spurious variations that are often present in the real world.

Variational objective.

We parametrize the models using a recurrent state space architure and optimize the objective using dual gradient descent.

Architecture.

Test-time adaptation via support constraint

While the representation is resilient to spurious varitions, it is not invariant to significant distribution shift. This is because the visual encoder does not generalize. We thus appeal to test-time adaptation to align the encoder while keeping the policy and model frozen. Standard unsupervised alignment methods employ a distribution matching objective, which are not suitable for online adaptation due to exploration inefficiency. We propose to instead match the support of the test-time distribution of visual features with the training time distribution.

Alignment method.

Experiments

Distracted DeepMind Control

We evaluate RePo's ability to learn in dynamic environments on the Distracted DeepMind Control, where the static background is replaced with with natural videos. RePo outperforms the baselines across all six environments.

DMC experiments.

Lazy Turtlebot

To see how RePo can deal with spurious variations in the real world, we construct an egocentric navigation task where a TurtleBot has to reach some goal location in a furnished room from egocentric vision. To introduce distration, we place two TVs playing random videos along the criticial paths towards the goal. RePo achieves 62.5% success rate within 15K environment steps, whereas as Dreamer fails to reach the goal.

Realistic Maniskill

We evaluate RePo's ability to learn across diverse environments on three Maniskill tasks with realistic backgrounds from the Matterport3D dataset. The background is randmized at each reset. RePo learns to ignore the background and only focus on the task-relevant components of the scene.

Maniskill experiments.

Test-Time Adaptation

We evaluate our test-time adaptation method by training RePo agent on original DMC tasks and adapting the encoders to the distracted DMC domains. Because RePo's representations are minimally task-relevant, we are able to recover near-optimal performance by simply aligning the visual encoder.

Adaptation experiments.

Visualization

To further understand the representations learned by RePo, we collect the same Maniskill trajectory across different backgrounds and visualizes the top two principal components of the final recurrent states. We observe that RePo learns a more compact latent space than Dreamer, mapping similar trajectories closer to each other. This allows for data sharing across backgrounds and explains RePo's sampling efficiency in these tasks.

BibTeX

@inproceedings{
zhu2023repo,
title={RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability},
author={Chuning Zhu and Max Simchowitz and Siri Gadipudi and Abhishek Gupta},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=OIJ3VXDy6s}
}