Learning Latent Dynamics
for Planning from Pixels

scroll down

Learning Latent Dynamics for
Planning from Pixels

Abstract

Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this problem using a latent dynamics model with both deterministic and stochastic transition components and a multi-step variational inference objective that we call latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms. The source code is available as open source for the research community to build upon.


Introduction

Planning is a natural and powerful approach to decision making problems with known dynamics, such as game playing and simulated robot control . To plan in unknown environments, the agent needs to learn the dynamics from experience. Learning dynamics models that are accurate enough for planning has been a long-standing challenge. Key difficulties include model inaccuracies, accumulating errors of multi-step predictions, failure to capture multiple possible futures, and overconfident predictions outside of the training distribution.

Figure 1: PlaNet learns a world model from image inputs only and successfully leverages it for planning in latent space. The agent solves a variety of image-based control tasks, competing with advanced model-free agents in terms of final performance while being 5000% more data efficient on average.

Planning using learned models offers several benefits over model-free reinforcement learning. First, model-based planning can be more data efficient because it leverages a richer training signal and does not require propagating rewards through Bellman backups. Moreover, planning carries the promise of increasing performance just by increasing the computational budget for searching for actions, as shown by Silver et al.. Finally, learned dynamics can be independent of any specific task and thus have the potential to transfer well to other tasks in the environment.

Recent work has shown promise in learning the dynamics of simple low-dimensional environments . However, these approaches typically assume access to the underlying state of the world and the reward function, which may not be available in practice. In high-dimensional environments, we would like to learn the dynamics in a compact latent space to enable fast planning. The success of such latent models has been limited to simple tasks such as balancing cartpoles and controlling 2-link arms from dense rewards .

In this paper, we propose the Deep Planning Network (PlaNet), a model-based agent that learns the environment dynamics from pixels and chooses actions through online planning in a compact latent space. To learn the dynamics, we use a transition model with both stochastic and deterministic components and train it using a generalized variational objective that encourages multi-step predictions. PlaNet solves continuous control tasks from pixels that are more difficult than those previously solved by planning with learned models.

Key contributions of this work are summarized as follows:

Latent Space Planning

To solve unknown environments via planning, we need to model the environment dynamics from experience. PlaNet does so by iteratively collecting data using planning and training the dynamics model on the gathered data. In this section, we introduce notation for the environment and describe the general implementation of our model-based agent. In this section, we assume access to a learned dynamics model. Our design and training objective for this model are detailed later on in the Recurrent State Space Model and Latent Overshooting sections respectively.

Problem setup   Since individual image observations generally do not reveal the full state of the environment, we consider a partially observable Markov decision process (POMDP). We define a discrete time step tt, hidden states sts_t, image observations oto_t, continuous action vectors ata_t, and scalar rewards rtr_t, that follow the stochastic dynamics:

where we assume a fixed initial state s0s_0 without loss of generality. The goal is to implement a policy p(atot,a<t)\mathrm{p}(a_t|o_{\leq t},a_{\lt t}) that maximizes the expected sum of rewards Ep[τ=t+1Tp(rτsτ)]E_{\mathrm{p}}[ \sum_{\tau=t+1}^T \mathrm{p}(r_\tau|s_\tau) ], where the expectation is over the distributions of the environment and the policy.

Figure 2: In a latent dynamics model, the information of the input images is integrated into the hidden states (green) using the encoder network (grey trapezoids). The hidden state is then projected forward in time to predict future images (blue trapezoids) and rewards (blue rectangle).

Model-based planning   PlaNet learns a transition model p(stst1,at1)p(s_t|s_{t-1},a_{t-1}), observation model p(otst)p(o_t|s_t), and reward model p(rtst)p(r_t|s_t) from previously experienced episodes (note italic letters for the model compared to upright letters for the true dynamics). The observation model provides a training signal but is not used for planning. We also learn an encoder q(stot,a<t)q(s_t|o_{\leq t},a_{\lt t}) to infer an approximate belief over the current hidden state from the history using filtering. Given these components, we implement the policy as a planning algorithm that searches for the best sequence of future actions. We use model-predictive control (MPC) to allow the agent to adapt its plan based on new observations, meaning we replan at each step. In contrast to model-free and hybrid reinforcement learning algorithms, we do not use a policy network.

Figure 3: For planning, we encode past images (gray trapezoid) into the current hidden state (green). From there, we efficiently predict future rewards for multiple action sequences. Note how the expensive image decoder (blue trapezoid) from the previous figure is gone. We then execute the first action of the best sequence found (red box).

Experience collection   Since the agent may not initially visit all parts of the environment, we need to iteratively collect new experience and refine the dynamics model. We do so by planning with the partially trained model, as shown in Algorithm 1. Starting from a small amount of SS seed episodes collected under random actions, we train the model and add one additional episode to the data set every CC update steps. When collecting episodes for the data set, we add small Gaussian exploration noise to the action. To reduce the planning horizon and provide a clearer learning signal to the model, we repeat each action RR times, as is common in reinforcement learning .

Planning algorithm   We use the cross entropy method (CEM) to search for the best action sequence under the model, as outlined in Algorithm 2 in the appendix section of our paper. We decided on this algorithm because of its robustness and because it solved all considered tasks when given the true dynamics for planning. CEM is a population-based optimization algorithm that infers a distribution over action sequences that maximize the objective. As detailed in Algorithm 2, we initialize a time-dependent diagonal Gaussian belief over optimal action sequences at:t+HN(μt:t+H,σt:t+H2I)a_{t:t+H}\sim N(\mu_{t:t+H},\sigma^2_{t:t+H} I), where tt is the current time step of the agent and HH is the length of the planning horizon. Starting from zero mean and unit variance, we repeatedly sample JJ candidate action sequences, evaluate them under the model, and re-fit the belief to the top KK action sequences. After II iterations, the planner returns the mean of the belief for the current time step, μt\mu_t. Importantly, after receiving the next observation, the belief over action sequences starts from zero mean and unit variance again to avoid local optima.

To evaluate a candidate action sequence under the learned model, we sample a state trajectory starting from the current state belief, and sum the mean rewards predicted along the sequence. Since we use a population-based optimizer, we found it sufficient to consider a single trajectory per action sequence and thus focus the computational budget on evaluating a larger number of different sequences. Because the reward is modeled as a function of the latent state, the planner can operate purely in latent space without generating images, which allows for fast evaluation of large batches of action sequences. The next section introduces the latent dynamics model that the planner uses.

Recurrent State Space Model

For planning, we need to evaluate thousands of action sequences at every time step of the agent. Therefore, we use a recurrent state-space model (RSSM) that can predict forward purely in latent space, similar to recently proposed models . This model can be thought of as a non-linear Kalman filter or sequential VAE. Instead of an extensive comparison to prior architectures, we highlight two findings that can guide future designs of dynamics models: our experiments show that both stochastic and deterministic paths in the transition model are crucial for successful planning. In this section, we remind the reader of latent state-space models and then describe our dynamics model.

Figure 4: Latent dynamics model designs. In this example, the model observes the first two time steps and predicts the third. Circles represent stochastic variables and squares deterministic variables. Solid lines denote the generative process and dashed lines the inference model.
(a) Transitions in a recurrent neural network are purely deterministic. This prevents the model from capturing multiple futures and makes it easy for the planner to exploit inaccuracies.
(b) Transitions in a state-space model are purely stochastic. This makes it difficult to remember information over multiple time steps.
(c) We split the state into stochastic and deterministic parts, allowing the model to robustly learn to predict multiple futures.

Latent dynamics   We consider sequences {ot,at,rt}t=1T\{o_t,a_t,r_t\}_{t=1}^{T} with discrete time step tt, high-dimensional image observations oto_t, continuous action vectors ata_t, and scalar rewards rtr_t. A typical latent state-space model is shown in Figure 4b and resembles the structure of a partially observable Markov decision process. It defines the generative process of the images and rewards using a hidden state sequence {st}t=1T\{s_t\}_{t=1}^T,

where we assume a fixed initial state s0s_0 without loss of generality. The transition model is Gaussian with mean and variance parameterized by a feed-forward neural network, the observation model is Gaussian with mean parameterized by a deconvolutional neural network and identity covariance, and the reward model is a scalar Gaussian with mean parameterized by a feed-forward neural network and unit variance. Note that the log-likelihood under a Gaussian distribution with unit variance equals the mean squared error up to a constant.

Variational encoder   Since the model is non-linear, we cannot directly compute the state posteriors that are needed for parameter learning. Instead, we use an encoder q(s1:To1:T,a1:T)q(s_{1:T}|o_{1:T},a_{1:T}) == t=1Tq(stst1,at1,ot)\prod_{t=1}^T q(s_t|s_{t-1},a_{t-1},o_t) to infer approximate state posteriors from past observations and actions, where q(stst1,at1,ot)q(s_t|s_{t-1},a_{t-1},o_t) is a diagonal Gaussian with mean and variance parameterized by a convolutional neural network followed by a feed-forward neural network. We use the filtering posterior that conditions on past observations since we are ultimately interested in using the model for planning, but one may also use the full smoothing posterior during training .

Training objective   Using the encoder, we construct a variational bound on the data log-likelihood. For simplicity, we write losses for predicting only the observations -- the reward losses follow by analogy. The variational bound obtained using Jensen's inequality is

For the derivation, please see the appendix in the PDF. Estimating the outer expectations using a single reparameterized sample yields an efficient objective for inference and learning in non-linear latent variable models that can be optimized using gradient ascent .

Deterministic path   Despite its generality, the purely stochastic transitions make it difficult for the transition model to reliably remember information for multiple time steps. In theory, this model could learn to set the variance to zero for some state components, but the optimization procedure may not find this solution. This motivates including a deterministic sequence of activation vectors hth_t, t1Tt \in 1 \ldots T that allow the model to access not just the last state but all previous states deterministically . We use such a model, shown in Figure 4c, that we name recurrent state-space model (RSSM),

where f(ht1,st1,at1)f(h_{t-1},s_{t-1},a_{t-1}) is implemented as a recurrent neural network (RNN). Intuitively, we can understand this model as splitting the state into a stochastic part sts_t and a deterministic part hth_t, which depend on the stochastic and deterministic parts at the previous time step through the RNN. We use the encoder q(s1:To1:T,a1:T)=t=1Tq(stht,ot)q(s_{1:T}|o_{1:T},a_{1:T})=\prod_{t=1}^T q(s_t|h_t,o_t) to parameterize the approximate state posteriors. Importantly, all information about the observations must pass through the sampling step of the encoder to avoid a deterministic shortcut from inputs to reconstructions.

Global prior   The model can be trained using the same loss function (Equation 3). In addition, we add a fixed global prior to prevent the posteriors from collapsing in near-deterministic environments. This alleviates overfitting to the initially small training data set and grounds the state beliefs (since posteriors and temporal priors are both learned, they could drift in latent space). The global prior adds additional KL-divergence loss terms from each posterior to a standard Gaussian. Another interpretation of this is to define the prior at each time step as product of the learned temporal prior and the global fixed prior. In the next section, we identify a limitation of the standard objective for latent sequence models and propose a generalization of it that improves long-term predictions.

Latent Overshooting

In the previous section, we derived the typical variational bound for learning and inference in latent sequence models (Equation 3). As show in Equation 3, this objective function contains reconstruction terms for the observations and KL-divergence regularizers for the approximate posteriors. A limitation of this objective is that the transition function p(stst1,at1)p(s_t|s_{t-1},a_{t-1}) is only trained via the KL-divergence regularizers for one-step predictions: the gradient flows through p(stst1,at1)p(s_t|s_{t-1},a_{t-1}) directly into q(st1)q(s_{t-1}) but never traverses a chain of multiple p(stst1,at1)p(s_t|s_{t-1},a_{t-1}). In this section, we generalize this variational bound to latent overshooting, which trains all multi-step predictions in latent space.

Limited capacity   If we could train our model to make perfect one-step predictions, it would also make perfect multi-step predictions, so this would not be a problem. However, when using a model with limited capacity and restricted distributional family, training the model only on one-step predictions until convergence does in general not coincide with the model that is best at multi-step predictions. For successful planning, we need accurate multi-step predictions. Therefore, we take inspiration from Amos et al. and earlier related ideas , and train the model on multi-step predictions of all distances. We develop this idea for latent sequence models, showing that multi-step predictions can be improved by a loss in latent space, without having to generate additional images.

Figure 5: Unrolling schemes. The labels sijs_{i|j} are short for the state at time ii conditioned on observations up to time jj. Arrows pointing at shaded circles indicate log-likelihood loss terms. Wavy arrows indicate KL-divergence loss terms.
(a) The standard variational objectives decodes the posterior at every step to compute the reconstruction loss. It also places a KL on the prior and posterior at every step, which trains the transition function for one-step predictions.
(b) Observation overshooting decodes all multi-step predictions to apply additional reconstruction losses. This is typically too expensive in image domains.
(c) Latent overshooting predicts all multi-step priors. These state beliefs are trained towards their corresponding posteriors in latent space to encourage accurate multi-step predictions.

Multi-step prediction   We start by generalizing the standard variational bound (Equation 3) from training one-step predictions to training multi-step predictions of a fixed distance dd. For ease of notation, we omit actions in the conditioning set here; every distribution over sts_t is conditioned upon a<ta_{\lt t}. We first define multi-step predictions, which are computed by repeatedly applying the transition model and integrating out the intermediate states,

The case d=1d=1 recovers the one-step transitions used in the original model. Given this definition of a multi-step prediction, we generalize Equation 3 to the variational bound on the multi-step predictive distribution pdp_d,

For the derivation, please see the appendix in the PDF. Maximizing this objective trains the multi-step predictive distribution. This reflects the fact that during planning, the model makes predictions without having access to all the preceding observations.

We conjecture that Equation 6 is also a lower bound on lnp(o1:T)\ln p(o_{1:T}) based on the data processing inequality. Since the latent state sequence is Markovian, for d1d\geq 1 we have I(st;std)I(st;st1)I(s_t;s_{t-d})\leq I(s_t;s_{t-1}) and thus E[lnpd(o1:T)]E[lnp(o1:T)]E[\ln p_d(o_{1:T})]\leq E[\ln p(o_{1:T})]. Hence, every bound on the multi-step predictive distribution is also a bound on the one-step predictive distribution in expectation over the data set. For details, please see the appendix in the PDF. In the next paragraph, we alleviate the limitation that a particular pdp_d only trains predictions of one distance and arrive at our final objective.

Latent overshooting   We introduced a bound on predictions of a given distance dd. However, for planning we need accurate predictions not just for a fixed distance but for all distances up to the planning horizon. We introduce latent overshooting for this, an objective function for latent sequence models that generalizes the standard variational bound (Equation 3) to train the model on multi-step predictions of all distances 1dD1 \leq d \leq D,

Latent overshooting can be interpreted as a regularizer in latent space that encourages consistency between one-step and multi-step predictions, which we know should be equivalent in expectation over the data set. We include weighting factors βd,d1D\beta_d, d \in 1 \ldots D analogously to the β\beta-VAE . While we set all β>1\beta_{\gt 1} to the same value for simplicity, they could be chosen to let the model focus more on long-term or short-term predictions. In practice, we stop gradients of the posterior distributions for overshooting distances d>1d>1, so that the multi-step predictions are trained towards the informed posteriors, but not the other way around. Equation 7 is the final objective function that we use to train the dynamics model of our agent.

Experiments

We evaluate PlaNet on six continuous control tasks from pixels. We explore multiple design axes of the agent: the stochastic and deterministic paths in the dynamics model, the latent overshooting objective, and online experience collection. We refer to the appendix for hyper parameters. Besides the action repeat, we use the same hyper parameters for all tasks. Within one fiftieth the episodes, PlaNet outperforms A3C and achieves similar performance to the top model-free algorithm D4PG . The training time of 1 day on a single Nvidia V100 GPU is comparable to that of D4PG. Our implementation uses TensorFlow Probability and will be open sourced. Please see the following video of the trained agents:

Figure 6: Video of the PlaNet agent learning to solve a variety of continuous control tasks from images in 2000 attempts. Previous agents that do not learn a model of the environment often require 50 times as many attempts to reach comparable performance.

For our evaluation, we consider six image-based continuous control tasks of the DeepMind control suite Tassa et al., shown in Figure 7. These environments provide qualitatively different challenges. The cartpole swingup task requires a long planning horizon and to memorize the cart when it is out of view, the finger spinning task includes contact dynamics between the finger and the object, the cheetah tasks exhibit larger state and action spaces, the cup task only has a sparse reward for when the ball is caught, and the walker is challenging because the robot first has to stand up and then walk, resulting in collisions with the ground that are difficult to predict. In all tasks, the only observations are third-person camera images of size 64×64×3 pixels.

Figure 7: Image-based control domains used in our experiments. The animation shows the image inputs as the agent is solving each task. The tasks test a variety of properties of our agent.
(a) For cartpole the camera is fixed, so the cart can move out of sight. The agent thus must absorb and remember information over multiple frames.
(b) The finger spin task requires predicting two separate objects, as well as the interactions between them.
(c) The cheetah running task includes contacts with the ground that are difficult to predict precisely, calling for a model that can predict multiple possible futures.
(d) The cup task only provides a sparse reward signal once a ball is caught. This demands accurate predictions far into the future to plan a precise sequence of actions.
(e) The simulated walker robot starts off by lying on the ground, so the agent must first learn to stand up and then walk.

Comparison to model-free methods   Figure 8 compares the performance of PlaNet to the model-free algorithms reported by Tassa et al.. Within 500 episodes, PlaNet outperforms the policy-gradient method A3C trained from proprioceptive states for 100,000 episodes, on all tasks. After 2,000 episodes, it achieves similar performance to D4PG, trained from images for 100,000 episodes, except for the finger task. On the cheetah running task, PlaNet surpasses the final performance of D4PG with a relative improvement of 19%. We refer to Table 1 for numerical results, which also includes the performance of CEM planning with the true dynamics of the simulator.

Table 1: Comparison of PlaNet to the model-free algorithms A3C and D4PG reported by Tassa et al.. The training curves for these are shown as orange lines in Figure 4 and as solid green lines in Figure 6 in their paper. From these, we estimate the number of episodes that D4PG takes to achieve the final performance of PlaNet to estimate the data efficiency gain. We further include CEM planning (H=12,I=10,J=1000,K=100) with the true simulator instead of learned dynamics as an estimated upper bound on performance. Numbers indicate mean final performance over 4 seeds.

Model designs   Figure 8 additionally compares design choices of the dynamics model. We train PlaNet using our recurrent state-space model (RSSM), as well as versions with purely deterministic GRU , and purely stochastic state-space model (SSM). We observe the importance of both stochastic and deterministic elements in the transition function on all tasks. The stochastic component might help because the tasks are stochastic from the agent's perspective due to partial observability of the initial states. The noise might also add a safety margin to the planning objective that results in more robust action sequences. The deterministic part allows the model to remember information over many time steps and is even more important -- the agent does not learn without it.

Figure 8: Comparison of PlaNet to model-free algorithms and other model designs. Plots show test performance for the number of collected episodes. We compare PlaNet using our RSSM to purely deterministic (RNN) and purely stochastic models (SSM). The RNN does not use latent overshooting, as it does not have stochastic latents. The lines show medians and the areas show percentiles 5 to 95 over 4 seeds and 10 rollouts.

Agent designs   Figure 9 compares PlaNet with latent overshooting to versions with standard variational objective, and with a fixed random data set rather than collecting experience online. We observe that online data collection helps all tasks and is necessary for the finger and walker tasks. Latent overshooting is necessary for successful planning on the walker and cup tasks; the sparse reward in the cup task demands accurate predictions for many time steps. It also slows down initial learning for the finger task, but increases final performance on the cartpole balance and cheetah tasks.

Figure 9: Comparison of agent designs. Plots show test performance for the number of collected episodes. We compare PlaNet using latent overshooting (Equation 7), a version with standard variational objective (Equation 3), and a version that trains from a random data set of 1000 episodes rather than collecting experience during training. The lines show medians and the areas show percentiles 5 to 95 over 4 seeds and 10 rollouts.

One agent all tasks   Additionally, we train a single PlaNet agent to solve all six tasks. The agent is placed into different environments without knowing the task, so it needs to infer the task from its image observations. Without changes to the hyper parameters, the multi-task agent achieves the same mean performance as individual agents. While learning slower on the cartpole tasks, it learns substantially faster and reaches a higher final performance on the challenging walker task that requires exploration.

Figure 10: Video predictions of the PlaNet agent trained on multiple tasks. Holdout episodes are shown above with agent video predictions below. The agent observes the first 5 frames as context to infer the task and state and accurately predicts ahead for 50 steps given a sequence of actions.

For this, we pad the action spaces with unused elements to make them compatible and adapt Algorithm 1 to collect one episode of each task every 6C6\,C update steps. We use the same hyper parameters as for the main experiments above. The agent reaches the same average performance over tasks as individually trained agents. While learning is slowed down for the cup task and the easier cartpole tasks, it is substantially improved for the difficult task of walker. This indicates that positive transfer between these tasks might be possible using model-based reinforcement learning, regardless of the conceptually different visuals. Full results available in the appendix section of our paper.

Related Work

Previous work in model-based reinforcement learning has focused on planning in low-dimensional state spaces , combining the benefits of model-based and model-free approaches , and pure video prediction without planning .

Planning in state space   When low-dimensional states of the environment are available to the agent, it is possible to learn the dynamics directly in state space. In the regime of control tasks with only a few state variables, such as the cart pole and mountain car tasks, PILCO achieves remarkable sample efficiency using Gaussian processes to model the dynamics. Similar approaches using neural networks dynamics models can solve two-link balancing problems and implement planning via gradients . Chua et al. use ensembles of neural networks, scaling up to the cheetah running task. The limitation of these methods is that they access the low-dimensional Markovian state of the underlying system and sometimes the reward function. Amos et al. train a deterministic model using overshooting in observation space for active exploration with a robotics hand. We move beyond low-dimensional state representations and use a latent dynamics model to solve control tasks from images.

Hybrid agents   The challenges of model-based RL have motivated the research community to develop hybrid agents that accelerate policy learning by training on imagined experience , improving feature representations , or leveraging the information content of the model directly . Srinivas et al. learn a policy network with integrated planning computation using reinforcement learning and without prediction loss, yet require expert demonstrations for training.

Multi-step predictions   Training sequence models on multi-step predictions has been explored for several years. Scheduled sampling changes the rollout distance of the sequence model over the course of training. Hallucinated replay mixes predictions into the data set to indirectly train multi-step predictions. Venkatraman et al. take an imitation learning approach. Recently, Amos et al. train a dynamics model on all multi-step predictions at once. We generalize this idea to latent sequence models trained via variational inference.

Latent sequence models   Classic work has explored models for non-Markovian observation sequences, including recurrent neural networks (RNNs) with deterministic hidden state and probabilistic state-space models (SSMs). The ideas behind variational autoencoders have enabled non-linear SSMs that are trained via variational inference . The VRNN combines RNNs and SSMs and is trained via variational inference. In contrast to our RSSM, it feeds generated observations back into the model which makes forward predictions expensive. Karl et al. address mode collapse to a single future by restricting the transition function, focus on multi-modal transitions, and Doerr et al. stabilize training of purely stochastic models. Buesing et al. propose a model similar to ours but use in a hybrid agent instead for explicit planning.

Video prediction   Video prediction is an active area of research in deep learning. Oh et al. and Chiappa et al. achieve visually plausible predictions on Atari games using deterministic models. Kalchbrenner et al. introduce an autoregressive video prediction model using gated CNNs and LSTMs. Recent approaches introduce stochasticity to the model to capture multiple futures . To obtain realistic predictions, Mathieu and Vondrick use adversarial losses. In simulated environments, Gemici et al. augment dynamics models with an external memory to remember long-time contexts. Van et al. propose a variational model that avoids sampling using a nearest neighbor look-up, yielding high fidelity image predictions. These models are complimentary to our approach.

Relatively few works have demonstrated successful planning from pixels using learned dynamics models. The robotics community focuses on video prediction models for planning that deal with the visual complexity of the real world and solve tasks with a simple gripper, such as grasping or pushing objects. In comparison, we focus on simulated environments, where we leverage latent planning to scale to larger state and action spaces, longer planning horizons, as well as sparse reward tasks. E2C and RCE embed images into a latent space, where they learn local-linear latent transitions and plan for actions using LQR. These methods balance simulated cartpoles and control 2-link arms from images, but have been difficult to scale up. We lift the Markov assumption of these models, making our method applicable under partial observability, and present results on more challenging environments that include longer planning horizons, contact dynamics, and sparse rewards.

Discussion

In this work, we present PlaNet, a model-based agent that learns a latent dynamics model from image observations and chooses actions by fast planning in latent space. To enable accurate long-term predictions, we design a model with both stochastic and deterministic paths and train it using our proposed latent overshooting objective. We show that our agent is successful at several continuous control tasks from image observations, reaching performance that is comparable to the best model-free algorithms while using 50× fewer episodes and similar training time. The results show that learning latent dynamics models for planning in image domains is a promising approach.

Directions for future work include learning temporal abstraction instead of using a fixed action repeat, possibly through hierarchical models. To further improve final performance, one could learn a value function to approximate the sum of rewards beyond the planning horizon. Moreover, exploring gradient-based planners could increase computational efficiency of the agent. Our work provides a starting point for multi-task control by sharing the dynamics model.

If you would like to discuss any issues or give feedback regarding this work, please visit the GitHub repository of this article.

Acknowledgments

We thank Jacob Buckman, Nicolas Heess, John Schulman, Rishabh Agarwal, Silviu Pitis, Mohammad Norouzi, George Tucker, David Duvenaud, Shane Gu, Chelsea Finn, Steven Bohez, Jimmy Ba, Stephanie Chan, and Jenny Liu for helpful discussions.

This article was prepared using the Distill template.

Citation

For attribution in academic contexts, please cite this work as

Hafner et al., "Learning Latent Dynamics for Planning from Pixels", 2018.

BibTeX citation

@article{hafner2018planet,
  title={Learning Latent Dynamics for Planning from Pixels},
  author={Hafner, Danijar and Lillicrap, Timothy and Fischer, Ian and Villegas, Ruben and Ha, David and Lee, Honglak and Davidson, James},
  journal={arXiv preprint arXiv:1811.04551},
  year={2018}
}

Open Source Code

We released our source code for reproducing this paper, and for future research to build upon. Please see this GitHub repo for instructions.