Render, Encode, Plan: a simple pipeline for hybrid RL-DL learning inside Unreal Engine

1University of Trento, 2CNIT
Teaser image.

We present REP - Render, Encode, Plan, a novel framework to train different kinds of intelligent agents inside Unreal Engine 5. a) Different kinds of agents can be spawned into arbitrarily large 3D environments, while b) capturing diverse kinds of observations (including visual), that c) are then encoded an hybrid RL+DL procedure. The encoded observations give rise to d) a set of actions for each agent, that receives rewards to improve over time.

Demonstration videos

Urban environment

Drone navigation

Agent exploration

Abstract

Learning is an iterative process that requires multiple forms of interaction with the environment. During learning, we experience the world through the repetition of observations and actions, gaining an insight into which combination of these leads to the best results, according to our goals. The same paradigm has been applied to traditional reinforcement learning (RL) over the years, with impressive results in 3D navigation and planning. On the other hand, the computer vision community has been focusing mostly on vision-related tasks (e.g. classification, segmentation, depth estimation) using deep learning (DL).

We present REP: Render, Encode, Plan, a unified framework to train embodied agents of different kinds (humanoids, vehicles, and drones) inside Unreal Engine, showing how a combination of RL and DL can help to shape intelligent agents that can better sense the surrounding environment. The main advantage of our method is the combination of different sensory modalities, including game state observations and vision features, that allow the agents to share a similar structure in their observations and rewards, while defining separate rewards based on their goals. We demonstrate impressive generalization capabilities on large-scale realistic 3D environments and on multiple dynamically changing scenarios, with different goals and rewards.

Architecture

Architecture.

Simplified schematic of our REP architecture. The goal here is to map physical and visual observations from multiple kinds of agents from a big, open world, into a meaningful set of actions. We do so by first creating an encoding of the visual observations using a DNN encoder. Then, we concatenate the normalized physical observations (i.e. game state) with the newly created visual observations. We then train the agents to optimize a policy that produces the best set of actions for a given reward function, using the PPO algorithm. At inference time, the same diagram applies, but the computation of the visual features runs in-engine using NNE for optimal performance.

State-of-the-art comparison

Architecture.

World

World.

Top and perspective view of a 2.5 Km spline for a drone mission scenario. In REP, similar splines can be defined for the humanoid and car agents as well.

Results: drone navigation

Results.

Horizontal and vertical extent of the Spline Navigation scenarios. Vertical bars identify the localized inverse value of the dot product between the drone flying direction and the spline direction.

Results: crowd simulation

Results.

Related Links

Here you can find some related links.

Unreal Engine's Learning Agents allows us to setup the training environment and render all the scenes.

BibTeX

@article{dellapietra2025render,
  title={Render, Encode, Plan: a simple pipeline for hybrid RL-DL learning inside Unreal Engine},
  author={Della Pietra, Daniele and Garau, Nicola},
  year={2025},
  publisher={Computers & Graphics}
}