Peek-a-bot: learning through vision in Unreal Engine

1University of Trento, 2CNIT
Teaser image.

We present Peek-a-bot, a hybrid DL-RL framework that allows the training of reinforcement learning agents only through vision observations and fully in-engine. We (a) build arbitrarily complex and photorealistic environments for agents to navigate. Each agent (b) perceives the world only through its vision. No other information is given at any time to any of the agents, including goal positions or distance information. We (c) extract visual features in-engine from a pre-trained ONNX backbone and (d) use them as input to the PPO algorithm to optimize a set of given rewards (e). Every step is designed to run in real-time inside Unreal Engine 5, making it possible to train agents specifically for video games and complex simulations.

Abstract

Humans learn to navigate and interact with their surroundings through their senses, particularly vision. Ego-vision has lately become a significant focus in computer vision, enabling neural networks to learn from first-person data effectively, as we humans do. Supervised or self-supervised learning of depth, object location, and segmentation maps through deep networks has shown considerable success in recent years.

On the other hand, reinforcement learning (RL) has been focusing on learning from different kinds of sensing data, such as rays, collisions, distances, and other types of observations. In this paper, we merge the two approaches, providing a complete pipeline to train reinforcement learning agents inside virtual environments, only relying on vision, eliminating the need for traditional RL observations.

We demonstrate that visual stimuli, if encoded by a carefully designed vision encoder, can provide informative observations, thus replacing ray-based approaches and drastically simplifying the reward shaping typical of classical RL. Our method is fully implemented inside Unreal Engine 5, from the real-time inference of visual features to the online training of the agents' behavior using the Proximal Policy Optimization (PPO) algorithm.

To the best of our knowledge, this is the first in-engine solution targeting video games and simulation, enabling game developers to easily train vision-based RL agents without writing a single line of code.

Architecture

Architecture.

Peek-a-bot architecture. Left to right: the environment is sensed through visual observations (RGB images), which are fed to a backbone neural network chosen by the user. If the selected backbone is pre-trained for an object detection task (optional), the game developer can decide to use the bounding boxes as observations instead of the visual features. A small encoder processes the observations to create inputs to the Policy and Critic networks, which are trained using the PPO algorithm. A small decoder transforms the Policy network's output into actions, which can be seen as the movement inputs typical of a video game controller.

Blueprints

Blueprints.

Our method is designed with the game developer in mind. All the visual inputs, backbone and RL network parameters, as well as the outputs and all the other parameters, are easily accessible through a GUI or Blueprint nodes, without the need for writing a single line of code.

Left: a set of Blueprint nodes managing the visual observations.

Right: a snippet of our GUI allowing the selection of the number of agents to be spawned, the desired backbone network architecture, as well as the visual input configuration (resolution, FOV, etc.).

Visual Results

Visual Results.

Peek-a-bot visual observations system applied to multiple dynamic and unbounded scenes.

Top row: different types of training scenarios (hide and seek, car race, urban navigation, drone coverage) and visual observation cones (blue).

Bottom row: corresponding visual observations to be encoded by the neural network into observation vectors. Our method allows us to manage different kinds of rich, dynamic scenes while dropping the need for different kinds of sensors for each scenario.

Related Links

Here you can find some related links.

Unreal Engine's Learning Agents allows us to setup the training environment and render all the scenes.

BibTeX

@inproceedings{10.2312:stag.20241330,
  booktitle = {Smart Tools and Applications in Graphics - Eurographics Italian Chapter Conference},
  editor = {Caputo, Ariel and Garro, Valeria and Giachetti, Andrea and Castellani, Umberto and Dulecha, Tinsae Gebrechristos},
  title = {{Peek-a-bot: learning through vision in Unreal Engine}},
  author = {Pietra, Daniele Della and Garau, Nicola and Conci, Nicola and Granelli, Fabrizio},
  year = {2024},
  publisher = {The Eurographics Association},
  ISSN = {2617-4855},
  ISBN = {978-3-03868-265-3},
  DOI = {10.2312/stag.20241330}
}