During recent years, enormous advances in Reinforcement Learning (RL) have been showcased by DeepMind, OpenAI and others. Their AIs achieve super-human or similar-to-human capabilities in various games including Atari games, the Chinese board game Go, Dota 2 and Starcraft.
At the same time developing Deep Learning solutions for Supervised Learning became easier and easier with popular frameworks like TensorFlow available to the public. With the new TF-Agents framework, it now also becomes easier and more straightforward to develop Reinforcement Learning solutions with TensorFlow.
In this post, we use TF-Agents to train a neural network agent to play a simple scenario of Doom. We will present the most relevant code parts here. Have a look at our Github Repository to find the full implementation and additional required files.
Since this guide focuses on the usage of TF-Agents' high-level APIs, we will not go deep into the details of reinforcement learnining and the used algorithms. If you are new to RL, please check out this awesome series by Arthur Juliani, who gives a great introduction from Q-learnining up to A3C. For more insights into Proximal Policy Optimization (PPO) the OpenAI webpage is a good starting point.
What is TF-Agents?
TF-Agents is TensorFlow's new framework to assist developing RL use cases. For RL to work properly, often all the finest details have to be considered. This can make RL hard to implement on your own. With TF-Agents, all parts required are already implemented for you, so you can concentrate on your use case and the optimization of hyperparameters.
At the moment, TF-Agents is still in an early phase of development but already implements several state-of-the-art RL algorithms.
What is Proximal Policy Optimization (PPO)?
PPO is a class of RL algorithms that have become the default RL algorithms for many use cases. According to OpenAI, they perform “comparably or better than state-of-the-art approaches while being much simpler to implement and tune”. Since PPO is a model-free on-policy algorithm that can be used for continuous and discrete action spaces, a great many of use cases can be tackled with it.
For details on PPO and how it works, have a look at this OpenAI introduction.
Scenario: Doom Basic
To be able to let our agent play doom, we utilize the ViZDoom project, which aims to allow “developing AI bots that play Doom using only the visual information”. It itself is based on the ZDoom project that makes Doom playable on modern PCs.
ViZDoom offers several preconfigured scenarios reaching from corridors with several monsters shooting at the player, to labyrinths where the player must find its way to a target room.
To start of easy, we want our agent to learn to handle the “basic” scenario. The map consists of a rectangle where the player is on one side and a monster is spawned at a random position on the other side. The player can only maneuver left and right and shoot his weapon. When the monster is hit, the episode (one match) is finished. A player gets a reward of +101 points for killing the monster and -5 points if it doesn’t kill it within 300 time-steps. Additionally, a reward of -1 is received for every time-step the monster is alive.
The following video shows a fully trained agent performing 10 episodes of this scenario. (Sometimes the muzzle flash is not visible due to the reduced frame rate of the gif.)
This tutorial assumes you have Python 3.x installed. While most code probably will also work with Python 2, it might require some adaptions.
Please note: At the time of writing, TensorFlow 2.0 is not yet released and only available in version 2.0.0b1. Furthermore, since TF-Agents is not released yet, we should install it from source. (TF-Agent also provides nightly pip packages, but they are currently not updated regularly.)
Please install the following packages.
TensorFlow can be installed for CPU with
pip3 install -U tensorflow==2.0.0b1. To install TensorFlow with GPU support, have a look at the requrements and install
pip3 install -U tensorflow-gpu==2.0.0b1.
Clone the TF-Agents repository from Github and install it via
pip install -e <directory where you clone tf-agents to>
pip3 install -U vizdoom
apt install python3-opencv
pip3 install -U imageio
apt install ffmpeg
To train an RL agent with TF-Agents, we need an environment of the game and configure the RL algorithm. For popular environment sets like gym, Atari or mujoco, TF-Agents offers predefined suites that can be used to load those environments. For Doom, we have to write our own environment.
The DoomEnvironment will be a small wrapper around ViZDoom’s DoomGame class. This wrapper configures the game with the desired scenario and maps the TF-Agents environment API to the ViZDoom API.
def configure_doom(config_name="basic.cfg"): game = DoomGame() game.load_config(config_name) game.init() return game
In the constructor of the DoomEnvironment class, the game is loaded and the number of available actions is saved (other scenarios might have a different number of actions). Furthermore, an
action_spec and an
observation_spec are declared. These define the format of the actions and observations provided by this environment.
The remainder of the code maps the TF-Agents API to the ViZDoom API:
- ViZDoom expects a one hot encoded action, wheres TF-Agent provides the index of the action.
- With out configuration, ViZDoom provides a 120x160 pixels rgb image. To have a similar structure like the Atari games, this image is cropped and resized to 84x84 pixels.
class DoomEnvironment(py_environment.PyEnvironment): def __init__(self): super().__init__() self._game = self.configure_doom() self._num_actions = self._game.get_available_buttons_size() self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.int32, minimum=0, maximum=self._num_actions - 1, name='action') self._observation_spec = array_spec.BoundedArraySpec(shape=(84, 84, 3), dtype=np.float32, minimum=0, maximum=1, name='observation') def _reset(self): self._game.new_episode() return time_step.restart(self.get_screen_buffer_preprocessed()) def _step(self, action): if self._game.is_episode_finished(): # The last action ended the episode. Ignore the current action and start a new episode. return self.reset() # construct one hot encoded action as required by ViZDoom one_hot =  * self._num_actions one_hot[action] = 1 # execute action and receive reward reward = self._game.make_action(one_hot) # return transition depending on game state if self._game.is_episode_finished(): return time_step.termination(self.get_screen_buffer_preprocessed(), reward) else: return time_step.transition(self.get_screen_buffer_preprocessed(), reward) def render(self, mode='rgb_array'): """ Return image for rendering. """ return self.get_screen_buffer_frame() def get_screen_buffer_preprocessed(self): """ Preprocess frame for agent by: - cutout interesting square part of screen - downsample cutout to 84x84 (same as used for atari games) - normalize images to interval [0,1] """ frame = self.get_screen_buffer_frame() cutout = frame[10:-10, 30:-30] resized = cv2.resize(cutout, (84, 84)) return np.divide(resized, 255, dtype=np.float32) def get_screen_buffer_frame(self): """ Get current screen buffer or an empty screen buffer if episode is finished""" if self._game.is_episode_finished(): return np.zeros((120, 160, 3), dtype=np.float32) else: return self._game.get_state().screen_buffer
Defining the Agent's Neural Networks
To train an agent with PPO, an actor network and a value network are required. TF-Agents offers classes allowing an easy configuration of the neural networks we want to use. The following code shows how these are configured.
def create_networks(observation_spec, action_spec): actor_net = ActorDistributionRnnNetwork( observation_spec, action_spec, conv_layer_params=[(16, 8, 4), (32, 4, 2)], input_fc_layer_params=(256,), lstm_size=(256,), output_fc_layer_params=(128,), activation_fn=tf.nn.elu) value_net = ValueRnnNetwork( observation_spec, conv_layer_params=[(16, 8, 4), (32, 4, 2)], input_fc_layer_params=(256,), lstm_size=(256,), output_fc_layer_params=(128,), activation_fn=tf.nn.elu) return actor_net, value_net
Our two networks have mostly the same structure:
- two convolutional neural network (CNN) layers
- a fully connected (FC) layer
- a long short-term memory (LSTM) layer
- another FC layer
On top of these layers, the actor distribution network adds a FC layer with the number of actions and the value network adds a FC layer with one cell to calculate the value of the observation.
Training with PPO
To train the agent with TF-Agents' PPO agent, we have to create an object of class
PPOAgent and provide it with the specifications of time steps (observations, reward, ...) and the allowed actions. Furthermore, we also provide the optimizer (e.g.
AdamOptimizer) that should be used during training. To improve the variance during training, we combine multiple
DoomEnvironments with TF-Agents'
ParallelPyEnvironment. If you run out of GPU memory, you might need to decrease the number of parallel environments.
PPOAgent requires additional hyperparameters that might be tuned to improve training performance further.
eval_tf_env = tf_py_environment.TFPyEnvironment(DoomEnvironment()) tf_env = tf_py_environment.TFPyEnvironment(parallel_py_environment.ParallelPyEnvironment([DoomEnvironment] * num_parallel_environments)) actor_net, value_net = create_networks(tf_env.observation_spec(), tf_env.action_spec()) global_step = tf.compat.v1.train.get_or_create_global_step() optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate, epsilon=1e-5) tf_agent = ppo_agent.PPOAgent( tf_env.time_step_spec(), tf_env.action_spec(), optimizer, actor_net, value_net, num_epochs=num_epochs, train_step_counter=global_step, discount_factor=0.995, gradient_clipping=0.5, entropy_regularization=1e-2, importance_ratio_clipping=0.2, use_gae=True, use_td_lambda_return=True ) tf_agent.initialize()
To collect the training samples, a
TFUniformReplayBuffer is created. To run the environment and fill up the replay buffer, an instance of
DynamicEpisodeDriver is created. It uses the agent's
collect_policy to execute the environment.
A training iteration then consists of the following steps:
- Collect a number of full episodes with the episode driver.
- Train the agent on the collected trajectories.
- Clear the replay buffer for the next iteration.
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(tf_agent.collect_data_spec, batch_size=num_parallel_environments, max_length=replay_buffer_capacity) collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(tf_env, tf_agent.collect_policy, observers=[replay_buffer.add_batch], num_episodes=collect_episodes_per_iteration) while trainingIsNotFinished: collect_driver.run() trajectories = replay_buffer.gather_all() tf_agent.train(experience=trajectories) replay_buffer.clear()
The full sample code can be found in our repository.
Logging the Training's Progress
To get better insights into the training's progress, we can use TensorBoard summaries and create videos of episodes performed by our agent. For the full working code with TensorBoard logging and video creation, please refer to the sample in our repository.
To create videos of the playing agent, we can use the render method of our
DoomEnvironment. However, to be able to let the agent process a full episode with the LSTM, we need to utilize the wrapped
TFEnvironment. Fortunately, since the
TFEnvironment is only a wrapper, we can use both in combination to render our videos with the following snippet.
def create_video(py_environment: PyEnvironment, tf_environment: TFEnvironment, policy: tf_policy.Base, num_episodes=10, video_filename='imageio.mp4'): with imageio.get_writer(video_filename, fps=60) as video: for episode in range(num_episodes): time_step = tf_environment.reset() state = policy.get_initial_state(tf_environment.batch_size) video.append_data(py_environment.render()) while not time_step.is_last(): policy_step: PolicyStep = policy.action(time_step, state) state = policy_step.state time_step = tf_environment.step(policy_step.action) video.append_data(py_environment.render())
create_video() function, we first reset the environment and get the initial state for the LSTM. Afterwards, the policy is used to calculate the
policy_step based on the current
time_step (the observation). From the policy_step, we get the new internal LSTM state and the next action to take. The latter one is then executed in the environment to retrieve the next
time_step. For every step in the environment, we call the
render() method of the
PyEnvironment and append the image to the video.
The following results are created with the extended training script available in our repository. The Training was finished after about 15 hours on a GeForce RTX 2080 Ti with 11GB of VRAM.
To view the results in TensorBoard, start it from the command line via
tensorboard --logdir=<root_dir>, where
<root_dir> is the directory specified when training the agent.
In the TensorBoard dashboard, several graphs give insight into the training progress. We can get a good overview of the agent's performance from the graphs in the
Metrics section as shown in the following figure. In our case, the blue line visualizes the performance during training and the orange line the performance during evaluations.
In general, the average return should increase towards 100 and the average episode length should decrease towards 0. However, since most of the times, the agent has to move to kill the monster, episode length will be higher than 0 and the reward will be equivalentely lower.
Learning Progress in Videos
The generated videos can give a good impression on the agent's current behaviour and strategy. The following four clips have been taken from a training and show the progression and improvement of the agent over time. Please note that every training behaves a bit differently due to the random initialization and random environment behavior.
In this tutorial, we showed how to create a custom environment for TF-Agents and how to use it to train a neural network agent with Proximal Policy Optimization with just about 250 lines of code. While TF-Agents is still in an early phase of development, it already provides a simple usage of powerfull reinforcement learning algorithms.
Feel free to get in touch if you have any comments or feedback. We hope the tutorial was helpful to get an easy start into reinforcement learning with TF-Agents.