Policy Proximal Optimization ― A New Frontier in Reinforcement Learning

Gladys wairimu
Oct 4, 2023
6 min read

Welcome to the world of reinforcement learning. One of the key algorithms used in this field of artificial intelligence is Proximal Policy Optimization (PPO). PPO is an algorithm used in reinforcement learning to train agents, but before we dive into Proximal Policy Optimization, let’s first understand a basic term used throughout this article: agent. An agent is an autonomous entity that interacts with its environment using actuators and sensors with an aim of achieving a certain goal. Examples of agents can be an automatic vacuum cleaner, a robotic arm, or a self-driving car. Reinforcement learning is a field in AI that can be used to train agents to learn how to perform certain tasks efficiently through the structure of penalties and rewards. So, at the heart of a RL program is an algorithm that rewards the agent when it performs well by awarding it certain points and penalizes it when it strays away from the intended goal. Hence, the autonomous agent strives to take actions that maximizes its chances of being rewarded. Think of it as training a dog. Usually, a pet owner gives a dog treats when it performs a certain trick correctly and gets denied the treat when it doesn’t. Eventually, the dog learns the set of actions that cause its owner to give it treats, and before you know it the dog has been fully trained after a number of repeated sessions. PPO is one of many algorithms used in RL to train agents to perform certain tasks.

Several articles allude to PPO as a ‘’state of the art’’ algorithm used in RL due to its recent creation in 2017 by OpenAI. But, in order to fully understand PPO, we must first understand the fundamental principles that make it. PPO is a mix of two algorithms namely, Policy Gradient and Trust Region Policy Optimization.

Policy Gradient

Policy Gradient is an algorithm used in RL that drives an agent to make decisions within an environment that maximizes reward. To understand this further, let’s break down the two-worded phrase into its individual parts. The ‘’policy’’ in this algorithm is a strategy used to map states or observations of an agent to actions. Policy gradient algorithms are typically formulated as optimization problems whose goal is to find the policy parameters (strategies) that lead to the highest return. In the field of Mathematics, Physics or Science, the term ‘’gradient’’ fundamentally refers to the rate of change of a function with respect to its input variables. This is no different in AI. In policy gradient algorithms, a gradient is used to find the minimum or maximum of a function, and update the model’s parameters towards the direction that will give the agent the highest rewards.

Policy Gradient is most suitable for use in continuous actions spaces. A continuous action space is a scenario where an agent can choose from an infinite set of actions rather than a discrete set of values. Now, let’s look at the main characteristics of policy gradient.

Continuous in nature – policy gradient algorithms are used in continuous action spaces where there are infinite possible action combinations and continuously varying parameters.
Real-numbered actions – actions in policy gradient are represented as real numbers. This contributes to its continuous nature and allows for a smooth transition between actions.
Stochastic and adaptable in nature – generally speaking, stochasticity has to do with randomness or unpredictability. In this case, we will also see an aspect of adaptability. A deterministic agent or algorithm is the opposite of a stochastic one. A deterministic agent follows a rigid pre-defined plan for every situation it encounters. So, when placed in the same situation over and over again, it will make the same decision and take the same actions. This makes it predictable. However, a stochastic agent takes a different, non-predefined and unpredictable plan. This agent is curious and wants to experiment with various actions, including the ones it hasn’t tried before. So, if placed in the same situation over and over again, it might take different actions and make different decisions each time. This is how the policy gradient algorithm is – stochastic. After taking a series of actions, the agent evaluates the rewards or penalties it has received and tries to adjust its parameters in order to take actions with the highest reward, hence making them adaptable.

Trust Region Policy

Trust Region Policy Optimization is an RL algorithm that was introduced in 2015 by John Schulman. While TRPO also works by finding a policy update that maximizes the cumulative rewards, it differs from Policy Gradient by avoiding parameter changes that lead to a large change between the previous policy and the next policy. Basically, policy variations are kept within the ‘’trust region.’’ Here are some main characteristics of TRPO.

Policy Optimization – the policy in TRPO is to map states or observations to actions that lead to a maximization of rewards.
Stability – TRPO is done in iterations where each iteration involves optimizing a policy (i.e mapping states to actions). The policy is then updated in the next iteration in order to get closer to the optimal reward. However, these changes to the policy after each iteration are done in small changes. This ensures policy updates are stable. Large, drastic changes can lead to catastrophic failures in learning.
Trust Region – a trust region in TRPO is a constraint that determines how far the new policy is allowed to deviate from the previous policy. A common method to create a trust region in TRPO is using the Kullback-Leibler divergence. Additionally, TRPO uses a line search to find an optimal step size from one policy to the next. In basic terms, a line search works by exploring different step sizes along a predefined search direction to determine the step size that leads to the greatest improvement in the objective function.
Monotonic improvement – in TRPO, the policy continuously improves from one iteration and policy to the next, moving closer and closer to the optimal policy that gives the greatest reward. Basically, the policy consistently improves without ever receding.

So, as you recall from the beginning of this article, PPO is a mixture of both Policy Gradient and Trust Region Policy Optimization. It leverages the advantages of both policies thereby being stronger and more efficient than the individual policies to create a RL algorithm that brings a model closer and closer to the reward each time it trains/learns. Hence, we shall notice that its characteristics come from both policies. Here are some.

Sample Efficiency – in PPO an agent learns and improves its policy using relatively few attempts. It, however, makes efficient use of the few attempts it uses to collect data that is used during training. This quality makes it suitable for resource-intensive and time-consuming environments.
Two-step Optimization – PPO works in two iterations as it moves closer to the maximum reward.

Step 1 – PPO collects data by running the current policy

Step 2 – it then uses the collected data to optimize the previous policy using a surrogate objective function. More about the surrogate objective function is explained below.

Parallelism – PPO runs multiple instances of an agent in different environments simultaneously.
Surrogate Objective Function – a surrogate objective function is a ‘tool’ used in PPO to help it to learn ‘safely’ by telling the agent how to adjust its strategy/policy to get better results in a way that is neither too risky nor too conservative, but rather just enough. It keeps changes made in policy within a safe range (i.e., within the trust region). This helps to maintain stability.
Policy Optimization – PPO aims to choose parameters that create a policy that will move agent towards attaining the maximum rewards and reducing its chances of being penalized due to taking actions that stray away from the agent’s goal.
Trust Region – PPO uses a trust region, which is a constraint that determines how far the new policy can deviate from the previous policy. The goal is to create small updates between policies to ensure stability of the algorithm.

Pseudocode

The following is a simple pseudocode illustrating how PPO algorithms works programmatically:

Initialize policy network with random weights
Initialize value function network with random weights
Set hyperparameters (learning rates, discount factor, clip epsilon, etc.)

for each episode do:
    Initialize an empty buffer to store trajectory data

    for each timestep in episode do:
        Collect data by executing the current policy in the environment
        Store (state, action, reward, next_state, done) in the buffer

    Compute advantages for each timestep in the buffer using the value function network

    for a fixed number of optimization iterations do:
        Compute the surrogate objective for each data point in the buffer:
            Compute the ratio of the new policy to the old policy
            Compute the clipped surrogate objective
        Compute the value function loss (e.g., mean squared error)
        Compute the total loss as a combination of the policy loss and the value function loss
        Update the policy and value function networks using gradient descent within a trust region

    end for

end for

In conclusion, PPO is a powerful and popular algorithm used in reinforcement learning to train an agent to accomplish a specific task by teaching it to make decisions that will lead to an increase in rewards and a decrease in penalties, just like training a dog to perform a trick, for example. It is made by combining the traits of Policy Gradient and Trust Region Policy Optimization. It works by comparing the likelihood of an agent to take certain actions under its current policy to the likelihood of the same agent to have taken those same actions in the previous policy. Then, it aims to increase the probability of actions that lead to better rewards. This way it learns to make improvements in its policy, and hence its actions.

Policy Gradient

Trust Region Policy

Pseudocode

Comments