Enhancing Exploration via Off-Reward Dynamic Reference Reinforcement Learning

Yamen Habib, Dmytro Grytskyy, Rubén Moreno-Bote

Abstract

In reinforcement learning (RL), balancing exploration and exploitation is essential for maximizing a target reward function. Traditional methods often employ regularizers to prevent the policy from becoming deterministic too early, such as penalizing deviations from a reference policy. This paper introduces a novel approach by training an off-reward dynamic reference policy (ORDRP) with a different reward function alongside the target policy to guide exploration.

We use Kullback–Leibler divergence as a regularization technique and train the ORDRP either with the maximum occupancy principle or Laplacian intrinsic off-rewards.

We prove the convergence of the ORDRP iteration method and validate our theory within an actor-critic framework. Our experiments in challenging environments reveal that incorporating an ORDRP enhances exploration, resulting in superior performance and higher sampling efficiency in benchmarks compared to state of the art baselines. Our findings suggest that dynamically training a reference policy with an off-reward function alongside the main policy can significantly improve learning outcomes.

Intuition

We assume that the target policy \(\pi\) is improved at each iteration: to maximize reward, but there is also a penalty (Kullback-Leibler (KL)) for deviating too much from the current reference policy $\mu$.

When using a static uniform reference policy (Fig. A, bottom panel), the KL penalty tends to greatly reduce the likelihood of an action that has been discovered to be good by the target policy (peak of the Gaussian), while it increases the likelihood of harmful or useless actions due to the uninformative regularization term (left tail of the Gaussian).

When using the current target policy as the reference (Fig. B, bottom), there are minimal differences between the new target policy and the reference policy, which do not significantly enhance exploration beyond what has already been learned (Fig. B, top).

In contrast, a dynamic reference policy that lies between purely uninformative and purely target-aligned references (Fig. C, bottom) can inject relevant information about historically useful actions while remaining flexible (Fig. C, top).

This suggests that for the reference policy to be effective, it should follow a different reward function than the target one (i.e., being off-reward), such that it retains knowledge about generally good and safe actions while dynamically adapting (i.e., being dynamic) as new regions of action-state space are explored during learning.

Off-Reward Dynamic Reference Policy (ORDRP)

In our approach, the dynamic reference policy \(\mu\) has its own distinct off-reward function defined based on the same MDP. We assume that the series of training reference policies \(\{\mu^n\}_{n=0}^{\infty}\) converge to an optimal policy \(\mu^{*}\) defined as \(\mu^{*} = \arg \max_{\mu} \sum_{t=0}^{\infty} \gamma^t \mathbb{E}_{(s_{t},a_{t})\sim p_{\pi}} [f(\mathcal{M})]\). Here, \( f(\mathcal{M}) \) represents a function defined over the Markov Decision Process (MDP).

Off-Reward Dynamic Reference Value Iteration

\begin{equation}\label{eq:state_value_function_sequence_def} \tilde{V}^{n+1}(s_{t}) = (B_{\mu^n} \tilde{V}^n)(s_t) = \alpha \log \sum_{a_{t} \in A} \mu^{n}(a_{t}|s_{t}) \exp \left( \frac{1}{\alpha} (r(s_{t}, a_{t}) + \gamma \mathbb{E}_{s_{t+1} \sim p} [\tilde{V}^{n}(s_{t+1})]) \right) \, . \end{equation}

Off-Reward Dynamic Reference Policy Iteration

\begin{equation} \pi_0 \xrightarrow[\mu_1]{\text{E}} Q_{\pi_0/\mu_1} \xrightarrow[\mu_1]{\text{I}} \pi_1 \xrightarrow[\mu_2]{\text{E}} Q_{\pi_1/\mu_2} \xrightarrow[\mu_2]{\text{I}} \pi_2 \xrightarrow[\mu_3]{\text{E}} \dots \xrightarrow[\mu^\infty]{\text{I}} \pi^\infty \xrightarrow[\mu_\infty]{\text{E}} Q_{\pi^\infty/\mu^\infty}. \end{equation}

MOP reference

The Maximum Occupancy Principle (MOP) is a novel approach to modeling agent's behavior, which diverges from traditional reward-maximization frameworks. The goal of MOP is to maximize the occupancy of future action-state paths, rather than seeking extrinsic rewards. This principle posits that agents are intrinsically motivated to explore and visit rare or unoccupied action-states, thus ensuring a broad and diverse range of behaviors over time.

The off-reward function in MOP is the entropy of the paths taken by the agent, \[ R(\tau) = -\sum_{t=0}^{\infty} \gamma^t \ln \left( \mu_{\text{MOP}}^{\alpha}(a_t | s_t) p^{\beta}(s_{t+1} | s_t, a_t) \right) \;, \] where \(\alpha > 0\) and \(\beta \geq 0\) are weights for actions and states, respectively, and \(\gamma\) is the discount factor.

The agent maximizes this intrinsic reward by preferring low-probability actions and transitions, which encourages exploration and the occupancy of a wide range of action-states. This intrinsic motivation leads to behaviors that appear goal-directed and complex without the necessity of explicitly defined extrinsic rewards.

Lap reference

We can think of learning a low-dimensional state representation as building a mapping between the original space and a low-dimensional representation such that near/distant points in the original space are near/distant in the representation. This problem is referred to as graph drawing problem. In the approach taken here, valid for large and continuous spaces, we want to build a set of features \(\phi(s) = (f_1(s),...,f_d(s)) \in \mathbb{R}^d\) that faithfully represent the input space \(s \in S\). Once the embedding function \(\phi(s) = [f_1(s), \ldots, f_d(s)]\) has been learned, we ask our off-reward dynamic reference policy to maximize the expected cumulative reward \(\sum_{t=0}^{\infty} \gamma^t R(s_{t}, a_{t})\), where the instantaneous off-reward \(R(s_{t}, a_{t})\) is defined as \[ R(s_{t}, a_{t}) = \frac{||\phi(s_{t+1})-\phi(s_{t})||_{2}^2}{||\phi(s_{t+1})||_{2}^2 + ||\phi(s_{t})||_{2}^2} \;. \] This novel off-reward function consistently encourages the agent to transition to states that are furthest from the current one, which leads to improved exploration. This approach effectively prevents the agent from wasting time on inefficient transitions that result in minimal changes to the embedded state space, leading to a more sample efficient algorithm.

Solving Escape Room

Soft Actor Critic

Incorporates entropy regularization into its objective function to balance exploration and exploitation.

MOP Reference

Maximizes the occupancy of action-state space by maximizing cumulative action entropy

Lap Reference

Leverages spectral information from the Laplacian of the transition matrix to guide exploration by encouraging the agent to visit spectrally distinct states.

Mujoco Benchmark

Metric/Method SAC PPO TD3 DDPG MOP reference Lap reference
Ant-v4 500k 5754 1431 3857 -521 6166 6016
Ant-v4 1M 6427 1929 4378 413 6647 6695
Hum-v4 500k 4761 546 93 130 4972 5294
Hum-v4 1M 5316 661 93 127 5075 5489
Walker2d-v4 500k 3347 2781 3587 396 3656 3919
Walker2d-v4 1M 4297 3444 4172 1449 4173 4558
Hopper-v4 500k 2292 961 3204 1450 2209 1691
Hopper-v4 1M 2766 935 3294 1721 2784 2394
Cheetah-v4 500k 8048 4840 8777 9968 8348 9098
Cheetah-v4 1M 9378 5463 10304 11607 9702 10516
mean 5238 2299 4175 2674 5373 5567

Conclusion

The findings from our experiments highlight that even state-of-the-art methods often struggle with directionless exploration or overcommitment to suboptimal policies. Our ORDRP approach addresses these challenges by providing a reference policy that evolves in response to the learning environment. This adaptive nature allows for more structured exploration, reducing the risk of inefficient learning trajectories.

Our study explored a novel approach to RL by introducing an off-reward dynamic reference policy to guide the main policy through complex learning environments. This reference policy evolves alongside the target policy, allowing for more targeted exploration and reducing the inefficiencies of traditional exploration methods. Our approach demonstrates that incorporating an off-reward reference policy into RL improves convergence rates and enhances performance in various tasks. The off-reward functions that we used, based on MOP and graph Laplacian, demonstrated to be much more efficient in exploring, learning and optimizing target rewards in slightly more involved environments than the ones typically considered.

The experiments conducted in this project, including the Escape Room scenario and MuJoCo Benchmark Evaluation, demonstrate the success of our method.