Dynamic Reference RL

Abstract

In reinforcement learning (RL), balancing exploration and exploitation is essential for maximizing a target reward function. Traditional methods often employ regularizers to prevent the policy from becoming deterministic too early, such as penalizing deviations from a reference policy. This paper introduces a novel approach by training an off-reward dynamic reference policy (ORDRP) with a different reward function alongside the target policy to guide exploration.

We use Kullback–Leibler divergence as a regularization technique and train the ORDRP either with the maximum occupancy principle or Laplacian intrinsic off-rewards.

We prove the convergence of the ORDRP iteration method and validate our theory within an actor-critic framework. Our experiments in challenging environments reveal that incorporating an ORDRP enhances exploration, resulting in superior performance and higher sampling efficiency in benchmarks compared to state of the art baselines. Our findings suggest that dynamically training a reference policy with an off-reward function alongside the main policy can significantly improve learning outcomes.

Intuition

We assume that the target policy $\pi$ is improved at each iteration: to maximize reward, but there is also a penalty (Kullback-Leibler (KL)) for deviating too much from the current reference policy $\mu$.

When using a static uniform reference policy (Fig. A, bottom panel), the KL penalty tends to greatly reduce the likelihood of an action that has been discovered to be good by the target policy (peak of the Gaussian), while it increases the likelihood of harmful or useless actions due to the uninformative regularization term (left tail of the Gaussian).

When using the current target policy as the reference (Fig. B, bottom), there are minimal differences between the new target policy and the reference policy, which do not significantly enhance exploration beyond what has already been learned (Fig. B, top).

In contrast, a dynamic reference policy that lies between purely uninformative and purely target-aligned references (Fig. C, bottom) can inject relevant information about historically useful actions while remaining flexible (Fig. C, top).

This suggests that for the reference policy to be effective, it should follow a different reward function than the target one (i.e., being off-reward), such that it retains knowledge about generally good and safe actions while dynamically adapting (i.e., being dynamic) as new regions of action-state space are explored during learning.

Off-Reward Dynamic Reference Policy (ORDRP)

In our approach, the dynamic reference policy $\mu$ has its own distinct off-reward function defined based on the same MDP. We assume that the series of training reference policies $\{\mu^n\}_{n=0}^{\infty}$ converge to an optimal policy $\mu^{*}$ defined as $\mu^{*} = \arg \max_{\mu} \sum_{t=0}^{\infty} \gamma^t \mathbb{E}_{(s_{t},a_{t})\sim p_{\pi}} [f(\mathcal{M})]$. Here, $ f(\mathcal{M}) $ represents a function defined over the Markov Decision Process (MDP).

Off-Reward Dynamic Reference Value Iteration

\begin{equation}\label{eq:state_value_function_sequence_def} \tilde{V}^{n+1}(s_{t}) = (B_{\mu^n} \tilde{V}^n)(s_t) = \alpha \log \sum_{a_{t} \in A} \mu^{n}(a_{t}|s_{t}) \exp \left( \frac{1}{\alpha} (r(s_{t}, a_{t}) + \gamma \mathbb{E}_{s_{t+1} \sim p} [\tilde{V}^{n}(s_{t+1})]) \right) \, . \end{equation}

Off-Reward Dynamic Reference Policy Iteration

\begin{equation} \pi_0 \xrightarrow[\mu_1]{\text{E}} Q_{\pi_0/\mu_1} \xrightarrow[\mu_1]{\text{I}} \pi_1 \xrightarrow[\mu_2]{\text{E}} Q_{\pi_1/\mu_2} \xrightarrow[\mu_2]{\text{I}} \pi_2 \xrightarrow[\mu_3]{\text{E}} \dots \xrightarrow[\mu^\infty]{\text{I}} \pi^\infty \xrightarrow[\mu_\infty]{\text{E}} Q_{\pi^\infty/\mu^\infty}. \end{equation}

MOP reference

The Maximum Occupancy Principle (MOP) is a novel approach to modeling agent's behavior, which diverges from traditional reward-maximization frameworks. The goal of MOP is to maximize the occupancy of future action-state paths, rather than seeking extrinsic rewards. This principle posits that agents are intrinsically motivated to explore and visit rare or unoccupied action-states, thus ensuring a broad and diverse range of behaviors over time.

The off-reward function in MOP is the entropy of the paths taken by the agent, \[ R(\tau) = -\sum_{t=0}^{\infty} \gamma^t \ln \left( \mu_{\text{MOP}}^{\alpha}(a_t | s_t) p^{\beta}(s_{t+1} | s_t, a_t) \right) \;, \] where $\alpha > 0$ and $\beta \geq 0$ are weights for actions and states, respectively, and $\gamma$ is the discount factor.

The agent maximizes this intrinsic reward by preferring low-probability actions and transitions, which encourages exploration and the occupancy of a wide range of action-states. This intrinsic motivation leads to behaviors that appear goal-directed and complex without the necessity of explicitly defined extrinsic rewards.

Lap reference

We can think of learning a low-dimensional state representation as building a mapping between the original space and a low-dimensional representation such that near/distant points in the original space are near/distant in the representation. This problem is referred to as graph drawing problem. In the approach taken here, valid for large and continuous spaces, we want to build a set of features $\phi(s) = (f_1(s),...,f_d(s)) \in \mathbb{R}^d$ that faithfully represent the input space $s \in S$. Once the embedding function $\phi(s) = [f_1(s), \ldots, f_d(s)]$ has been learned, we ask our off-reward dynamic reference policy to maximize the expected cumulative reward $\sum_{t=0}^{\infty} \gamma^t R(s_{t}, a_{t})$, where the instantaneous off-reward $R(s_{t}, a_{t})$ is defined as \[ R(s_{t}, a_{t}) = \frac{||\phi(s_{t+1})-\phi(s_{t})||_{2}^2}{||\phi(s_{t+1})||_{2}^2 + ||\phi(s_{t})||_{2}^2} \;. \] This novel off-reward function consistently encourages the agent to transition to states that are furthest from the current one, which leads to improved exploration. This approach effectively prevents the agent from wasting time on inefficient transitions that result in minimal changes to the embedded state space, leading to a more sample efficient algorithm.

Solving Escape Room

Soft Actor Critic

Incorporates entropy regularization into its objective function to balance exploration and exploitation.

MOP Reference

Maximizes the occupancy of action-state space by maximizing cumulative action entropy

Lap Reference

Leverages spectral information from the Laplacian of the transition matrix to guide exploration by encouraging the agent to visit spectrally distinct states.

Mujoco Benchmark

Metric/Method	SAC	PPO	TD3	DDPG	MOP reference	Lap reference
Ant-v4 500k	5754	1431	3857	-521	6166	6016
Ant-v4 1M	6427	1929	4378	413	6647	6695
Hum-v4 500k	4761	546	93	130	4972	5294
Hum-v4 1M	5316	661	93	127	5075	5489
Walker2d-v4 500k	3347	2781	3587	396	3656	3919
Walker2d-v4 1M	4297	3444	4172	1449	4173	4558
Hopper-v4 500k	2292	961	3204	1450	2209	1691
Hopper-v4 1M	2766	935	3294	1721	2784	2394
Cheetah-v4 500k	8048	4840	8777	9968	8348	9098
Cheetah-v4 1M	9378	5463	10304	11607	9702	10516
mean	5238	2299	4175	2674	5373	5567

Conclusion

The findings from our experiments highlight that even state-of-the-art methods often struggle with directionless exploration or overcommitment to suboptimal policies. Our ORDRP approach addresses these challenges by providing a reference policy that evolves in response to the learning environment. This adaptive nature allows for more structured exploration, reducing the risk of inefficient learning trajectories.

Our study explored a novel approach to RL by introducing an off-reward dynamic reference policy to guide the main policy through complex learning environments. This reference policy evolves alongside the target policy, allowing for more targeted exploration and reducing the inefficiencies of traditional exploration methods. Our approach demonstrates that incorporating an off-reward reference policy into RL improves convergence rates and enhances performance in various tasks. The off-reward functions that we used, based on MOP and graph Laplacian, demonstrated to be much more efficient in exploring, learning and optimizing target rewards in slightly more involved environments than the ones typically considered.

The experiments conducted in this project, including the Escape Room scenario and MuJoCo Benchmark Evaluation, demonstrate the success of our method.

Enhancing Exploration via Off-Reward Dynamic Reference Reinforcement Learning