Soft Actor Critic
Incorporates entropy regularization into its objective function to balance exploration and exploitation.
In reinforcement learning (RL), balancing exploration and exploitation is essential for maximizing a target reward function. Traditional methods often employ regularizers to prevent the policy from becoming deterministic too early, such as penalizing deviations from a reference policy. This paper introduces a novel approach by training an off-reward dynamic reference policy (ORDRP) with a different reward function alongside the target policy to guide exploration.
We use Kullback–Leibler divergence as a regularization technique and train the ORDRP either with the maximum occupancy principle or Laplacian intrinsic off-rewards.
We prove the convergence of the ORDRP iteration method and validate our theory within an actor-critic framework. Our experiments in challenging environments reveal that incorporating an ORDRP enhances exploration, resulting in superior performance and higher sampling efficiency in benchmarks compared to state of the art baselines. Our findings suggest that dynamically training a reference policy with an off-reward function alongside the main policy can significantly improve learning outcomes.
We assume that the target policy \(\pi\) is improved at each iteration: to maximize reward, but there is also a penalty (Kullback-Leibler (KL)) for deviating too much from the current reference policy $\mu$.
When using a static uniform reference policy (Fig. A, bottom panel), the KL penalty tends to greatly reduce the likelihood of an action that has been discovered to be good by the target policy (peak of the Gaussian), while it increases the likelihood of harmful or useless actions due to the uninformative regularization term (left tail of the Gaussian).
When using the current target policy as the reference (Fig. B, bottom), there are minimal differences between the new target policy and the reference policy, which do not significantly enhance exploration beyond what has already been learned (Fig. B, top).
In contrast, a dynamic reference policy that lies between purely uninformative and purely target-aligned references (Fig. C, bottom) can inject relevant information about historically useful actions while remaining flexible (Fig. C, top).
This suggests that for the reference policy to be effective, it should follow a different reward function than the target one (i.e., being off-reward), such that it retains knowledge about generally good and safe actions while dynamically adapting (i.e., being dynamic) as new regions of action-state space are explored during learning.In our approach, the dynamic reference policy \(\mu\) has its own distinct off-reward function defined based on the same MDP. We assume that the series of training reference policies \(\{\mu^n\}_{n=0}^{\infty}\) converge to an optimal policy \(\mu^{*}\) defined as \(\mu^{*} = \arg \max_{\mu} \sum_{t=0}^{\infty} \gamma^t \mathbb{E}_{(s_{t},a_{t})\sim p_{\pi}} [f(\mathcal{M})]\). Here, \( f(\mathcal{M}) \) represents a function defined over the Markov Decision Process (MDP).
\begin{equation}\label{eq:state_value_function_sequence_def} \tilde{V}^{n+1}(s_{t}) = (B_{\mu^n} \tilde{V}^n)(s_t) = \alpha \log \sum_{a_{t} \in A} \mu^{n}(a_{t}|s_{t}) \exp \left( \frac{1}{\alpha} (r(s_{t}, a_{t}) + \gamma \mathbb{E}_{s_{t+1} \sim p} [\tilde{V}^{n}(s_{t+1})]) \right) \, . \end{equation}
\begin{equation} \pi_0 \xrightarrow[\mu_1]{\text{E}} Q_{\pi_0/\mu_1} \xrightarrow[\mu_1]{\text{I}} \pi_1 \xrightarrow[\mu_2]{\text{E}} Q_{\pi_1/\mu_2} \xrightarrow[\mu_2]{\text{I}} \pi_2 \xrightarrow[\mu_3]{\text{E}} \dots \xrightarrow[\mu^\infty]{\text{I}} \pi^\infty \xrightarrow[\mu_\infty]{\text{E}} Q_{\pi^\infty/\mu^\infty}. \end{equation}
The Maximum Occupancy Principle (MOP) is a novel approach to modeling agent's behavior, which diverges from traditional reward-maximization frameworks. The goal of MOP is to maximize the occupancy of future action-state paths, rather than seeking extrinsic rewards. This principle posits that agents are intrinsically motivated to explore and visit rare or unoccupied action-states, thus ensuring a broad and diverse range of behaviors over time.
The off-reward function in MOP is the entropy of the paths taken by the agent, \[ R(\tau) = -\sum_{t=0}^{\infty} \gamma^t \ln \left( \mu_{\text{MOP}}^{\alpha}(a_t | s_t) p^{\beta}(s_{t+1} | s_t, a_t) \right) \;, \] where \(\alpha > 0\) and \(\beta \geq 0\) are weights for actions and states, respectively, and \(\gamma\) is the discount factor.
The agent maximizes this intrinsic reward by preferring low-probability actions and transitions, which encourages exploration and the occupancy of a wide range of action-states. This intrinsic motivation leads to behaviors that appear goal-directed and complex without the necessity of explicitly defined extrinsic rewards.
Incorporates entropy regularization into its objective function to balance exploration and exploitation.
Maximizes the occupancy of action-state space by maximizing cumulative action entropy
Leverages spectral information from the Laplacian of the transition matrix to guide exploration by encouraging the agent to visit spectrally distinct states.
Metric/Method | SAC | PPO | TD3 | DDPG | MOP reference | Lap reference |
---|---|---|---|---|---|---|
Ant-v4 500k | 5754 | 1431 | 3857 | -521 | 6166 | 6016 |
Ant-v4 1M | 6427 | 1929 | 4378 | 413 | 6647 | 6695 |
Hum-v4 500k | 4761 | 546 | 93 | 130 | 4972 | 5294 |
Hum-v4 1M | 5316 | 661 | 93 | 127 | 5075 | 5489 |
Walker2d-v4 500k | 3347 | 2781 | 3587 | 396 | 3656 | 3919 |
Walker2d-v4 1M | 4297 | 3444 | 4172 | 1449 | 4173 | 4558 |
Hopper-v4 500k | 2292 | 961 | 3204 | 1450 | 2209 | 1691 |
Hopper-v4 1M | 2766 | 935 | 3294 | 1721 | 2784 | 2394 |
Cheetah-v4 500k | 8048 | 4840 | 8777 | 9968 | 8348 | 9098 |
Cheetah-v4 1M | 9378 | 5463 | 10304 | 11607 | 9702 | 10516 |
mean | 5238 | 2299 | 4175 | 2674 | 5373 | 5567 |
The findings from our experiments highlight that even state-of-the-art methods often struggle with directionless exploration or overcommitment to suboptimal policies. Our ORDRP approach addresses these challenges by providing a reference policy that evolves in response to the learning environment. This adaptive nature allows for more structured exploration, reducing the risk of inefficient learning trajectories.
Our study explored a novel approach to RL by introducing an off-reward dynamic reference policy to guide the main policy through complex learning environments. This reference policy evolves alongside the target policy, allowing for more targeted exploration and reducing the inefficiencies of traditional exploration methods. Our approach demonstrates that incorporating an off-reward reference policy into RL improves convergence rates and enhances performance in various tasks. The off-reward functions that we used, based on MOP and graph Laplacian, demonstrated to be much more efficient in exploring, learning and optimizing target rewards in slightly more involved environments than the ones typically considered.
The experiments conducted in this project, including the Escape Room scenario and MuJoCo Benchmark Evaluation, demonstrate the success of our method.