\( a = \mu_k \)
The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.
Most reinforcement learning approaches optimize behavior to maximize a taskspecific reward. However, from this, it is difficult to learn transferable token-like behaviors that can be reused and composed to solve arbitrary downstream tasks. We introduce an online unsupervised reinforcement learning framework that autonomously quantizes the agent’s action space into component policies via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure diverse, exploratory behavior under the maximum-occupancy principle, while minimizing the entropy of each component to enforce diversity and high specialization. Unlike existing approaches, our framework tackles action quantization into state-dependent component policies in a fully principled, unsupervised, online manner. We prove convergence in the tabular setting through a novel policy iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically within an online optimizer to maximize cumulative reward. Empirical results demonstrate that our maxi-mixmini-com entropy-based action-policy quantization provides interpretable, reusable token-like behavioral patterns, yielding a fully online, task-agnostic, scalable architecture that requires no task-specific offline data and transfers readily across tasks.
This section evaluates the behavior of the Swimmer-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.
The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.
The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).
This section evaluates the behavior of the Ant-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.
The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.
The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).
The agent samples a component \(k\) from a state-dependent categorical distribution \(w(s)\) at each time step, then sample an action from component policy \(\pi_k\).
The agent samples a component \(k\) uniformly at random from \(\{0, ..., K\}\) at each time step, then use the mode \( \mu_k \) of component policy \(\pi_k\) as quantized action.
This section evaluates our method’s ability to control the complex dynamics of the MuJoCo Humanoid and Walker2d tasks, where randomly generated actions invariably fail.
For example, in the default Humanoid-v5 environment, episodes terminate if the torso’s z‑coordinate (height) falls outside the healthy interval \\(1.0,\,2.0\\).
We compare our approach against two random‑action baselines: (1) discretized sampling, in which K actions are generated by drawing each action dimension independently from {−magnitude, 0, + magnitude}—with magnitudes of 0.05, 0.1, 0.2, 0.3, and 0.4 and K equal to 4, 8, 16, 32, 64, or 128—and (2) uniform sampling across the continuous action space. Across every tested magnitude and number of generated actions, neither baseline produces sustained or meaningful Humanoid behaviors under this termination criterion.
This section compares the behaviors discovered by the DADs and METRA skill discovery methods and our maxi-mix-mini-com approach in the Ant-v5 environment, using both top and side camera views. We are showing 16 components for our mixture against trained DADs and METRA models on 16-dimensional discrete latent skill variables.
This section evaluates the behavior of the FetchReach-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.
The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.
The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).
This section visualizes the state-to-component mapping learned by the mixture policy in the Humanoid-v5 environment. We can see how our mixture policy has high entropic mixing weights with the low entropic components specialized in different behaviors.