Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components

Yamen Habib, Dmytro Grytskyy, Rubén Moreno-Bote

Abstract

Most reinforcement learning approaches optimize behavior to maximize a taskspecific reward. However, from this, it is difficult to learn transferable token-like behaviors that can be reused and composed to solve arbitrary downstream tasks. We introduce an online unsupervised reinforcement learning framework that autonomously quantizes the agent’s action space into component policies via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure diverse, exploratory behavior under the maximum-occupancy principle, while minimizing the entropy of each component to enforce diversity and high specialization. Unlike existing approaches, our framework tackles action quantization into state-dependent component policies in a fully principled, unsupervised, online manner. We prove convergence in the tabular setting through a novel policy iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically within an online optimizer to maximize cumulative reward. Empirical results demonstrate that our maxi-mixmini-com entropy-based action-policy quantization provides interpretable, reusable token-like behavioral patterns, yielding a fully online, task-agnostic, scalable architecture that requires no task-specific offline data and transfers readily across tasks.

Overview figure

Videos 1: Swimmer-v5 Agent Dynamics Using A Fixed Component

This section evaluates the behavior of the Swimmer-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.



\( a = \mu_k \)

The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.

\( a \sim \pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\)

The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).

Videos 2: Ant-v5 Agent Dynamics Using A Fixed Component

This section evaluates the behavior of the Ant-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.



\( a = \mu_k \)

The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.

\( a \sim \pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\)

The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).

Videos 3: Behavior under maxi-mix-mini-com

\(k \sim w(s), \quad a \sim \pi_k(\cdot|s)\)

The agent samples a component \(k\) from a state-dependent categorical distribution \(w(s)\) at each time step, then sample an action from component policy \(\pi_k\).

\(k \sim Uniform \{0, ..., K\}, \quad a = \mu_k\)

The agent samples a component \(k\) uniformly at random from \(\{0, ..., K\}\) at each time step, then use the mode \( \mu_k \) of component policy \(\pi_k\) as quantized action.

Videos 4: Comparison between our method and using random set of actions

This section evaluates our method’s ability to control the complex dynamics of the MuJoCo Humanoid and Walker2d tasks, where randomly generated actions invariably fail.
For example, in the default Humanoid-v5 environment, episodes terminate if the torso’s z‑coordinate (height) falls outside the healthy interval \\(1.0,\,2.0\\). We compare our approach against two random‑action baselines: (1) discretized sampling, in which K actions are generated by drawing each action dimension independently from {−magnitude, 0, + magnitude}—with magnitudes of 0.05, 0.1, 0.2, 0.3, and 0.4 and K equal to 4, 8, 16, 32, 64, or 128—and (2) uniform sampling across the continuous action space. Across every tested magnitude and number of generated actions, neither baseline produces sustained or meaningful Humanoid behaviors under this termination criterion.

Maxi-Mix-Mini-Com

Random Actions

Maxi-Mix-Mini-Com

Random Actions

Videos 5: Comparison: DADs, METRA, and Our Method (Ant-v5, Fixed Component)

This section compares the behaviors discovered by the DADs and METRA skill discovery methods and our maxi-mix-mini-com approach in the Ant-v5 environment, using both top and side camera views. We are showing 16 components for our mixture against trained DADs and METRA models on 16-dimensional discrete latent skill variables.


DADs

METRA

Ours


DADs

METRA

Ours

Videos 6: FetchReach-v5 Agent Dynamics Using A Fixed Component

This section evaluates the behavior of the FetchReach-v5 agent when consistently utilizing a single, fixed component policy \(\pi_k\) from the learned mixture policy. By isolating and deploying one component at a time, we can observe the distinct behaviors and strategies that each component has specialized in during training. This analysis provides insights into how different components contribute to the overall performance and adaptability of the agent in navigating its environment.



\( a = \mu_k \)

The agent consistently uses the mode (most probable action) of a selected component policy \(\pi_k\) at each time step, resulting in deterministic behavior.

\( a \sim \pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\)

The agent samples actions from the component policy \(\pi_k = \mathcal{N}(\mu_k(s), \Sigma_k(s))\), where the diagonal elements of the covariance matrix \(\Sigma_k(s)\) are limited from above by \(0.1\).

Videos 7: Mapping from State to Components

This section visualizes the state-to-component mapping learned by the mixture policy in the Humanoid-v5 environment. We can see how our mixture policy has high entropic mixing weights with the low entropic components specialized in different behaviors.