Unsupervised Action-Policy Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components

Anonymous Author

Abstract

We introduce an online unsupervised reinforcement learning (RL) framework that autonomously quantizes the agent’s action space into component policies via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure diverse, exploratory behavior under the maximum-occupancy principle, while minimizing the entropy of each component to enforce diversity and high specialization. Unlike existing approaches, our framework tackles action quantization into component policies in a fully principled, unsupervised, online manner. We prove convergence in the tabular setting through a novel policy-iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically within an online optimizer to maximize cumulative reward. Empirical results demonstrate that our {\em maxi-mix-mini-com} entropy-based action-policy quantization provides interpretable, reusable token-like behavioral patterns, yielding a fully online, task-agnostic, scalable architecture that requires no task-specific offline data and transfers readily across tasks.

Behavior under maxi-mix-mini-com

\(k \sim w(s)\)

(Stochastic)

\(k \sim Uniform \{0, ..., K\}, \quad \sigma_k = 0\)

(Deterministic)

Ant-v5 Agent Dynamics Using A Fixed Component

\(\sigma = 0\) (Deterministic)

\(\sigma = 0.1\)

Swimmer-v5 Agent Dynamics Using A Fixed Component

\(\sigma = 0\) (Deterministic)

\(\sigma = 0.1\)

Fetch Robot Agent Dynamics Using A Fixed Component

\(\sigma = 0\) (Deterministic)

\(\sigma = 0.1\)

Comparison between our method and using random set of actions

This section evaluates our method’s ability to control the complex dynamics of the MuJoCo Humanoid task, where randomly generated actions invariably fail. In the default environment, episodes terminate if the torso’s z‑coordinate (height) falls outside the healthy interval \[1.0, 2.0]. We compare our approach against two random‑action baselines: (1) discretized sampling, in which K actions are generated by drawing each action dimension independently from {–magnitude, 0, + magnitude}—with magnitudes of 0.05, 0.1, 0.2, 0.3, and 0.4 and K equal to 4, 8, 16, 32, 64, or 128—and (2) uniform sampling across the continuous action space. Across every tested magnitude and number of generated actions, neither baseline produces sustained or meaningful Humanoid behaviors under this termination criterion.

Maxi-Mix-Mini-Com

Random Actions

Maxi-Mix-Mini-Com

Random Actions

Mapping from State to Components