We introduce an online unsupervised reinforcement learning (RL) framework that autonomously quantizes the agent’s action space into component policies via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure diverse, exploratory behavior under the maximum-occupancy principle, while minimizing the entropy of each component to enforce diversity and high specialization. Unlike existing approaches, our framework tackles action quantization into component policies in a fully principled, unsupervised, online manner. We prove convergence in the tabular setting through a novel policy-iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically within an online optimizer to maximize cumulative reward. Empirical results demonstrate that our {\em maxi-mix-mini-com} entropy-based action-policy quantization provides interpretable, reusable token-like behavioral patterns, yielding a fully online, task-agnostic, scalable architecture that requires no task-specific offline data and transfers readily across tasks.
This section evaluates our method’s ability to control the complex dynamics of the MuJoCo Humanoid task, where randomly generated actions invariably fail. In the default environment, episodes terminate if the torso’s z‑coordinate (height) falls outside the healthy interval \[1.0, 2.0]. We compare our approach against two random‑action baselines: (1) discretized sampling, in which K actions are generated by drawing each action dimension independently from {–magnitude, 0, + magnitude}—with magnitudes of 0.05, 0.1, 0.2, 0.3, and 0.4 and K equal to 4, 8, 16, 32, 64, or 128—and (2) uniform sampling across the continuous action space. Across every tested magnitude and number of generated actions, neither baseline produces sustained or meaningful Humanoid behaviors under this termination criterion.