@Kseniase on Hugging Face: "11 Fascinating new Policy Optimization techniques Policy optimization (PO)…"

Kseniase

posted an update Nov 2, 2025

Post

11173

11 Fascinating new Policy Optimization techniques

Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them:

1. BAlanced Policy Optimization (BAPO) → BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping (2510.18927)
Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse

2. Training-Free GRPO → Training-Free Group Relative Policy Optimization (2510.08191)
Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the model’s behavior

3. Asymmetric Importance Sampling Policy Optimization (ASPO) → ASPO: Asymmetric Importance Sampling Policy Optimization (2510.06062)
Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable

4. In-Context Steered Policy Optimization (ICPO) → https://arxiv.org/abs/2510.26519
Uses a model’s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence

5. Graph-Enhanced Policy Optimization (GEPO) → https://arxiv.org/abs/2510.26270
Builds a graph of an agent’s experiences to understand how different states connect, guide exploration and assign rewards more effectively

6. Information Gain-based Policy Optimization (IGPO) → Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents (2510.14967)
Uses the model’s own belief updates to create dense, informative feedback for smoother multi-turn learning

Read further below ⬇️
If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

Kseniase

Nov 2, 2025

Agentic Entropy-Balanced Policy Optimization (AEPO) → https://huggingface.co/papers/2510.14545
Keeps web agents from collapsing during training by balancing entropy in data collection and policy updates, and adjusting gradients on high-uncertainty steps
Agent- and Turn-wise Grouped Reinforcement Policy Optimization (AT-GRPO) → https://huggingface.co/papers/2510.11062
PO for multi-agent LLM systems. It groups training by agent roles and dialogue turns, allowing each agent to learn more effectively within its context
Direct Group Preference Optimization (DGPO) → https://huggingface.co/papers/2510.08425
RL method made for diffusion models. Learns directly from group-level preferences between samples, allowing it to use fast deterministic ODE samplers instead of noisy stochastic ones
Entropy-regularized Policy Optimization (EPO) → https://huggingface.co/papers/2509.22576
Controls entropy and adapts it across training phases, encouraging exploration early on and steady convergence later
Multiplayer Nash Preference Optimization (MNPO) → https://huggingface.co/papers/2509.23102
Extends human feedback alignment to a multiplayer game setup. Each policy competes with a population of others, capturing more complex and realistic human preference patterns while keeping stable Nash equilibria

harsh306

Nov 3, 2025

•

edited Nov 3, 2025

CLPO by Alibaba : https://arxiv.org/abs/2509.25004

Join the conversation