Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Kseniase 
posted an update Nov 2
Post
11127
11 Fascinating new Policy Optimization techniques

Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them:

1. BAlanced Policy Optimization (BAPO) → BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping (2510.18927)
Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse

2. Training-Free GRPO → Training-Free Group Relative Policy Optimization (2510.08191)
Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the model’s behavior

3. Asymmetric Importance Sampling Policy Optimization (ASPO) → ASPO: Asymmetric Importance Sampling Policy Optimization (2510.06062)
Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable

4. In-Context Steered Policy Optimization (ICPO) → https://arxiv.org/abs/2510.26519
Uses a model’s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence

5. Graph-Enhanced Policy Optimization (GEPO) → https://arxiv.org/abs/2510.26270
Builds a graph of an agent’s experiences to understand how different states connect, guide exploration and assign rewards more effectively

6. Information Gain-based Policy Optimization (IGPO) → Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents (2510.14967)
Uses the model’s own belief updates to create dense, informative feedback for smoother multi-turn learning

Read further below ⬇️
If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
  1. Agentic Entropy-Balanced Policy Optimization (AEPO) → https://huggingface.co/papers/2510.14545
    Keeps web agents from collapsing during training by balancing entropy in data collection and policy updates, and adjusting gradients on high-uncertainty steps

  2. Agent- and Turn-wise Grouped Reinforcement Policy Optimization (AT-GRPO) → https://huggingface.co/papers/2510.11062
    PO for multi-agent LLM systems. It groups training by agent roles and dialogue turns, allowing each agent to learn more effectively within its context

  3. Direct Group Preference Optimization (DGPO) → https://huggingface.co/papers/2510.08425
    RL method made for diffusion models. Learns directly from group-level preferences between samples, allowing it to use fast deterministic ODE samplers instead of noisy stochastic ones

  4. Entropy-regularized Policy Optimization (EPO) → https://huggingface.co/papers/2509.22576
    Controls entropy and adapts it across training phases, encouraging exploration early on and steady convergence later

  5. Multiplayer Nash Preference Optimization (MNPO) → https://huggingface.co/papers/2509.23102
    Extends human feedback alignment to a multiplayer game setup. Each policy competes with a population of others, capturing more complex and realistic human preference patterns while keeping stable Nash equilibria