Implementation of the paper "Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning" (arXiv:2606.10968).
CPPO is an LLM RL algorithm. It replays the sampled tokens, compares train-side
log-probs against the rollout log-probs, expands sample-level advantages to tokens, and
applies a position-weighted, cumulative-prefix-budget Binary-TV mask on top of the
DPPO ratio-advantage surrogate (this repo ships the Binary-TV variant).
- Loss:
unirl/algorithms/cppo.py(CPPO,_cppo_loss,_cppo_mask) - Recipe (SGLang):
examples/ar/qwen3_cppo_30b_a3b_base_dapo_sglang.yaml— Qwen3-30B-A3B-Base on DAPO-Math - Config extract:
config.yaml
Lineage: PPO → GRPO → DPPO → CPPO. CPPO is the hard-mask sibling of DPPO Binary-TV;
DRPO is the smooth-mask sibling of the same trust region.
LLM RL is off-policy: the rollout engine (SGLang) and the training engine differ, and one
batch of rollouts is split into several gradient steps, so the updated policy π drifts
from the behavior policy µ that sampled the tokens. DPPO replaces PPO's ratio clipping
with a Binary-TV mask — the trust region is |π(y_t|s_t) − µ(y_t|s_t)| ≤ δ, constraining
the absolute probability shift on the sampled token, which is better behaved than the ratio
for a long-tailed vocabulary.
But DPPO applies the same threshold δ to every token. CPPO's observation (paper §3-4)
is that not all positions deserve the same budget: errors compound along a sequence, and a
fixed per-token threshold lets the cumulative drift over a long response grow unchecked.
CPPO therefore makes the trust region position-aware and cumulative:
- Position weight
w_tdecreasing from 1 (first token) tow_min(last token), so the effective per-token allowance shrinks as the response goes on. - Cumulative prefix budget: the threshold at token
ttightens by how much the prefixy_{<t}has already spent, so a response that drifts early is held to a tighter bound later.
Only the mask changes — the loss term is DPPO's −A_t · r_t on kept tokens.
For a response of length T (t is the 1-based token position):
With prefix sums S_{t-1} = Σ_{j<t} Z_j and W_{t-1} = Σ_{j<t} w_j (S_0 = W_0 = 0), the
effective threshold and keep rule are:
The first clause always keeps updates that move π back toward µ (i.e. A_t(r_t−1) ≤ 0);
the budget only restricts updates that push π farther from µ. The loss is then:
Dynamic prefix budget (paper Eq. 22, Base-model warm-up calibration): each sequence sets its own budget floor from its divergence statistics,
where δ_b^min is the config cppo_delta_b.
The reward here is rule-based: MathVerifyRewardScorer
(unirl/reward/local/mathverify.py) checks the parsed
answer against the DAPO-Math ground truth — the verifiable reward the paper trains on.
_cppo_mask builds the keep-mask under torch.no_grad (it is a trust-region gate, not part
of the differentiable loss). UniRL packs an AR batch as a single varlen [total_tokens] tensor,
so the mask is computed per sequence via torch.split(..., segment.lengths) — the prefix
sums must not bleed across packed boundaries, and the position weight keys on each sequence's
own length T (there is no padding, unlike a 2D right-padded layout):
pos = torch.arange(1, T + 1)
w_t = w_min + (1.0 - w_min) * (T - pos) / max(T - 1, 1) # decreasing position weight
Z_t = w_t * D_t
S_prev = torch.cat([Z_t.new_zeros(1), torch.cumsum(Z_t, 0)[:-1]]) # right-shifted prefix sums
W_prev = torch.cat([w_t.new_zeros(1), torch.cumsum(w_t, 0)[:-1]])
delta_b_seq = torch.quantile(D_t, 0.9).clamp(min=delta_b, max=2.0 * delta_b) # Eq. 22
c_t = torch.minimum(torch.full_like(Z_t, delta), delta + delta_b_seq * W_prev - S_prev) # Eq. 8
keep = (adv * (ratio - 1.0) <= 0.0) | (Z_t <= c_t) # Eq. 10_cppo_loss then forms −A_t · r_t · keep with r_t kept differentiable (no .detach()),
matching GRPO / DRPO.
| Math object | Repo object |
|---|---|
State s_t = (x, y_{<t}) |
track.conditions + the packed prefix in TextSegment |
Sampled token y_t |
segment.tokens |
Behavior log-prob log µ(y_t|s_t) |
segment.log_probs (emitted by SGLang) → old_logp |
New log-prob log π_θ(y_t|s_t) |
stage.replay(segment, temperature=sampling_temperature) → new_logp |
Binary-TV divergence D_t |
(exp(new_logp) − exp(old_logp)).abs() |
Position weight w_t |
w_min + (1-w_min)*(T-pos)/max(T-1,1), per sequence |
Effective threshold c_t (Eq. 8) |
torch.minimum(delta, delta + delta_b_seq*W_prev - S_prev) |
Token-level threshold δ |
cppo_delta (0.20 for 30B-A3B; paper Table 3) |
Position-weight floor w_min |
cppo_w_min (0.8) |
Prefix-budget floor δ_b^min |
cppo_delta_b (0.02) |
Sample-level advantage  |
track.advantages |
| Token-level advantage | GRPO._expand_advantages_to_tokens(advantages, segment.lengths, ...) |
| Padding/eos mask | segment.loss_mask |
unirl.train_arbuildsARTrainerfor the text-only Qwen3 recipe.SGLangRolloutEnginesamples completions and returns an"ar"track with packedTextSegment.tokens,log_probs,lengths, and masks.MathVerifyRewardScorerscores each completion correct/incorrect.RolloutTrack.compute_advantages(normalize=False, scope="group")mean-centers rewards within each prompt group; the recipe setsnormalize_adv_by_std: false, so there is no std division.TrainStack.train_trackcallsCPPO.compute_loss_and_backward, which replays the sampled tokens attemperature=sampling_temperature, readsold_logp = segment.log_probs, expands advantages to tokens, builds the CPPO mask, appliessegment.loss_mask, reduces, andbackward()s.
Like GRPO / DRPO, CPPO reuses the rollout log-prob as old_logp (the behavior policy
µ) — it does not freeze a train-side anchor by default, so old_logp_source: rollout is
the canonical mode. (old_logp_source: replay is available for ablations.)
Key knobs (config.yaml)
| Knob | Meaning |
|---|---|
cppo_delta |
Token-level Binary-TV threshold δ. Paper Table 3: 0.15 dense, 0.20 for 30B-A3B. |
cppo_w_min |
Position-weight floor w_min (0.8). Earlier tokens get weight 1, late tokens w_min. |
cppo_delta_b |
Dynamic prefix-budget floor δ_b^min (0.02); δ_b^seq = clamp(P90(D), δ_b, 2·δ_b). |
sampling_temperature |
MUST equal sampling.temperature (and the rollout engine's). |
loss_agg_mode |
token-mean, or the recipe's seq-mean-token-sum-norm. |
horizon |
Fixed normalizer for seq-mean-token-sum-norm; recipe 16384. |
old_logp_source |
rollout (canonical: µ = the SGLang sampler's logp) or replay (ablation). |
Metric source: ratio_mean, ratio_max, approx_kl (k3), masked_fraction (the budget-mask
share — paper Fig. 7), and the AR-only rollout_replay_logp_absdiff_mean are emitted by
_cppo_loss / compute_loss_and_backward.
# one-time: build the local jsonl from the raw DAPO-Math + AIME datasets
python -m unirl.utils.prepare_dapo_math --out-dir data/dapo_math
DATA_PATH=data/dapo_math/train.jsonl EVAL_DATA_PATH=data/dapo_math/aime_eval.jsonl \
python -m unirl.train_ar --config-name=ar/qwen3_cppo_30b_a3b_base_dapo_sglang num_devices=128The model defaults to Qwen/Qwen3-30B-A3B-Base; set QWEN3_PATH to a local checkpoint dir to
avoid downloading at runtime. The MoE + cluster knobs in the recipe (num_devices, tp_size,
mem_fraction_static) are starting points to tune for your hardware.
- DRPO is the closest sibling: same DPPO Binary-TV trust region, but DRPO smooths the hard mask into an advantage-weighted quadratic regularizer, whereas CPPO keeps a hard keep/reject mask and instead makes the threshold position-weighted and cumulative.
- CPPO: "Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning"
arXiv:2606.10968: mask Eq. 8-11, Algorithm 1, dynamic
δ_bEq. 22, experiments §4. - DPPO: Qi et al., "Rethinking the Trust Region in LLM Reinforcement Learning" arXiv:2602.04879.
