Spent about $3.50 on a single RTX 4090 to figure out why recent papers on hybrid AR + diffusion language models keep contradicting each other. Some report that adding an autoregressive planner improves diffusion-model reasoning. Others report it degrades. Both are right, in different projections. Starting observation: on LLaDA-8B, prepending the literal string "Plan: " to a GSM8K question costs 8pp of accuracy. No plan content. Just the word. That single number forced a decomposition. Hybrid AR/DDLM reasoning fails along at least three orthogonal axes: Interface-format brittleness — how much accuracy drops from any plan-shaped scaffold, content-free or otherwise. Planner-content trust — how much the model uses upstream plan content once the prefix shape is absorbed. Sampling-diversity preservation — whether fine-tuning collapses or expands the stochastic branches that consensus mechanisms rely on. A small (r=8) prefix-robustness LoRA flattens axis 1 from 8pp damage to within 1pp. Axis 2 turns out to be capacity-dependent in opposite directions across planner sizes — previously unmeasured. Axis 3 unexpectedly expanded under format-augmented training rather than collapsing, the inverse of the standard encoder-collapse story. The consensus-distillation track was the most instructive part. A late-block LoRA designed to distill majority-vote into a single forward pass plateaued at 70.5% across a 3.25x capacity bump. Looked like architectural impossibility. It wasn't — two design errors were masking each other. Fixing both recovered accuracy to 79%, within sampling error of target. Generalizable lesson: parameter-efficient distillation of sampling-based inference mechanisms requires the surgery to match the temporal structure of the original mechanism. A plateau across capacity is not, by itself, evidence the distillation is impossible. Workshop-scope, not main-conference. Single seed, N=200, GSM8K only. Limitations flagged honestly in the appendix. Total compute under $4. If you work on hybrid AR/diffusion or parameter-efficient distillation, especially if you've seen similar prefix-shape damage on other DDLMs, I'd be interested to compare notes. #MachineLearning #LLM #DiffusionModels

To view or add a comment, sign in

Explore content categories