Post
67
Interesting... looked into Apple's DiffuCoder and the masked diffusion approach is actually hitting SOTA parity... basicallly proving global MDLM can work for code https://arxiv.org/pdf/2506.20639
but then you look at Tiny-A2D results and it’s the complete opposite...BD3LM (block diffusion) totally outperforms MDLM... and then both MDLM and BD3LM models struggle hard compared to the AR baselines... https://github.com/ZHZisZZ/dllm/tree/main/examples/a2d
digging into the why and i think it comes down to the adaptation method....tiny-A2D just SFT’d an AR model adaption to force it into diffusion... asking a model wired for left to right causal attention to suddenly think bidirectionally is a massive shock... it struggles to unlearn that strong AR inductive bias
...that explains why BD3LM worked better in their case... since it generates in chunks it preserves some sequential order... acts like a bridge or crutch that feels more natural to the original Qwen weights
contrast that with Apple... they didn't just SFT...they pre-trained/adapted on 130B tokens... fundamentally rewiring the model to understand global dependencies from the ground up
my theory is if we want MDLM to actually work we can’t just SFT... we need that heavy adaptation or full pre-training phase to break the causal priors... otherwise the model just gets confused
but then you look at Tiny-A2D results and it’s the complete opposite...BD3LM (block diffusion) totally outperforms MDLM... and then both MDLM and BD3LM models struggle hard compared to the AR baselines... https://github.com/ZHZisZZ/dllm/tree/main/examples/a2d
digging into the why and i think it comes down to the adaptation method....tiny-A2D just SFT’d an AR model adaption to force it into diffusion... asking a model wired for left to right causal attention to suddenly think bidirectionally is a massive shock... it struggles to unlearn that strong AR inductive bias
...that explains why BD3LM worked better in their case... since it generates in chunks it preserves some sequential order... acts like a bridge or crutch that feels more natural to the original Qwen weights
contrast that with Apple... they didn't just SFT...they pre-trained/adapted on 130B tokens... fundamentally rewiring the model to understand global dependencies from the ground up
my theory is if we want MDLM to actually work we can’t just SFT... we need that heavy adaptation or full pre-training phase to break the causal priors... otherwise the model just gets confused