🧠 Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)?
Hi DeepSeek team 👋,
First of all, thank you for releasing DeepSeek-OCR — it’s an impressive and elegant vision-to-text model.
While exploring the model architecture and configuration files (config.json), I noticed that Multi-Head Latent Attention (MLA) is default enabled in this OCR model.
Questions
Could you please share some insights into why MLA was not used in DeepSeek-OCR?
- Was it due to compatibility issues between MLA and the vision encoder–decoder pipeline?
- Or did MLA not provide practical benefits in the OCR setting (e.g., shorter sequence lengths or the main bottleneck lying elsewhere)?
- Is there any plan to integrate MLA into future versions of DeepSeek-OCR to improve inference efficiency?
I’m asking because MLA has demonstrated significant efficiency gains in your other models (e.g., DeepSeek-V2/V3), and I’m curious about the reasoning behind excluding it here.
Thanks again for your excellent work and for open-sourcing this project! 🙏
Hello,
We actually have an internal MLA-enabled version of DeepSeek-OCR.
The only reason it hasn’t been open-sourced yet is simply that I haven’t had the bandwidth to implement the code needed to convert the internal weights into the Hugging Face format.
Best regards
Hi
Hello Sir, I am currently in School (8th Grade from Pakistan) doing basic ML, any suggestions for me ?