-
Notifications
You must be signed in to change notification settings - Fork 448
Description
Feature Summary
Omni foundational image gen model for seamless multimodal generation and understanding.
Detailed Description
These new models are much more compelling than the massive, multi-billion-parameter models that only a tiny fraction of users can run.
https://github.com/Alpha-VLLM/Lumina-DiMOO
"We introduce Lumina-DiMOO, an open-source foundational model for seamless multimodal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-diffusion paradigms and adeptly support a broad spectrum of multimodal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding."
https://github.com/tyfeld/MMaDA-Parallel
"While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory."
Alternatives you considered
No response
Additional context
No response