Exciting New Research on Multimodal Retrieval-Augmented Generation (MRAG)! I just finished reading a fascinating survey paper on Multimodal Retrieval-Augmented Generation (MRAG) from researchers at Huawei Cloud. This cutting-edge technology represents a significant advancement in enhancing large language models by integrating multimodal data like text, images, and videos into both retrieval and generation processes. Traditional Retrieval-Augmented Generation (RAG) systems primarily rely on textual data, which limits their ability to leverage rich contextual information available in multimodal sources. MRAG addresses this limitation by extending the RAG framework to include multimodal retrieval and generation, enabling more comprehensive and contextually relevant responses. The paper outlines the evolution of MRAG through three distinct stages: >> MRAG1.0 ("Pseudo-MRAG") This initial stage extended RAG by converting multimodal data into textual representations. The architecture consisted of three key components: - Document Parsing and Indexing: Processing multimodal documents using OCR and specialized models to generate captions for images and videos - Retrieval: Using vector embeddings to find relevant information - Generation: Synthesizing responses using LLMs While effective, this approach suffered from information loss during modality conversion and retrieval bottlenecks. >> MRAG2.0 ("True Multimodal") This stage preserved original multimodal data within the knowledge base and leveraged Multimodal Large Language Models (MLLMs) for direct processing. Key improvements included: - Using unified MLLMs for captioning instead of separate models - Supporting cross-modal retrieval to minimize data loss - Employing MLLMs for generation to directly process multimodal inputs >> MRAG3.0 (Advanced Integration) The latest evolution introduces: - Enhanced document parsing that retains document screenshots to minimize information loss - Multimodal Search Planning that optimizes retrieval strategies through retrieval classification and query reformulation - Multimodal output capabilities that combine text with images, videos, or other modalities in responses The technical architecture includes sophisticated components like multimodal retrievers (using single/dual-stream and generative structures), rerankers (fine-tuning or prompting-based), and refiners (hard or soft prompt methods) to optimize the information flow. What's particularly impressive is how MRAG outperforms traditional text-modal RAG in scenarios where both visual and textual information are critical for understanding and responding to queries. The researchers have systematically analyzed essential components, datasets, evaluation methods, and current limitations to provide a comprehensive understanding of this promising paradigm.
Improving Multimodal Model Performance
Explore top LinkedIn content from expert professionals.
Summary
Improving multimodal model performance means making AI systems better at understanding and generating information across multiple types of data—like text, images, audio, and video—at the same time. This involves refining how models process, retrieve, and combine these different modalities so their responses are more accurate and relevant.
- Prioritize cross-modal alignment: Continually check and reinforce the consistency between what the model thinks and what it outputs across both text and images.
- Refine retrieval strategies: Focus on smarter ways to gather and organize information, including building connections between data types instead of treating them separately.
- Customize training: Adapt models with your own domain-specific data to improve their ability to handle unique queries and tasks without needing oversized, general-purpose systems.
-
-
Most multimodal QA systems fail in the same place. Not at perception. Not at language. But at how evidence is retrieved and constrained before generation. This paper on Pythia - RAG makes that failure mode very clear. Even strong vision - language models still rely on captions and flat retrieval. In dense scenes, that silently drops relations. The model may see the objects, but it never commits to who is doing what to whom, so those relations disappear before reasoning even starts. What’s different here is that relations are treated as first - class structure rather than something the model is expected to infer implicitly. Textual relations are extracted as explicit triplets, visual relations are extracted directly from images using relation - aware detection, and both are unified into a single multimodal knowledge graph that is further grounded with external common sense knowledge. Retrieval is also structural, not similarity - based. Instead of pulling isolated facts, the system retrieves a query - guided subgraph using a graph algorithm that explicitly optimizes for relevance and cohesion. That sub graph is then encoded in two complementary ways: structurally through a graph encoder and textually through an LLM, while the associated image is encoded in parallel. These representations are fused with attention and only then passed to generation. Hallucinations don’t drop here because the language model is more careful or better prompted. They drop because generation is no longer free - form. The model is forced to operate over a constrained, relationally coherent slice of the multimodal graph. The larger takeaway is subtle but important. Multimodal reasoning doesn’t fail because models lack modalities. It fails because relations are implicit, retrieval is flat, and structure is introduced too late. Once relations are explicit and retrieval preserves topology, generation becomes a consequence of reasoning rather than a guess. That feels like a meaningful shift for multimodal RAG, especially in settings where confident answers without grounding are the real failure mode. #MultimodalAI #RetrievalAugmentedGeneration #KnowledgeGraphs #GraphReasoning #MultimodalQA #AIResearch #LLMs #VisionLanguage #StructuredReasoning #GraphNeuralNetworks #TrustworthyAI
-
ByteDance, in collaboration with researchers from Peking University, Princeton, and other institutions, has unveiled a significant advancement in multimodal AI on Hugging Face! They've introduced MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation. This work addresses a critical issue where current "thinking-aware" models can ironically degrade performance due to errors propagating between generated reasoning and the final image. MMaDA-Parallel tackles this by proposing a novel parallel multimodal diffusion framework. Instead of sequential processing, it enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. This new paradigm ensures a much stronger alignment between the model's internal reasoning and the visual output. The approach is further optimized by Parallel Reinforcement Learning (ParaRL), applying semantic rewards at each step to enforce cross-modal consistency. To thoroughly evaluate these improvements, the team also developed ParaBench, a new benchmark specifically designed to assess the alignment between generated reasoning and image outputs. MMaDA-Parallel demonstrates impressive results, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel. This sets a more robust standard for thinking-aware image synthesis. Explore the paper, try out the models, and delve into the new benchmark to see how parallel multimodal diffusion is enhancing image generation and editing! --- Paper: https://lnkd.in/epbZ7pVD Models (8B): MMaDA-Parallel-A: https://lnkd.in/eJf4wKaQ MMaDA-Parallel-M: https://lnkd.in/eUp4KSMm ParaBench Dataset: https://lnkd.in/egAfh2fx Project Page: https://lnkd.in/ey5wKT4x
-
📈 I've written a new blog post on training and finetuning multimodal embedding & reranker models with Sentence Transformers. As a practical example, I finetuned Qwen3-VL-Embedding-2B for Visual Document Retrieval (matching text queries to document screenshots with charts, tables, and layouts intact). Details: After 1 epoch on 10k query-image pairs, the model went from NDCG@10 of 0.888 to 0.947 on my eval set. That's ahead of every other VDR model I tested against, including ones up to 4x larger. Finetuning on your own domain is very often worth it over reaching for a bigger general-purpose model. I also wrapped the loss in MatryoshkaLoss, so you can truncate the 2048-dim embeddings at deployment time. The finetuned model stays within 0.3% of peak down to 512 dims (4x smaller), and retains 92.4% of peak even at 64 dims (32x smaller). For context, the base model already fell to 76.5% of its (lower) peak at 64 dims. The best part: the training script is nearly identical to a text-only one. The data collator automatically calls model.preprocess(), which detects the modality of each input (text, image, audio, video, or mixed) and applies the right preprocessing. No manual tokenization or image processing needed. The blog also walks through training multimodal Cross Encoder (reranker) models, with a few architectural options like Any-to-Any + LogitScore or Feature Extraction + Pooling + Dense. Read the full blog, or just point your Agent at the URL: https://lnkd.in/ewDQJME6
-
What if we've been overcomplicating the "modality gap" in CLIP? 🤯 For years, this gap has been a known bottleneck: images and texts live in separate, disconnected neighborhoods of the embedding space. A new paper on AlignCLIP suggests a simpler path. Instead of complex loss functions, they use two elegant architectural tricks to bridge the divide: 1️⃣ Share Parameters: Instead of two separate encoders, it uses a single Transformer encoder for both vision and language. This forces the model to learn one unified set of parameters that can process both modalities, preventing it from developing the divergent, specialized logics that cause the gap. 2️⃣ Separate & Mix: It introduces an intra-modal contrastive loss. This new objective actively pushes vectors of the same modality (e.g., image-vs-image) away from each other. This acts as an anti-clustering force, preventing dense, uni-modal groups from forming and forcing the two sets of embeddings to spread out and intermingle. The benefits are immediate and impressive: ▪️ The modality gap is significantly reduced, with alignment scores jumping up to 0.25 on MSCOCO. ▪️ Zero-shot performance is maintained or even improved, showing there's no trade-off. ▪️ CLIP's famous robustness is preserved, a critical feature for real-world applications. The breakthrough is the insight: elegant architectural tweaks can solve bottlenecks that complex losses can't. ⚡ This has huge implications for how we build the next generation of multi-modal models. It proves that sometimes, the simplest solutions are the most powerful.
-
Excited to share our latest work on multimodal pre-training MoMa! When working with early fusion mixed modal models (natively multimodal in/out), Chameleon showed the text LLM architecture (e.g. Llama) doesn’t scale to early fusion. While Chameleon proved the significant quality and efficiency benefits of early fusion models, there is a core issue with the inherent differences in information across text and image tokens being treated uniformly. We propose MoMa (Mixture of Modality-aware experts), a novel adaptive architecture that explores adaptivity across modality, width, and depth, delivering up to 4x FLOPs savings (4x smaller or 4x cheaper for the same quality)! Building on the success of MoE in Text LLMs (e.g. Mixtral, Deepseek-MoE, Grok), MoMa leverages a MoE (mixture-of-experts) block for each Modality showing for early fusion LLMs this is a much more efficient approach compared to dense or MoE alternatives. We further show empirical scaling of MoMa, extensions with MoD (mixture-of-depths), and Upcycling to propose our core training recipe. MoMa is our first step in re-thinking the core architecture primitives when building multimodal early fusion LLMs, we believe there’s a lot more potential in further exploring adaptive compute. This is joint work with amazing co-first authors Victoria Lin, Armen Aghajanyan and co-authors – Liang Luo, Srini Iyer, Mike Lewis, Gargi Ghosh and Luke Zettlemoyer. Paper: https://lnkd.in/gvKBj8SQ Twitter: https://lnkd.in/gj8TG7iR
-
Oncology data is inherently multimodal—combining radiology scans, pathology slides, genomics, and clinical notes. Yet, most AI models are trapped in single-modality silos, missing the complete biological picture and underutilizing complementary information. Integrating these heterogeneous data types into a unified patient representation is complex due to fragmented tools and rigid code dependencies. 𝘼𝙖𝙠𝙖𝙨𝙝 𝙏𝙧𝙞𝙥𝙖𝙩𝙝𝙞 𝙚𝙩 𝙖𝙡. published a comprehensive solution, 𝙃𝙊𝙉𝙚𝙔𝘽𝙀𝙀 (Harmonized ONcologY Biomedical Embedding Encoder). This open-source framework generates and integrates patient-level embeddings using domain-specific foundation models. Here are the key innovations from their evaluation of over 11,400 patients across 33 cancer types: • 𝙐𝙣𝙞𝙛𝙞𝙚𝙙 𝙈𝙪𝙡𝙩𝙞𝙢𝙤𝙙𝙖𝙡 𝙋𝙞𝙥𝙚𝙡𝙞𝙣𝙚: 𝙃𝙊𝙉𝙚𝙔𝘽𝙀𝙀 processes five distinct data types—clinical text, pathology reports, radiologic images, whole slide images (WSIs), and molecular profiles—through specialized preprocessing pipelines. Crucially, its modular design accommodates patients with missing data modalities without requiring complete-case cohorts. • 𝙏𝙝𝙚 𝙋𝙤𝙬𝙚𝙧 𝙊𝙛 𝘾𝙡𝙞𝙣𝙞𝙘𝙖𝙡 𝙏𝙚𝙭𝙩: In an interesting reality check, clinical embeddings derived from structured and unstructured data actually showed the strongest single-modality performance, achieving 98.5% classification accuracy and the highest overall survival prediction concordance indices. The authors note this reflects the expert-curated nature of clinical documentation in datasets like TCGA, which effectively summarizes information dispersed across other raw modalities. • 𝙁𝙪𝙨𝙞𝙤𝙣 𝙁𝙤𝙧 𝙎𝙪𝙧𝙫𝙞𝙫𝙖𝙡: While clinical data dominated, multimodal fusion strategies (such as concatenation and Kronecker product) provided critical complementary benefits. For specific cancers, fusing information from molecular, pathology, and imaging modalities significantly improved overall survival predictions beyond what clinical features could capture alone. • 𝙇𝙇𝙈𝙨 𝙋𝙪𝙩 𝙏𝙤 𝙏𝙝𝙚 𝙏𝙚𝙨𝙩: The team compared four large language models to evaluate text embeddings. They found that general-purpose models (like Qwen3) actually outperformed specialized medical models (like GatorTron) on standard clinical text. However, task-specific fine-tuning proved essential across all models to achieve high performance on messy, heterogeneous data like pathology reports. 𝙏𝙝𝙚 𝙏𝙖𝙠𝙚𝙖𝙬𝙖𝙮: The future of precision oncology relies not just on building individual foundation models, but on creating scalable, open-source infrastructure that can standardize and unify these distinct representations into a cohesive clinical picture. https://lnkd.in/gs_CHduM --- Keeping up with the literature is increasingly a team sport. This analysis was supported by NotebookLM and grounded in my own review and experience. If you found this useful, let me know in the comments. If it missed the mark, I want that feedback too. Weekly briefings on making vision AI work in the real world → https://lnkd.in/guekaSPf
-
🍲 Create your LLM soup Model merging is a cheap way to boost performance by combining the weights of multiple checkpoints without retraining. The most straightforward technique is a linear average, also called "model soup". Meta released a new paper that describes how to pick the weights for your soup! → Most models excel at different things, even within the same benchmark. The paper exploits these "category experts" by finding models that are weak on correlated tasks and combining them with optimized weights (instead of 50-50 blends). → The method achieves SOTA on Berkeley Function Calling Leaderboard with 80.68% accuracy (70B models) by averaging just 3-4 models. This is 2.7% than the previous best and significantly higher than uniform averaging. → Model soups are also more consistent across tasks: performance correlations between categories jump significantly, meaning you're less likely to get a model that's great at tool calling but terrible at multi-turn conversations. → The approach scales beyond function calling: gains appear on multilingual math (MGSM) and long-context tasks (∞-Bench), though benefits are not as clear when benchmarks lack clear "expert" candidates or show high correlation between categories. The core idea is nothing new (already done many times in the open-source community), but the results are pretty compelling. It's also interesting to see that it doesn't work for FLORES-36 translation, where all models perform similarly across languages. Have fun merging models!
-
A new paper introduces Uni-MoE, a large multimodal language model that utilizes a Mixture of Experts (#MoE) architecture to process multiple data modalities like images, speech, video, and text efficiently. Key aspects include: - Modality-specific encoders and connectors map different input modalities into a unified language representation space. - A sparse MoE layer activates only a subset of expert components for each input, enabling efficient scaling. - A three-stage progressive training approach: 1) Cross-modality alignment 2)Training modality-specific experts 3)Tuning the unified multimodal mode Evaluations on multimodal benchmarks for speech recognition, video question-answering, and audio captioning tasks showed Uni-MoE outperforming dense multimodal models like InstructBLIP and Macaw-LLM. The paper demonstrates the potential of using MoE architectures for powerful multimodal AI systems that can understand and process different data modalities efficiently. Learn more about this paper: https://lnkd.in/gFtNSCHg