Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

European Conference on Computer Vision

802 Accesses
15 Citations

Abstract

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the “modality gap” – a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model’s learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP’s capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.99; Price excludes VAT (USA)

Softcover Book: USD 161.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019)
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)
Google Scholar
Bigolin Lanfredi, R., et al.: REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Sci. Data 9(1), 350 (2022)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)
Google Scholar
Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Douze, M., et al.: The Faiss library. arXiv preprint arXiv:2401.08281 (2024)
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021)
Geng, X., Liu, H., Lee, L., Schuurmans, D., Levine, S., Abbeel, P.: M3AE: multimodal masked autoencoders learn transferable representations. Technical report (2022)
Google Scholar
Goel, S., Bansal, H., Bhatia, S., Rossi, R., Vinay, V., Grover, A.: CyCLIP: cyclic contrastive language-image pretraining. In: Advances in Neural Information Processing Systems, vol. 35, pp. 6704–6719 (2022)
Google Scholar
Gu, S., Clark, C., Kembhavi, A.: I can’t believe there’s no images! learning visual tasks using only language supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2672–2683 (2023)
Google Scholar
Han, Z., et al.: Umix: improving importance weighting for subpopulation shift via uncertainty-aware mixup. In: Advances in Neural Information Processing Systems, vol. 35, pp. 37704–37718 (2022)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: a multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3942–3951 (2021)
Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual-language foundation model for pathology image analysis using medical twitter. Nat. Med. 29(9), 2307–2316 (2023)
Article Google Scholar
Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
Johnson, A., et al.: MIMIC-CXR-JPG-chest radiographs with structured labels. PhysioNet (2019)
Google Scholar
Johnson, A.E., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), 317 (2019)
Google Scholar
Karargyris, A., et al.: Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development. Sci. Data 8(1), 92 (2021)
Article Google Scholar
Krishnan, R., Rajpurkar, P., Topol, E.J.: Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6(12), 1346–1352 (2022)
Article Google Scholar
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400 (2023)
Google Scholar
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 17612–17625 (2022)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Mo, S., Kim, M., Lee, K., Shin, J.: S-clip: semi-supervised vision-language learning using few specialist captions. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Moor, M., et al.: Med-flamingo: a multimodal medical few-shot learner. In: Machine Learning for Health (ML4H), pp. 353–367. PMLR (2023)
Google Scholar
Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: self-supervision meets language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 529–544. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19809-0_30
Chapter Google Scholar
Naeem, M.F., Xian, Y., Zhai, X., Hoyer, L., Van Gool, L., Tombari, F.: SILC: improving vision language pretraining with self-distillation. arXiv preprint arXiv:2310.13355 (2023)
Nukrai, D., Mokady, R., Globerson, A.: Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575 (2022)
Oh, C., et al.: Geodesic multi-modal mixup for robust fine-tuning. In: Advances in Neural Information Processing Systems, vol. 36 (2023)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Palepu, A., Beam, A.: Tier: text-image entropy regularization for medical clip-style models. In: Machine Learning for Healthcare Conference, pp. 548–564. PMLR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
Google Scholar
Shih, G., et al.: Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiol. Artif. Intell. 1(1), e180041 (2019)
Google Scholar
Singh, A., et al.: Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
Google Scholar
Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1519 (2020)
Google Scholar
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding. In: Advances in Neural Information Processing Systems, vol. 33, pp. 16857–16867 (2020)
Google Scholar
van Sonsbeek, T., Zhen, X., Mahapatra, D., Worring, M.: Probabilistic integration of object level annotations in chest x-ray classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3630–3640 (2023)
Google Scholar
Sowrirajan, H., Yang, J., Ng, A.Y., Rajpurkar, P.: MoCo-CXR: MoCo pretraining improves representation and transferability of chest X-ray models (2021). https://arxiv.org/abs (2010)
Sun, Z., et al.: Alpha-CLIP: a clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818 (2023)
Tschannen, M., Mustafa, B., Houlsby, N.: CLIPPO: image-and-language understanding from pixels only. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11006–11017 (2023)
Google Scholar
Tu, T., et al.: Towards generalist biomedical AI. NEJM AI 1(3), AIoa2300138 (2024)
Google Scholar
Verma, V., et al.: Manifold mixup: better representations by interpolating hidden states. In: International Conference on Machine Learning, pp. 6438–6447. PMLR (2019)
Google Scholar
Verma, V., Luong, T., Kawaguchi, K., Pham, H., Le, Q.: Towards domain-agnostic contrastive learning. In: International Conference on Machine Learning, pp. 10530–10541. PMLR (2021)
Google Scholar
Wang, B., et al.: GazeGNN: a gaze-guided graph neural network for chest x-ray classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2194–2203 (2024)
Google Scholar
Wang, F., Liu, H.: Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2021)
Google Scholar
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning, pp. 9929–9939. PMLR (2020)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-Ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017)
Google Scholar
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163 (2022)
Weers, F., Shankar, V., Katharopoulos, A., Yang, Y., Gunter, T.: Masked autoencoding does not help natural language supervision at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23432–23444 (2023)
Google Scholar
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: medical knowledge enhanced language-image pre-training. medRxiv pp. 2023–01 (2023)
Google Scholar
Xu, H., et al.: Demystifying CLIP data. arXiv preprint arXiv:2309.16671 (2023)
Xu, S., et al.: Elixr: towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317 (2023)
You, K., et al.: CXR-CLIP: toward large scale chest X-ray language-image pre-training. In: Greenspan, H., et al. (eds.) MICCAI 2023. LNCS, vol. 14221, pp. 101–111. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43895-0_10
Chapter Google Scholar
Yu, F., et al.: Evaluating progress in automatic chest x-ray radiology report generation. Patterns 4(9) (2023)
Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, pp. 12310–12320. PMLR (2021)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, K., et al.: Multi-task paired masking with alignment modeling for medical vision-language pre-training. IEEE Trans. Multimedia (2023)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (2019)
Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25. PMLR (2022)
Google Scholar
Zhang, Y., HaoChen, J.Z., Huang, S.C., Wang, K.C., Zou, J., Yeung, S.: Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269 (2023)
Zhang, Y., Sui, E., Yeung, S.: Connect, collapse, corrupt: learning cross-modal tasks with uni-modal data. In: The Twelfth International Conference on Learning Representations (2024)
Google Scholar

Download references

Acknowledgements

This work was supported by the Research Council of Finland (Flagship programme: Finnish Center for Artificial Intelligence FCAI, and grants 352986, 358246) and EU (H2020 grant 101016775 and NextGenerationEU). We acknowledge the computational resources provided by Aalto Science-IT project. We acknowledge CSC for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through Finland.

Author information

Authors and Affiliations

Department of Computer Science, Aalto University, Espoo, Finland
Yogesh Kumar & Pekka Marttinen

Authors

Yogesh Kumar
View author publications
Search author on:PubMed Google Scholar
Pekka Marttinen
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yogesh Kumar .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1089 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, Y., Marttinen, P. (2025). Improving Medical Multi-modal Contrastive Learning with Expert Annotations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-72661-3_27
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics