Logo LMOD

A Large Multimodal Ophthalmology Dataset and Benchmark for Vision-Language Models

Zhenyue Qin * 1, Yu Yin * 2, Dylan Campbell 3, Xuansheng Wu 4,
Ke Zou 5, Yih-Chung Tham 5, Ninghao Liu 4, Xiuzhen Zhang 6, Qingyu Chen 1 ,
1 Yale University, 2 Imperial College London, 3 Australian National University,
4 University of Georgia, 5 National University of Singapore,
6 RMIT University,

*Equal contribution
Correspondance: qingyu.chen@yale.edu
NAACL 2025

Overview of our data processing and evaluation pipeline for assessing the performance of large vision–language models (LVLMs) on box-annotated ophthalmology images.

👋 Overview

This study proposes a systematic and reproducible data and evaluation pipeline that repurposes existing datasets to curate a dataset we refer to as LMOD (Large Multimodal Ophthalmology Dataset) for the development and evaluation of LVLMs in ophthalmology.

  1. We introduce LMOD, a large-scale ophthalmology dataset that includes over 21K images across diverse imaging modalities. LMOD is richly annotated with disease labels and bounding boxes, supporting comprehensive evaluations from macro-level.
  2. We systematically benchmark 13 state-of-the-art (SoTA) LVLMs, including models with diverse visual backbones and LLMs. The evaluation is conducted using a wide range of metrics, assessing strengths and weaknesses of LVLMs from various perspectives.
  3. Through fine-tuning and supervised classification, we demonstrate that while the challenges posed by ophthalmic image analysis are intricate for LVLMs, they are insurmountable. Our comprehensive evaluations and error analysis provide both a high-level overview and detailed insights, presented through various result formats, including weighted averages, bar charts, radar charts, and visual illustrations, to highlight the key strengths and weaknesses.

Logo LMOD Benchmark

🔍Related Work

Comparison of existing general-domain and ophthalmology-specific benchmarks for evaluating large vision-language models, highlighting their supported modalities, coverage of image types, and evaluation perspectives.

<

Logo Experimental Results

🎉Main Results

Performance comparison of state-of-the-art large vision-language models on the LMOD benchmark, evaluating their capabilities in anatomical recognition and diagnosis analysis. The best-performing model in each metric is highlighted in bold. Finetuned results were obtained via fine-tuning a LLaVA-Med model.

🏅 Best-performing Model Comparison

Comparison of the top five best-performing models for each evaluation metric: Precision, Recall, F1, and PPV, across different medical image categories.

Logo Cite Our Work

BibTeX


          @inproceedings{2025_lmod,
            title={LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models},
            author={Qin, Zhenyue and Yin, Yu and Campbell, Dylan and Wu, Xuansheng and Zou, Ke and Tham, Yih-Chung and Liu, Ninghao and Zhang, Xiuzhen and Chen, Qingyu},
            booktitle={NAACL: Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics},
            year={2025}
          }