LMOD+

A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

Zhenyue Qin*1, Yang Liu*2, Yu Yin3, Jinyu Ding1, Haoran Zhang1, Anran Li1, Dylan Campbell2, Xuansheng Wu4, Ke Zou5, Tiarnan D. L. Keenan6, Emily Y. Chew6, Zhiyong Lu6, Yih-Chung Tham5, Ninghao Liu4, Xiuzhen Zhang7, Qingyu Chen†1
1Yale University, 2Australian National University, 3Imperial College London, 4University of Georgia, 5National University of Singapore, 6National Institutes of Health, 7RMIT University
†Equal contribution
*Corresponding to: qingyu.chen@yale.edu

Vision-threatening eye diseases pose a major global health burden, affecting more than 2.2 billion people worldwide. In the United States alone, over 90 million people are at high risk for vision loss, yet many remain undiagnosed or are diagnosed too late for effective treatment. Up to 50% of patients with diabetic retinopathy do not receive timely eye examinations, highlighting critical gaps in screening and management.

While artificial intelligence offers promising solutions through multimodal large language models (MLLMs), a major challenge is the lack of unified, comprehensive benchmarks for ophthalmology. Most existing benchmarks were designed for earlier CNN-based models or focus on text-only tasks, failing to reflect real-world ophthalmic practice where medical imaging is indispensable.

This work presents LMOD+, a significantly enhanced version of our large-scale multimodal ophthalmology benchmark, comprising 32,633 images with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. Our key contributions include:

  • Comprehensive Dataset: Our dataset encompasses 32,633 high-quality images, featuring an extensive collection of color fundus photographs that covers diverse pathological conditions
  • Diverse Tasks: Comprehensive evaluation across 12 binary eye condition diagnosis tasks, multi-class disease diagnosis, severity classification, and demographic prediction to assess potential bias
  • Extensive Evaluation: Systematic assessment of 24 state-of-the-art MLLMs, including recent advanced models such as InternVL, Qwen, and DeepSeek series
  • Public Resources: Full dataset release with dynamic leaderboard and evaluation pipeline to support ongoing benchmarking and model development

LMOD+ Dataset

Comparison of existing general-domain and ophthalmology-specific benchmarks for evaluating large vision-language models, highlighting their supported modalities, coverage of image types, and evaluation perspectives.

βš™οΈ Data Curation Pipeline

Our comprehensive data curation pipeline systematically processes ophthalmology datasets by extracting key clinical information, anatomical annotations, and diagnostic metadata. We leverage multimodal large language models to automatically generate diverse question-answer pairs spanning anatomical recognition, disease diagnosis, staging assessment, and patient demographic analysis.

πŸ“Š Statistics

Dataset Type Number Disease
Cataract-1K Surgical Scene 2256 Cataract
Harvard FairSeg Scanning Laser Ophthalmoscopy 10000 Retinal
OIMHS Optical Coherence Tomography 3859 Macular
CAU001 Lens Photography 1417 N/A
Cataract Detection 2 Lens Photography 1015 Cataract
REFUGE Color Fundus Photography 1200 Glaucoma
IDRiD Color Fundus Photography 516 Retinopathy
ORIGA Color Fundus Photography 650 Glaucoma
G1020 Color Fundus Photography 1020 Glaucoma
BRSET Color Fundus Photography 16249 Multi

πŸ–ΌοΈ  Data Samples

This image provides sample annotations from the eight ophthalmology datasets used in our experiments, demonstrating the variety and complexity of regional markings. These datasets include: (a) Cataract-1K, (b) Harvard FairSeg, (c) OIMHS, (d) CAU001, (e) Cataract Detection 2, (f) REFUGE, (g) IDRiD, and (h) ORIGA.

Benchmark Results

πŸ“Š Ocular Anatomical Structure Recognition Performance

Comprehensive performance evaluation of leading multimodal large language models across diverse ophthalmic imaging techniques. The radar visualization illustrates how top-performing models achieve varying levels of precision, recall, F1-score, and hit rate across five critical imaging modalities in ophthalmology: surgical scenes, optical coherence tomography, color fundus photography, scanning laser ophthalmoscopy, and lens photography. This analysis reveals modality-specific strengths and limitations of current state-of-the-art models in medical image understanding.

🎯 Binary Eye Condition Diagnosis Performance

Comprehensive diagnostic accuracy assessment of 23 multimodal large language models across 12 distinct eye conditions. The heatmap visualization employs a color-coded accuracy scale from 0 to 1, where darker intensities represent superior diagnostic capabilities. This systematic evaluation demonstrates the varying proficiency of different models in accurately identifying specific ophthalmic conditions, highlighting both model-specific strengths and disease-specific diagnostic challenges in automated eye care diagnosis.

πŸ”¬ Multi-class Eye Disease Classification Analysis

Relationship analysis between model complexity and diagnostic performance in multi-class eye disease classification using color fundus photography. The scatter plot reveals how model parameter scale (measured in billions) correlates with diagnostic accuracy on a four-class classification task. Connected trajectories within model families demonstrate performance scaling patterns across different architectural configurations. The baseline random performance threshold (25% for four-class classification) is indicated by the gray dashed line, with selected LLaVA variants specifically labeled to highlight architectural distinctions and their impact on diagnostic capabilities.

πŸ“ˆ Ophthalmologic Stage Diagnosis Capabilities

Comparative analysis of multimodal large language model performance on stage-based ophthalmologic diagnosis tasks. The bar chart evaluation encompasses 10 carefully selected models tested across three specialized datasets requiring precise stage classification: OIMHS Macular Hole staging, ICDR severity assessment, and SDRG grading. Baseline performance thresholds at 20% and 25% provide reference points for model evaluation. The analysis includes diverse model architectures from InternVL variants (ranging from 1.5-2B to 2.5-8B-MPO parameters), LLaVA family models, specialized LLaVA-Med-7B, QWen-7B, YI-VL-6B, and DeepSeek VL2-Tiny. Results demonstrate that ICDR tasks achieve the highest diagnostic accuracies (approaching 40%), while OIMHS Macular Hole staging and SDRG exhibit more consistent performance within the 15-25% range, with InternVL 2.5-8B showing particular excellence in ICDR assessment.

πŸ† LMOD+ Subset Leaderboard

We conduct comprehensive evaluation of state-of-the-art multimodal large language models (MLLMs) using our proposed benchmarks, encompassing both proprietary and open-source models across multiple performance dimensions. Our evaluation framework focuses on three critical ophthalmological tasks: anatomical recognition, disease diagnosis, and disease stage assessment. The LMOD+ Standard Leaderboard is evaluated on a carefully curated LMOD+ subset, comprising 1,076 images specifically selected for efficient and reliable model evaluation.

Open-Source Proprietary
Models Release Date Anatomical Recognition Diagnosis Analysis Staging Assessment
Acc
Prec. Rec. F1 HC Binary Acc Multi-class Acc
🎲 Random N/A N/A N/A N/A 0.5000 0.2500 0.2500
πŸ€– GPT-4o 2024-05-13 0.3116 0.2424 0.2577 0.9786 N/A N/A 0.1053
LLaVA Logo LLaVa-Med-v1.5-mistral-7B 2023-06-01 0.2090 0.2558 0.2098 0.6240 0.5000 0.2500 0.2368
LLaVA Logo YI-VL-6B 2024-05-11 0.1509 0.0379 0.0451 0.8135 0.4117 0.2525 0.4000
LLaVA Logo Med-Flamingo N/A INVALID INVALID INVALID INVALID INVALID INVALID INVALID
LLaVA Logo InternVL 1.5-2B 2024-05-25 0.2377 0.2386 0.2204 0.7722 0.5000 0.2525 0.2632
LLaVA Logo InternVL 1.5-4B 2024-05-25 0.2499 0.2541 0.2414 0.8096 0.5000 0.2500 0.2500
LLaVA Logo InternVL 2.0-2B 2024-07-04 0.3718 0.3912 0.3722 0.9069 0.5000 0.2175 0.2500
LLaVA Logo InternVL 2.0-4B 2024-07-04 0.4200 0.4294 0.4096 0.8951 0.5000 0.2525 0.2237
LLaVA Logo InternVL 2.0-8B 2024-07-04 0.2865 0.2872 0.2770 0.9533 0.5017 0.3500 0.2500
LLaVA Logo InternVL 2.5-2B 2024-12-05 0.3751 0.3681 0.3524 0.9744 0.5017 0.3000 0.2308
LLaVA Logo InternVL 2.5-4B 2024-12-05 0.2685 0.1893 0.1890 0.9854 0.5000 0.3225 0.2500
LLaVA Logo InternVL 2.5-8B 2024-12-05 0.2851 0.2519 0.2574 0.9864 0.5000 0.3275 0.2500
LLaVA Logo InternVL 2.5-2B-MPO 2025-04-13 0.3123 0.2992 0.2835 0.9646 0.5000 0.2375 0.3077
LLaVA Logo InternVL 2.5-4B-MPO 2025-04-13 0.3126 0.2261 0.2320 0.9918 0.5000 0.3200 0.2500
LLaVA Logo InternVL 2.5-8B-MPO 2025-04-13 0.2986 0.2536 0.2624 0.9838 0.5000 0.2650 0.1053
LLaVA Logo LLaVA-1.5-7B 2023-10-05 0.0965 0.0530 0.0557 0.4629 0.4937 0.2475 0.2500
LLaVA Logo LLaVA-Mistral-7B 2024-01-30 0.0806 0.0773 0.0696 0.5788 0.5000 0.2500 0.2500
LLaVA Logo LLaVA-Vicuna-7B 2024-01-30 0.0437 0.0353 0.0342 0.1929 0.5000 0.3693 N/A
LLaVA Logo LLaVA-Vicuna-13B 2024-01-30 0.1123 0.0071 0.0105 0.6612 0.5000 0.2725 N/A
LLaVA Logo Qwen-VL-Chat 2023-08-22 0.1040 0.0120 0.0178 0.7790 0.5000 0.2675 0.2763
LLaVA Logo Qwen-3B N/A 0.2611 0.1489 0.1509 0.7468 0.5000 0.2500 0.2368
LLaVA Logo Qwen-7B 2023-08-03 0.2506 0.2261 0.2251 0.7556 0.5017 0.2500 0.2368
LLaVA Logo DeepSeek VL2-Tiny 2024-12-13 0.2228 0.0583 0.0688 0.9518 0.5000 0.2500 0.2237
LLaVA Logo DeepSeek VL2-Small 2024-12-13 0.0805 0.0130 0.0218 0.5917 INVALID 0.1700 0.0667

Comprehensive evaluation results of different large multimodal models across anatomical recognition, diagnosis, and stage classification tasks. The best-performing model in each category is in-bold, and the second best is underlined. Models are evaluated under zero-shot settings with consistent prompt templates.

To submit to our leaderboard, please send an email that contains the scores and a brief description of the large multimodal model to Zhenyue Qin via zhenyue.qin@yale.edu.

πŸ“‹ Citation

If you find our work useful, please cite our paper:

@article{qin2024lmod,
  title={LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology},
  author={Qin, Zhenyue and Liu, Yang and Yin, Yu and Ding, Jinyu and Zhang, Haoran and Li, Anran and Campbell, Dylan and Wu, Xuansheng and Zou, Ke and Keenan, Tiarnan D. L. and Chew, Emily Y. and Lu, Zhiyong and Tham, Yih-Chung and Liu, Ninghao and Zhang, Xiuzhen and Chen, Qingyu},
  journal={arXiv preprint arXiv:2410.01620},
  year={2024}
}