VLM3D Challenge – Task 1: Radiology Report Generation

Welcome to Task 1 of the Vision‑Language Modeling in 3D Medical Imaging (VLM3D) Challenge. In this task, it is aimed to train models that convert 3‑D chest CT volumes into clinically accurate radiology reports.


Contents

  1. Overview
  2. Dataset
  3. Task Objective
  4. Participation Rules
  5. Evaluation & Ranking
  6. Prizes & Publication
  7. Citation
  8. Contact

Overview

Radiologists spend considerable time dictating comprehensive reports for chest CT scans. Automating this step can:

  • Speed up diagnostic workflows
  • Reduce variability between readers
  • Improve patient care by enabling rapid triage

Task 1 leverages CT‑RATE, the largest open dataset pairing 3‑D chest CT volumes with expert reports, to benchmark the next generation of Vision‑Language Models (VLMs) for volumetric data.


Dataset

Split Patients CT Volumes Reports Source
Train (public) 20 000 ~47 k 20 000 Istanbul Medipol University
Validation (public) 1 304 ~3 k 1 564 Istanbul Medipol University
Internal Test (private) 2 000 2 000 hidden Istanbul Medipol University
External Test (private) 1 024 1 024 hidden Boston University Hospital

Raw nifti volumes are provided as‑is (no preprocessing). All CT metadata (spacing, rescale slope/intercept, etc.) is preserved.


Task Objective

Given a 3‑D chest CT volume, generate one free‑text radiology report that:

  • Correctly describes normal findings and pathologies
  • Uses standard chest CT terminology
  • Covers findings and impression

Participation Rules

  • Method type: Fully automatic – no human interaction at inference time.
  • Training data: Use any publicly available data or models in addition to CT‑RATE.
  • Team limits: Max 1 submissions/day; the last valid entry before the deadline counts.
  • Organizer teams: May submit for leaderboard visibility but are ineligible for prizes.

Evaluation & Ranking

Natural‑Language Metrics

Metric Purpose
BLEU‑1/2/3/4 n‑gram fluency & relevance
METEOR synonym & stem matching
ROUGE‑L longest‑common‑subsequence recall

Clinical Accuracy Metrics

Metric Purpose
Precision ↓ false positives
Recall (Sensitivity) ↓ false negatives
F1 Score Balanced precision/recall
CRG Score Distribution aware clinical metric

Final Ranking

A point‑based scheme (VerSe/BraTS style):

  1. For every metric, perform a two‑sided permutation test (10 000 samples) between all team pairs.
  2. Assign one point per significant win.
  3. Rank teams by total points (higher = better). Ties share the same place.

Missing reports receive the minimum score for that case.


Prizes & Publication

  • Awards – details TBA.
  • Every team with a valid submission will be invited to co‑author the joint challenge paper (MedIA / IEEE TMI).
  • An overview manuscript describing baseline results will appear on arXiv before the test phase closes.

Citation

If you use this dataset or participate in the challenge, please cite:

@article{hamamci2024developing,
  title={Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography},
  author={Hamamci, Ibrahim Ethem and Er, Sezgin and Almas, Furkan and Simsek, Ayse Gulnihan and Esirgun, Sevval Nil and Dogan, Irem and Dasdelen, Muhammed Furkan and Durugol, Omer Faruk and Wittmann, Bastian and Amiranashvili, Tamaz and others},
  journal={arXiv preprint arXiv:2403.17834},
  year={2024}
}

CRG Score:

@inproceedings{hamamci2025crg,
  title={CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation},
  author={Hamamci, Ibrahim Ethem and Er, Sezgin and Shit, Suprosanna and Reynaud, Hadrien and Kainz, Bernhard and Menze, Bjoern},
  booktitle={Medical Imaging with Deep Learning-Short Papers},
  year={2025}
}

Contact

For technical issues, open an issue or post on the challenge forum. For other issues, please send an e-mail via help -> email organizers.