GitHub - ZwQ803/MM-Skin (original) (raw)

MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

Paper[PDF] Dataset[Google Drive] Code[Github]

we propose MM-Skin, a large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks and over 27k vision question answering (VQA) samples.

In addition, we developed SkinVL, a dermatology-specific VLM, and conducted comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT), and zero-shot classification tasks.

Code and model weights are coming soon.

Quick Start

1、Environment

First, clone the repo and cd into the directory:

git clone https://github.com/ZwQ803/MM-Skin.git
cd MM-Skin

Then create a conda env and install the dependencies:

conda create -n mmskin python=3.10 -y
conda activate mmskin
pip install -r requirements.txt

2、Download MM-SkinVL Pre-trained Weights

Model Name Link
SkinVL-MM Link
SkinVL-Pub Link
SkinVL-PubMM Link

Download Pre-training Datasets

Dataset Modality Link
SCIN Clinical Link
DDI Clinical Link
Fitzpatrick17k Clinical Link
PAD Clinical Link
Dermnet Clinical Link
HAM10000 Dermoscopy Link
ISIC2019 Dermoscopy Link
BCN20000 Dermoscopy Link
HIBA Dermoscopy Link
MSKCC Dermoscopy Link
Patch16 Pathology Link
MM-Skin Clinical, Dermoscopy, Pathology Link

Training

To train the model using LoRA, run finetune_lora.sh with pre-trained LLaVA-Med weights (available here).
Update LLAVA_MED_WEIGHT_PATH in the script to your local path, and replace PRETRAIN_DATAFRAME with the processed JSON training file.
We provide training JSONs for SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM at: /Dataframe/Pretrain.

After training, merge the LoRA weights with the base model:

python merge_lora_weights.py \
    --model-path /path/to/lora_model \
    --model-base /path/to/base_model/llava-med-v1.5-mistral-7b \
    --save-model-path /path/to/merge_model

You can also directly use our provided merged models by placing them in the /merge directory.

Evaluation

1、 VQA Evaluation。

To evaluate SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM, run:

python VQA_test.py --model-path MERGED_SKINVL_MODEL

Replace caption file and image folder in the script with your dataset paths. We provide preprocessed MM-Skin test data in /Dataframe/test/VQA, which can be used directly for evaluation.

2. Supervised Fine-Tuning (SFT) Classification

Run SFT_classify_test.sh for supervised classification. Replace all paths with your local files. Preprocessed data for reproducing our results can be found in /Dataframe/test/classification.

3. Zero-Shot Classification

Run ZS_classify_test.sh to perform zero-shot classification.

Data Collection and Statistics

The 15 professional dermatology textbooks are:

MM-Skin contains 11,039 dermatology images with expert descriptions across three modalities. It provides three subsets:

- MM-Skin-C (Captions)

- MM-Skin-O (Open-ended VQA)

- MM-Skin-D (Demographics)

Data Collection Process

  1. Image-Text Extraction: From 15 dermatology textbooks using OCR and Adobe API.
  2. Alignment: Match images with captions.
  3. Modality Classification: Feature-based classification (color, texture) with manual verification.
  4. Text Cleaning: Extract age and gender info.
  5. Filtering: Remove sensitive or annotated images.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zeng2025mm,
  title={MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks},
  author={Zeng, Wenqi and Sun, Yuqi and Ma, Chenxi and Tan, Weimin and Yan, Bo},
  journal={arXiv preprint arXiv:2505.06152},
  year={2025}
}