GitHub - ZwQ803/MM-Skin (original) (raw)
MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks
Paper[PDF] Dataset[Google Drive] Code[Github]
we propose MM-Skin, a large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks and over 27k vision question answering (VQA) samples.
In addition, we developed SkinVL, a dermatology-specific VLM, and conducted comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT), and zero-shot classification tasks.
Code and model weights are coming soon.
Quick Start
1、Environment
First, clone the repo and cd into the directory:
git clone https://github.com/ZwQ803/MM-Skin.git
cd MM-Skin
Then create a conda env and install the dependencies:
conda create -n mmskin python=3.10 -y
conda activate mmskin
pip install -r requirements.txt
2、Download MM-SkinVL Pre-trained Weights
| Model Name | Link |
|---|---|
| SkinVL-MM | Link |
| SkinVL-Pub | Link |
| SkinVL-PubMM | Link |
Download Pre-training Datasets
| Dataset | Modality | Link |
|---|---|---|
| SCIN | Clinical | Link |
| DDI | Clinical | Link |
| Fitzpatrick17k | Clinical | Link |
| PAD | Clinical | Link |
| Dermnet | Clinical | Link |
| HAM10000 | Dermoscopy | Link |
| ISIC2019 | Dermoscopy | Link |
| BCN20000 | Dermoscopy | Link |
| HIBA | Dermoscopy | Link |
| MSKCC | Dermoscopy | Link |
| Patch16 | Pathology | Link |
| MM-Skin | Clinical, Dermoscopy, Pathology | Link |
Training
To train the model using LoRA, run finetune_lora.sh with pre-trained LLaVA-Med weights (available here).
Update LLAVA_MED_WEIGHT_PATH in the script to your local path, and replace PRETRAIN_DATAFRAME with the processed JSON training file.
We provide training JSONs for SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM at: /Dataframe/Pretrain.
After training, merge the LoRA weights with the base model:
python merge_lora_weights.py \
--model-path /path/to/lora_model \
--model-base /path/to/base_model/llava-med-v1.5-mistral-7b \
--save-model-path /path/to/merge_model
You can also directly use our provided merged models by placing them in the /merge directory.
Evaluation
1、 VQA Evaluation。
To evaluate SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM, run:
python VQA_test.py --model-path MERGED_SKINVL_MODEL
Replace caption file and image folder in the script with your dataset paths. We provide preprocessed MM-Skin test data in /Dataframe/test/VQA, which can be used directly for evaluation.
2. Supervised Fine-Tuning (SFT) Classification
Run SFT_classify_test.sh for supervised classification. Replace all paths with your local files. Preprocessed data for reproducing our results can be found in /Dataframe/test/classification.
3. Zero-Shot Classification
Run ZS_classify_test.sh to perform zero-shot classification.
Data Collection and Statistics
The 15 professional dermatology textbooks are:
- Diagnostic Dermoscopy: The Illustrated Guide, Second Edition
- Skin Lymphoma: The Illustrated
- Skin Disease: Diagnosis and Treatment, Fourth Edition
- Imported Skin Diseases
- Shimizu's Dermatology
- Skin Cancer: Recognition and Management
- Clinical Dermatology
- Andrews' Diseases of the Skin, Fourteenth Edition
- Diseases of the Liver and Biliary System in Children
- McKee's Pathology of the Skin, Fifth Edition
- Harper's Textbook of Pediatric Dermatology
- Skin Infections
- Advances in Integrative Dermatology
- Cancer of the Skin, 2nd Edition
- Rook's Textbook of Dermatology
MM-Skin contains 11,039 dermatology images with expert descriptions across three modalities. It provides three subsets:
- MM-Skin-C (Captions)
- MM-Skin-O (Open-ended VQA)
- MM-Skin-D (Demographics)
Data Collection Process
- Image-Text Extraction: From 15 dermatology textbooks using OCR and Adobe API.
- Alignment: Match images with captions.
- Modality Classification: Feature-based classification (color, texture) with manual verification.
- Text Cleaning: Extract age and gender info.
- Filtering: Remove sensitive or annotated images.
Citation
If you find our work helpful, feel free to give us a cite.
@article{zeng2025mm,
title={MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks},
author={Zeng, Wenqi and Sun, Yuqi and Ma, Chenxi and Tan, Weimin and Yan, Bo},
journal={arXiv preprint arXiv:2505.06152},
year={2025}
}