GitHub - ptschandl/HAM10000_dataset: Tools for workup of the HAM10000 dataset (original) (raw)

HAM 10000 Dataset Tools

Creative Commons Lizenzvertrag

This repository gives access to the tools created and used for assembling the training dataset for the proposed HAM-10000 (Human Against Machine with 10000 training images) study, which extending part 3 of the ISIC 2018 challenge. The dataset itself is available for download at the Harvard dataverse or the ISIC-archive.


Extract

Following technique was used to leverage image data from PowerPoint slides, by extracting and ordering them with unique identifiers:


Filter

To more efficiently order large image sets of containing non-annotated overview (clinic), closeup (macro) and dermatoscopic (dsc) images, we fine-tuned a neural network to distinguish between those types automatically.

1. Annotation

2. Training

Training was performed in Caffe / DIGITS abstracting away many training variables. We gained 1501 annotated images with the tool above and proceeded to training: GoogLeNet pretrained on ImageNet (taken from the NVIDIA DIGITS 5 Model Store) was fine-tuned on three classes for 20 epochs, landing at a final top-1 accuracy on the test-set of 98.68% (one dermatoscopic image classified as macro). The trained model files are provided in ./classify/caffe_model/*

3. Inference


Unify

Pathologic diagnoses in clinical practice are often non-standardized and verbose. The notebook below depicts our boilerplate used on different datasets to merge raw string data into a clean set of classes.


Standardise

To normalise image format without squeezing, one Bash/ImageMagick command was applied to final images before data submission to the archive:

find . -type f \( -iname \*.jpg -o -iname \*.jpeg -o -iname \*.tiff -o -iname \*.tif \) -print0 | xargs -0 -n1 mogrify -strip -rotate "90<" -resize "600x450^" -gravity center -crop 600x450+0+0 -density 72 -units PixelsPerInch -format jpg -quality 100


Segment


Cite

If tools or data helped your research, please cite:

@article{Tschandl2018_HAM10000,
  author    = {Philipp Tschandl and
               Cliff Rosendahl and
               Harald Kittler},
  title     = {The {HAM10000} dataset, a large collection of multi-source dermatoscopic
               images of common pigmented skin lesions},
  journal   = {Sci. Data},
  volume    = {5},
  year      = {2018},
  pages     = {180161},
  doi       = {10.1038/sdata.2018.161}
}

If you used the segmentation macros or resulting segmentation masks from here, please cite:

@article{Tschandl2020_NatureMedicine,
  author = {Philipp Tschandl and Christoph Rinner and Zoe Apalla and Giuseppe Argenziano and Noel Codella and Allan Halpern and Monika Janda and Aimilios Lallas and Caterina Longo and Josep Malvehy and John Paoli and Susana Puig and Cliff Rosendahl and H. Peter Soyer and Iris Zalaudek and Harald Kittler},
  title = {Human{\textendash}computer collaboration for skin cancer recognition},
  journal = {Nature Medicine},
  volume = {26},
  number = {8},
  year = {2020},
  pages = {1229--1234},
  doi = {10.1038/s41591-020-0942-0},
  url = {https://doi.org/10.1038/s41591-020-0942-0}
}