From explanation to unsupervised segmentation: fusion of multiple explanation maps for vision transformers (original) (raw)
References
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at arXiv:2010.11929 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Lai-Dang, Q.-V.: A survey of vision transformers in autonomous driving: current trends and future directions. Preprint at arXiv:2403.07542 (2024)
Hassan, O.F., Ibrahim, A.F., Gomaa, A., Makhlouf, M.A., Bahaa Eldin, H., et al.: Real-time driver drowsiness detection using transformer architectures: a novel deep learning approach. Sci. Rep. 15, 17493 (2025). https://doi.org/10.1038/s41598-025-02111-x Article Google Scholar
Silva, S.H., Bethany, M., Votto, A.M., Scarff, I.H., Beebe, N., Najafirad, P.: Deepfake forensics analysis: an explainable hierarchical ensemble of weakly supervised models. Forensic Sci. Int. Synergy 4, 100217 (2022). https://doi.org/10.1016/j.fsisyn.2022.100217 Article Google Scholar
Forest, F., Porta, H., Tuia, D., Fink, O.: From classification to segmentation with explainable AI: a study on crack detection and growth monitoring. Autom. Constr. 165, 105497 (2024) Article Google Scholar
Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., Müller, K.-R.: Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE 109(3), 247–278 (2021) Article Google Scholar
Kashefi, R., Barekatain, L., Sabokrou, M., Aghaeipoor, F.: Explainability of vision transformers: a comprehensive review and new perspectives. Preprint at arXiv:2311.06786 (2023)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at arXiv:1312.6034 (2013)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer (2014)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. Preprint at arXiv:1412.6806 (2014)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. Preprint at arXiv:1706.03825 (2017)
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One 10(7), 0130140 (2015) Article Google Scholar
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: International Conference on Machine Learning, pp. 3145–3153. PMlR (2017)
Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.-R.: Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn. 65, 211–222 (2017) Article Google Scholar
Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. Preprint at arXiv:2005.00928 (2020)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)
Leem, S., Seo, H.: Attention guided cam: visual explanations of vision transformer guided by self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 2956–2964 (2024)
Mallick, R., Benois-Pineau, J., Zemmari, A.: I saw: a self-attention weighted method for explanation of visual transformers. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3271–3275. IEEE (2022)
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim, B.: Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 31 (2018)
Jain, S., Wallace, B.C.: Attention is not explanation. Preprint at arXiv:1902.10186 (2019)
Serrano, S., Smith, N.A.: Is attention interpretable? Preprint at arXiv:1906.03731 (2019)
Fung, C., Zeng, E., Bauer, L.: Attributions for ML-based ICS anomaly detection: from theory to practice. In: Proceedings of the 31st Network and Distributed System Security Symposium (2024)
Joshi, G., Joshi, A., Shetty, M., Walambe, R., Kotecha, K., Scotti, F., Piuri, V.: Ensemble learning and eigencam-based feature analysis for improving the performance and explainability of object detection in drone imagery. Discover Appl. Sci. 7, 376 (2025). https://doi.org/10.1007/s42452-025-06879-5 Article Google Scholar
Mendonça, T., Ferreira, P.M., Marques, J.S., Marcal, A.R.S., Rozeira, J.: PH² - a dermoscopic image database for research and benchmarking. In: 35th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5437–5440. IEEE (2013). https://doi.org/10.1109/EMBC.2013.6610779
Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338 (2010) Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015) ArticleMathSciNet Google Scholar
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: transformers make strong encoders for medical image segmentation. Preprint at arXiv:2102.04306 (2021)
Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: TransBTS: multimodal brain tumor segmentation using transformer. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 109–119. Springer (2021)
Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: UNETR: transformers for 3d medical image segmentation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1748–1758 (2022). https://doi.org/10.1109/WACV51458.2022.00181
Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI BrainLesion Workshop. Lecture Notes in Computer Science, pp. 272–284 (2022)
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-UNet: UNet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. Springer(2022)
Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., Averta, G., Leibe, B., Dubbelman, G., Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25303–25313 (2025)
Bakkouri, I., Afdel, K.: MLCA2F: multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal Image Video Process. (2022)
Bakkouri, I., Afdel, K., Benois-Pineau, J., Initiative, G.C.F.T.A.D.N.: BG-3DM2F: bidirectional gated 3d multi-scale feature fusion for Alzheimer’s disease diagnosis. Multimed. Tools Appl. 81(8), 10743–10776 (2022)
Bakkouri, I., Afdel, K.: Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed. Tools Appl. 79(29), 20483–20518 (2020) Article Google Scholar
Bakkouri, I., Afdel, K.: Multi-scale CNN based on region proposals for efficient breast abnormality recognition. Multimed. Tools Appl. 78(10), 12939–12960 (2019) Article Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285–296), 23–27 (1975) Google Scholar
Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. Preprint at arXiv:1806.07421 (2018)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollár, P., Girshick, R.: Segment anything. Preprint at arXiv:2304.02643 (2023)