TorchMetrics - Measuring Reproducibility in PyTorch (original) (raw)

On Reproducibility of Deep Convolutional Neural Networks Approaches

Reproducible Research in Pattern Recognition, 2019

Nowadays, Machine Learning techniques are more and more pervasive in several application fields. In order to perform an evaluation as reliable as possible, it is necessary to consider the reproducibility of these models both at training and inference time. With the introduction of Deep Learning (DL), the assessment of reproducibility became a critical issue due to heuristic considerations made at training time that, although improving the optimization performances of such complex models, can result in non-deterministic outcomes and, therefore, not reproducible models. The aim of this paper is to quantitatively highlight the reproducibility problem of DL approaches, proposing to overcome it by using statistical considerations. We show that, even if the models generated by using several times the same data show differences in the inference phase, the obtained results are not statistically different. In particular, this short paper analyzes, as a case study, our ICPR2018 DL based approach for the breast segmentation in DCE-MRI, demonstrating the reproducibility of the reported results.

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

2020

One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submiss...

Description of the Sandia Validation Metrics Project

2001

Our increased dependence on computer models leads to the natural question -how do we know whether a computer model is valid? Models have traditionally been tested against experimental measurements through simple comparisons such as x-y plots, scatter plots, or two-dimensional contour plots. We are then faced with two questions: When is the agreement between experimental measurement and model prediction good enough, and how should we quantify this agreement?

How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

ArXiv, 2021

Devising domainand model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, (α-Precision, β-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domainagnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sampleand distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data—a crucial performance indicator when modeling sensitive data with requirements on privacy. The three ...

Exploring Effects of Computational Parameter Changes to Image Recognition Systems

arXiv (Cornell University), 2022

Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safetycritical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models. PyTorch PyTorch GPU 1 Cyclist 80ms PyTorch TensorFlow GPU 1 Cat 75ms PyTorch PyTorch GPU 2 Cyclist 150ms True label: Cyclist Ref. inf. time: 80ms TensorFlow TensorFlow GPU 1 Cyclist 85ms TensorFlow PyTorch GPU 1 Cyclist 90ms TensorFlow TensorFlow GPU 2 Cyclist 130ms Framework Model Source Device Classification Inf. time

OpenKiwi: An Open Source Framework for Quality Estimation

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce OpenKiwi, a PyTorch-based open source framework for translation quality estimation. OpenKiwi supports training and testing of word-level and sentence-level quality estimation systems, implementing the winning systems of the WMT 2015-18 quality estimation campaigns. We benchmark OpenKiwi on two datasets from WMT 2018 (English-German SMT and NMT), yielding state-of-the-art performance on the word-level tasks and near state-of-the-art in the sentencelevel tasks.

Accounting for Variance in Machine Learning Benchmarks

2021

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51× reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

Challenges and Pitfalls of Machine Learning Evaluation and Benchmarking

arXiv: Learning, 2019

An increasingly complex and diverse collection of Machine Learning (ML) models as well as hardware/software stacks, collectively referred to as "ML artifacts", are being proposed - leading to a diverse landscape of ML. These ML innovations proposed have outpaced researchers' ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non-reproducible procedures for ML evaluation. A common practice of sharing ML artifacts is through repositories where artifact authors post ad-hoc code and some documentation, but often fail to reveal critical information for others to reproduce their results. This results in users' inability to compare with artifact authors' claims or adapt the model to his/her own use. This paper discusses common challenges and pitfalls of ML evaluation and benchmarking, which can be used as a guideline for ML model authors when sharing ML artifacts, and for system developers when benchmarking or designing ML s...

Unbiased Evaluation of Deep Metric Learning Algorithms

2019

Deep metric learning (DML) is a popular approach for images retrieval, solving verification (same or not) problems and addressing open set classification. Arguably, the most common DML approach is with triplet loss, despite significant advances in the area of DML. Triplet loss suffers from several issues such as collapse of the embeddings, high sensitivity to sampling schemes and more importantly a lack of performance when compared to more modern methods. We attribute this adoption to a lack of fair comparisons between various methods and the difficulty in adopting them for novel problem statements. In this paper, we perform an unbiased comparison of the most popular DML baseline methods under same conditions and more importantly, not obfuscating any hyper parameter tuning or adjustment needed to favor a particular method. We find, that under equal conditions several older methods perform significantly better than previously believed. In fact, our unified implementation of 12 recent...

A new typology design of performance metrics to measure errors in machine learning regression algorithms

Interdisciplinary Journal of Information, Knowledge, and Management, 2019

Aim/Purpose The aim of this study was to analyze various performance metrics and approaches to their classification. The main goal of the study was to develop a new typology that will help to advance knowledge of metrics and facilitate their use in machine learning regression algorithms Background Performance metrics (error measures) are vital components of the evaluation frameworks in various fields. A performance metric can be defined as a logical and mathematical construct designed to measure how close are the actual results from what has been expected or predicted. A vast variety of performance metrics have been described in academic literature. The most commonly mentioned metrics in research studies are Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), etc. Knowledge about metrics properties needs to be systematized to simplify the design and use of the metrics. Methodology A qualitative study was conducted to achieve the objectives of identifying related peer-reviewed research studies, literature reviews, critical thinking and inductive reasoning. Contribution The main contribution of this paper is in ordering knowledge of performance metrics and enhancing understanding of their structure and properties by proposing a new typology, generic primary metrics mathematical formula and a visualization chart Findings Based on the analysis of the structure of numerous performance metrics, we proposed a framework of metrics which includes four (4) categories: primary metrics, extended metrics, composite metrics, and hybrid sets of metrics. The paper identified three (3) key components (dimensions) that determine the structure and properties of primary metrics: method of de-termining point distance, method of normalization, method of aggrega-tion of point distances over a data set. For each component, implementa-tion options have been identified. The suggested new typology has been shown to cover a total of over 40 commonly used primary metrics Recommendations for Practitioners Presented findings can be used to facilitate teaching performance metrics to university students and expedite metrics selection and implementation processes for practitioners Recommendations for Researchers By using the proposed typology, researchers can streamline development of new metrics with predetermined properties Impact on Society The outcomes of this study could be used for improving evaluation results in machine learning regression, forecasting and prognostics with direct or indirect positive impacts on innovation and productivity in a societal sense Future Research Future research is needed to examine the properties of the extended metrics, composite metrics, and hybrid sets of metrics. Empirical study of the metrics is needed using R Studio or Azure Machine Learning Studio, to find associations between the properties of primary metrics and their “numerical” behavior in a wide spectrum of data characteristics and business or research requirements Keywords performance metrics, error measures, accuracy measures, distance, similarity, dissimilarity, properties, typology, classification, machine learning, regression, forecasting, prognostics, prediction, evaluation, estimation, modeling