irene: a software to evaluate model performance (original) (raw)

IRENE_DLL: A class library for evaluating numerical estimates

Agronomy Journal

Library) is a MS COM class library providing a set of routines designed to facilitate the implementation of model evaluation techniques. Statistical procedures (difference-based analysis, correlation-regression analysis, probability distributions, pattern analysis, statistics aggregation, time mismatch analysis) are applied to allow comparing estimates against measurements, either individually taken or replicated. The DLL can be easily interfaced with applications developed under a MS Windows programming language. An essential description of the program is given along with the basic concepts of usage, plus a real example application in the field of agrometereology.

Statistics for the Evaluation and Comparison of Models

Journal of Geophysical Research, 1985

Procedures that may be used to evaluate the operational performance of a wide spectrum of geophysical models are introduced. Primarily using a complementary set of difference measures, both model accuracy and precision can be meaningfully estimated, regardless of whether the model predictions are manifested as scalars, directions, or vectors. It is additionally suggested that the reliability of the accuracy and precision measures can be determined from bootstrap estimates of confidence and significance. Recommended procedures are illustrated with a comparative evaluation of two models that estimate wind velocity over the South Atlantic Bight.

Concordance Correlation for Model Performance Assessment: An Example with Reference Evapotranspiration Observations

Agronomy Journal, 2009

All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. M odel performance assessment in agronomy and agroecology has employed many procedures in the past and continues to do so in the present. As recently reiterated by Tedeschi (2006), modelers in many fi elds, as well as statisticians, recommend the employment of multiple performance criteria tailored to each application rather than relying on a single performance measure. Relevantly, from this tailored application approach, Tedeschi (2006) provides formulae, discussion, and comparison of several available procedures. With models becoming modular and increasingly complex, Fila et al. (2003) developed the public domain soft ware library called IRENE_DLL (Integrated Resources for Evaluating Numerical Estimates-Dynamic Link Library). Th is library has most of the commonly used univariate and bivariate analyses including procedures called diff erence based methods and association based methods. Univariate analyses allow the modeler to consider and compare empirical and ideal distributional features of the observations and predictions. Diff erence based measures include methods such as the root mean square error, the mean absolute deviation or error, and the mean bias error (MBE) with a t test (see e.g., Fox, 1981). Th e MBE is oft en just called the bias. Association based methods include Pearson (r) or Spearman (r s) correlation coeffi cients and simple linear least-squares regression analysis. In addition, there are empirical indices of agreement or effi ciency like d (Willmott, 1981) and e (Nash-Suitcliff e effi ciency statistic; Nash and Sutcliff e, 1970). Legates and McCabe (1999) proposed robust versions of d and e based on absolute value distances rather than squared distances measures. Other tools or updated versions of some of the IRENE tools are available and used in other disciplines, particularly medical research. Using either a t test alone or a correlation coeffi cient alone can result in a misleading or inadequate assessment (Lin, 1989). In addition, many measures or indices, like d and e, are arbitrary, being ad hoc and not associated with signifi cance tests and probability distributions. Willmott (1981, 1982, 1984), Willmott et al. (1985), and others argue against using signifi cance tests. While many statisticians would agree with Willmott that assessment of model performance should make sense in terms of the underlying knowledge of the subject being modeled, they would still use the probability measures (Lin et al., 2002). Some statisticians, however, would recommend that no absolute signifi cance level or confi dence interval be preset, instead let the user decide whether or not to accept the results at the reported p levels. Moreover, in this spirit, know that an insightful omnibus procedure combining the essential ABSTRACT Th e assessment procedures for agronomic model performance are oft en arbitrary and unhelpful. An omnibus analysis, the concordance correlation coeffi cient (r c), is widely used in many other sciences. Th is work illustrates model assessment with two r c measures accompanied with a mean-diff erence (MD) plot and a distribution comparison. Each r c is an adjusted value of the usual Pearson correlation coeffi cient, r, assuming the exact relationship observations = predictions. Th e adjustments use a scale shift , u, and a location shift , v. Both of these measures also can indicate the similarity of the two variables' distributions; however, a formal test, the Kolmorgov-Smirnov D statistic, is used to statistically compare the distributions. Daily evapotranspiration data (ET 0) from a published study are compared with estimates from two possible weather observation based models. Although the fi rst model has slightly lower r than the second (0.980 vs. 0.982), its predictions reasonably agree with observations by having comparatively small location and scale shift s [(u = 0.025, v = 1.10, D = 0.12 (p ≤ 0.86)] and, consequently, a higher r c (0.975 vs. 0.946). Results for the second model are comparatively unacceptable having larger scale and location shift s (u = 0.215, v = 1.19, D = 0.28 [p ≤ 0.04]) with the bias ≠ 0 (p ≤ 0.05) as clearly shown in the associated MD plot. Researchers should consider using r c with an MD plot and distribution comparison in their model assessment toolkit because, together, they can provide a simple and sound probability based omnibus test as well as add useful insight.

SAVE: an R package for the Statistical Analysis of Computer Models

2015

This paper introduces the R package SAVE which implements statistical methodology for the analysis of computer models. Namely, the package includes routines that perform emulation, calibration and validation of this type of models. The methodology is Bayesian and is essentially that of Bayarri, Berger, Paulo, Sacks, Cafeo, Cavendish, Lin, and Tu (2007). The package is available through the Comprehensive R Archive Network, CRAN. We illustrate its use with a real data example.

Comparison of model prediction with measurements of 644

2008

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

New Indices to Quantify Patterns of Residuals Produced by Model Estimates

Agronomy Journal, 2004

Also, effective application of models suffers from the lack of knowledge of model parameters and their uncer-The evaluation of patterns in the residuals of model estimates vs. tainty. For both process-based and statistically based other variables can be useful in both model evaluation and parameter models, proper simulation techniques require the use calibration. New indices that allow quantifying such patterns (pattern indices) are presented. Groups of residuals are created by dividing of parameter values that, in the former case, properly the range of the variable under evaluation into two, three, four, or represent the system under evaluation and, in the latter five subranges. Two types of indices are proposed. The first type case, minimize the difference between estimates and (PI-type) is based on the absolute value of the maximum difference measured values. The process of adjusting parameter between pairwise comparisons among average residuals of each group values is known as calibration (Beck, 1986; Klepper and of residuals. A variant of this index is computed by using variance Rouse, 1991; Klepper et al., 1991), and it is based on ratios (PI-F type). The subranges of the variable that determines the different procedures. In general, the goal of calibration grouping of residuals may be of equal length (PI) or variable length is to minimize the difference between measured and (PI ). In the second case, they are generated by an algorithm that estimated values. The mathematical representation of optimizes subranges to maximize patterns. The power of the diverse pattern indices at identifying patterns was investigated, and their effec-this goal is called the cost (or loss, or objective) function tiveness was compared against the runs test. Critical values for pattern or assessment criterion. Defining an appropriate cost indices were generated by Monte Carlo simulations. Monte Carlo function can be difficult as it may require considerations probability tables, the results of power analysis, and the results of of different type such as the relative dominance of one or using pattern indices at two case studies (i.e., daily radiation and more parameters, the autocorrelation among different soil water content estimates) were presented. The analysis based on parameters, and the drift in time series (Janssen and pattern indices provided insight in model structure and parameter .

JOURNAL OF NATURAL RESOURCES AND DEVELOPMENT A new novel index for evaluating model performance

A vast array of scientific literature is concerned with simulation models. The aim of models is to predict the unknown situation as close to as real one. To do this, models are validated and examined for their performance under known condition. In this paper, commonly used model performance evaluation indices are overviewed and examined under different situations. Difference based, efficiency based (Nash and Sutcliffe coefficient, model efficiency of Loague and Green, Legates and McCabe's index) and composite indices (such as index of agreement, d, and d r ) were found ambiguous, inconsistent and not logical in many cases. A new index, Percent Mean Relative Absolute Error (PMRAE), is proposed which is found unambiguous, logical, straight-forward, and interpretable; thus can be used to evaluate model performance. The model evaluation performance ratings based on PMRAE are also suggested.

SAM: a computer program for statistical analysis and modelling

Environmental Modelling and Software, 1998

A computer program, SAM, has been developed to undertake the sophisticated task of statistical analysis and modelling for PCs, UNIX stations and mainframes. This paper illustrates the various procedures and methodologies adopted within the computer package. SAM is a useful tool for practitioners in selecting an appropriate probability distribution from a set of alternatives to represent statistical, environmental, socioeconomic, and engineering data. It also estimates the parameters of the distribution, predicts the desired percentile values, and calculates the minimum errors associated with the prediction. The package also provides extreme value, reliability and life-testing analyses that are essential in many scientific fields.