Probabilistic Forecasts, Calibration and Sharpness JRSSB Submission B6257 Revision 1 (original) (raw)
Related papers
Probabilistic forecasts, calibration and sharpness
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007
Probabilistic forecasts of a continuous variable take the form of predictive densities or predictive cumulative distribution functions. We propose a diagnostic approach to the evaluation of predictive performance that is based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. A simple game-theoretic framework allows us to distinguish probabilistic calibration, exceedance calibration and marginal calibration. We propose and study tools for checking calibration and sharpness, among them the probability integral transform (PIT) histogram, marginal calibration plots, the sharpness diagram and proper scoring rules. The diagnostic approach is illustrated by an assessment and ranking of probabilistic forecasts of wind speed at the Stateline wind energy center in the US Pacific Northwest. In combination with cross-validation or in the time series context, our proposal provides very general, nonparametric alternatives to the use of information criteria for model diagnostics and model selection.
A Score Regression Approach to Assess Calibration of Continuous Probabilistic Predictions
Biometrics, 2010
Calibration, the statistical consistency of forecast distributions and the observations, is a central requirement for probabilistic predictions. Calibration of continuous forecasts is typically assessed using the probability integral transform histogram. In this article, we propose significance tests based on scoring rules to assess calibration of continuous predictive distributions. For an ideal normal forecast we derive the first two moments of two commonly used scoring rules: the logarithmic and the continuous ranked probability score. This naturally leads to the construction of two unconditional tests for normal predictions. More generally, we propose a novel score regression approach, where the individual scores are regressed on suitable functions of the predictive variance. This conditional approach is applicable even for certain nonnormal predictions based on the Dawid-Sebastiani score. Two case studies illustrate that the score regression approach has typically more power in detecting miscalibrated forecasts than the other approaches considered, including a recently proposed technique based on conditional exceedance probability curves.
EVALUATING PROBABILITY FORECASTS: Calibration Isn't Everything
Using evaluation methodologies for rare events from meteorology and psychology, we examine the value of probability forecasts of real GDP declines during the current and each of the next four quarters using data from the Survey of Professional Forecasters. We study the quality of these probability forecasts in terms of calibration, resolution, the relative operating characteristic (ROC), and alternative variance decompositions. We find that even though QPS and the calibration tests suggest the forecasts for all five horizons to be useful, the other approaches clearly identify the longer-term forecasts (Q3-Q4) having no skill relative to a naïve baseline forecast. For a given hit rate of (say) 90%, the associated high false alarm rates for the longer-term forecasts make these unusable in practice. We find conclusive evidence that the shorter- term forecasts (Q0-Q1) possess significant skill in terms of all measures considered, even though they are characterized by considerable excess...
Evaluating probability forecasts
The Annals of Statistics, 2011
Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans or in estimating the probability of a patient's positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or nonoccurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or nonoccurrence of the event.
Evaluating probability forecasts - Stanford University
2012
Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans, or in estimating the probability of a patient's positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or non-occurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events, and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or non-occurrence of the event.
Sensitivity to Distance and Baseline Distributions in Forecast Evaluation
Management Science, 2009
S coring rules can provide incentives for truthful reporting of probabilities and evaluation measures for the probabilities after the events of interest are observed. Often the space of events is ordered and an evaluation relative to some baseline distribution is desired. Scoring rules typically studied in the literature and used in practice do not take account of any ordering of events, and they evaluate probabilities relative to a default baseline distribution. In this paper, we construct rich families of scoring rules that are strictly proper (thereby encouraging truthful reporting), are sensitive to distance (thereby taking into account ordering of events), and incorporate a baseline distribution relative to which the value of a forecast is measured. In particular, we extend the power and pseudospherical families of scoring rules to allow for sensitivity to distance, with or without a specified baseline distribution.
Inferential, Nonparametric Statistics to Assess the Quality of Probabilistic Forecast Systems
Monthly Weather Review, 2007
Many statistical forecast systems are available to interested users. To be useful for decision making, these systems must be based on evidence of underlying mechanisms. Once causal connections between the mechanism and its statistical manifestation have been firmly established, the forecasts must also provide some quantitative evidence of "quality." However, the quality of statistical climate forecast systems (forecast quality) is an ill-defined and frequently misunderstood property. Often, providers and users of such forecast systems are unclear about what quality entails and how to measure it, leading to confusion and misinformation. A generic framework is presented that quantifies aspects of forecast quality using an inferential approach to calculate nominal significance levels ( p values), which can be obtained either by directly applying nonparametric statistical tests such as Kruskal-Wallis (KW) or Kolmogorov-Smirnov (KS) or by using Monte Carlo methods (in the case of forecast skill scores). Once converted to p values, these forecast quality measures provide a means to objectively evaluate and compare temporal and spatial patterns of forecast quality across datasets and forecast systems. The analysis demonstrates the importance of providing p values rather than adopting some arbitrarily chosen significance levels such as 0.05 or 0.01, which is still common practice. This is illustrated by applying nonparametric tests (such as KW and KS) and skill scoring methods [linear error in the probability space (LEPS) and ranked probability skill score (RPSS)] to the five-phase Southern Oscillation index classification system using historical rainfall data from Australia, South Africa, and India. The selection of quality measures is solely based on their common use and does not constitute endorsement. It is found that nonparametric statistical tests can be adequate proxies for skill measures such as LEPS or RPSS. The framework can be implemented anywhere, regardless of dataset, forecast system, or quality measure. Eventually such inferential evidence should be complemented by descriptive statistical methods in order to fully assist in operational risk management.
Probabilistic recalibration of forecasts
International Journal of Forecasting, 2019
We present a scheme by which a probabilistic forecasting system whose predictions have poor probabilistic calibration may be recalibrated by incorporating past performance information to produce a new forecasting system that is demonstrably superior to the original, in that one may use it to consistently win wagers against someone using the original system. The scheme utilizes Gaussian process (GP) modeling to estimate a probability distribution over the Probability Integral Transform (PIT) of a scalar predictand. The GP density estimate gives closed-form access to information entropy measures associated with the estimated distribution, which allows prediction of winnings in wagers against the base forecasting system. A separate consequence of the procedure is that the recalibrated forecast has a uniform expected PIT distribution. A distinguishing feature of the procedure is that it is appropriate even if the PIT values are not i.i.d. The recalibration scheme is formulated in a framework that exploits the deep connections between information theory, forecasting, and betting. We demonstrate the effectiveness of the scheme in two case studies: a laboratory experiment with a nonlinear circuit and seasonal forecasts of the intensity of the El Niño-Southern Oscillation phenomenon.
Evaluation of Probabilistic Forecasts: Proper Scoring Rules and Moments
SSRN Electronic Journal, 2000
The paper provides an overview of probabilistic forecasting and discusses a theoretical framework for evaluation of probabilistic forecasts which is based on proper scoring rules and moments. An artificial example of predicting second-order autoregression and an example of predicting the RTSI stock index are used as illustrations.
Probabilistic forecast reconciliation: Properties, evaluation and score optimisation
European Journal of Operational Research
We develop a framework for prediction of multivariate data that follow some known linear constraints, such as the example where some variables are aggregates of others. This is particularly common when forecasting time series (predicting the future), but also arises in other types of prediction. For point prediction, an increasingly popular technique is reconciliation, whereby predictions are made for all series (so-called 'base' predictions) and subsequently adjusted to ensure coherence with the constraints. This paper extends reconciliation from the setting of point prediction to probabilistic prediction. A novel definition of reconciliation is developed and used to construct densities and draw samples from a reconciled probabilistic prediction. In the elliptical case, it is proven that the true predictive distribution can be recovered from reconciliation even when the location and scale matrix of the base prediction are chosen arbitrarily. To find reconciliation weights, an objective function based on scoring rules is optimised. The energy and variogram scores are considered since the log score is improper in the context of comparing unreconciled to reconciled predictions, a result also proved in this paper. To account for the stochastic nature of the energy and variogram scores, optimisation is achieved using stochastic gradient descent. This method is shown to improve base predictions in simulation studies and in an empirical application, particularly when the base prediction models are severely misspecified. When misspecification is not too severe, extending popular reconciliation methods for point prediction can result in a similar performance to score optimisation via stochastic gradient descent. The methods described here are implemented in the ProbReco package for R.