Peter Grünwald | Universiteit Leiden (original) (raw)

Papers by Peter Grünwald

Research paper thumbnail of Regret and Jeffreys Integrals in Exp. Families

Research paper thumbnail of Jeffreys versus Shtarkov distributions associated with some natural exponential families

This article appeared in a journal published by Elsevier. The attached copy is furnished to the a... more This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier's archiving and manuscript policies are encouraged to visit:

Research paper thumbnail of Discovering outstanding subgroup lists for numeric targets using MDL

ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Research paper thumbnail of Information Theoretic Methods for Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology, 2007

... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...

Research paper thumbnail of CHRISTOPHER S. WALLACE Statistical and Inductive Inference by Minimum Message Length. Springer (2005). ISBN 038723795X.  46.00. 432 pp. Hardbound

The Computer Journal, 2007

Research paper thumbnail of An empirical study of minimum description length model selection with infinite parametric complexity

Journal of Mathematical Psychology, 2006

Parametric complexity is a central concept in Minimum Description Length (MDL) model selection. I... more Parametric complexity is a central concept in Minimum Description Length (MDL) model selection. In practice it often turns out to be infinite, even for quite simple models such as the Poisson and Geometric families. In such cases, MDL model selection as based on NML and Bayesian inference based on Jeffreys' prior can not be used. Several ways to resolve this problem have been proposed. We conduct experiments to compare and evaluate their behaviour on small sample sizes.

Research paper thumbnail of Ondeugdelijke statistiek

Theoretical Computer Science, 2012

Research paper thumbnail of When Discriminative Learning of Bayesian Network Parameters Is Easy

Bayesian network models are widely used for discriminative prediction tasks such as classificatio... more Bayesian network models are widely used for discriminative prediction tasks such as classification. Usually their parameters are determined using 'unsupervised' methods such as maximization of the joint likelihood. The reason is often that it is unclear how to find the parameters maximizing the conditional (supervised) likelihood. We show how the discriminative learning problem can be solved efficiently for a large class of Bayesian network models, including the Naive Bayes (NB) and treeaugmented Naive Bayes (TAN) models. We do this by showing that under a certain general condition on the network structure, the discriminative learning problem is exactly equivalent to logistic regression with unconstrained convex parameter spaces. Hitherto this was known only for Naive Bayes models. Since logistic regression models have a concave log-likelihood surface, the global maximum can be easily found by local optimization methods.

Research paper thumbnail of Supervised Learning of Bayesian Network Parameters Made Easy

Bayesian network models are widely used for supervised prediction tasks such as classification. U... more Bayesian network models are widely used for supervised prediction tasks such as classification. Usually the parameters of such models are determined using 'unsupervised' methods such as maximization of the joint likelihood. In many cases, the reason is that it is not clear how to find the parameters maximizing the supervised (conditional) likelihood. We show how the supervised learning problem can be solved efficiently for a large class of Bayesian network models, including the Naive Bayes (NB) and tree-augmented NB (TAN) classifiers. We do this by showing that under a certain general condition on the network structure, the supervised learning problem is exactly equivalent to logistic regression. Hitherto this was known only for Naive Bayes models. Since logistic regression models have a concave loglikelihood surface, the global maximum can be easily found by local optimization methods.

Research paper thumbnail of On Supervised Learning Of

Research paper thumbnail of Supervised Naive Bayes Parameters

Bayesian network models are widely used for supervised prediction tasks such as classification. T... more Bayesian network models are widely used for supervised prediction tasks such as classification. The Naive Bayes (NB) classifier in particular has been successfully applied in many fields. Usually its parameters are determined using 'unsupervised' methods such as likelihood maximization. This can lead to seriously biased prediction, since the independence assumptions made by the NB model rarely ever hold. It has not been clear though, how to find parameters maximizing the supervised likelihood or posterior globally. In this paper we show, how this supervised learning problem can be solved efficiently. We introduce an alternative parametrization in which the supervised likelihood becomes concave. From this result it follows that there can be at most one maximum, easily found by local optimization methods. We present test results that show this is feasible and highly beneficial.

Research paper thumbnail of Algorithmic information theory

We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.

Research paper thumbnail of A Game-Theoretic Analysis of Updating Sets of Probabilities

Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...

Research paper thumbnail of Mini-Course on MDL

Research paper thumbnail of Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...

Research paper thumbnail of Catching up Faster by Switching Sooner

Research paper thumbnail of A minimum description length approach to grammar inference

Lecture Notes in Computer Science, 1996

Research paper thumbnail of Safe Probability: Restricted Conditioning and Extended Marginalization

Lecture Notes in Computer Science, 2013

Updating probabilities by conditioning can lead to bad predictions, unless one explicitly takes i... more Updating probabilities by conditioning can lead to bad predictions, unless one explicitly takes into account the mechanisms that determine what is observed and what has to be predicted. Analogous to the observation-CAR (coarsening at random) condition, used in existing analyses of (1), we propose a new prediction task-CAR condition to analyze (2). We redefine conditioning so that it remains valid if the mechanisms and are unknown. This will often update a singleton distribution to an imprecise set of probabilities, leading to dilation, but we show how to mitigate this problem by marginalization. We illustrate our notions using the Monty Hall Puzzle.

Research paper thumbnail of A First Look at the Minimum Description Length Principle

Intelligent Algorithms in Ambient and Biomedical Computing, 2006

This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.

Research paper thumbnail of Safe learning — how to adjust Bayes and MDL when the model is wrong

IEEE Information Theory Workshop 2010 (ITW 2010), 2010

ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.

Research paper thumbnail of Regret and Jeffreys Integrals in Exp. Families

Research paper thumbnail of Jeffreys versus Shtarkov distributions associated with some natural exponential families

This article appeared in a journal published by Elsevier. The attached copy is furnished to the a... more This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier's archiving and manuscript policies are encouraged to visit:

Research paper thumbnail of Discovering outstanding subgroup lists for numeric targets using MDL

ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Research paper thumbnail of Information Theoretic Methods for Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology, 2007

... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...

Research paper thumbnail of CHRISTOPHER S. WALLACE Statistical and Inductive Inference by Minimum Message Length. Springer (2005). ISBN 038723795X.  46.00. 432 pp. Hardbound

The Computer Journal, 2007

Research paper thumbnail of An empirical study of minimum description length model selection with infinite parametric complexity

Journal of Mathematical Psychology, 2006

Parametric complexity is a central concept in Minimum Description Length (MDL) model selection. I... more Parametric complexity is a central concept in Minimum Description Length (MDL) model selection. In practice it often turns out to be infinite, even for quite simple models such as the Poisson and Geometric families. In such cases, MDL model selection as based on NML and Bayesian inference based on Jeffreys' prior can not be used. Several ways to resolve this problem have been proposed. We conduct experiments to compare and evaluate their behaviour on small sample sizes.

Research paper thumbnail of Ondeugdelijke statistiek

Theoretical Computer Science, 2012

Research paper thumbnail of When Discriminative Learning of Bayesian Network Parameters Is Easy

Bayesian network models are widely used for discriminative prediction tasks such as classificatio... more Bayesian network models are widely used for discriminative prediction tasks such as classification. Usually their parameters are determined using 'unsupervised' methods such as maximization of the joint likelihood. The reason is often that it is unclear how to find the parameters maximizing the conditional (supervised) likelihood. We show how the discriminative learning problem can be solved efficiently for a large class of Bayesian network models, including the Naive Bayes (NB) and treeaugmented Naive Bayes (TAN) models. We do this by showing that under a certain general condition on the network structure, the discriminative learning problem is exactly equivalent to logistic regression with unconstrained convex parameter spaces. Hitherto this was known only for Naive Bayes models. Since logistic regression models have a concave log-likelihood surface, the global maximum can be easily found by local optimization methods.

Research paper thumbnail of Supervised Learning of Bayesian Network Parameters Made Easy

Bayesian network models are widely used for supervised prediction tasks such as classification. U... more Bayesian network models are widely used for supervised prediction tasks such as classification. Usually the parameters of such models are determined using 'unsupervised' methods such as maximization of the joint likelihood. In many cases, the reason is that it is not clear how to find the parameters maximizing the supervised (conditional) likelihood. We show how the supervised learning problem can be solved efficiently for a large class of Bayesian network models, including the Naive Bayes (NB) and tree-augmented NB (TAN) classifiers. We do this by showing that under a certain general condition on the network structure, the supervised learning problem is exactly equivalent to logistic regression. Hitherto this was known only for Naive Bayes models. Since logistic regression models have a concave loglikelihood surface, the global maximum can be easily found by local optimization methods.

Research paper thumbnail of On Supervised Learning Of

Research paper thumbnail of Supervised Naive Bayes Parameters

Bayesian network models are widely used for supervised prediction tasks such as classification. T... more Bayesian network models are widely used for supervised prediction tasks such as classification. The Naive Bayes (NB) classifier in particular has been successfully applied in many fields. Usually its parameters are determined using 'unsupervised' methods such as likelihood maximization. This can lead to seriously biased prediction, since the independence assumptions made by the NB model rarely ever hold. It has not been clear though, how to find parameters maximizing the supervised likelihood or posterior globally. In this paper we show, how this supervised learning problem can be solved efficiently. We introduce an alternative parametrization in which the supervised likelihood becomes concave. From this result it follows that there can be at most one maximum, easily found by local optimization methods. We present test results that show this is feasible and highly beneficial.

Research paper thumbnail of Algorithmic information theory

We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.

Research paper thumbnail of A Game-Theoretic Analysis of Updating Sets of Probabilities

Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...

Research paper thumbnail of Mini-Course on MDL

Research paper thumbnail of Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...

Research paper thumbnail of Catching up Faster by Switching Sooner

Research paper thumbnail of A minimum description length approach to grammar inference

Lecture Notes in Computer Science, 1996

Research paper thumbnail of Safe Probability: Restricted Conditioning and Extended Marginalization

Lecture Notes in Computer Science, 2013

Updating probabilities by conditioning can lead to bad predictions, unless one explicitly takes i... more Updating probabilities by conditioning can lead to bad predictions, unless one explicitly takes into account the mechanisms that determine what is observed and what has to be predicted. Analogous to the observation-CAR (coarsening at random) condition, used in existing analyses of (1), we propose a new prediction task-CAR condition to analyze (2). We redefine conditioning so that it remains valid if the mechanisms and are unknown. This will often update a singleton distribution to an imprecise set of probabilities, leading to dilation, but we show how to mitigate this problem by marginalization. We illustrate our notions using the Monty Hall Puzzle.

Research paper thumbnail of A First Look at the Minimum Description Length Principle

Intelligent Algorithms in Ambient and Biomedical Computing, 2006

This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.

Research paper thumbnail of Safe learning — how to adjust Bayes and MDL when the model is wrong

IEEE Information Theory Workshop 2010 (ITW 2010), 2010

ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.