Peter Grünwald | Universiteit Leiden (original) (raw)

Papers by Peter Grünwald

Research paper thumbnail of Regret and Jeffreys Integrals in Exp. Families

Research paper thumbnail of Jeffreys versus Shtarkov distributions associated with some natural exponential families

Research paper thumbnail of Discovering outstanding subgroup lists for numeric targets using MDL

ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Research paper thumbnail of Information Theoretic Methods for Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology, 2007

... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...

Research paper thumbnail of CHRISTOPHER S. WALLACE Statistical and Inductive Inference by Minimum Message Length. Springer (2005). ISBN 038723795X.  46.00. 432 pp. Hardbound

The Computer Journal, 2007

Research paper thumbnail of An empirical study of minimum description length model selection with infinite parametric complexity

Journal of Mathematical Psychology, 2006

Research paper thumbnail of Ondeugdelijke statistiek

Theoretical Computer Science, 2012

Research paper thumbnail of When Discriminative Learning of Bayesian Network Parameters Is Easy

Research paper thumbnail of Supervised Learning of Bayesian Network Parameters Made Easy

Research paper thumbnail of On Supervised Learning Of

Research paper thumbnail of Supervised Naive Bayes Parameters

Research paper thumbnail of Algorithmic information theory

We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.

Research paper thumbnail of A Game-Theoretic Analysis of Updating Sets of Probabilities

Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...

Research paper thumbnail of Mini-Course on MDL

Research paper thumbnail of Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...

Research paper thumbnail of Catching up Faster by Switching Sooner

Research paper thumbnail of A minimum description length approach to grammar inference

Lecture Notes in Computer Science, 1996

Research paper thumbnail of Safe Probability: Restricted Conditioning and Extended Marginalization

Lecture Notes in Computer Science, 2013

Research paper thumbnail of A First Look at the Minimum Description Length Principle

Intelligent Algorithms in Ambient and Biomedical Computing, 2006

This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.

Research paper thumbnail of Safe learning — how to adjust Bayes and MDL when the model is wrong

IEEE Information Theory Workshop 2010 (ITW 2010), 2010

ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.

Research paper thumbnail of Regret and Jeffreys Integrals in Exp. Families

Research paper thumbnail of Jeffreys versus Shtarkov distributions associated with some natural exponential families

Research paper thumbnail of Discovering outstanding subgroup lists for numeric targets using MDL

ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

Research paper thumbnail of Information Theoretic Methods for Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology, 2007

... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...

Research paper thumbnail of CHRISTOPHER S. WALLACE Statistical and Inductive Inference by Minimum Message Length. Springer (2005). ISBN 038723795X.  46.00. 432 pp. Hardbound

The Computer Journal, 2007

Research paper thumbnail of An empirical study of minimum description length model selection with infinite parametric complexity

Journal of Mathematical Psychology, 2006

Research paper thumbnail of Ondeugdelijke statistiek

Theoretical Computer Science, 2012

Research paper thumbnail of When Discriminative Learning of Bayesian Network Parameters Is Easy

Research paper thumbnail of Supervised Learning of Bayesian Network Parameters Made Easy

Research paper thumbnail of On Supervised Learning Of

Research paper thumbnail of Supervised Naive Bayes Parameters

Research paper thumbnail of Algorithmic information theory

We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.

Research paper thumbnail of A Game-Theoretic Analysis of Updating Sets of Probabilities

Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...

Research paper thumbnail of Mini-Course on MDL

Research paper thumbnail of Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...

Research paper thumbnail of Catching up Faster by Switching Sooner

Research paper thumbnail of A minimum description length approach to grammar inference

Lecture Notes in Computer Science, 1996

Research paper thumbnail of Safe Probability: Restricted Conditioning and Extended Marginalization

Lecture Notes in Computer Science, 2013

Research paper thumbnail of A First Look at the Minimum Description Length Principle

Intelligent Algorithms in Ambient and Biomedical Computing, 2006

This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.

Research paper thumbnail of Safe learning — how to adjust Bayes and MDL when the model is wrong

IEEE Information Theory Workshop 2010 (ITW 2010), 2010

ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.