Peter Grünwald | Universiteit Leiden (original) (raw)
Papers by Peter Grünwald
ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020
The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.
EURASIP Journal on Bioinformatics and Systems Biology, 2007
... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...
The Computer Journal, 2007
Journal of Mathematical Psychology, 2006
Theoretical Computer Science, 2012
We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.
Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...
Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...
Lecture Notes in Computer Science, 1996
Lecture Notes in Computer Science, 2013
Intelligent Algorithms in Ambient and Biomedical Computing, 2006
This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.
IEEE Information Theory Workshop 2010 (ITW 2010), 2010
ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.
ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020
The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a datase... more The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.
EURASIP Journal on Bioinformatics and Systems Biology, 2007
... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover lon... more ... seem to have a role in evolutionary and structural analysis of proteomes ... to dis-cover long-range regulatory elements (LREs) that determine tissue-specific gene expression. ... Using MDL-compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms ...
The Computer Journal, 2007
Journal of Mathematical Psychology, 2006
Theoretical Computer Science, 2012
We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. W... more We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory.
Abstract We consider how an agent should update her un- certainty when,it is represented by a set... more Abstract We consider how an agent should update her un- certainty when,it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes,decisions using the mini- max criterion, perhaps the best-studied and most commonly-used,criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reason- able games,that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood,as arising because different games are being played, against bookies with dif- ferent information. We characterize the impor- tant special cases in which the optimal decision rules according to the minimax,criterion amount to either conditioning or simply ignoring the in- formation. Finally, we consider the relationship between conditioning and calibration ...
Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant r... more Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have...
Lecture Notes in Computer Science, 1996
Lecture Notes in Computer Science, 2013
Intelligent Algorithms in Ambient and Biomedical Computing, 2006
This is an informal overview of Rissanen's Minimum Description Length (MDL) Princ... more This is an informal overview of Rissanen's Minimum Description Length (MDL) Principle. We provide an entirely non-technical introduction to the sub- ject, focussing on conceptual issues.
IEEE Information Theory Workshop 2010 (ITW 2010), 2010
ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be s... more ABSTRACT In a recent paper, Grunwald and Langford showed that MDL and Bayesian inference can be statistically inconsistent in a classification context, when the model is wrong. They presented a countable family M = {P1, P2, ...} of probability distributions, a "true" distribution P* outside M and a Bayesian prior distribution Π on M, such that M contains a distribution Q within a small KL divergence δ > 0 from P*, and with substantial prior, e.g. Π(Q) = 1/2. Nevertheless, when data are i.i.d. (independently identically distributed) according to P*, then, no matter how many data are observed, the Bayesian posterior puts nearly all its mass on distributions that are at a distance from P* that is much larger than δ. As a result, classification based on the Bayesian posterior can perform substantially worse than random guessing, no matter how many data are observed, even though the classifier based on Q performs much better than random guessing. Similarly, with probability 1, the distribution inferred by 2-part MDL has KL divergence to P* tending to infinity, and performs much worse than Q in classification - though, intriguingly, in contrast to the full Bayesian predictor, for large n the two-part MDL estimator never performs worse than random guessing.