Benjamin Rubinstein | University of California, Berkeley (original) (raw)
Books by Benjamin Rubinstein
Papers by Benjamin Rubinstein
Proceedings of the AAAI Conference on Artificial Intelligence
We study how to communicate findings of Bayesian inference to third parties, while preserving the... more We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on probabilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier transform so as to preserve update consistency. We also utilise a recently introduced posterior sampling mechanism, for which we prove bounds for the specific but general case of discrete Bayesian networks; and we introduce a maximum-a-posteriori private mechanism. Our analysis includes utility and privacy bounds, with a novel focus on the influence of graph structure on privacy. Worked examples and experiments with Bayesian naive Bayes and Bayesian linear regression illustrate the application of our mechanisms.
2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
Parallel coverage-guided greybox fuzzing is the most common setup for vulnerability discovery at ... more Parallel coverage-guided greybox fuzzing is the most common setup for vulnerability discovery at scale. However, so far it has received little attention from the research community compared to single-mode fuzzing, leaving open several problems particularly in its task allocation strategies. Current approaches focus on managing micro tasks, at the seed input level, and their task division algorithms are either ad-hoc or static. In this paper, we leverage research on graph partitioning and search algorithms to propose a systematic and dynamic task allocation solution that works at the macro-task level. First, we design an attributed graph to capture both the program structures (e.g., program call graph) and fuzzing information (e.g., branch hit counts, bug discovery probability). Second, our graph partitioning algorithm divides the global program search space into sub-search-spaces. Finally our search algorithm prioritizes these sub-search-spaces (i.e., tasks) and explores them to maximize code coverage and number of bugs found. The results are collected to update the graph and guide further iterations of partitioning and exploration. We implemented a prototype tool called AFLTeam. In our preliminary experiments on well-tested benchmarks, AFLTeam achieved higher code coverage (up to 16.4% branch coverage improvement) compared to the default parallel mode of AFL and discovered 2 zero-day bugs in FFmpeg and JasPer toolkits.
Advances in Neural Information Processing Systems 19, 2007
Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample ... more Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an n th point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube-the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this paper is a density bound of n n−1 ≤d−1 / (n ≤d) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
ArXiv, 2020
Achieving statistically significant evaluation with passive sampling of test data is challenging ... more Achieving statistically significant evaluation with passive sampling of test data is challenging in settings such as extreme classification and record linkage, where significant class imbalance is prevalent. Adaptive importance sampling focuses labeling on informative regions of the instance space, however it breaks data independence assumptions - commonly required for asymptotic guarantees that assure estimates approximate population performance and provide practical confidence intervals. In this paper we develop an adaptive importance sampling framework for supervised evaluation that defines a sequence of proposal distributions given a user-defined discriminative model of p(y|x) and a generalized performance measure to evaluate. Under verifiable conditions on the model and performance measure, we establish strong consistency and a (martingale) central limit theorem for resulting performance estimates. We instantiate our framework with worked examples given stochastic or determinis...
In this work, we study how to use sampling to speed up mechanisms for answering adaptive queries ... more In this work, we study how to use sampling to speed up mechanisms for answering adaptive queries into datasets without reducing the accuracy of those mechanisms. This is important to do when both the datasets and the number of queries asked are very large. In particular, we describe a mechanism that provides a polynomial speed-up per query over previous mechanisms, without needing to increase the total amount of data required to maintain the same generalization error as before. We prove that this speed-up holds for arbitrary statistical queries. We also provide an even faster method for achieving statistically-meaningful responses wherein the mechanism is only allowed to see a constant number of samples from the data per query. Finally, we show that our general results yield a simple, fast, and unified approach for adaptively optimizing convex and strongly convex functions over a dataset.
In recent years, complex systems techniques for modelling biological systems (plants, cells genes... more In recent years, complex systems techniques for modelling biological systems (plants, cells genes and others) have become invaluable in helping researchers developing insights into these fields. Entomologists too have been looking to apply computational modelling approaches to the field of insect behaviour. In this paper, a first-cut design of a system for creating virtual insects as agents is presented, in which behaviour is specified through an insect behaviour definition language based on Teleo-Reactive principles. These virtual insects are then simulated using a combination of a plant simulation package and an agent system that implements the insect behaviour. This system will allow entomologists to consider the validity of insect behaviour hypotheses by modelling and visualising insect behaviour.
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medi... more Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expose system underperformance. We find that numerical mistranslation is a general issue: major commercial systems and state-of-the-art research models fail on many of our test examples, for highand low-resource languages. Our tests reveal novel errors that have not previously been reported in NMT systems, to the best of our knowledge. Lastly, we discuss strategies to mitigate numerical mistranslation.
IEEE Access, 2021
Anomalies could be the threats to the network that has ever/never happened. To detect and protect... more Anomalies could be the threats to the network that has ever/never happened. To detect and protect networks against malicious access is always challenging even though it has been studied for a long time. Due to the evolution of network in both new technologies and fast growth of connected devices, network attacks are getting versatile as well. Comparing to the traditional detection approaches, machine learning is a novel and flexible method to detect intrusions in the network, it is applicable to any network structure. In this paper, we introduce the challenges of anomaly detection in the traditional network, as well as the next generation network, and review the implementation of machine learning in anomaly detection under different network contexts. The procedure of each machine learning type is explained, as well as the methodology and advantages presented. The comparison of using different machine learning models is also summarised. INDEX TERMS Machine learning, anomaly detection, network security, software defined network, Internet of Things, cloud network.
Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546)
This paper presents a new representation and corresponding set of genetic operators for a scheme ... more This paper presents a new representation and corresponding set of genetic operators for a scheme to evolve quantum circuits with various properties. The scheme is a variant on the techniques of genetic programming and genetic algorithms, having components borrowed from each. By recognising the foundation of a quantum circuit as being a collection of gates, each operating on various categories of qubits and each taking parameters, the scheme can successfully search for most circuits. The algorithm is applied to the problem of entanglement production. perfect sense to investigate the use of known automatic techniques, such as genetic programming and genetic algorithms which have proven to exhibit many desirable properties such as requiring no auxiliary information about the search space, except access to some kind of raw fitness function, and being highly robust. Section 2 of this paper is a brief overview of the basics of quantum computing, and of the work done in the application of GP to quantum computing. Section 3 outlines the GP scheme for this paper, detailing its representation scheme and operators. Section 4 illustrates the aforementioned scheme applied to quantum entanglement production. Section 5 presents the results of section 4. Section 6 then discusses these results and the scheme in more detail.
ACM SIGKDD Explorations Newsletter, 2003
Machine learning and data mining have found a multitude of successful applications in microarray ... more Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.
Lecture Notes in Computer Science
Whenever machine learning is applied to security problems, it is important to measure vulnerabili... more Whenever machine learning is applied to security problems, it is important to measure vulnerabilities to adversaries who poison the training data. We demonstrate the impact of variance injection schemes on PCA-based network-wide volume anomaly detectors, when a single compromised PoP injects chaff into the network. These schemes can increase the chance of evading detection by sixfold, for DoS attacks.
Proceedings of the 1st ACM workshop on Workshop on AISec - AISec '08, 2008
Machine learning has become a valuable tool for detecting and preventing malicious activity. Howe... more Machine learning has become a valuable tool for detecting and preventing malicious activity. However, as more applications employ machine learning techniques in adversarial decision-making situations, increasingly powerful attacks become possible against machine learning systems. In this paper, we present three broad research directions towards the end of developing truly secure learning. First, we suggest that finding bounds on adversarial influence is important to understand the limits of what an attacker can and cannot do to a learning system. Second, we investigate the value of adversarial capabilities-the success of an attack depends largely on what types of information and influence the attacker has. Finally, we propose directions in technologies for secure learning and suggest lines of investigation into secure techniques for learning in adversarial environments. We intend this paper to foster discussion about the security of machine learning, and we believe that the research directions we propose represent the most important directions to pursue in the quest for secure learning.
Journal of Computer and System Sciences, 2009
We present new expected risk bounds for binary and multiclass prediction, and resolve several rec... more We present new expected risk bounds for binary and multiclass prediction, and resolve several recent conjectures on sample compressibility due to Kuzmin and Warmuth. By exploiting the combinatorial structure of concept class F , Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy. The key step in their proof is a d = VC(F) bound on the graph density of a subgraph of the hypercube-one-inclusion graph. The first main result of this report is a density bound of n`n −1 ≤d−1´/ (n ≤d) < d, which positively resolves a conjecture of Kuzmin and Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved one-inclusion mistake bound. The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is an algebraic topological property of maximum classes of VC-dimension d as being d-contractible simplicial complexes, extending the well-known characterization that d = 1 maximum classes are trees. We negatively resolve a minimum degree conjecture of Kuzmin and Warmuth-the second part to a conjectured proof of correctness for Peeling-that every class has one-inclusion minimum degree at most its VC-dimension. Our final main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This result improves on known PAC-based expected risk bounds by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
IEEE Transactions on Dependable and Secure Computing, 2012
Despite the conventional wisdom that proactive security is superior to reactive security, we show... more Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender's strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker's incentives and knowledge.
IEEE Transactions on Information Theory, 2012
Recently Kutin and Niyogi investigated several notions of algorithmic stability-a property of a l... more Recently Kutin and Niyogi investigated several notions of algorithmic stability-a property of a learning map conceptually similar to continuity-showing that training-stability is sufficient for consistency of Empirical Risk Minimization while distribution-free CV-stability is necessary and sufficient for having finite VC-dimension. This paper concerns a phase transition in the training stability of ERM, conjectured by the same authors. Kutin and Niyogi proved that ERM on finite hypothesis spaces containing a unique risk minimizer has training stability that scales exponentially with sample size, and conjectured that the existence of multiple risk minimizers prevents even super-quadratic convergence. We prove this result for the strictly weaker notion of CV-stability, positively resolving the conjecture.
Findings of the Association for Computational Linguistics: EMNLP 2021
IEEE Transactions on Information Theory
Blowfish privacy is a recent generalisation of differential privacy that enables improved utility... more Blowfish privacy is a recent generalisation of differential privacy that enables improved utility while maintaining privacy policies with semantic guarantees, a factor that has driven the popularity of differential privacy in computer science. This paper relates Blowfish privacy to an important measure of privacy loss of information channels from the communications theory community: min-entropy leakage. Symmetry in an input data neighbouring relation is central to known connections between differential privacy and min-entropy leakage. But while differential privacy exhibits strong symmetry, Blowfish neighbouring relations correspond to arbitrary simple graphs owing to the framework's flexible privacy policies. To bound the minentropy leakage of Blowfish-private mechanisms we organise our analysis over symmetrical partitions corresponding to orbits of graph automorphism groups. A construction meeting our bound with asymptotic equality demonstrates tightness.
Proceedings of the AAAI Conference on Artificial Intelligence
We study how to communicate findings of Bayesian inference to third parties, while preserving the... more We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on probabilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier transform so as to preserve update consistency. We also utilise a recently introduced posterior sampling mechanism, for which we prove bounds for the specific but general case of discrete Bayesian networks; and we introduce a maximum-a-posteriori private mechanism. Our analysis includes utility and privacy bounds, with a novel focus on the influence of graph structure on privacy. Worked examples and experiments with Bayesian naive Bayes and Bayesian linear regression illustrate the application of our mechanisms.
2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
Parallel coverage-guided greybox fuzzing is the most common setup for vulnerability discovery at ... more Parallel coverage-guided greybox fuzzing is the most common setup for vulnerability discovery at scale. However, so far it has received little attention from the research community compared to single-mode fuzzing, leaving open several problems particularly in its task allocation strategies. Current approaches focus on managing micro tasks, at the seed input level, and their task division algorithms are either ad-hoc or static. In this paper, we leverage research on graph partitioning and search algorithms to propose a systematic and dynamic task allocation solution that works at the macro-task level. First, we design an attributed graph to capture both the program structures (e.g., program call graph) and fuzzing information (e.g., branch hit counts, bug discovery probability). Second, our graph partitioning algorithm divides the global program search space into sub-search-spaces. Finally our search algorithm prioritizes these sub-search-spaces (i.e., tasks) and explores them to maximize code coverage and number of bugs found. The results are collected to update the graph and guide further iterations of partitioning and exploration. We implemented a prototype tool called AFLTeam. In our preliminary experiments on well-tested benchmarks, AFLTeam achieved higher code coverage (up to 16.4% branch coverage improvement) compared to the default parallel mode of AFL and discovered 2 zero-day bugs in FFmpeg and JasPer toolkits.
Advances in Neural Information Processing Systems 19, 2007
Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample ... more Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an n th point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube-the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this paper is a density bound of n n−1 ≤d−1 / (n ≤d) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
ArXiv, 2020
Achieving statistically significant evaluation with passive sampling of test data is challenging ... more Achieving statistically significant evaluation with passive sampling of test data is challenging in settings such as extreme classification and record linkage, where significant class imbalance is prevalent. Adaptive importance sampling focuses labeling on informative regions of the instance space, however it breaks data independence assumptions - commonly required for asymptotic guarantees that assure estimates approximate population performance and provide practical confidence intervals. In this paper we develop an adaptive importance sampling framework for supervised evaluation that defines a sequence of proposal distributions given a user-defined discriminative model of p(y|x) and a generalized performance measure to evaluate. Under verifiable conditions on the model and performance measure, we establish strong consistency and a (martingale) central limit theorem for resulting performance estimates. We instantiate our framework with worked examples given stochastic or determinis...
In this work, we study how to use sampling to speed up mechanisms for answering adaptive queries ... more In this work, we study how to use sampling to speed up mechanisms for answering adaptive queries into datasets without reducing the accuracy of those mechanisms. This is important to do when both the datasets and the number of queries asked are very large. In particular, we describe a mechanism that provides a polynomial speed-up per query over previous mechanisms, without needing to increase the total amount of data required to maintain the same generalization error as before. We prove that this speed-up holds for arbitrary statistical queries. We also provide an even faster method for achieving statistically-meaningful responses wherein the mechanism is only allowed to see a constant number of samples from the data per query. Finally, we show that our general results yield a simple, fast, and unified approach for adaptively optimizing convex and strongly convex functions over a dataset.
In recent years, complex systems techniques for modelling biological systems (plants, cells genes... more In recent years, complex systems techniques for modelling biological systems (plants, cells genes and others) have become invaluable in helping researchers developing insights into these fields. Entomologists too have been looking to apply computational modelling approaches to the field of insect behaviour. In this paper, a first-cut design of a system for creating virtual insects as agents is presented, in which behaviour is specified through an insect behaviour definition language based on Teleo-Reactive principles. These virtual insects are then simulated using a combination of a plant simulation package and an agent system that implements the insect behaviour. This system will allow entomologists to consider the validity of insect behaviour hypotheses by modelling and visualising insect behaviour.
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medi... more Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expose system underperformance. We find that numerical mistranslation is a general issue: major commercial systems and state-of-the-art research models fail on many of our test examples, for highand low-resource languages. Our tests reveal novel errors that have not previously been reported in NMT systems, to the best of our knowledge. Lastly, we discuss strategies to mitigate numerical mistranslation.
IEEE Access, 2021
Anomalies could be the threats to the network that has ever/never happened. To detect and protect... more Anomalies could be the threats to the network that has ever/never happened. To detect and protect networks against malicious access is always challenging even though it has been studied for a long time. Due to the evolution of network in both new technologies and fast growth of connected devices, network attacks are getting versatile as well. Comparing to the traditional detection approaches, machine learning is a novel and flexible method to detect intrusions in the network, it is applicable to any network structure. In this paper, we introduce the challenges of anomaly detection in the traditional network, as well as the next generation network, and review the implementation of machine learning in anomaly detection under different network contexts. The procedure of each machine learning type is explained, as well as the methodology and advantages presented. The comparison of using different machine learning models is also summarised. INDEX TERMS Machine learning, anomaly detection, network security, software defined network, Internet of Things, cloud network.
Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546)
This paper presents a new representation and corresponding set of genetic operators for a scheme ... more This paper presents a new representation and corresponding set of genetic operators for a scheme to evolve quantum circuits with various properties. The scheme is a variant on the techniques of genetic programming and genetic algorithms, having components borrowed from each. By recognising the foundation of a quantum circuit as being a collection of gates, each operating on various categories of qubits and each taking parameters, the scheme can successfully search for most circuits. The algorithm is applied to the problem of entanglement production. perfect sense to investigate the use of known automatic techniques, such as genetic programming and genetic algorithms which have proven to exhibit many desirable properties such as requiring no auxiliary information about the search space, except access to some kind of raw fitness function, and being highly robust. Section 2 of this paper is a brief overview of the basics of quantum computing, and of the work done in the application of GP to quantum computing. Section 3 outlines the GP scheme for this paper, detailing its representation scheme and operators. Section 4 illustrates the aforementioned scheme applied to quantum entanglement production. Section 5 presents the results of section 4. Section 6 then discusses these results and the scheme in more detail.
ACM SIGKDD Explorations Newsletter, 2003
Machine learning and data mining have found a multitude of successful applications in microarray ... more Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.
Lecture Notes in Computer Science
Whenever machine learning is applied to security problems, it is important to measure vulnerabili... more Whenever machine learning is applied to security problems, it is important to measure vulnerabilities to adversaries who poison the training data. We demonstrate the impact of variance injection schemes on PCA-based network-wide volume anomaly detectors, when a single compromised PoP injects chaff into the network. These schemes can increase the chance of evading detection by sixfold, for DoS attacks.
Proceedings of the 1st ACM workshop on Workshop on AISec - AISec '08, 2008
Machine learning has become a valuable tool for detecting and preventing malicious activity. Howe... more Machine learning has become a valuable tool for detecting and preventing malicious activity. However, as more applications employ machine learning techniques in adversarial decision-making situations, increasingly powerful attacks become possible against machine learning systems. In this paper, we present three broad research directions towards the end of developing truly secure learning. First, we suggest that finding bounds on adversarial influence is important to understand the limits of what an attacker can and cannot do to a learning system. Second, we investigate the value of adversarial capabilities-the success of an attack depends largely on what types of information and influence the attacker has. Finally, we propose directions in technologies for secure learning and suggest lines of investigation into secure techniques for learning in adversarial environments. We intend this paper to foster discussion about the security of machine learning, and we believe that the research directions we propose represent the most important directions to pursue in the quest for secure learning.
Journal of Computer and System Sciences, 2009
We present new expected risk bounds for binary and multiclass prediction, and resolve several rec... more We present new expected risk bounds for binary and multiclass prediction, and resolve several recent conjectures on sample compressibility due to Kuzmin and Warmuth. By exploiting the combinatorial structure of concept class F , Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy. The key step in their proof is a d = VC(F) bound on the graph density of a subgraph of the hypercube-one-inclusion graph. The first main result of this report is a density bound of n`n −1 ≤d−1´/ (n ≤d) < d, which positively resolves a conjecture of Kuzmin and Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved one-inclusion mistake bound. The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is an algebraic topological property of maximum classes of VC-dimension d as being d-contractible simplicial complexes, extending the well-known characterization that d = 1 maximum classes are trees. We negatively resolve a minimum degree conjecture of Kuzmin and Warmuth-the second part to a conjectured proof of correctness for Peeling-that every class has one-inclusion minimum degree at most its VC-dimension. Our final main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This result improves on known PAC-based expected risk bounds by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
IEEE Transactions on Dependable and Secure Computing, 2012
Despite the conventional wisdom that proactive security is superior to reactive security, we show... more Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender's strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker's incentives and knowledge.
IEEE Transactions on Information Theory, 2012
Recently Kutin and Niyogi investigated several notions of algorithmic stability-a property of a l... more Recently Kutin and Niyogi investigated several notions of algorithmic stability-a property of a learning map conceptually similar to continuity-showing that training-stability is sufficient for consistency of Empirical Risk Minimization while distribution-free CV-stability is necessary and sufficient for having finite VC-dimension. This paper concerns a phase transition in the training stability of ERM, conjectured by the same authors. Kutin and Niyogi proved that ERM on finite hypothesis spaces containing a unique risk minimizer has training stability that scales exponentially with sample size, and conjectured that the existence of multiple risk minimizers prevents even super-quadratic convergence. We prove this result for the strictly weaker notion of CV-stability, positively resolving the conjecture.
Findings of the Association for Computational Linguistics: EMNLP 2021
IEEE Transactions on Information Theory
Blowfish privacy is a recent generalisation of differential privacy that enables improved utility... more Blowfish privacy is a recent generalisation of differential privacy that enables improved utility while maintaining privacy policies with semantic guarantees, a factor that has driven the popularity of differential privacy in computer science. This paper relates Blowfish privacy to an important measure of privacy loss of information channels from the communications theory community: min-entropy leakage. Symmetry in an input data neighbouring relation is central to known connections between differential privacy and min-entropy leakage. But while differential privacy exhibits strong symmetry, Blowfish neighbouring relations correspond to arbitrary simple graphs owing to the framework's flexible privacy policies. To bound the minentropy leakage of Blowfish-private mechanisms we organise our analysis over symmetrical partitions corresponding to orbits of graph automorphism groups. A construction meeting our bound with asymptotic equality demonstrates tightness.
2017 46th International Conference on Parallel Processing (ICPP)
Apache Storm is a fault-tolerant, distributed inmemory computation system for processing large vo... more Apache Storm is a fault-tolerant, distributed inmemory computation system for processing large volumes of high-velocity data in real-time. As an integral part of the faulttolerance mechanism, Storm's state management is achieved by a checkpointing framework, which commits states regularly and recovers lost states from the latest checkpoint. However, this method involves a remote data store for state preservation and access, resulting in significant overheads to the performance of error-free execution. In this paper, we propose E-Storm, a replication-based state management system that actively maintains multiple state backups on different worker nodes. We build a prototype on top of Storm by extending it with monitoring and recovery modules to support inter-task state transfer whenever needed. The experiments carried out on synthetic and real-world streaming applications confirm that E-Storm outperforms the existing checkpointing method in terms of the resulting application performance, obtaining as much as 9.44 times throughput improvement while reducing the application latency down to 9.8%.
arXiv: Machine Learning, 2013
Differential privacy formalises privacy-preserving mechanisms that provide access to a database. ... more Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utility. To do so, we generalise differential privacy to arbitrary dataset metrics, outcome spaces and distribution families. This allows us to also deal with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of the posterior to the data, which gives a measure of robustness. We also show how to use posterior sampling to provide differentially private responses to queries, within a decision-theoretic framework. Finally, we provide bounds on the utility and on the distinguishability of datasets. The latter are complemented by a novel use of Le Cam's method to obtain lower bo...