Robust Stochastic Optimization via Gradient Quantile Clipping (original) (raw)
Related papers
arXiv (Cornell University), 2024
In this paper, we consider non-smooth convex optimization with a zeroth-order oracle corrupted by symmetric stochastic noise. Unlike the existing high-probability results requiring the noise to have bounded κ-th moment with κ ∈ (1, 2], our results allow even heavier noise with any κ > 0, e.g., the noise distribution can have unbounded expectation. Our convergence rates match the best-known ones for the case of the bounded variance. To achieve this, we build the median gradient estimate with bounded second moment as the mini-batched median of the sampled gradient differences. We apply this technique to the stochastic multi-armed bandit problem with heavy-tailed distribution of rewards and achieveÕ(√ dT) regret. We demonstrate the performance of our zeroth-order and MAB algorithms for different κ on synthetic and real-world data. Our methods do not lose to SOTA approaches, moreover, they dramatically outperform SOTA for κ ≤ 1.
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties
ArXiv, 2020
Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hype...
arXiv (Cornell University), 2023
We consider a distributionally robust stochastic optimization problem and formulate it as a stochastic two-level composition optimization problem with the use of the mean-semideviation risk measure. In this setting, we consider a single timescale algorithm, involving two versions of the inner function value tracking: linearized tracking of a continuously differentiable loss function, and SPIDER tracking of a weakly convex loss function. We adopt the norm of the gradient of the Moreau envelope as our measure of stationarity and show that the sample complexity of O(ε −3) is possible in both cases, with only the constant larger in the second case. Finally, we demonstrate the performance of our algorithm with a robust learning example and a weakly convex, non-smooth regression example. where ℓ : R n × R d → R is the loss function of the predictor x on the random data D with a perturbed distribution with probability law Q, M (P) is a closed convex set of probability measures (the ambiguity set) that models perturbations to the reference law P, and X ⊂ R n is the feasible set. Such formulations allow training predictive models from data that are robust to perturbations in the input data distribution P, by considering the worst case of the input distribution varying in the set M (P). Such a worst-case approach to stochastic optimization; recently, it has also become relevant to machine learning applications. Such applications include but are not limited to convex and non-convex formulations of logistic regression, deep learning, and more generally supervised learning of predictive models in a data-robust fashion with risk minimization (
2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
Online minimization of an unknown convex function over the interval [0, 1] is considered under first-order stochastic bandit feedback, which returns a random realization of the gradient of the function at each query point. Without knowing the distribution of the random gradients, a learning algorithm sequentially chooses query points with the objective of minimizing regret defined as the expected cumulative loss of the function values at the query points in excess to the minimum value of the function. An approach based on devising a biased random walk on an infinite-depth binary tree constructed through successive partitioning of the domain of the function is developed. Each move of the random walk is guided by a sequential test based on confidence bounds on the empirical mean constructed using the law of the iterated logarithm. With no tuning parameters, this learning algorithm is robust to heavy-tailed noise with infinite variance and adaptive to unknown function characteristics (specifically, convex, strongly convex, and nonsmooth). It achieves the corresponding optimal regret orders (up to a √ log T or a log log T factor) in each class of functions and offers better or matching regret orders than the classical stochastic gradient descent approach which requires the knowledge of the function characteristics for tuning the sequence of step-sizes. T t=1 (F (x t , ξ t) − f (x *)) .
Stochastic Gradient Descent for Risk Optimization
Lecture notes in mechanical engineering, 2020
This paper presents an approach for the use of stochastic gradient descent methods for the solution of risk optimization problems. The first challenge is to avoid the high-cost evaluation of the failure probability and its gradient at each iteration of the optimization process. We propose here that it is accomplished by employing a stochastic gradient descent algorithm for the minimization of the Chernoff bound of the limit state function associated with the probabilistic constraint. The employed stochastic gradient descent algorithm, the Adam algorithm, is a robust method used in machine learning training. A numerical example is presented to illustrate the advantages and potential drawbacks of the proposed approach.
A Stochastic Subgradient Method for Distributionally Robust Non-Convex Learning
arXiv (Cornell University), 2020
We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifying uncertainty, allowing us to compute solutions that are robust against perturbations in the population data distribution. We consider a broad class of generalized differentiable loss functions that can be non-convex and non-smooth, involving upward and downward cusps, and we develop an efficient stochastic subgradient method for distributionally robust problems with such functions. We prove that it converges to a point satisfying the optimality conditions. To our knowledge, this is the first method with rigorous convergence guarantees in the context of generalized differentiable non-convex and non-smooth distributionally robust stochastic optimization. Our method allows for control of the desired level of robustness with little extra computational cost compared to population risk minimization with stochastic gradient methods. We also illustrate the performance of our algorithm on real datasets arising in convex and non-convex supervised learning problems.
Stability and Generalization of Stochastic Gradient Methods for Minimax Problems
2021
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs), AUC maximization and robust estimation, to mention but a few. A substantial amount of studies are devoted to studying the convergence behavior of their stochastic gradient-type algorithms. In contrast, there is relatively little work on understanding their generalization, i.e., how the learning models built from training examples would behave on test examples. In this paper, we provide a comprehensive generalization analysis of stochastic gradient methods for minimax problems under both convex-concave and nonconvex-nonconcave cases through the lens of algorithmic stability. We establish a quantitative connection between stability and several generalization measures both in expectation and with high probability. For the convex-concave setting, our stability analysis shows that stochastic gradient descent ascent attains optimal generalization bounds for both smooth and ...
Robust Stochastic Approximation Approach to Stochastic Programming
SIAM Journal on Optimization, 2009
In this paper we consider optimization problems where the objective function is given in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is that the involved multidimensional integrals (expectations) cannot be computed with high accuracy. The aim of this paper is to compare two computational approaches based on Monte Carlo sampling techniques, namely, the Stochastic Approximation (SA) and the Sample Average Approximation (SAA) methods. Both approaches, the SA and SAA methods, have a long history. Current opinion is that the SAA method can efficiently use a specific (say linear) structure of the considered problem, while the SA approach is a crude subgradient method which often performs poorly in practice. We intend to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. We extend the analysis to the case of convex-concave stochastic saddle point problems, and present (in our opinion highly encouraging) results of numerical experiments.
Convergence Properties of Stochastic Hypergradients
2021
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems in the design of optimization algorithms for bilevel optimization is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. We provide iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. Preliminary numerical experiments support our theoretical analysis.
A Stochastic Subgradient Method for Distributionally Robust Non-convex and Non-smooth Learning
Journal of Optimization Theory and Applications
We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifying uncertainty, allowing us to compute solutions that are robust against perturbations in the population data distribution. We consider a large family of loss functions that can be non-convex and non-smooth and develop an efficient stochastic subgradient method. We prove that it converges to a point satisfying the optimality conditions. To our knowledge, this is the first method with rigorous convergence guarantees in the context of non-convex nonsmooth distributionally robust stochastic optimization. Our method can achieve any desired level of robustness with little extra computational cost compared to population risk minimization. We also illustrate the performance of our algorithm on real datasets arising in convex and non-convex supervised learning problems.