Pratik Chaudhari - Academia.edu (original) (raw)
Papers by Pratik Chaudhari
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) alg... more This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-theart algorithms on standard continuous-control RL benchmarks.
arXiv: Learning, 2015
We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated... more We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization paramet...
ArXiv, 2015
We study a theoretical model that connects deep learning to finding the ground state of the Hamil... more We study a theoretical model that connects deep learning to finding the ground state of the Hamiltonian of a spherical spin glass. Existing results motivated from statistical physics show that deep networks have a highly non-convex energy landscape with exponentially many local minima and energy barriers beyond which gradient descent algorithms cannot make progress. We leverage a technique known as topology trivialization where, upon perturbation by an external magnetic field, the energy landscape of the spin glass Hamiltonian changes dramatically from exponentially many local minima to "total trivialization", i.e., a constant number of local minima. There also exists a transitional regime with polynomially many local minima which interpolates between these extremes. We show that a number of regularization schemes in deep learning can benefit from this phenomenon. As a consequence, our analysis provides order heuristics to choose regularization parameters and motivates ann...
2017 51st Asilomar Conference on Signals, Systems, and Computers, 2017
This paper establishes a connection between non-convex optimization and nonlinear partial differe... more This paper establishes a connection between non-convex optimization and nonlinear partial differential equations (PDEs). We interpret empirically successful relaxation techniques motivated from statistical physics for training deep neural networks as solutions of a viscous Hamilton-Jacobi (HJ) PDE. The underlying stochastic control interpretation allows us to prove that these techniques perform better than stochastic gradient descent. Our analysis provides insight into the geometry of the energy landscape and suggests new algorithms based on the non-viscous Hamilton-Jacobi PDE that can effectively tackle the high dimensionality of modern neural networks.
ArXiv, 2020
Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for ... more Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for few-shot learning. When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-ImageNet, Tiered-ImageNet, CIFAR-FS and FC-100 with the same hyper-parameters. The simplicity of this approach enables us to demonstrate the first few-shot learning results on the ImageNet-21k dataset. We find that using a large number of meta-training classes results in high few-shot accuracies even for a large number of few-shot classes. We do not advocate our approach as the solution for few-shot learning, but simply use the results to highlight limitations of current benchmarks and few-shot protocols. We perform extensive studies on benchmark datasets to propose a metric that quantifies the "hardness" of a few-shot episode. This metric can be used to report the performance of few-shot algorithms in a more systematic way.
2018 Information Theory and Applications Workshop (ITA), 2018
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used... more Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of highly non-isotropic gradient noise in SGD; the covariance matrix of mini-batch gradients for deep networks has a rank as small as 1% of its dimension. We provide extensive empirical validation of these claims, proven in the appendix.
Engineering Applications of Artificial Intelligence, 2020
In many real-world applications of Machine Learning it is of paramount importance not only to pro... more In many real-world applications of Machine Learning it is of paramount importance not only to provide accurate predictions, but also to ensure certain levels of robustness. Adversarial Training is a training procedure aiming at providing models that are robust to worst-case perturbations around predefined points. Unfortunately, one of the main issues in adversarial training is that robustness w.r.t. gradient-based attackers is always achieved at the cost of prediction accuracy. In this paper, a new algorithm, called Wasserstein Projected Gradient Descent (WPGD), for adversarial training is proposed. WPGD provides a simple way to obtain cost-sensitive robustness, resulting in a finer control of the robustness-accuracy trade-off. Moreover, WPGD solves an optimal transport problem on the output space of the network and it can efficiently discover directions where robustness is required, allowing to control the directional trade-off between accuracy and robustness. The proposed WPGD is validated in this work on image recognition tasks with different benchmark datasets and architectures. Moreover, real world-like datasets are often unbalanced: this paper shows that when dealing with such type of datasets, the performance of adversarial training are mainly affected in term of standard accuracy.
Journal of Statistical Mechanics: Theory and Experiment, 2019
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural netw... more This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a localentropy-based objective function that favors well-generalizable solutions lying
Medicine & Science in Sports & Exercise, 2019
Research in the Mathematical Sciences, 2018
In this paper we establish a connection between non-convex optimization methods for training deep... more In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks.
2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), 2016
This paper is concerned about the wheel slip measurement of the anti-lock braking system (ABS). T... more This paper is concerned about the wheel slip measurement of the anti-lock braking system (ABS). The wheel slip must follow the desired wheel slip; for this purpose multiple sliding surface controller (MSSC) based on disturbance observer (DO) is used. DO is integrated with sliding mode controller (SMC) to strengthen the overall performance of the system by estimating the lumped uncertainties that are present in the system. The performance of the suggested scheme is testified in MATLAB/simulink with experimental set up of ABS by considering different cases for tracking the slip ratio.
2014 IEEE International Conference on Robotics and Automation (ICRA), 2014
We consider a class of multi-robot motion planning problems where each robot is associated with m... more We consider a class of multi-robot motion planning problems where each robot is associated with multiple objectives and decoupled task specifications. The problems are formulated as an open-loop non-cooperative differential game. A distributed anytime algorithm is proposed to compute a Nash equilibrium of the game. The following properties are proven: (i) the algorithm asymptotically converges to the set of Nash equilibrium; (ii) for scalar cost functionals, the price of stability equals one; (iii) for the worst case, the computational complexity and communication cost are linear in the robot number.
We consider the problem of control strategy synthesis for robots given a set of complex mission s... more We consider the problem of control strategy synthesis for robots given a set of complex mission specifications, such as "eventually visit region A and then return to a base", "periodically survery regions A and B" or "do not enter region D". We focus on problem instances where there does not exist a strategy that satisfies all the specifications, and we aim to find strategies that satisfy the most important specifications albeit violating the least important ones. We focus on two particular problem formulations, both of which take as input the mission specifications in the form of Linear Temporal Logic (LTL) formulae. In our first formulation we model the robot as a discrete transition system and each of the specifications has a reward associated with its satisfaction. We propose an algorithm for finding the strategy of maximum cumulative reward which has a significantly better computational complexity than that of a brute-force approach. In our second formulation we model the robot as a continuous dynamical system and the specifications are associated with priorities in such a way that a specification with priority i is infinitely more important than one with priority level j, for any i < j. For this purpose, we introduce a functional that quantifies the level of violation of a motion plan and we design an algorithm for asymptotically computing the control strategy of minimum level of violation among all strategies that guide the robot from an initial state to a goal set. For each of our two formulations we demonstrate the usefulness of our algorithms in possible applications through simulations, and in the case of our second formulation we also carry experiments on a real-time autonomous test-bed.
2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012
In this paper, the filtering problem for a large class of continuous-time, continuous-state stoch... more In this paper, the filtering problem for a large class of continuous-time, continuous-state stochastic dynamical systems is considered. Inspired by recent advances in asymptotically-optimal sampling-based motion planning algorithms, such as the PRM * and the RRT * , an incremental sampling-based algorithm is proposed. Using incremental sampling, this approach constructs a sequence of Markov chain approximations, and solves the filtering problem, in an incremental manner, on these discrete approximations. It is shown that the trajectories of the Markov chain approximations converge in distribution to the trajectories of the original stochastic system; moreover, the optimal filter calculated on these Markov chains converges to the optimal continuoustime nonlinear filter. The convergence results are verified in a number of simulation examples.
2013 American Control Conference, 2013
This paper focuses on a continuous-time, continuous-space formulation of the stochastic optimal c... more This paper focuses on a continuous-time, continuous-space formulation of the stochastic optimal control problem with nonlinear dynamics and observation noise. We lay the mathematical foundations to construct, via incremental sampling, an approximating sequence of discretetime finite-state partially observable Markov decision processes (POMDPs), such that the behavior of successive approximations converges to the behavior of the original continuous system in an appropriate sense. We also show that the optimal cost function and control policies for these POMDP approximations converge almost surely to their counterparts for the underlying continuous system in the limit. We demonstrate this approach on two popular continuous-time problems, viz., the Linear-Quadratic-Gaussian (LQG) control problem and the light-dark domain problem.
Journal of Sound and Vibration, 2014
Abstract In this paper, a novel scheme to reduce the acceleration of the sprung mass, used in com... more Abstract In this paper, a novel scheme to reduce the acceleration of the sprung mass, used in combination with sliding mode control, is proposed. The proposed scheme estimates the effects of the uncertain, nonlinear spring and damper, load variation and the unknown road disturbance. The controller needs the states of sprung mass only, obviating the need to measure the states of the unsprung mass. The ultimate boundedness of the overall suspension system is proved. The efficacy of the method is verified through simulations for three different types of road profiles and load variation and the scheme is validated on an experimental setup. The results are compared with passive suspension system.
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) alg... more This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-theart algorithms on standard continuous-control RL benchmarks.
arXiv: Learning, 2015
We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated... more We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization paramet...
ArXiv, 2015
We study a theoretical model that connects deep learning to finding the ground state of the Hamil... more We study a theoretical model that connects deep learning to finding the ground state of the Hamiltonian of a spherical spin glass. Existing results motivated from statistical physics show that deep networks have a highly non-convex energy landscape with exponentially many local minima and energy barriers beyond which gradient descent algorithms cannot make progress. We leverage a technique known as topology trivialization where, upon perturbation by an external magnetic field, the energy landscape of the spin glass Hamiltonian changes dramatically from exponentially many local minima to "total trivialization", i.e., a constant number of local minima. There also exists a transitional regime with polynomially many local minima which interpolates between these extremes. We show that a number of regularization schemes in deep learning can benefit from this phenomenon. As a consequence, our analysis provides order heuristics to choose regularization parameters and motivates ann...
2017 51st Asilomar Conference on Signals, Systems, and Computers, 2017
This paper establishes a connection between non-convex optimization and nonlinear partial differe... more This paper establishes a connection between non-convex optimization and nonlinear partial differential equations (PDEs). We interpret empirically successful relaxation techniques motivated from statistical physics for training deep neural networks as solutions of a viscous Hamilton-Jacobi (HJ) PDE. The underlying stochastic control interpretation allows us to prove that these techniques perform better than stochastic gradient descent. Our analysis provides insight into the geometry of the energy landscape and suggests new algorithms based on the non-viscous Hamilton-Jacobi PDE that can effectively tackle the high dimensionality of modern neural networks.
ArXiv, 2020
Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for ... more Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for few-shot learning. When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-ImageNet, Tiered-ImageNet, CIFAR-FS and FC-100 with the same hyper-parameters. The simplicity of this approach enables us to demonstrate the first few-shot learning results on the ImageNet-21k dataset. We find that using a large number of meta-training classes results in high few-shot accuracies even for a large number of few-shot classes. We do not advocate our approach as the solution for few-shot learning, but simply use the results to highlight limitations of current benchmarks and few-shot protocols. We perform extensive studies on benchmark datasets to propose a metric that quantifies the "hardness" of a few-shot episode. This metric can be used to report the performance of few-shot algorithms in a more systematic way.
2018 Information Theory and Applications Workshop (ITA), 2018
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used... more Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of highly non-isotropic gradient noise in SGD; the covariance matrix of mini-batch gradients for deep networks has a rank as small as 1% of its dimension. We provide extensive empirical validation of these claims, proven in the appendix.
Engineering Applications of Artificial Intelligence, 2020
In many real-world applications of Machine Learning it is of paramount importance not only to pro... more In many real-world applications of Machine Learning it is of paramount importance not only to provide accurate predictions, but also to ensure certain levels of robustness. Adversarial Training is a training procedure aiming at providing models that are robust to worst-case perturbations around predefined points. Unfortunately, one of the main issues in adversarial training is that robustness w.r.t. gradient-based attackers is always achieved at the cost of prediction accuracy. In this paper, a new algorithm, called Wasserstein Projected Gradient Descent (WPGD), for adversarial training is proposed. WPGD provides a simple way to obtain cost-sensitive robustness, resulting in a finer control of the robustness-accuracy trade-off. Moreover, WPGD solves an optimal transport problem on the output space of the network and it can efficiently discover directions where robustness is required, allowing to control the directional trade-off between accuracy and robustness. The proposed WPGD is validated in this work on image recognition tasks with different benchmark datasets and architectures. Moreover, real world-like datasets are often unbalanced: this paper shows that when dealing with such type of datasets, the performance of adversarial training are mainly affected in term of standard accuracy.
Journal of Statistical Mechanics: Theory and Experiment, 2019
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural netw... more This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a localentropy-based objective function that favors well-generalizable solutions lying
Medicine & Science in Sports & Exercise, 2019
Research in the Mathematical Sciences, 2018
In this paper we establish a connection between non-convex optimization methods for training deep... more In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks.
2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), 2016
This paper is concerned about the wheel slip measurement of the anti-lock braking system (ABS). T... more This paper is concerned about the wheel slip measurement of the anti-lock braking system (ABS). The wheel slip must follow the desired wheel slip; for this purpose multiple sliding surface controller (MSSC) based on disturbance observer (DO) is used. DO is integrated with sliding mode controller (SMC) to strengthen the overall performance of the system by estimating the lumped uncertainties that are present in the system. The performance of the suggested scheme is testified in MATLAB/simulink with experimental set up of ABS by considering different cases for tracking the slip ratio.
2014 IEEE International Conference on Robotics and Automation (ICRA), 2014
We consider a class of multi-robot motion planning problems where each robot is associated with m... more We consider a class of multi-robot motion planning problems where each robot is associated with multiple objectives and decoupled task specifications. The problems are formulated as an open-loop non-cooperative differential game. A distributed anytime algorithm is proposed to compute a Nash equilibrium of the game. The following properties are proven: (i) the algorithm asymptotically converges to the set of Nash equilibrium; (ii) for scalar cost functionals, the price of stability equals one; (iii) for the worst case, the computational complexity and communication cost are linear in the robot number.
We consider the problem of control strategy synthesis for robots given a set of complex mission s... more We consider the problem of control strategy synthesis for robots given a set of complex mission specifications, such as "eventually visit region A and then return to a base", "periodically survery regions A and B" or "do not enter region D". We focus on problem instances where there does not exist a strategy that satisfies all the specifications, and we aim to find strategies that satisfy the most important specifications albeit violating the least important ones. We focus on two particular problem formulations, both of which take as input the mission specifications in the form of Linear Temporal Logic (LTL) formulae. In our first formulation we model the robot as a discrete transition system and each of the specifications has a reward associated with its satisfaction. We propose an algorithm for finding the strategy of maximum cumulative reward which has a significantly better computational complexity than that of a brute-force approach. In our second formulation we model the robot as a continuous dynamical system and the specifications are associated with priorities in such a way that a specification with priority i is infinitely more important than one with priority level j, for any i < j. For this purpose, we introduce a functional that quantifies the level of violation of a motion plan and we design an algorithm for asymptotically computing the control strategy of minimum level of violation among all strategies that guide the robot from an initial state to a goal set. For each of our two formulations we demonstrate the usefulness of our algorithms in possible applications through simulations, and in the case of our second formulation we also carry experiments on a real-time autonomous test-bed.
2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012
In this paper, the filtering problem for a large class of continuous-time, continuous-state stoch... more In this paper, the filtering problem for a large class of continuous-time, continuous-state stochastic dynamical systems is considered. Inspired by recent advances in asymptotically-optimal sampling-based motion planning algorithms, such as the PRM * and the RRT * , an incremental sampling-based algorithm is proposed. Using incremental sampling, this approach constructs a sequence of Markov chain approximations, and solves the filtering problem, in an incremental manner, on these discrete approximations. It is shown that the trajectories of the Markov chain approximations converge in distribution to the trajectories of the original stochastic system; moreover, the optimal filter calculated on these Markov chains converges to the optimal continuoustime nonlinear filter. The convergence results are verified in a number of simulation examples.
2013 American Control Conference, 2013
This paper focuses on a continuous-time, continuous-space formulation of the stochastic optimal c... more This paper focuses on a continuous-time, continuous-space formulation of the stochastic optimal control problem with nonlinear dynamics and observation noise. We lay the mathematical foundations to construct, via incremental sampling, an approximating sequence of discretetime finite-state partially observable Markov decision processes (POMDPs), such that the behavior of successive approximations converges to the behavior of the original continuous system in an appropriate sense. We also show that the optimal cost function and control policies for these POMDP approximations converge almost surely to their counterparts for the underlying continuous system in the limit. We demonstrate this approach on two popular continuous-time problems, viz., the Linear-Quadratic-Gaussian (LQG) control problem and the light-dark domain problem.
Journal of Sound and Vibration, 2014
Abstract In this paper, a novel scheme to reduce the acceleration of the sprung mass, used in com... more Abstract In this paper, a novel scheme to reduce the acceleration of the sprung mass, used in combination with sliding mode control, is proposed. The proposed scheme estimates the effects of the uncertain, nonlinear spring and damper, load variation and the unknown road disturbance. The controller needs the states of sprung mass only, obviating the need to measure the states of the unsprung mass. The ultimate boundedness of the overall suspension system is proved. The efficacy of the method is verified through simulations for three different types of road profiles and load variation and the scheme is validated on an experimental setup. The results are compared with passive suspension system.