Prateek Jain | Microsoft Research (original) (raw)

Papers by Prateek Jain

Proceedings Cvpr Ieee Computer Society Conference on Computer Vision and Pattern Recognition Ieee Computer Society Conference on Computer Vision and Pattern Recognition, Jun 13, 2010

Active learning methods aim to select the most informa-tive unlabeled instances to label first, a... more Active learning methods aim to select the most informa-tive unlabeled instances to label first, and can help to fo-cus image or video annotations on the examples that will most improve a recognition system. However, most exist-ing methods only make myopic queries for a ...

In this paper we consider the problem of semi-supervised kernel function learning. We first propo... more In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies.

In this paper, we study the generalization properties of online learning based stochastic methods... more In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than those obtained by Wang et al (COLT 2012) for the same problem. Using our decoupling technique, we are further able to obtain fast convergence rates for strongly convex pairwise loss functions. We are also able to analyze a class of memory efficient online learning algorithms for pairwise learning problems that use only a bounded subset of past training samples to update the hypothesis at each step. Finally, in order to complement our generalization bounds, we propose a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees.

Despite its good practical performance, very little is known about Wolfe's minimum norm algorithm... more Despite its good practical performance, very little is known about Wolfe's minimum norm algorithm theoretically. To our knowledge, the only result is an exponential time analysis due to Wolfe himself. In this paper we give a maiden convergence analysis of Wolfe's algorithm. We prove that in ttt iterations, Wolfe's algorithm returns an O(1/t)O(1/t)O(1/t)-approximate solution to the min-norm point on {\em any} polytope. We also prove a robust version of Fujishige's theorem which shows that an O(1/n2)O(1/n^2)O(1/n2)-approximate solution to the min-norm point on the base polytope implies {\em exact} submodular minimization. As a corollary, we get the first pseudo-polynomial time guarantee for the Fujishige-Wolfe minimum norm algorithm for unconstrained submodular function minimization.

We consider the problem of retrieving the database points nearest to a given hyperplane query wit... more We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the database. We propose two hashingbased solutions. Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our first method's preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods' tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points.

The use of M-estimators in generalized linear regression models in high dimensional settings requ... more The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard L 0 constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these methods in extremely restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHT-style methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. Finally, we extend our analysis to the problem of low-rank matrix recovery.

Eprint Arxiv Quant Ph 0411013, Nov 2, 2004

We present a framework for efficiently solving Approximate Traveling Salesman Problem (Approximat... more We present a framework for efficiently solving Approximate Traveling Salesman Problem (Approximate TSP) for Quantum Computing Models. Existing representations of TSP introduce extra states which do not correspond to any permutation. We present an efficient and intuitive encoding for TSP in quantum computing paradigm. Using this representation and assuming a Gaussian distribution on tour-lengths, we give an algorithm to solve Approximate TSP (Euclidean) within BQP resource bounds. Generalizing this strategy for any distribution, we present an oracle based Quantum Algorithm to solve Approximate TSP. We present a realization of the oracle in the quantum counterpart of PP.

Proceedings of the 22nd international conference on World Wide Web - WWW '13, 2013

ABSTRACT A typical problem for a search engine (hosting sponsored search service) is to provide t... more ABSTRACT A typical problem for a search engine (hosting sponsored search service) is to provide the advertisers with a forecast of the number of impressions his/her ad is likely to obtain for a given bid. Accurate forecasts have high business value, since they enable advertisers to select bids that lead to better returns on their investment. They also play an important role in services such as automatic campaign optimization. Despite its importance the problem has remained relatively unexplored in literature. Existing methods typically overfit to the training data, leading to inconsistent performance. Furthermore, some of the existing methods cannot provide predictions for new ads, i.e., for ads that are not present in the logs. In this paper, we develop a generative model based approach that addresses these drawbacks. We design a Bayes net to capture inter-dependencies between the query traffic features and the competitors in an auction. Furthermore, we account for variability in the volume of query traffic by using a dynamic linear model. Finally, we implement our approach on a production grade MapReduce framework and conduct extensive large scale experiments on substantial volumes of sponsored search data from Bing. Our experimental results demonstrate significant advantages over existing methods as measured using several accuracy/error criteria, improved ability to provide estimates for new ads and more consistent performance with smaller variance in accuracies. Our method can also be adapted to several other related forecasting problems such as predicting average position of ads or the number of clicks under budget constraints.

We consider the problem of classification using similarity/distance functions over data. Specific... more We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [Balcan-Blum ICML 2006] and [Wang et al ICML 2007]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin.

We study the problem of learning a distribution from samples, when the underlying distribution is... more We study the problem of learning a distribution from samples, when the underlying distribution is a mixture of product distributions over discrete domains. This problem is motivated by several practical applications such as crowd-sourcing, recommendation systems, and learning Boolean functions. The existing solutions either heavily rely on the fact that the number of components in the mixtures is finite or have sample/time complexity that is exponential in the number of components. In this paper, we introduce a polynomial time/sample complexity method for learning a mixture of r discrete product distributions over {1, 2, . . . , ℓ} n , for general ℓ and r. We show that our approach is statistically consistent and further provide finite sample guarantees.

Nips, 2008

Metric learning algorithms can provide useful distance functions for a variety of domains, and re... more Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.

Lecture Notes in Computer Science, 2005

Abstract. Designing web sites is a complex problem. Adaptive sites are those which improve themse... more Abstract. Designing web sites is a complex problem. Adaptive sites are those which improve themselves by learning from user access patterns. In this paper we have considered a problem of index page synthesis for an adaptive website and framed it in a new type of Multi-...

Minimum rank problems arise frequently in machine learning applications and are notoriously diffi... more Minimum rank problems arise frequently in machine learning applications and are notoriously difficult to solve due to the non-convex nature of the rank objective. In this paper, we present the first online learning approach for the problem of rank minimization of matrices over polyhedral sets. In particular, we present two online learning algorithms for rank minimization -our first algorithm is a multiplicative update method based on a generalized experts framework, while our second algorithm is a novel application of the online convex programming framework (Zinkevich, 2003). In the latter, we flip the role of the decision maker by making the decision maker search over the constraint space instead of feasible points, as is usually the case in online convex programming. A salient feature of our online learning approach is that it allows us to give provable approximation guarantees for the rank minimization problem over polyhedral sets. We demonstrate the effectiveness of our methods on synthetic examples, and on the real-life application of low-rank kernel learning.

In this paper, we present a Bayesian framework for multilabel classification using compressed sen... more In this paper, we present a Bayesian framework for multilabel classification using compressed sensing. The key idea in compressed sensing for multilabel classification is to first project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efficient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key benefits of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show significant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model.

Modern applications in sensitive domains such as biometrics and medicine frequently require the u... more Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions. We instantiate two popular losses, Prec @k and pAUC, in our model and prove sublinear regret bounds for both of them. Our proofs require a novel structural lemma over ranked lists which may be of independent interest. We then develop scalable stochastic gradient descent solvers for non-decomposable loss functions. We show that for a large family of loss functions satisfying a certain uniform convergence property (that includes Prec @k , pAUC, and F-measure), our methods provably converge to the empirical risk minimizer. Such uniform convergence results were not known for these losses and we establish these using novel proof techniques. We then use extensive experimentation on real life and benchmark datasets to establish that our method can be orders of magnitude faster than a recently proposed cutting plane method.

Empirically, we demonstrate that alternating minimization performs similar to recently proposed c... more Empirically, we demonstrate that alternating minimization performs similar to recently proposed convex techniques for this problem (which are based on "lifting" to a convex matrix problem) in sample complexity and robustness to noise. However, it is much more efficient and can scale to large problems. Analytically, for a resampling version of alternating minimization, we show geometric convergence to the solution, and sample complexity that is off by log factors from obvious lower bounds. We also establish close to optimal scaling for the case when the unknown vector is sparse. Our work represents the first theoretical guarantee for alternating minimization (albeit with resampling) for any variant of phase retrieval problems in the non-convex setting.

In this paper, we study the problem of onebit compressed sensing (1-bit CS), where the goal is to... more In this paper, we study the problem of onebit compressed sensing (1-bit CS), where the goal is to design a measurement matrix A and a recovery algorithm such that a k-sparse unit vector x * can be efficiently recovered from the sign of its linear measurements, i.e., b = sign(Ax * ). This is an important problem for signal acquisition and has several learning applications as well, e.g., multi-label classification . We study this problem in two settings: a) support recovery: recover the support of x * , b) approximate vector recovery: recover a unit vectorx such that x − x * 2 ≤ ǫ. For support recovery, we propose two novel and efficient solutions based on two combinatorial structures: union free families of sets and expanders. In contrast to existing methods for support recovery, our methods are universal i.e. a single measurement matrix A can recover all the signals. For approximate recovery, we propose the first method to recover a sparse vector using a near optimal number of measurements. We also empirically validate our algorithms and demonstrate that our algorithms recover the true signal using fewer measurements than the existing methods.

We address the problem of general supervised learning when data can only be accessed through an (... more We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multi-class classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a "goodness" criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using "good" similarity functions. We demonstrate the effectiveness of our model on three important super-vised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs.

Metric and kernel learning are important in several machine learning applications. However, most ... more Metric and kernel learning are important in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study metric learning as a problem of learning a linear transformation of the input data. We show that for high-dimensional data, a particular framework for learning a linear transformation of the data based on the LogDet divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss functions for learning linear transformations can similarly be kernelized, thereby considerably expanding the potential applications of metric learning. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision and text mining.

Proceedings Cvpr Ieee Computer Society Conference on Computer Vision and Pattern Recognition Ieee Computer Society Conference on Computer Vision and Pattern Recognition, Jun 13, 2010

Eprint Arxiv Quant Ph 0411013, Nov 2, 2004

Proceedings of the 22nd international conference on World Wide Web - WWW '13, 2013

Nips, 2008

Lecture Notes in Computer Science, 2005