Anirban Dasgupta - Academia.edu (original) (raw)
Papers by Anirban Dasgupta
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, 2003
We introduce a simple network design game that models how independent selfish agents can build or... more We introduce a simple network design game that models how independent selfish agents can build or maintain a large network. In our game every agent has a specific connectivity requirement, i.e. each agent has a set of terminals and wants to build a network in which his terminals are connected. Possible edges in the network have costs and each agent's goal is to pay as little as possible. Determining whether or not a Nash equilibrium exists in this game is NP-complete. However, when the goal of each player is to connect a terminal to a common source, we prove that there is a Nash equilibrium as cheap as the optimal network, and give a polynomial time algorithm to find a (1 + ε)-approximate Nash equilibrium that does not cost much more. For the general connection game we prove that there is a 3-approximate Nash equilibrium that is as cheap as the optimal network, and give an algorithm to find a (4.65 + ε)-approximate Nash equilibrium that does not cost much more.
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2008
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, 2011
Web mail providers rely on users to "vote" to quickly and collaboratively identify spam messages.... more Web mail providers rely on users to "vote" to quickly and collaboratively identify spam messages. Unfortunately, spammers have begun to use large collections of compromised accounts not only to send spam, but also to vote "not spam" on many spam emails in an attempt to thwart collaborative filtering. We call this practice a vote gaming attack. This attack confuses spam filters, since it causes spam messages to be mislabeled as legitimate; thus, spammer IP addresses can continue sending spam for longer. In this paper, we introduce the vote gaming attack and study the extent of these attacks in practice, using four months of email voting data from a large Web mail provider. We develop a model for vote gaming attacks, explain why existing detection mechanisms cannot detect them, and develop new, efficient detection methods. Our empirical analysis reveals that the bots that perform fraudulent voting differ from those that send spam. We use this insight to develop a clustering technique that identifies bots that engage in vote-gaming attacks. Our method detects tens of thousands of previously undetected fraudulent voters with only a 0.17% false positive rate, significantly outperforming existing clustering methods used to detect bots who send spam from compromised Web mail accounts.
In this paper we quantify the effect of unsolicited emails (spam) on behavior and engagement of e... more In this paper we quantify the effect of unsolicited emails (spam) on behavior and engagement of email users. Since performing randomized experiments in this setting is rife with practical and moral issues, we seek to determine causal relationships using observational data, something that is difficult in many cases. Using a novel modification of a user matching method combined with a time series regression on matched user pairs, we develop a framework for such causal inference that is particularly suited for the spam exposure use case. Using our matching technique, we objectively quantify the effect that continued exposure to spam has on user engagement in Yahoo! Mail. We find that indeed spam exposure leads to significantly, both statistically and economically, lower user engagement. The impact is non-linear; large changes impact users in a progressively more negative fashion. The impact is the strongest on "voluntary" categories of engagement such as composed emails and lowest on "responsive" engagement metrics. Our estimation technique and results not only quantify the negative impact of abuse, but also allow decision makers to estimate potential engagement gains from proposed investments in abuse mitigation.
A real-valued set function is (additively) approximately submodular if it satisfies the submodula... more A real-valued set function is (additively) approximately submodular if it satisfies the submodularity conditions with an additive error. Approximate submodularity arises in many settings, especially in machine learning, where the function evaluation might not be exact. In this paper we study how close such approximately submodular functions are to truly submodular functions. We show that an approximately submodular function defined on a ground set of n elements is O(n 2) pointwise-close to a submodular function. This result also provides an algorithmic tool that can be used to adapt existing submodular optimization algorithms to approximately submodular functions. To complement, we show an Ω(√ n) lower bound on the distance to submodularity. These results stand in contrast to the case of approximate modularity, where the distance to modularity is a constant, and approximate convexity, where the distance to convexity is logarithmic.
Proceedings of the 22nd international conference on World Wide Web, 2013
Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with a... more Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content all the way to evaluation of student assignments in massively open online courses (MOOCs) via peer grading. A key issue in these settings, where direct monitoring of both effort and accuracy is infeasible, is incentivizing agents in the 'crowd' to put in effort to make good evaluations, as well as to truthfully report their evaluations. We study the design of mechanisms for crowdsourced judgement elicitation when workers strategically choose both their reports and the effort they put into their evaluations. This leads to a new family of information elicitation problems with unobservable ground truth, where an agent's proficiency-the probability with which she correctly evaluates the underlying ground truth-is endogenously determined by her strategic choice of how much effort to put into the task. Our main contribution is a simple, new, mechanism for binary information elicitation for multiple tasks when agents have endogenous proficiencies, with the following properties: (i) Exerting maximum effort followed by truthful reporting of observations is a Nash equilibrium. (ii) This is the equilibrium with maximum payoff to all agents, even when agents have different maximum proficiencies, can use mixed strategies, and can choose a different strategy for each of their tasks. Our information elicitation mechanism requires only minimal bounds on the priors, asks agents to only report their own evaluations, and does not require any conditions on a diverging number of agent reports per task to achieve its incentive properties. The main idea behind our mechanism is to use the presence of multiple tasks and ratings to estimate a reporting statistic to identify and penalize low-effort agreement-the mechanism rewards agents for agreeing with another 'reference' agent report on the same task but also penalizes for blind agreement by subtracting out this statistic term, designed so that agents obtain rewards only when they put in effort into their observations.
Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, 2015
In this paper we consider the problem of learning a mixture of permutations, where each component... more In this paper we consider the problem of learning a mixture of permutations, where each component of the mixture is generated by a stochastic process. Learning permutation mixtures arises in practical settings when a set of items is ranked by different sub-populations and the rankings of users in a sub-population tend to agree with each other. While there is some applied work on learning such mixtures, they have been mostly heuristic in nature. We study the problem where the permutations in a mixture component are generated by the classical Mallows process in which each component is associated with a center and a scalar parameter. We show that even when the centers are arbitrarily separated, with exponentially many samples one can learn the mixture, provided the parameters are all the same and known; we also show that the latter two assumptions are information-theoretically inevitable. We then focus on polynomial-time learnability and show bounds on the performance of two simple algorithms for the case when the centers are well separated. Conceptually, our work suggests that while permutations may not enjoy as nice mathematical properties as Gaussians, certain structural aspects can still be exploited towards analyzing the corresponding mixture learning problem.
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 2013
We introduce a new model of Gaussian mixtures, motivated by the setting where the data points cor... more We introduce a new model of Gaussian mixtures, motivated by the setting where the data points correspond to ratings on a set of items provided by users who have widely varying expertise, and each user can rate an item at most once. In this mixture model, each item i has a true quality µ i , each user has a variance (lack of expertise) σ 2 j , and the rating of a user j on an item i consists of a single sample independently drawn from the Normal distribution N (µ i , σ 2 j). The aim is to learn the unknown item qualities µ i 's as precisely as possible. We study the single item case and obtain efficient algorithms for the problem, complemented by near-matching lower bounds; we also obtain preliminary results for the multiple items case.
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011
Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing appl... more Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)parameterized LSH from O(dkL) to O(d log d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.
We propose a new optimization framework for summarization by generalizing the submodular framewor... more We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We consider three natural dispersion functions and show that a greedy algorithm can obtain an approximately optimal summary in all three cases. We conduct experiments on two corpora-DUC 2004 and user comments on news articles-and show that the performance of our algorithm outperforms those that rely only on submodularity.
Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2009
We study an online job scheduling problem motivated by storyboarding in web advertising, where an... more We study an online job scheduling problem motivated by storyboarding in web advertising, where an advertiser derives value from uninterrupted sequential access to a user surfing the web. The user ceases to browse with probability 1 − β at each step, independently. Stories (jobs) arrive online; job s has length s and per-unit value v s. A value v s is obtained for every unit of the job that is scheduled consecutively without interruption, discounted for the time at which it is scheduled. Jobs can be preempted, but no further value can be derived from the residual unscheduled units of the job. We seek an online algorithm whose total reward is competitive against that of the offline scheduler that knows all jobs in advance. We consider two models based on the maximum delay that can be allowed between the arrival and scheduling of a job. In the first, a job can be scheduled anytime after its arrival; in the second a job is lost unless scheduled immediately upon arrival, preempting a currently running job if needed. The two settings correspond to two natural models of how long an advertiser retains interest in a relevant user. We show that there is, in fact, a sharp separation between what an online scheduler can achieve in these two settings. In the first setting with no deadlines, we give a natural deterministic algorithm with a constant competitive ratio against the offline scheduler. In contrast, we show that in the sharp deadline setting, no (deterministic or randomized) online algorithm can achieve better than a polylogarithmic ratio.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005
Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a te... more Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or "variable") low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions.
Proceedings of the 16th international conference on World Wide Web, 2007
Previous studies have highlighted the high arrival rate of new content on the web. We study the e... more Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis.
Proceedings of the 26th Annual International Conference on Machine Learning, 2009
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction an... more Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case-multitask learning with hundreds of thousands of tasks.
The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motiv... more The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motivated by common practical situations, in this paper, we generalize Mallows to model distributions on top-k lists by using a suitable distance measure between top-k lists. Unlike many earlier works, our model is both analytically tractable and computationally efficient. We demonstrate this by studying two basic problems in this model, namely, sampling and reconstruction, from both algorithmic and experimental points of view.
Proceedings of the 22nd international conference on World Wide Web, 2013
In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary ... more In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user-question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.
46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)
We consider the problem of learning mixtures of arbitrary symmetric distributions. We formulate s... more We consider the problem of learning mixtures of arbitrary symmetric distributions. We formulate sufficient separation conditions and present a learning algorithm with provable guarantees for mixtures of distributions that satisfy these separation conditions. Our bounds are independent of the variances of the distributions; to the best of our knowledge, there were no previous algorithms known with provable learning guarantees for distributions having infinite variance and/or expectation. For Gaussians and log-concave distributions, our results match the best known sufficient separation conditions [1, 15]. Our algorithm requires a sample of sizẽ O(dk), where d is the number of dimensions and k is the number of distributions in the mixture. We also show that for isotropic power-laws, exponential, and Gaussian distributions, our separation condition is optimal up to a constant factor.
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007
We consider feature selection for text classification both theoretically and empirically. Our mai... more We consider feature selection for text classification both theoretically and empirically. Our main result is an unsupervised feature selection strategy for which we give worst-case theoretical guarantees on the generalization power of the resultant classification functionf with respect to the classification function f obtained when keeping all the features. To the best of our knowledge, this is the first feature selection method with such guarantees. In addition, the analysis leads to insights as to when and why this feature selection strategy will perform well in practice. We then use the TechTC-100, 20-Newsgroups, and Reuters-RCV2 data sets to evaluate empirically the performance of this and two simpler but related feature selection strategies against two commonly-used strategies. Our empirical evaluation shows that the strategy with provable performance guarantees performs well in comparison with other commonly-used feature selection strategies. In addition, it performs better on certain datasets under very aggressive feature selection.
Proceedings of the 23rd international conference on World wide web, 2014
Networks are characterized by nodes and edges. While there has been a spate of recent work on est... more Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for this problem that relies on access to node samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume that a uniform random sample of nodes is available and the number of nodes is known.
Internet Mathematics, 2006
While several analytic models aim to explain the existence of short paths in social networks such... more While several analytic models aim to explain the existence of short paths in social networks such as the web, relatively few address the problem of efficiently finding them, especially in a decentralized manner. Since developing purely decentralized search algorithms in general social-network models appears hard, we relax the notion of decentralized search by allowing the option of storing a small amount of preprocessed information about the network. We show that one can identify a small set of vertices in an undirected social network so that connectivity information of the vertices in this set can be used in conjunction with the local connectivity properties to perform decentralized search and find short paths between vertices. Our results are for random graphs with power law degree distribution generated by a variant of the expected degree model.
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, 2003
We introduce a simple network design game that models how independent selfish agents can build or... more We introduce a simple network design game that models how independent selfish agents can build or maintain a large network. In our game every agent has a specific connectivity requirement, i.e. each agent has a set of terminals and wants to build a network in which his terminals are connected. Possible edges in the network have costs and each agent's goal is to pay as little as possible. Determining whether or not a Nash equilibrium exists in this game is NP-complete. However, when the goal of each player is to connect a terminal to a common source, we prove that there is a Nash equilibrium as cheap as the optimal network, and give a polynomial time algorithm to find a (1 + ε)-approximate Nash equilibrium that does not cost much more. For the general connection game we prove that there is a 3-approximate Nash equilibrium that is as cheap as the optimal network, and give an algorithm to find a (4.65 + ε)-approximate Nash equilibrium that does not cost much more.
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2008
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, 2011
Web mail providers rely on users to "vote" to quickly and collaboratively identify spam messages.... more Web mail providers rely on users to "vote" to quickly and collaboratively identify spam messages. Unfortunately, spammers have begun to use large collections of compromised accounts not only to send spam, but also to vote "not spam" on many spam emails in an attempt to thwart collaborative filtering. We call this practice a vote gaming attack. This attack confuses spam filters, since it causes spam messages to be mislabeled as legitimate; thus, spammer IP addresses can continue sending spam for longer. In this paper, we introduce the vote gaming attack and study the extent of these attacks in practice, using four months of email voting data from a large Web mail provider. We develop a model for vote gaming attacks, explain why existing detection mechanisms cannot detect them, and develop new, efficient detection methods. Our empirical analysis reveals that the bots that perform fraudulent voting differ from those that send spam. We use this insight to develop a clustering technique that identifies bots that engage in vote-gaming attacks. Our method detects tens of thousands of previously undetected fraudulent voters with only a 0.17% false positive rate, significantly outperforming existing clustering methods used to detect bots who send spam from compromised Web mail accounts.
In this paper we quantify the effect of unsolicited emails (spam) on behavior and engagement of e... more In this paper we quantify the effect of unsolicited emails (spam) on behavior and engagement of email users. Since performing randomized experiments in this setting is rife with practical and moral issues, we seek to determine causal relationships using observational data, something that is difficult in many cases. Using a novel modification of a user matching method combined with a time series regression on matched user pairs, we develop a framework for such causal inference that is particularly suited for the spam exposure use case. Using our matching technique, we objectively quantify the effect that continued exposure to spam has on user engagement in Yahoo! Mail. We find that indeed spam exposure leads to significantly, both statistically and economically, lower user engagement. The impact is non-linear; large changes impact users in a progressively more negative fashion. The impact is the strongest on "voluntary" categories of engagement such as composed emails and lowest on "responsive" engagement metrics. Our estimation technique and results not only quantify the negative impact of abuse, but also allow decision makers to estimate potential engagement gains from proposed investments in abuse mitigation.
A real-valued set function is (additively) approximately submodular if it satisfies the submodula... more A real-valued set function is (additively) approximately submodular if it satisfies the submodularity conditions with an additive error. Approximate submodularity arises in many settings, especially in machine learning, where the function evaluation might not be exact. In this paper we study how close such approximately submodular functions are to truly submodular functions. We show that an approximately submodular function defined on a ground set of n elements is O(n 2) pointwise-close to a submodular function. This result also provides an algorithmic tool that can be used to adapt existing submodular optimization algorithms to approximately submodular functions. To complement, we show an Ω(√ n) lower bound on the distance to submodularity. These results stand in contrast to the case of approximate modularity, where the distance to modularity is a constant, and approximate convexity, where the distance to convexity is logarithmic.
Proceedings of the 22nd international conference on World Wide Web, 2013
Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with a... more Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content all the way to evaluation of student assignments in massively open online courses (MOOCs) via peer grading. A key issue in these settings, where direct monitoring of both effort and accuracy is infeasible, is incentivizing agents in the 'crowd' to put in effort to make good evaluations, as well as to truthfully report their evaluations. We study the design of mechanisms for crowdsourced judgement elicitation when workers strategically choose both their reports and the effort they put into their evaluations. This leads to a new family of information elicitation problems with unobservable ground truth, where an agent's proficiency-the probability with which she correctly evaluates the underlying ground truth-is endogenously determined by her strategic choice of how much effort to put into the task. Our main contribution is a simple, new, mechanism for binary information elicitation for multiple tasks when agents have endogenous proficiencies, with the following properties: (i) Exerting maximum effort followed by truthful reporting of observations is a Nash equilibrium. (ii) This is the equilibrium with maximum payoff to all agents, even when agents have different maximum proficiencies, can use mixed strategies, and can choose a different strategy for each of their tasks. Our information elicitation mechanism requires only minimal bounds on the priors, asks agents to only report their own evaluations, and does not require any conditions on a diverging number of agent reports per task to achieve its incentive properties. The main idea behind our mechanism is to use the presence of multiple tasks and ratings to estimate a reporting statistic to identify and penalize low-effort agreement-the mechanism rewards agents for agreeing with another 'reference' agent report on the same task but also penalizes for blind agreement by subtracting out this statistic term, designed so that agents obtain rewards only when they put in effort into their observations.
Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, 2015
In this paper we consider the problem of learning a mixture of permutations, where each component... more In this paper we consider the problem of learning a mixture of permutations, where each component of the mixture is generated by a stochastic process. Learning permutation mixtures arises in practical settings when a set of items is ranked by different sub-populations and the rankings of users in a sub-population tend to agree with each other. While there is some applied work on learning such mixtures, they have been mostly heuristic in nature. We study the problem where the permutations in a mixture component are generated by the classical Mallows process in which each component is associated with a center and a scalar parameter. We show that even when the centers are arbitrarily separated, with exponentially many samples one can learn the mixture, provided the parameters are all the same and known; we also show that the latter two assumptions are information-theoretically inevitable. We then focus on polynomial-time learnability and show bounds on the performance of two simple algorithms for the case when the centers are well separated. Conceptually, our work suggests that while permutations may not enjoy as nice mathematical properties as Gaussians, certain structural aspects can still be exploited towards analyzing the corresponding mixture learning problem.
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 2013
We introduce a new model of Gaussian mixtures, motivated by the setting where the data points cor... more We introduce a new model of Gaussian mixtures, motivated by the setting where the data points correspond to ratings on a set of items provided by users who have widely varying expertise, and each user can rate an item at most once. In this mixture model, each item i has a true quality µ i , each user has a variance (lack of expertise) σ 2 j , and the rating of a user j on an item i consists of a single sample independently drawn from the Normal distribution N (µ i , σ 2 j). The aim is to learn the unknown item qualities µ i 's as precisely as possible. We study the single item case and obtain efficient algorithms for the problem, complemented by near-matching lower bounds; we also obtain preliminary results for the multiple items case.
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011
Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing appl... more Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)parameterized LSH from O(dkL) to O(d log d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.
We propose a new optimization framework for summarization by generalizing the submodular framewor... more We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We consider three natural dispersion functions and show that a greedy algorithm can obtain an approximately optimal summary in all three cases. We conduct experiments on two corpora-DUC 2004 and user comments on news articles-and show that the performance of our algorithm outperforms those that rely only on submodularity.
Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2009
We study an online job scheduling problem motivated by storyboarding in web advertising, where an... more We study an online job scheduling problem motivated by storyboarding in web advertising, where an advertiser derives value from uninterrupted sequential access to a user surfing the web. The user ceases to browse with probability 1 − β at each step, independently. Stories (jobs) arrive online; job s has length s and per-unit value v s. A value v s is obtained for every unit of the job that is scheduled consecutively without interruption, discounted for the time at which it is scheduled. Jobs can be preempted, but no further value can be derived from the residual unscheduled units of the job. We seek an online algorithm whose total reward is competitive against that of the offline scheduler that knows all jobs in advance. We consider two models based on the maximum delay that can be allowed between the arrival and scheduling of a job. In the first, a job can be scheduled anytime after its arrival; in the second a job is lost unless scheduled immediately upon arrival, preempting a currently running job if needed. The two settings correspond to two natural models of how long an advertiser retains interest in a relevant user. We show that there is, in fact, a sharp separation between what an online scheduler can achieve in these two settings. In the first setting with no deadlines, we give a natural deterministic algorithm with a constant competitive ratio against the offline scheduler. In contrast, we show that in the sharp deadline setting, no (deterministic or randomized) online algorithm can achieve better than a polylogarithmic ratio.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005
Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a te... more Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or "variable") low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions.
Proceedings of the 16th international conference on World Wide Web, 2007
Previous studies have highlighted the high arrival rate of new content on the web. We study the e... more Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis.
Proceedings of the 26th Annual International Conference on Machine Learning, 2009
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction an... more Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case-multitask learning with hundreds of thousands of tasks.
The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motiv... more The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motivated by common practical situations, in this paper, we generalize Mallows to model distributions on top-k lists by using a suitable distance measure between top-k lists. Unlike many earlier works, our model is both analytically tractable and computationally efficient. We demonstrate this by studying two basic problems in this model, namely, sampling and reconstruction, from both algorithmic and experimental points of view.
Proceedings of the 22nd international conference on World Wide Web, 2013
In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary ... more In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user-question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.
46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)
We consider the problem of learning mixtures of arbitrary symmetric distributions. We formulate s... more We consider the problem of learning mixtures of arbitrary symmetric distributions. We formulate sufficient separation conditions and present a learning algorithm with provable guarantees for mixtures of distributions that satisfy these separation conditions. Our bounds are independent of the variances of the distributions; to the best of our knowledge, there were no previous algorithms known with provable learning guarantees for distributions having infinite variance and/or expectation. For Gaussians and log-concave distributions, our results match the best known sufficient separation conditions [1, 15]. Our algorithm requires a sample of sizẽ O(dk), where d is the number of dimensions and k is the number of distributions in the mixture. We also show that for isotropic power-laws, exponential, and Gaussian distributions, our separation condition is optimal up to a constant factor.
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007
We consider feature selection for text classification both theoretically and empirically. Our mai... more We consider feature selection for text classification both theoretically and empirically. Our main result is an unsupervised feature selection strategy for which we give worst-case theoretical guarantees on the generalization power of the resultant classification functionf with respect to the classification function f obtained when keeping all the features. To the best of our knowledge, this is the first feature selection method with such guarantees. In addition, the analysis leads to insights as to when and why this feature selection strategy will perform well in practice. We then use the TechTC-100, 20-Newsgroups, and Reuters-RCV2 data sets to evaluate empirically the performance of this and two simpler but related feature selection strategies against two commonly-used strategies. Our empirical evaluation shows that the strategy with provable performance guarantees performs well in comparison with other commonly-used feature selection strategies. In addition, it performs better on certain datasets under very aggressive feature selection.
Proceedings of the 23rd international conference on World wide web, 2014
Networks are characterized by nodes and edges. While there has been a spate of recent work on est... more Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for this problem that relies on access to node samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume that a uniform random sample of nodes is available and the number of nodes is known.
Internet Mathematics, 2006
While several analytic models aim to explain the existence of short paths in social networks such... more While several analytic models aim to explain the existence of short paths in social networks such as the web, relatively few address the problem of efficiently finding them, especially in a decentralized manner. Since developing purely decentralized search algorithms in general social-network models appears hard, we relax the notion of decentralized search by allowing the option of storing a small amount of preprocessed information about the network. We show that one can identify a small set of vertices in an undirected social network so that connectivity information of the vertices in this set can be used in conjunction with the local connectivity properties to perform decentralized search and find short paths between vertices. Our results are for random graphs with power law degree distribution generated by a variant of the expected degree model.