Eric Bax - Academia.edu (original) (raw)
Papers by Eric Bax
arXiv (Cornell University), Aug 13, 2022
A data sketch algorithm scans a big data set, collecting a small amount of data-the sketch, which... more A data sketch algorithm scans a big data set, collecting a small amount of data-the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
IEEE Transactions on Information Theory, 2000
We develop a probabilistic bound on the error rate of the nearest neighbor classi er formed from ... more We develop a probabilistic bound on the error rate of the nearest neighbor classi er formed from a set of labelled examples. The bound is computed using only the examples in the set. A subset of the examples is used as a validation set to bound the error rate of the classi er formed from the remaining examples. Then a bound is computed for the di erence in error rates between the original classi er and the reduced classi er. This bound is computed by partitioning the validation set and using each subset to compute bounds for the error rate di erence due to the other subsets.
arXiv (Cornell University), Nov 3, 2014
Media publisher platforms often face an effectiveness-nuisance tradeoff: more annoying ads can be... more Media publisher platforms often face an effectiveness-nuisance tradeoff: more annoying ads can be more effective for some advertisers because of their ability to attract attention, but after attracting viewers' attention, their nuisance to viewers can decrease engagement with the platform over time. With the rise of mobile technology and ad blockers, many platforms are becoming increasingly concerned about how to improve monetization through digital ads while improving viewer experience. We study an online ad auction mechanism that incorporates a charge for ad impact on user experience as a criterion for ad selection and pricing. Like a Pigovian tax, the charge causes advertisers to internalize the hidden cost of foregone future platform revenue due to ad impact on user experience. Over time, the mechanism provides an
Proceedings of the 28th annual Southeast regional conference on - ACM-SE 28, 1990
Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter thro... more Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter through a COSEN grant funded by the Pew Memorial Trust on a project entitled "Understanding Chaos." I studied some of the history of research in the area of nonlinear systems, and my studies so far have concentrated mainly upon the logistic equation its behavior on both the real line and the complex plane, its properties, including scaling factors among some of the structures that arise within its domain, and how it can serve as a simplified model for studies of chaos in general. This paper represents a report on work which is still in progress as well as a review of some of the literature on nonlinear dynamics and a documentation of some of our results.
In markets for online advertising, advertisers may post bids that they pay only when a user respo... more In markets for online advertising, advertisers may post bids that they pay only when a user responds to an ad. Market-makers estimate response rates for each ad and multiply by the bid to estimate expected revenue for showing the ad. For each advertising opportunity, called an ad call, the market-maker selects an ad that maximizes estimated expected revenue. Actual revenue deviates from estimated expected revenue for two reasons: (a) uncertainty introduced by errors in estimation of response rates and (b) random fluctuations in response rates from their expected values. This paper outlines a method to allocate a set of ad calls over a set of ads. The method mediates a tradeoff between maximizing estimated expected revenue for publishers and minimizing estimated variance for that revenue. The method accounts for uncertainty as well as randomness as sources of variability. The paper also demonstrates the surprising result that using portfolio allocation to reduce variance can also inc...
2021 IEEE International Conference on Big Data (Big Data)
One way to estimate a statistic over a large data set is to draw a sample consisting of some reco... more One way to estimate a statistic over a large data set is to draw a sample consisting of some records from the data set, and compute the statistic over the sample as an estimate of the statistic over the data set. This procedure may fail to produce an accurate estimate. Using one sample for multiple statistics reduces computation and latency, but it can increase the probability of multiple failures to produce accurate estimates, because estimates based on the same sample may not have independent failure probabilities. We show how to bound the probability of multiple failures for sequences of estimates over one or more samples.
Proceedings of the Future Technologies Conference (FTC) 2021, Volume 1, 2021
We develop an algorithm for the traveling salesman problem by applying nite diierences to a gener... more We develop an algorithm for the traveling salesman problem by applying nite diierences to a generating function. This algorithm requires polynomial space. In comparison, a dynamic programming algorithm requires exponential space. Also, the nite-diierence algorithm requires less space than a similar inclusion and exclusion algorithm.
We introduce a technique to compute probably approximately correct (PAC) bounds on precision and ... more We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for matching algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds can be applied to network reconciliation or entity resolution algorithms, which identify nodes in different networks or values in a data set that correspond to the same entity. For network reconciliation, the bounds do not require knowledge of the network generation process.
We derive distribution-free uniform test error bounds that improve o n VC-type bounds for validat... more We derive distribution-free uniform test error bounds that improve o n VC-type bounds for validation. We s h o w h o w to use knowledge of test inputs to improve the bounds. The bounds are sharp, but they require intense computation. We i n troduce a method to trade sharpness for speed of computation. Also, we compute the bounds for several test cases.
2019 IEEE International Conference on Big Data (Big Data), 2019
A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, whi... more A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
2020 IEEE International Conference on Big Data (Big Data), 2020
We propose a system for privacy-aware machine learning. The data provider encodes each record in ... more We propose a system for privacy-aware machine learning. The data provider encodes each record in way that avoids revealing information about the record’s field values or about the ordering of values from different records. A service provider stores the encoded records and uses them to perform classification on queries consisting of encoded input field values. The encoding provides privacy for the data provider from the service provider and from a third party issuing unauthorized queries. But the encoding makes regression-based and many tree-based classifiers impossible to implement. It does allow histogram-type classifiers that are based on category membership, and we present one such classification method that ensures data sufficiency on a per-classification basis.
arXiv: Computer Science and Game Theory, 2015
In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the we... more In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the weight of their bid divided by the sum of weights of all bids, where each bid's weight is determined by a weight function. We study the relationship between the weight function, bidders' private values, number of bidders, and the seller's revenue in equilibrium. It has been shown that if one bidder has a much higher private value than the others, then a nearly flat weight function maximizes revenue. Essentially, threatening the bidder who has the highest valuation with having to share the allocation maximizes the revenue. We show that as bidder private values approach parity, steeper weight functions maximize revenue by making the quasi-proportional auction more like a winner-take-all auction. We also show that steeper weight functions maximize revenue as the number of bidders increases. For flatter weight functions, there is known to be a unique pure-strategy Nash equilibrium....
ArXiv, 2014
Network reconciliation is the problem of identifying nodes in separate networks that represent th... more Network reconciliation is the problem of identifying nodes in separate networks that represent the same entity, for example matching nodes across social networks that correspond to the same user. We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for network reconciliation algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds do not require knowledge of the network generation process, and they can supply confidence levels for individual matches.
For an ensemble classifier that is composed of classifiers selected from a hypothesis set of clas... more For an ensemble classifier that is composed of classifiers selected from a hypothesis set of classifiers, and that selects one of its constituent classifiers at random to use for each classification, we present ensemble error bounds consisting of the average of error bounds for the individual classifiers in the ensemble, a term that depends on the fraction of hypothesis classifiers selected for the ensemble, and a small constant term and multiplier. There is no penalty for using a richer hypothesis set, if the same fraction of the hypothesis classifiers are selected for the ensemble.
ArXiv, 2015
In markets for online advertising, some advertisers pay only when users respond to ads. So publis... more In markets for online advertising, some advertisers pay only when users respond to ads. So publishers estimate ad response rates and multiply by advertiser bids to estimate expected revenue for showing ads. Since these estimates may be inaccurate, the publisher risks not selecting the ad for each ad call that would maximize revenue. The variance of revenue can be decomposed into two components -- variance due to `uncertainty' because the true response rate is unknown, and variance due to `randomness' because realized response statistics fluctuate around the true response rate. Over a sequence of many ad calls, the variance due to randomness nearly vanishes due to the law of large numbers. However, the variance due to uncertainty doesn't diminish. We introduce a technique for ad selection that augments existing estimation and explore-exploit methods. The technique uses methods from portfolio optimization to produce a distribution over ads rather than selecting the single ...
arXiv (Cornell University), Aug 13, 2022
A data sketch algorithm scans a big data set, collecting a small amount of data-the sketch, which... more A data sketch algorithm scans a big data set, collecting a small amount of data-the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
IEEE Transactions on Information Theory, 2000
We develop a probabilistic bound on the error rate of the nearest neighbor classi er formed from ... more We develop a probabilistic bound on the error rate of the nearest neighbor classi er formed from a set of labelled examples. The bound is computed using only the examples in the set. A subset of the examples is used as a validation set to bound the error rate of the classi er formed from the remaining examples. Then a bound is computed for the di erence in error rates between the original classi er and the reduced classi er. This bound is computed by partitioning the validation set and using each subset to compute bounds for the error rate di erence due to the other subsets.
arXiv (Cornell University), Nov 3, 2014
Media publisher platforms often face an effectiveness-nuisance tradeoff: more annoying ads can be... more Media publisher platforms often face an effectiveness-nuisance tradeoff: more annoying ads can be more effective for some advertisers because of their ability to attract attention, but after attracting viewers' attention, their nuisance to viewers can decrease engagement with the platform over time. With the rise of mobile technology and ad blockers, many platforms are becoming increasingly concerned about how to improve monetization through digital ads while improving viewer experience. We study an online ad auction mechanism that incorporates a charge for ad impact on user experience as a criterion for ad selection and pricing. Like a Pigovian tax, the charge causes advertisers to internalize the hidden cost of foregone future platform revenue due to ad impact on user experience. Over time, the mechanism provides an
Proceedings of the 28th annual Southeast regional conference on - ACM-SE 28, 1990
Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter thro... more Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter through a COSEN grant funded by the Pew Memorial Trust on a project entitled "Understanding Chaos." I studied some of the history of research in the area of nonlinear systems, and my studies so far have concentrated mainly upon the logistic equation its behavior on both the real line and the complex plane, its properties, including scaling factors among some of the structures that arise within its domain, and how it can serve as a simplified model for studies of chaos in general. This paper represents a report on work which is still in progress as well as a review of some of the literature on nonlinear dynamics and a documentation of some of our results.
In markets for online advertising, advertisers may post bids that they pay only when a user respo... more In markets for online advertising, advertisers may post bids that they pay only when a user responds to an ad. Market-makers estimate response rates for each ad and multiply by the bid to estimate expected revenue for showing the ad. For each advertising opportunity, called an ad call, the market-maker selects an ad that maximizes estimated expected revenue. Actual revenue deviates from estimated expected revenue for two reasons: (a) uncertainty introduced by errors in estimation of response rates and (b) random fluctuations in response rates from their expected values. This paper outlines a method to allocate a set of ad calls over a set of ads. The method mediates a tradeoff between maximizing estimated expected revenue for publishers and minimizing estimated variance for that revenue. The method accounts for uncertainty as well as randomness as sources of variability. The paper also demonstrates the surprising result that using portfolio allocation to reduce variance can also inc...
2021 IEEE International Conference on Big Data (Big Data)
One way to estimate a statistic over a large data set is to draw a sample consisting of some reco... more One way to estimate a statistic over a large data set is to draw a sample consisting of some records from the data set, and compute the statistic over the sample as an estimate of the statistic over the data set. This procedure may fail to produce an accurate estimate. Using one sample for multiple statistics reduces computation and latency, but it can increase the probability of multiple failures to produce accurate estimates, because estimates based on the same sample may not have independent failure probabilities. We show how to bound the probability of multiple failures for sequences of estimates over one or more samples.
Proceedings of the Future Technologies Conference (FTC) 2021, Volume 1, 2021
We develop an algorithm for the traveling salesman problem by applying nite diierences to a gener... more We develop an algorithm for the traveling salesman problem by applying nite diierences to a generating function. This algorithm requires polynomial space. In comparison, a dynamic programming algorithm requires exponential space. Also, the nite-diierence algorithm requires less space than a similar inclusion and exclusion algorithm.
We introduce a technique to compute probably approximately correct (PAC) bounds on precision and ... more We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for matching algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds can be applied to network reconciliation or entity resolution algorithms, which identify nodes in different networks or values in a data set that correspond to the same entity. For network reconciliation, the bounds do not require knowledge of the network generation process.
We derive distribution-free uniform test error bounds that improve o n VC-type bounds for validat... more We derive distribution-free uniform test error bounds that improve o n VC-type bounds for validation. We s h o w h o w to use knowledge of test inputs to improve the bounds. The bounds are sharp, but they require intense computation. We i n troduce a method to trade sharpness for speed of computation. Also, we compute the bounds for several test cases.
2019 IEEE International Conference on Big Data (Big Data), 2019
A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, whi... more A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
2020 IEEE International Conference on Big Data (Big Data), 2020
We propose a system for privacy-aware machine learning. The data provider encodes each record in ... more We propose a system for privacy-aware machine learning. The data provider encodes each record in way that avoids revealing information about the record’s field values or about the ordering of values from different records. A service provider stores the encoded records and uses them to perform classification on queries consisting of encoded input field values. The encoding provides privacy for the data provider from the service provider and from a third party issuing unauthorized queries. But the encoding makes regression-based and many tree-based classifiers impossible to implement. It does allow histogram-type classifiers that are based on category membership, and we present one such classification method that ensures data sufficiency on a per-classification basis.
arXiv: Computer Science and Game Theory, 2015
In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the we... more In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the weight of their bid divided by the sum of weights of all bids, where each bid's weight is determined by a weight function. We study the relationship between the weight function, bidders' private values, number of bidders, and the seller's revenue in equilibrium. It has been shown that if one bidder has a much higher private value than the others, then a nearly flat weight function maximizes revenue. Essentially, threatening the bidder who has the highest valuation with having to share the allocation maximizes the revenue. We show that as bidder private values approach parity, steeper weight functions maximize revenue by making the quasi-proportional auction more like a winner-take-all auction. We also show that steeper weight functions maximize revenue as the number of bidders increases. For flatter weight functions, there is known to be a unique pure-strategy Nash equilibrium....
ArXiv, 2014
Network reconciliation is the problem of identifying nodes in separate networks that represent th... more Network reconciliation is the problem of identifying nodes in separate networks that represent the same entity, for example matching nodes across social networks that correspond to the same user. We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for network reconciliation algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds do not require knowledge of the network generation process, and they can supply confidence levels for individual matches.
For an ensemble classifier that is composed of classifiers selected from a hypothesis set of clas... more For an ensemble classifier that is composed of classifiers selected from a hypothesis set of classifiers, and that selects one of its constituent classifiers at random to use for each classification, we present ensemble error bounds consisting of the average of error bounds for the individual classifiers in the ensemble, a term that depends on the fraction of hypothesis classifiers selected for the ensemble, and a small constant term and multiplier. There is no penalty for using a richer hypothesis set, if the same fraction of the hypothesis classifiers are selected for the ensemble.
ArXiv, 2015
In markets for online advertising, some advertisers pay only when users respond to ads. So publis... more In markets for online advertising, some advertisers pay only when users respond to ads. So publishers estimate ad response rates and multiply by advertiser bids to estimate expected revenue for showing ads. Since these estimates may be inaccurate, the publisher risks not selecting the ad for each ad call that would maximize revenue. The variance of revenue can be decomposed into two components -- variance due to `uncertainty' because the true response rate is unknown, and variance due to `randomness' because realized response statistics fluctuate around the true response rate. Over a sequence of many ad calls, the variance due to randomness nearly vanishes due to the law of large numbers. However, the variance due to uncertainty doesn't diminish. We introduce a technique for ad selection that augments existing estimation and explore-exploit methods. The technique uses methods from portfolio optimization to produce a distribution over ads rather than selecting the single ...