Akash Bharadwaj - Academia.edu (original) (raw)
Papers by Akash Bharadwaj
arXiv (Cornell University), Jan 24, 2023
arXiv (Cornell University), Nov 18, 2022
We consider the federated frequency estimation problem, where each user holds a private item X i ... more We consider the federated frequency estimation problem, where each user holds a private item X i from a sized domain and a server aims to estimate the empirical frequency (i.e., histogram) of n items with n d. Without any security and privacy considerations, each user can communicate its item to the server by using log d bits. A naive application of secure aggregation protocols would, however, require d log n bits per user. Can we reduce the communication needed for secure aggregation, and does security come with a fundamental cost in communication?
arXiv (Cornell University), Sep 25, 2021
We introduce Opacus, a free, open-source PyTorch library for training deep learning models with d... more We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy (hosted at opacus.ai). Opacus is designed for simplicity, flexibility, and speed. It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code. It supports a wide variety of layers, including multi-head attention, convolution, LSTM, GRU (and generic RNN), and embedding, right out of the box and provides the means for supporting other user-defined layers. Opacus computes batched per-sample gradients, providing higher efficiency compared to the traditional "micro batch" approach. In this paper we present Opacus, detail the principles that drove its implementation and unique features, and benchmark it against other frameworks for training models with differential privacy as well as standard PyTorch. * Equal contribution. Jessica Zhao contributed benchmarking code and analysis (Section 3
Companion Proceedings of the Web Conference 2022
This paper summarizes the content of the 20 tutorials that have been given at The Web Conference ... more This paper summarizes the content of the 20 tutorials that have been given at The Web Conference 2022: 85% of these tutorials are lecture style, and 15% of these are hands on.
Proceedings of the 2022 International Conference on Management of Data
Federated Computation is an emerging area that seeks to provide stronger privacy for user data, b... more Federated Computation is an emerging area that seeks to provide stronger privacy for user data, by performing large scale, distributed computations where the data remains in the hands of users. Only the necessary summary information is shared, and additional security and privacy tools can be employed to provide strong guarantees of secrecy. The most prominent application of federated computation is in training machine learning models (federated learning), but many additional applications are emerging, more broadly relevant to data management and querying data. This tutorial gives an overview of federated computation models and algorithms. It includes an introduction to security and privacy techniques and guarantees, and shows how they can be applied to solve a variety of distributed computations providing statistics and insights to distributed data. It also discusses the issues that arise when implementing systems to support federated computation, and open problems for future research.
arXiv (Cornell University), Dec 10, 2021
Federated analytics seeks to compute accurate statistics from data distributed across users' devi... more Federated analytics seeks to compute accurate statistics from data distributed across users' devices while providing a suitable privacy guarantee and being practically feasible to implement and scale. In this paper, we show how a strong (ε, δ)-Differential Privacy (DP) guarantee can be achieved for the fundamental problem of histogram generation in a federated setting, via a highly practical sampling-based procedure that does not add noise to disclosed data. Given the ubiquity of sampling in practice, we thus obtain a DP guarantee almost for free, avoid overestimating histogram counts, and allow easy reasoning about how privacy guarantees may obscure minorities and outliers. Using such histograms, related problems such as heavy hitters and quantiles can be answered with provable error and privacy guarantees. Experimental results show that our sample-and-threshold approach is accurate and scalable.
National Conference on Artificial Intelligence, 2016
EDUCATION Carnegie Mellon University Pittsburgh, PA, USA Master of Science in Language and Inform... more EDUCATION Carnegie Mellon University Pittsburgh, PA, USA Master of Science in Language and Information Technologies Aug 2015-Aug 2017 (Expected) • Research Master’s degree, with emphasis on the intersection between natural language processing and machine learning. • Advisor: Professor Chris Dyer (also collaborate closely with Professors Noah Smith and Graham Neubig). • Research area: Natural Language Processing for low-resource languages, including machine learning methods (especially neural networks) in structured prediction tasks, multi-lingual learning, and multi-modal machine learning, • Coursework: Algorithms for NLP (A), Introduction to Machine Learning PhD Level (A), Advanced Multi-Modal Machine Learning (A+), Machine Translation (A), Deep Learning (A+), Language and Statistics (A), Directed Research (A+) • CGPA: 4.07 out of 4.33
This paper contributes to a growing body of evidence that—when coupled with appropriate machine-l... more This paper contributes to a growing body of evidence that—when coupled with appropriate machine-learning techniques–linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.
International Journal of Speech Technology, 2017
factors needed to be considered for the tasks of language identification.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016
Named Entity Recognition is a well established information extraction task with many state of the... more Named Entity Recognition is a well established information extraction task with many state of the art systems existing for a variety of languages. Most systems rely on language specific resources, large annotated corpora, gazetteers and feature engineering to perform well monolingually. In this paper, we introduce an attentional neural model which only uses language universal phonological character representations with word embeddings to achieve state of the art performance in a monolingual setting using supervision and which can quickly adapt to a new language with minimal or no data. We demonstrate that phonological character representations facilitate cross-lingual transfer, outperform orthographic representations and incorporating both attention and phonological features improves statistical efficiency of the model in 0-shot and low data transfer settings with no task specific feature engineering in the source or target language.
Lecture Notes in Computer Science, 2014
In the work here presented, we apply textual and sequential methods to assess the outcomes of an ... more In the work here presented, we apply textual and sequential methods to assess the outcomes of an unconstrained multiparty dialogue. In the context of chat transcripts from a collaborative learning scenario, we demonstrate that while low-level textual features can indeed predict student success, models derived from sequential discourse act labels are also predictive, both on their own and as a supplement to textual feature sets. Further, we find that evidence from the initial stages of a collaborative activity is just as effective as using the whole.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
Many tasks aim to measure MACHINE READING COMPREHENSION (MRC), often focusing on question types p... more Many tasks aim to measure MACHINE READING COMPREHENSION (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehension-a TEMPLATE OF UNDERSTANDING-for a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.
arXiv (Cornell University), Jan 24, 2023
arXiv (Cornell University), Nov 18, 2022
We consider the federated frequency estimation problem, where each user holds a private item X i ... more We consider the federated frequency estimation problem, where each user holds a private item X i from a sized domain and a server aims to estimate the empirical frequency (i.e., histogram) of n items with n d. Without any security and privacy considerations, each user can communicate its item to the server by using log d bits. A naive application of secure aggregation protocols would, however, require d log n bits per user. Can we reduce the communication needed for secure aggregation, and does security come with a fundamental cost in communication?
arXiv (Cornell University), Sep 25, 2021
We introduce Opacus, a free, open-source PyTorch library for training deep learning models with d... more We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy (hosted at opacus.ai). Opacus is designed for simplicity, flexibility, and speed. It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code. It supports a wide variety of layers, including multi-head attention, convolution, LSTM, GRU (and generic RNN), and embedding, right out of the box and provides the means for supporting other user-defined layers. Opacus computes batched per-sample gradients, providing higher efficiency compared to the traditional "micro batch" approach. In this paper we present Opacus, detail the principles that drove its implementation and unique features, and benchmark it against other frameworks for training models with differential privacy as well as standard PyTorch. * Equal contribution. Jessica Zhao contributed benchmarking code and analysis (Section 3
Companion Proceedings of the Web Conference 2022
This paper summarizes the content of the 20 tutorials that have been given at The Web Conference ... more This paper summarizes the content of the 20 tutorials that have been given at The Web Conference 2022: 85% of these tutorials are lecture style, and 15% of these are hands on.
Proceedings of the 2022 International Conference on Management of Data
Federated Computation is an emerging area that seeks to provide stronger privacy for user data, b... more Federated Computation is an emerging area that seeks to provide stronger privacy for user data, by performing large scale, distributed computations where the data remains in the hands of users. Only the necessary summary information is shared, and additional security and privacy tools can be employed to provide strong guarantees of secrecy. The most prominent application of federated computation is in training machine learning models (federated learning), but many additional applications are emerging, more broadly relevant to data management and querying data. This tutorial gives an overview of federated computation models and algorithms. It includes an introduction to security and privacy techniques and guarantees, and shows how they can be applied to solve a variety of distributed computations providing statistics and insights to distributed data. It also discusses the issues that arise when implementing systems to support federated computation, and open problems for future research.
arXiv (Cornell University), Dec 10, 2021
Federated analytics seeks to compute accurate statistics from data distributed across users' devi... more Federated analytics seeks to compute accurate statistics from data distributed across users' devices while providing a suitable privacy guarantee and being practically feasible to implement and scale. In this paper, we show how a strong (ε, δ)-Differential Privacy (DP) guarantee can be achieved for the fundamental problem of histogram generation in a federated setting, via a highly practical sampling-based procedure that does not add noise to disclosed data. Given the ubiquity of sampling in practice, we thus obtain a DP guarantee almost for free, avoid overestimating histogram counts, and allow easy reasoning about how privacy guarantees may obscure minorities and outliers. Using such histograms, related problems such as heavy hitters and quantiles can be answered with provable error and privacy guarantees. Experimental results show that our sample-and-threshold approach is accurate and scalable.
National Conference on Artificial Intelligence, 2016
EDUCATION Carnegie Mellon University Pittsburgh, PA, USA Master of Science in Language and Inform... more EDUCATION Carnegie Mellon University Pittsburgh, PA, USA Master of Science in Language and Information Technologies Aug 2015-Aug 2017 (Expected) • Research Master’s degree, with emphasis on the intersection between natural language processing and machine learning. • Advisor: Professor Chris Dyer (also collaborate closely with Professors Noah Smith and Graham Neubig). • Research area: Natural Language Processing for low-resource languages, including machine learning methods (especially neural networks) in structured prediction tasks, multi-lingual learning, and multi-modal machine learning, • Coursework: Algorithms for NLP (A), Introduction to Machine Learning PhD Level (A), Advanced Multi-Modal Machine Learning (A+), Machine Translation (A), Deep Learning (A+), Language and Statistics (A), Directed Research (A+) • CGPA: 4.07 out of 4.33
This paper contributes to a growing body of evidence that—when coupled with appropriate machine-l... more This paper contributes to a growing body of evidence that—when coupled with appropriate machine-learning techniques–linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.
International Journal of Speech Technology, 2017
factors needed to be considered for the tasks of language identification.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016
Named Entity Recognition is a well established information extraction task with many state of the... more Named Entity Recognition is a well established information extraction task with many state of the art systems existing for a variety of languages. Most systems rely on language specific resources, large annotated corpora, gazetteers and feature engineering to perform well monolingually. In this paper, we introduce an attentional neural model which only uses language universal phonological character representations with word embeddings to achieve state of the art performance in a monolingual setting using supervision and which can quickly adapt to a new language with minimal or no data. We demonstrate that phonological character representations facilitate cross-lingual transfer, outperform orthographic representations and incorporating both attention and phonological features improves statistical efficiency of the model in 0-shot and low data transfer settings with no task specific feature engineering in the source or target language.
Lecture Notes in Computer Science, 2014
In the work here presented, we apply textual and sequential methods to assess the outcomes of an ... more In the work here presented, we apply textual and sequential methods to assess the outcomes of an unconstrained multiparty dialogue. In the context of chat transcripts from a collaborative learning scenario, we demonstrate that while low-level textual features can indeed predict student success, models derived from sequential discourse act labels are also predictive, both on their own and as a supplement to textual feature sets. Further, we find that evidence from the initial stages of a collaborative activity is just as effective as using the whole.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
Many tasks aim to measure MACHINE READING COMPREHENSION (MRC), often focusing on question types p... more Many tasks aim to measure MACHINE READING COMPREHENSION (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehension-a TEMPLATE OF UNDERSTANDING-for a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.