Naman Goyal - Academia.edu (original) (raw)

Papers by Naman Goyal

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional ... more Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in-and out-of-domain language modeling, zero-and few-shot priming, and full finetuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ∼4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use. 1

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation s... more The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a lowresource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for lowresource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

Global Metabolomic Profiling of Host Red Blood Cells Infected with Babesia divergens Reveals Novel Antiparasitic Target Pathways

Microbiology Spectrum

Human babesiosis is caused by apicomplexan parasites of the Babesia genus and is associated with ... more Human babesiosis is caused by apicomplexan parasites of the Babesia genus and is associated with transfusion-transmitted illness and relapsing disease in immunosuppressed populations. Through its continuous cycles of invasion, proliferation, and egress, B. divergens radically changes the metabolic environment of the host red blood cell, allowing us opportunities to study potential chemical vulnerabilities that can be targeted by drugs.

arXiv (Cornell University), Dec 20, 2021

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While the... more Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent multiple languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few-and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 counterparts on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong incontext few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples. 1

Interspeech 2022

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning b... more This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLin-gua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can perform as well as English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world. Models and code are available at www.github.com/ pytorch/fairseq/tree/master/examples/wav2vec/xlsr. 1 * Equal contribution. † Work done while at Facebook AI. ‡ Equal advising.

arXiv (Cornell University), Aug 5, 2020

Although widely adopted, existing approaches for fine-tuning pre-trained language models have bee... more Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Recent work demonstrates the potential of training one model for multilingual machine translation... more Recent work demonstrates the potential of training one model for multilingual machine translation. In parallel, denoising pretraining using unlabeled monolingual data as a starting point for finetuning bitext machine translation systems has demonstrated strong performance gains. However, little has been explored on the potential to combine denoising pretraining with multilingual machine translation in a single model. In this work, we fill this gap by studying how multilingual translation models can be created through multilingual finetuning. Fintuning multilingual model from a denoising pretrained model incorporates the benefits of large quantities of unlabeled monolingual data, which is particularly important for low resource languages where bitext is rare. Further, we create the ML50 benchmark to facilitate reproducible research by standardizing training and evaluation data. On ML50, we show that multilingual finetuning significantly improves over multilingual models trained from scratch and bilingual finetuning for translation into English. We also find that multilingual finetuning can significantly improve over multilingual models trained from scratch for zero-shot translation on non-English directions. Finally, we discuss that the pretraining and finetuning paradigm alone is not enough to address the challenges of multilingual models for to-Many directions performance.

arXiv (Cornell University), Mar 7, 2022

In this paper, we will evaluate the performance of graph neural networks in two distinct domains:... more In this paper, we will evaluate the performance of graph neural networks in two distinct domains: computer vision and reinforcement learning. In the computer vision section, we seek to learn whether a novel non-redundant representation for images as graphs can improve performance over trivial pixel to node mapping on a graph-level prediction graph, specifically image classification. For the reinforcement learning section, we seek to learn if explicitly modeling solving a Rubik's cube as a graph problem can improve performance over a standard model-free technique with no inductive bias.

arXiv (Cornell University), Aug 2, 2020

Recent work demonstrates the potential of multilingual pretraining of creating one model that can... more Recent work demonstrates the potential of multilingual pretraining of creating one model that can be used for various tasks in different languages. Previous work in multilingual pretraining has demonstrated that machine translation systems can be created by finetuning on bitext. In this work, we show that multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. Compared to multilingual models trained from scratch, starting from pretrained models incorporates the benefits of large quantities of unlabeled monolingual data, which is particularly important for low resource languages where bitext is not available. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance. We double the number of languages in mBART to support multilingual machine translation models of 50 languages. Finally, we create the ML50 benchmark, covering low, mid, and high resource languages, to facilitate reproducible research by standardizing training and evaluation data. On ML50, we demonstrate that multilingual finetuning improves on average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while improving 9.3 BLEU on average over bilingual baselines from scratch.

MOESM2 of Tarsier Goggles: a virtual reality tool for experiencing the optics of a dark-adapted primate visual system

Additional file 2. Data containing user responses to questions listed in TableÂ 1. Responses that... more Additional file 2. Data containing user responses to questions listed in TableÂ 1. Responses that neglected to answer the question at hand were omitted from the data.

MOESM1 of Tarsier Goggles: a virtual reality tool for experiencing the optics of a dark-adapted primate visual system

Additional file 1. 2D video progression of the learning environments in Tarsier Goggles.

Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 2021

Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cr... more Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests pretrained models with larger capacity may obtain both strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available. 1

Thermal characteristics of sensible heat storage materials applicable for concentrated solar power systems

Materials Today: Proceedings, 2021

Abstract Concentrated Solar Power (CSP) is rapidly increasing as a lucrative renewable energy sou... more Abstract Concentrated Solar Power (CSP) is rapidly increasing as a lucrative renewable energy source. CSP plants are integrated with Thermal Energy Storage (TES) systems to resolve its intermittent nature and enhance its economic feasibility. TES systems also smoothen out the fluctuations in energy demands throughout the day. The efficient design of the thermal storage system has three major aspect i.e., selecting the suitable heat storage material with high thermal conductivity, high energy storage density, and thermally stability. The paper presents an overview of all currently operational CSP plants and the technologies used by them. The paper also reviews the thermal characteristics of potential Sensible Heat Storage (SHS) materials as energy storage media in these plants and provides a critical assessment of each material. This paper presents crucial data needed for optimized selection of materials used for energy storage systems employing sensible heat. A quantitative study of the available data has been made to provide the results for the same.

J. Mach. Learn. Res., 2021

Existing work in translation demonstrated the potential of massively multilingual machine transla... more Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while perform...

Content Credibility Check on Twitter

During large-scale events, a large volume of content is posted on Twitter, but not all of this co... more During large-scale events, a large volume of content is posted on Twitter, but not all of this content is trustworthy. The presence of spam, advertisements, rumours and fake images reduces the value of information collected from Twitter. In this research work, various facets of assessing the credibility of user–generated content on Twitter are described, and a novel real-time system to assess the integrity of tweets has been proposed. The system has been proposed to achieve this by assigning a score or rating to content on Twitter to indicate its trustworthiness.

We introduce a new balanced assignment of experts (BASE) layer for large language models that gre... more We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released.

Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), ... more Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), we posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems. This year, the task tackled the low resource condition of Pashto–English and Khmer–English and also included the challenge of sentence alignment from document pairs.

ArXiv, 2021

In this paper, we describe our end-to-end multilingual speech translation system submitted to the... more In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech t...

ArXiv, 2020

Large pre-trained language models have been shown to store factual knowledge in their parameters,... more Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of...

Investigating Packet Dropping Attacks in RPL-DODAG in IoT

2019 IEEE 5th International Conference for Convergence in Technology (I2CT), 2019

The Internet of Things (IoT) facilitates communication among a huge number of uniquely identifiab... more The Internet of Things (IoT) facilitates communication among a huge number of uniquely identifiable heterogeneous devices and services without human intervention. To efficiently leverage the benefits of IoT, it is important that IoT applications are secured. IoT also employs large scale deployment of Low power and Lossy Networks (LLNs) comprising of sensors and RFIDs which are resource constrained. These resource constrained devices are connected to the untrustworthy Internet via IPv6 over Low power Wireless Personal Area Networks (6LOWPAN). RPL is the routing protocol used in 6LOWPAN networks which is susceptible to many security attacks. Packet dropping is one of the many RPL security attacks in which a malicious node drops data packets. In this paper, we investigate packet dropping attacks in IoT-LLN and compare their impact the normal scenario. We also present a detection mechanism to identify packet dropping nodes. The mechanism is implemented at the edge of the 6LOWPAN network and hence does not place any computational overhead on the constrained nodes.