Exploring Neural Language Models via Analysis of Local and Global Self-Attention Spaces (original) (raw)

AttViz: Online exploration of self-attention for transparent neural language modeling

ArXiv, 2020

Neural language models are becoming the prevailing methodology for the tasks of query answering, text classification, disambiguation, completion and translation. Commonly comprised of hundreds of millions of parameters, these neural network models offer state-of-the-art performance at the cost of interpretability; humans are no longer capable of tracing and understanding how decisions are being made. The attention mechanism, introduced initially for the task of translation, has been successfully adopted for other language-related tasks. We propose AttViz, an online toolkit for exploration of self-attention---real values associated with individual text tokens. We show how existing deep learning pipelines can produce outputs suitable for AttViz, offering novel visualizations of the attention heads and their aggregations with minimal effort, online. We show on examples of news segments how the proposed system can be used to inspect and potentially better understand what a model has lea...

End-to-End Transformer-Based Models in Textual-Based NLP

AI

Transformer architectures are highly expressive because they use self-attention mechanisms to encode long-range dependencies in the input sequences. In this paper, we present a literature review on Transformer-based (TB) models, providing a detailed overview of each model in comparison to the Transformer’s standard architecture. This survey focuses on TB models used in the field of Natural Language Processing (NLP) for textual-based tasks. We begin with an overview of the fundamental concepts at the heart of the success of these models. Then, we classify them based on their architecture and training mode. We compare the advantages and disadvantages of popular techniques in terms of architectural design and experimental value. Finally, we discuss open research, directions, and potential future work to help solve current TB application challenges in NLP.

Toward Practical Usage of the Attention Mechanism as a Tool for Interpretability

IEEE Access

Natural language processing (NLP) has been one of the subfields of artificial intelligence much affected by the recent neural revolution. Architectures such as recurrent neural networks (RNNs) and attention-based transformers helped propel the state of the art across various NLP tasks, such as sequence classification, machine translation, and natural language inference. However, if neural models are to be used in high-stakes decision making scenarios, the explainability of their decisions becomes a paramount issue. The attention mechanism has offered some transparency in the workings of otherwise black-box RNN models: attention weights (scalar values assigned input words) invite to be interpreted as the importance of that word, providing a simple method of interpretability. Recent work, however, has questioned the faithfulness of this practice. Subsequent experiments have shown that faithfulness of attention weights may still be achieved by incorporating word-level objectives in the training process of neural networks. In this article, we present a study that extends the techniques for improving faithfulness of attention based on regularization methods that promote retention of word-level information. We perform extensive experiments on a wide array of recurrent neural architectures and analyze to what extent the explanations provided by inspecting attention weights are correlated with the human notion of importance. We find that incorporating tying regularization consistently improves both the faithfulness (−0.14 F1, +0.07 Brier, on average) and plausibility (+53.6% attention mass on salient tokens) of explanations obtained through inspecting attention weights across analyzed datasets and models. INDEX TERMS Natural language processing, explainable AI, interpretability, LSTM, GRU, recurrent neural network.

Rethinking Self-Attention: Towards Interpretability in Neural Parsing

Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

Attention mechanisms have improved the performance of NLP tasks while allowing models to remain explainable. Self-attention is currently widely used, however interpretability is difficult due to the numerous attention distributions. Recent work has shown that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We test our novel layer by running constituency and dependency parsing experiments and show our new model obtains new state-of-the-art results for both tasks on both the Penn Treebank (PTB) and Chinese Treebank. Additionally, our model requires fewer self-attention layers compared to existing work. Finally, we find that the Label Attention heads learn relations between syntactic categories and show pathways to analyze errors.

BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Transformer-based language models (TLMs), such as BERT, ALBERT and GPT-3, have shown strong performance in a wide range of NLP tasks and currently dominate the field of NLP. However, many researchers wonder whether these models can maintain their dominance forever. Of course, we do not have answers now, but, as an attempt to find better neural architectures and training schemes, we pretrain a simple CNN using a GAN-style learning scheme and Wikipedia data, and then integrate it with standard TLMs. We show that on the GLUE tasks, the combination of our pretrained CNN with ALBERT outperforms the original ALBERT and achieves a similar performance to that of SOTA. Furthermore, on open-domain QA (Quasar-T and SearchQA), the combination of the CNN with ALBERT or RoBERTa achieved stronger performance than SOTA and the original TLMs. We hope that this work provides a hint for developing a novel strong network architecture along with its training scheme. Our source code and models are available at https://github.com/nict-wisdom/bertac.

Language Modeling with Deep Transformers

Interspeech 2019

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our wordlevel models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive selfattention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.

A Study of the Plausibility of Attention between RNN Encoders in Natural Language Inference

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021

Attention maps in neural models for NLP are appealing to explain the decision made by a model, hopefully emphasizing words that justify the decision. While many empirical studies hint that attention maps can provide such justification from the analysis of sound examples, only a few assess the plausibility of explanations based on attention maps, i.e., the usefulness of attention maps for humans to understand the decision. These studies furthermore focus on text classification. In this paper, we report on a preliminary assessment of attention maps in a sentence comparison task, namely natural language inference. We compare the cross-attention weights between two RNN encoders with human-based and heuristic-based annotations on the eSNLI corpus. We show that the heuristic reasonably correlates with human annotations and can thus facilitate evaluation of plausible explanations in sentence comparison tasks. Raw attention weights however remain only loosely related to a plausible explanation.

Character-Level Language Modeling with Deeper Self-Attention

Proceedings of the AAAI Conference on Artificial Intelligence, 2019

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model (Vaswani et al. 2017) with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

Investigation of Transformer-based Latent Attention Models for Neural Machine Translation

2020

Current neural translation networks are based on an effective attention mechanism that can be considered as an implicit probabilistic notion of alignment. Such architectures do not guarantee a high quality alignment, even though alignments can easily be used for explainable machine translation. This work describes a latent variable attention model using the transformer architecture, where we carry out an approximate marginalization over alignments. We show that the alignment quality in transformer models can be improved by introducing a latent variable for the alignments. To study the effect of the latent model, we quantitatively and qualitatively analyze the extracted alignments from the multi-head attention. We demonstrate that this method slightly improves translation quality on four WMT 2018 shared translation tasks, as well as generating more focused alignments for better interpretability.

An Analysis of Deep Learning and Attention Models for Natural Language Processing Tasks

International Research Journal of Computer Science

A Language is a means of communication among humans. It has a structure and is defined by the rules of grammar, which govern the constitution of sentences, clauses and words. Linguistics, the scientific study of language concerns itself with a wide variety of topics such as syntax and semantics. It concerns itself with how the words in a language are combined to form a sentence and how meaning and information are derived from the sentence based on the context. Natural Language processing as a computational discipline has the goal of getting computers to perform tasks, which involve language such as sentiment analysis, parts of speech tagging, machine translation and conversational agents. Classical NLP consisted of the symbolic paradigm, which was based on modelling grammar using theoretical computer science and using rules based on logic. Later on, statistical and probabilistic models became the standard with noisy channel models and Bayesian inference methods being prevalent. In the last two decades, the increase in computing power and the large amounts of data that is available over the internet has led to increased focus on Machine Learning. Machine Learning paradigms like Hidden Markov Models were successful. In the few years, there has been a surge in the use of Deep Learning for various tasks including NLP. Deep Learning is that subfield of Machine Learning, which creates representations of data using artificial neural networks based on computational graphs. In this paper, we will review how deep learning architectures can be used for the task of machine translation and building conversational agents.