Aliaksei Severyn | University of Trento (original) (raw)

Papers by Aliaksei Severyn

Research paper thumbnail of Recurrent Dropout Without Memory Loss

This paper presents a novel approach to recurrent neural network (RNN) regularization. Differentl... more This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to forward connections of feed-forward architectures or RNNs, we propose to drop neurons directly in recurrent connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for the most popular recurrent networks: vanilla RNNs, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Our experiments on three NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout.

Research paper thumbnail of 2013b. Learning adaptable patterns for passage reranking

This paper proposes passage reranking models that (i) do not require manual fea-ture engineering ... more This paper proposes passage reranking models that (i) do not require manual fea-ture engineering and (ii) greatly preserve accuracy, when changing application do-main. Their main characteristic is the use of relational semantic structures rep-resenting questions and their answer pas-sages. The relations are established us-ing information from automatic classifiers,

Research paper thumbnail of Adversarial Neural Networks for Cross-lingual Sequence Tagging

ArXiv, 2018

We study cross-lingual sequence tagging with little or no labeled data in the target language. Ad... more We study cross-lingual sequence tagging with little or no labeled data in the target language. Adversarial training has previously been shown to be effective for training cross-lingual sentence classifiers. However, it is not clear if language-agnostic representations enforced by an adversarial language discriminator will also enable effective transfer for token-level prediction tasks. Therefore, we experiment with different types of adversarial training on two tasks: dependency parsing and sentence compression. We show that adversarial training consistently leads to improved cross-lingual performance on each task compared to a conventionally trained baseline.

Research paper thumbnail of Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks

ArXiv, 2016

In this paper, we propose convolutional neural networks for learning an optimal representation of... more In this paper, we propose convolutional neural networks for learning an optimal representation of question and answer sentences. Their main aspect is the use of relational information given by the matches between words from the two members of the pair. The matches are encoded as embeddings with additional parameters (dimensions), which are tuned by the network. These allows for better capturing interactions between questions and answers, resulting in a significant boost in accuracy. We test our models on two widely used answer sentence selection benchmarks. The results clearly show the effectiveness of our relational information, which allows our relatively simple network to approach the state of the art.

Research paper thumbnail of Learning to Learn from Weak Supervision by Full Supervision

ArXiv, 2017

In this paper, we propose a method for training neural networks when we have a large set of data ... more In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.

Research paper thumbnail of Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision

Training deep neural networks requires massive amounts of training data, but for many tasks only ... more Training deep neural networks requires massive amounts of training data, but for many tasks only limited labeled data is available. This makes weak supervision attractive, using weak or noisy signals like the output of heuristic methods or user click-through data for training. In a semi-supervised setting, we can use a large set of data with weak labels to pretrain a neural network and then fine-tune the parameters with a small amount of data with true labels. This feels intuitively sub-optimal as these two independent stages leave the model unaware about the varying label quality. What if we could somehow inform the model about the label quality? In this paper, we propose a semi-supervised learning method where we train two neural networks in a multi-task fashion: a "target network" and a "confidence network". The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to weight...

Research paper thumbnail of On Accurate Evaluation of GANs for Language Generation

Generative Adversarial Networks (GANs) are a promising approach to language generation. The lates... more Generative Adversarial Networks (GANs) are a promising approach to language generation. The latest works introducing novel GAN models for language generation use n-gram based metrics for evaluation and only report single scores of the best run. In this paper, we argue that this often misrepresents the true picture and does not tell the full story, as GAN models can be extremely sensitive to the random initialization and small deviations from the best hyperparameter choice. In particular, we demonstrate that the previously used BLEU score is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. We also conduct a set of experiments comparing a number of GAN models for text with a conventional Language Model (LM) and find that neither of the considered models performs convincingly better than the LM.

Research paper thumbnail of Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Proceedings of the 26th International Conference on World Wide Web, 2017

This paper presents a novel approach for multilingual sentiment classification in short texts. Th... more This paper presents a novel approach for multilingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multilingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weaklysupervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pretraining of such networks. We thoroughly evaluate our approach on various multilingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved stateof-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse-but still acceptable-performance when compared to the single language model, while benefiting from better generalization properties across languages.

Research paper thumbnail of Neural Ranking Models with Weak Supervision

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Despite the impressive improvements achieved by unsupervised deep neural networks in computer vis... more Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision and NLP tasks, such improvements have not yet been observed in ranking for information retrieval. e reason may be the complexity of the ranking problem, as it is not obvious how to learn from queries and documents when no supervised signal is available. Hence, in this paper, we propose to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources (e.g., click data). To this aim, we use the output of an unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet e ective ranking models based on feed-forward neural networks. We study their e ectiveness under various learning scenarios (point-wise and pair-wise models) and using di erent input representations (i.e., from encoding querydocument pairs into dense/sparse vectors to using word embedding representation). We train our networks using tens of millions of training instances and evaluate it on two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that employing proper objective functions and le ing the networks to learn the input representation based on weakly supervised data leads to impressive performance, with over 13% and 35% MAP improvements over the BM25 model on the Robust and the ClueWeb collections. Our ndings also suggest that supervised neural ranking models can greatly bene t from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models. KEYWORDS Ranking model; weak supervision; deep neural network; deep learning; ad-hoc retrieval * Work done while interning at Google Research.

Research paper thumbnail of Recurrent Context Window Networks for Italian Named Entity Recognizer

Italian Journal of Computational Linguistics, 2016

In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers ... more In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers (NERs) in Italian. Our network uses a sliding window of word contexts to predict tags. It relies on a simple word-level log-likelihood as a cost function and uses a new recurrent feedback mechanism to ensure that the dependencies between the output tags are properly modeled. These choices make our network simple and computationally efficient. Unlike previous best NERs for Italian, our model does not require manual-designed features, external parsers or additional resources. The evaluation on the Evalita 2009 benchmark shows that our DNN performs on par with the best NERs, outperforming the state of the art when gazetteer features are used.

Research paper thumbnail of 2013b. Learning adaptable patterns for passage reranking

This paper proposes passage reranking models that (i) do not require manual fea-ture engineering ... more This paper proposes passage reranking models that (i) do not require manual fea-ture engineering and (ii) greatly preserve accuracy, when changing application do-main. Their main characteristic is the use of relational semantic structures rep-resenting questions and their answer pas-sages. The relations are established us-ing information from automatic classifiers,

Research paper thumbnail of Unsupervised Text Style Transfer with Padded Masked Language Models

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

We propose MASKER, an unsupervised textediting method for style transfer. To tackle cases when no... more We propose MASKER, an unsupervised textediting method for style transfer. To tackle cases when no parallel source-target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that MASKER performs competitively in a fully unsupervised setting. Moreover, in lowresource settings, it improves supervised methods' accuracy by over 10 percentage points when pre-training them on silver training data generated by MASKER.

Research paper thumbnail of Encode, Tag, Realize: High-Precision Text Editing

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

We propose LASERTAGGER-a sequence tagging approach that casts text generation as a text editing t... more We propose LASERTAGGER-a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LASERTAGGER achieves new state-ofthe-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.

Research paper thumbnail of Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks

ArXiv, 2016

In this paper, we propose convolutional neural networks for learning an optimal representation of... more In this paper, we propose convolutional neural networks for learning an optimal representation of question and answer sentences. Their main aspect is the use of relational information given by the matches between words from the two members of the pair. The matches are encoded as embeddings with additional parameters (dimensions), which are tuned by the network. These allows for better capturing interactions between questions and answers, resulting in a significant boost in accuracy. We test our models on two widely used answer sentence selection benchmarks. The results clearly show the effectiveness of our relational information, which allows our relatively simple network to approach the state of the art.

Research paper thumbnail of Globally Normalized Transition-Based Neural Networks

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016

We introduce a globally normalized transition-based neural network model that achieves state-of-t... more We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-ofspeech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. The key insight is based on a novel proof illustrating the label bias problem and showing that globally normalized models can be strictly more expressive than locally normalized models.

Research paper thumbnail of Recurrent Context Window Networks for Italian Named Entity Recognizer

Italian Journal of Computational Linguistics, 2016

In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers ... more In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers (NERs) in Italian. Our network uses a sliding window of word contexts to predict tags. It relies on a simple word-level log-likelihood as a cost function and uses a new recurrent feedback mechanism to ensure that the dependencies between the output tags are properly modeled. These choices make our network simple and computationally efficient. Unlike previous best NERs for Italian, our model does not require manual-designed features, external parsers or additional resources. The evaluation on the Evalita 2009 benchmark shows that our DNN performs on par with the best NERs, outperforming the state of the art when gazetteer features are used.

Research paper thumbnail of Editoriale

Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

Research paper thumbnail of UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015

This paper describes our deep learning system for sentiment analysis of tweets. The main contribu... more This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a process to initialize the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to initialize word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model which is then trained on the supervised training data from Semeval-2015. According to results on the official test sets, our model ranks 1st in the phrase-level subtask A (among 11 teams) and 2nd on the messagelevel subtask B (among 40 teams). Interestingly, computing an average rank over all six test sets (official and five progress test sets) puts our system 1st in both subtasks A and B.

Research paper thumbnail of UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015

This paper describes our deep learning system for sentiment analysis of tweets. The main contribu... more This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a process to initialize the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to initialize word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model which is then trained on the supervised training data from Semeval-2015. According to results on the official test sets, our model ranks 1st in the phrase-level subtask A (among 11 teams) and 2nd on the messagelevel subtask B (among 40 teams). Interestingly, computing an average rank over all six test sets (official and five progress test sets) puts our system 1st in both subtasks A and B.

Research paper thumbnail of Distributional Neural Networks for Automatic Resolution of Crossword Puzzles

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015

Automatic resolution of Crossword Puzzles (CPs) heavily depends on the quality of the answer cand... more Automatic resolution of Crossword Puzzles (CPs) heavily depends on the quality of the answer candidate lists produced by a retrieval system for each clue of the puzzle grid. Previous work has shown that such lists can be generated using Information Retrieval (IR) search algorithms applied to the databases containing previously solved CPs and reranked with tree kernels (TKs) applied to a syntactic tree representation of the clues. In this paper, we create a labelled dataset of 2 million clues on which we apply an innovative Distributional Neural Network (DNN) for reranking clue pairs. Our DNN is computationally efficient and can thus take advantage of such large datasets showing a large improvement over the TK approach, when the latter uses small training data. In contrast, when data is scarce, TKs outperform DNNs.

Research paper thumbnail of Recurrent Dropout Without Memory Loss

This paper presents a novel approach to recurrent neural network (RNN) regularization. Differentl... more This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to forward connections of feed-forward architectures or RNNs, we propose to drop neurons directly in recurrent connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for the most popular recurrent networks: vanilla RNNs, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Our experiments on three NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout.

Research paper thumbnail of 2013b. Learning adaptable patterns for passage reranking

This paper proposes passage reranking models that (i) do not require manual fea-ture engineering ... more This paper proposes passage reranking models that (i) do not require manual fea-ture engineering and (ii) greatly preserve accuracy, when changing application do-main. Their main characteristic is the use of relational semantic structures rep-resenting questions and their answer pas-sages. The relations are established us-ing information from automatic classifiers,

Research paper thumbnail of Adversarial Neural Networks for Cross-lingual Sequence Tagging

ArXiv, 2018

We study cross-lingual sequence tagging with little or no labeled data in the target language. Ad... more We study cross-lingual sequence tagging with little or no labeled data in the target language. Adversarial training has previously been shown to be effective for training cross-lingual sentence classifiers. However, it is not clear if language-agnostic representations enforced by an adversarial language discriminator will also enable effective transfer for token-level prediction tasks. Therefore, we experiment with different types of adversarial training on two tasks: dependency parsing and sentence compression. We show that adversarial training consistently leads to improved cross-lingual performance on each task compared to a conventionally trained baseline.

Research paper thumbnail of Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks

ArXiv, 2016

In this paper, we propose convolutional neural networks for learning an optimal representation of... more In this paper, we propose convolutional neural networks for learning an optimal representation of question and answer sentences. Their main aspect is the use of relational information given by the matches between words from the two members of the pair. The matches are encoded as embeddings with additional parameters (dimensions), which are tuned by the network. These allows for better capturing interactions between questions and answers, resulting in a significant boost in accuracy. We test our models on two widely used answer sentence selection benchmarks. The results clearly show the effectiveness of our relational information, which allows our relatively simple network to approach the state of the art.

Research paper thumbnail of Learning to Learn from Weak Supervision by Full Supervision

ArXiv, 2017

In this paper, we propose a method for training neural networks when we have a large set of data ... more In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.

Research paper thumbnail of Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision

Training deep neural networks requires massive amounts of training data, but for many tasks only ... more Training deep neural networks requires massive amounts of training data, but for many tasks only limited labeled data is available. This makes weak supervision attractive, using weak or noisy signals like the output of heuristic methods or user click-through data for training. In a semi-supervised setting, we can use a large set of data with weak labels to pretrain a neural network and then fine-tune the parameters with a small amount of data with true labels. This feels intuitively sub-optimal as these two independent stages leave the model unaware about the varying label quality. What if we could somehow inform the model about the label quality? In this paper, we propose a semi-supervised learning method where we train two neural networks in a multi-task fashion: a "target network" and a "confidence network". The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to weight...

Research paper thumbnail of On Accurate Evaluation of GANs for Language Generation

Generative Adversarial Networks (GANs) are a promising approach to language generation. The lates... more Generative Adversarial Networks (GANs) are a promising approach to language generation. The latest works introducing novel GAN models for language generation use n-gram based metrics for evaluation and only report single scores of the best run. In this paper, we argue that this often misrepresents the true picture and does not tell the full story, as GAN models can be extremely sensitive to the random initialization and small deviations from the best hyperparameter choice. In particular, we demonstrate that the previously used BLEU score is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. We also conduct a set of experiments comparing a number of GAN models for text with a conventional Language Model (LM) and find that neither of the considered models performs convincingly better than the LM.

Research paper thumbnail of Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Proceedings of the 26th International Conference on World Wide Web, 2017

This paper presents a novel approach for multilingual sentiment classification in short texts. Th... more This paper presents a novel approach for multilingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multilingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weaklysupervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pretraining of such networks. We thoroughly evaluate our approach on various multilingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved stateof-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse-but still acceptable-performance when compared to the single language model, while benefiting from better generalization properties across languages.

Research paper thumbnail of Neural Ranking Models with Weak Supervision

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Despite the impressive improvements achieved by unsupervised deep neural networks in computer vis... more Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision and NLP tasks, such improvements have not yet been observed in ranking for information retrieval. e reason may be the complexity of the ranking problem, as it is not obvious how to learn from queries and documents when no supervised signal is available. Hence, in this paper, we propose to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources (e.g., click data). To this aim, we use the output of an unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet e ective ranking models based on feed-forward neural networks. We study their e ectiveness under various learning scenarios (point-wise and pair-wise models) and using di erent input representations (i.e., from encoding querydocument pairs into dense/sparse vectors to using word embedding representation). We train our networks using tens of millions of training instances and evaluate it on two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that employing proper objective functions and le ing the networks to learn the input representation based on weakly supervised data leads to impressive performance, with over 13% and 35% MAP improvements over the BM25 model on the Robust and the ClueWeb collections. Our ndings also suggest that supervised neural ranking models can greatly bene t from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models. KEYWORDS Ranking model; weak supervision; deep neural network; deep learning; ad-hoc retrieval * Work done while interning at Google Research.

Research paper thumbnail of Recurrent Context Window Networks for Italian Named Entity Recognizer

Italian Journal of Computational Linguistics, 2016

In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers ... more In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers (NERs) in Italian. Our network uses a sliding window of word contexts to predict tags. It relies on a simple word-level log-likelihood as a cost function and uses a new recurrent feedback mechanism to ensure that the dependencies between the output tags are properly modeled. These choices make our network simple and computationally efficient. Unlike previous best NERs for Italian, our model does not require manual-designed features, external parsers or additional resources. The evaluation on the Evalita 2009 benchmark shows that our DNN performs on par with the best NERs, outperforming the state of the art when gazetteer features are used.

Research paper thumbnail of 2013b. Learning adaptable patterns for passage reranking

This paper proposes passage reranking models that (i) do not require manual fea-ture engineering ... more This paper proposes passage reranking models that (i) do not require manual fea-ture engineering and (ii) greatly preserve accuracy, when changing application do-main. Their main characteristic is the use of relational semantic structures rep-resenting questions and their answer pas-sages. The relations are established us-ing information from automatic classifiers,

Research paper thumbnail of Unsupervised Text Style Transfer with Padded Masked Language Models

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

We propose MASKER, an unsupervised textediting method for style transfer. To tackle cases when no... more We propose MASKER, an unsupervised textediting method for style transfer. To tackle cases when no parallel source-target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that MASKER performs competitively in a fully unsupervised setting. Moreover, in lowresource settings, it improves supervised methods' accuracy by over 10 percentage points when pre-training them on silver training data generated by MASKER.

Research paper thumbnail of Encode, Tag, Realize: High-Precision Text Editing

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

We propose LASERTAGGER-a sequence tagging approach that casts text generation as a text editing t... more We propose LASERTAGGER-a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LASERTAGGER achieves new state-ofthe-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.

Research paper thumbnail of Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks

ArXiv, 2016

In this paper, we propose convolutional neural networks for learning an optimal representation of... more In this paper, we propose convolutional neural networks for learning an optimal representation of question and answer sentences. Their main aspect is the use of relational information given by the matches between words from the two members of the pair. The matches are encoded as embeddings with additional parameters (dimensions), which are tuned by the network. These allows for better capturing interactions between questions and answers, resulting in a significant boost in accuracy. We test our models on two widely used answer sentence selection benchmarks. The results clearly show the effectiveness of our relational information, which allows our relatively simple network to approach the state of the art.

Research paper thumbnail of Globally Normalized Transition-Based Neural Networks

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016

We introduce a globally normalized transition-based neural network model that achieves state-of-t... more We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-ofspeech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. The key insight is based on a novel proof illustrating the label bias problem and showing that globally normalized models can be strictly more expressive than locally normalized models.

Research paper thumbnail of Recurrent Context Window Networks for Italian Named Entity Recognizer

Italian Journal of Computational Linguistics, 2016

In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers ... more In this paper, we introduce a Deep Neural Network (DNN) for engineering Named Entity Recognizers (NERs) in Italian. Our network uses a sliding window of word contexts to predict tags. It relies on a simple word-level log-likelihood as a cost function and uses a new recurrent feedback mechanism to ensure that the dependencies between the output tags are properly modeled. These choices make our network simple and computationally efficient. Unlike previous best NERs for Italian, our model does not require manual-designed features, external parsers or additional resources. The evaluation on the Evalita 2009 benchmark shows that our DNN performs on par with the best NERs, outperforming the state of the art when gazetteer features are used.

Research paper thumbnail of Editoriale

Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

Research paper thumbnail of UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015

This paper describes our deep learning system for sentiment analysis of tweets. The main contribu... more This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a process to initialize the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to initialize word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model which is then trained on the supervised training data from Semeval-2015. According to results on the official test sets, our model ranks 1st in the phrase-level subtask A (among 11 teams) and 2nd on the messagelevel subtask B (among 40 teams). Interestingly, computing an average rank over all six test sets (official and five progress test sets) puts our system 1st in both subtasks A and B.

Research paper thumbnail of UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015

This paper describes our deep learning system for sentiment analysis of tweets. The main contribu... more This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a process to initialize the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to initialize word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model which is then trained on the supervised training data from Semeval-2015. According to results on the official test sets, our model ranks 1st in the phrase-level subtask A (among 11 teams) and 2nd on the messagelevel subtask B (among 40 teams). Interestingly, computing an average rank over all six test sets (official and five progress test sets) puts our system 1st in both subtasks A and B.

Research paper thumbnail of Distributional Neural Networks for Automatic Resolution of Crossword Puzzles

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015

Automatic resolution of Crossword Puzzles (CPs) heavily depends on the quality of the answer cand... more Automatic resolution of Crossword Puzzles (CPs) heavily depends on the quality of the answer candidate lists produced by a retrieval system for each clue of the puzzle grid. Previous work has shown that such lists can be generated using Information Retrieval (IR) search algorithms applied to the databases containing previously solved CPs and reranked with tree kernels (TKs) applied to a syntactic tree representation of the clues. In this paper, we create a labelled dataset of 2 million clues on which we apply an innovative Distributional Neural Network (DNN) for reranking clue pairs. Our DNN is computationally efficient and can thus take advantage of such large datasets showing a large improvement over the TK approach, when the latter uses small training data. In contrast, when data is scarce, TKs outperform DNNs.