Enhancing Unsupervised Sentence Similarity Methods with Deep ContextualisedWord Representations (original) (raw)
Related papers
A Word Embeddings Model for Sentence Similarity
Research in Computing Science, 2016
Currently, word embeddings (Bengio et al, 2003; Mikolov et al, 2013) have had a major boom due to its performance in dierent Natural Language Processing tasks. This technique has overpassed many conventional methods in the literature. From the obtained embedding vectors, we can make a good grouping of words and surface elements. It is common to represent top-level elements such as sentences, using the idea of composition (Baroni et al, 2014) through vectors sum, vectors product or through dening a linear operator representing the composition. Here, we propose the representation of sentences through a matrix containing the word embedding vectors of such sentence. However, this involves obtaining a distance between matrices. To solve this, we use a Frobenius inner product. We show that this sentence representation overtakes traditional composition methods.
Weiwei: A Simple Unsupervised Latent Semantics based Approach for Sentence Similarity
2012
The Semantic Textual Similarity (STS) shared task (Agirre et al., 2012) computes the degree of semantic equivalence between two sentences. We show that a simple unsupervised latent semantics based approach, Weighted Textual Matrix Factorization that only exploits bag-of-words features, can outperform most systems for this task. The key to the approach is to carefully handle missing words that are not in the sentence, and thus rendering it superior to Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Our system ranks 20 out of 89 systems according to the official evaluation metric for the task, Pearson correlation, and it ranks 10/89 and 19/89 in the other two evaluation metrics employed by the organizers.
Sentence encoders for semantic textual similarity: A survey
The last decade has witnessed many accomplishments in the field of Natural Language Processing, especially in understanding the language semantics. Well-established machine learning models for generating word representation are available and has been proven useful. However, the existing techniques proposed for learning sentence level representations do not adequately capture the complexity of compositional semantics. Finding semantic similarity between sentences is a fundamental language understanding problem. In this project, we compare various machine learning models on their ability to capture the semantics of a sentence using Semantic Textual Similarity (STS) Task. We focus on models that exhibit state-of-the-art performance in Sem-Eval(2017) STS shared task. Also, we analyse the impact of models' internal architectures on STS task performance. Out of all the models the we compared, Bi-LSTM RNN with max-pooling layer achieves the best performance in extracting a generic semantic representation and aids in better transfer learning when compared to hierarchical CNN.
Sentence Similarity Techniques for Short vs Variable Length Text using Word Embeddings
Computación y Sistemas, 2019
In goal-oriented conversational agents like Chatbots, finding the similarity between user input and representative text result is a big challenge. Generally, the conversational agent developers tend to provide a minimal number of utterances per intent, which makes the classification task difficult. The problem becomes more complex when the length of the representative text per action is short and the length of the user input is long. We propose a methodology that derives Sentence Similarity score based on N-gram and Sliding Window and uses the FastText Word Embeddings technique which outperforms the current state-of-the-art Sentence Similarity results. We are also publishing a dataset on the shopping domain, to build conversational agents. And the extensive experiments done on the dataset fetched better results in accuracy, precision and recall by 6%, 2% and 80% respectively. It also evinces that our solution generalizes well on the low corpus and requires no training.
Learning Semantic Textual Similarity from Conversations
Proceedings of The Third Workshop on Representation Learning for NLP
We present a novel approach to learn representations for sentence-level semantic similarity using conversational data. Our method trains an unsupervised model to predict conversational input-response pairs. The resulting sentence embeddings perform well on the semantic textual similarity (STS) benchmark and Se-mEval 2017's Community Question Answering (CQA) question similarity subtask. Performance is further improved by introducing multitask training combining the conversational input-response prediction task and a natural language inference task. Extensive experiments show the proposed model achieves the best performance among all neural models on the STS benchmark and is competitive with the state-of-the-art feature engineered and mixed systems in both tasks.
Short Texts Semantic Similarity Based on Word Embeddings
2019
Evaluating semantic similarity of texts is a task that assumes paramount importance in real-world applications. In this paper, we describe some experiments we carried out to evaluate the performance of different forms of word embeddings and their aggregations in the task of measuring the similarity of short texts. In particular, we explore the results obtained with two publicly available pre-trained word embeddings (one based on word2vec trained on a specific dataset and the second extending it with embeddings of word senses). We test five approaches for aggregating words into text. Two approaches are based on centroids and summarize a text as a word embedding. The other approaches are some variations of the Okapi BM25 function and provide directly a measure of the similarity of
HHU at SemEval-2016 Task 1: Multiple Approaches to Measuring Semantic Textual Similarity
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016
This paper describes our participation in the SemEval-2016 Task 1: Semantic Textual Similarity (STS). We developed three methods for the English subtask (STS Core). The first method is unsupervised and uses WordNet and word2vec to measure a token-based overlap. In our second approach, we train a neural network on two features. The third method uses word2vec and LDA with regression splines.
Deep Learning based Semantic Similarity Detection using Text Data
Information Technology And Control, 2020
Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) applications. As textual data is comparatively large in quantity and huge in volume than the numeric data, therefore measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short Term Memory network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two questions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. The model in comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches.
CA-RNN: Using Context-Aligned Recurrent Neural Networks for Modeling Sentence Similarity
Proceedings of the AAAI Conference on Artificial Intelligence
The recurrent neural networks (RNNs) have shown good performance for sentence similarity modeling in recent years. Most RNNs focus on modeling the hidden states based on the current sentence, while the context information from the other sentence is not well investigated during the hidden state generation. In this paper, we propose a context-aligned RNN (CA-RNN) model, which incorporates the contextual information of the aligned words in a sentence pair for the inner hidden state generation. Specifically, we first perform word alignment detection to identify the aligned words in the two sentences. Then, we present a context alignment gating mechanism and embed it into our model to automatically absorb the aligned words' context for the hidden state update. Experiments on three benchmark datasets, namely TREC-QA and WikiQA for answer selection and MSRP for paraphrase identification, show the great advantages of our proposed model. In particular, we achieve the new state-of-the-art...
COMPARATIVE ANALYSIS OF WORD EMBEDDINGS FOR CAPTURING WORD SIMILARITIES
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.