Exploring Deep Learning in Semantic Question Matching (original) (raw)
Related papers
An Analysis of Pairwise Question Matching with Machine Learning
International Journal of Advanced Trends in Computer Science and Engineering , 2023
In the realm of Natural Language Processing (NLP) and machine learning, lies the challenging quest to detect duplicate question pairs with semantic precision. Our research endeavors to craft a cutting-edge model capable of discerning whether two questions, despite their divergent phrasing, spelling, or grammatical variations, share a common intent on digital forums or search engines. A paramount facet of this study involves the creation and training of an exemplary model using a meticulously curated dataset of labeled question pairs, each annotated as either duplicates or distinct entities. By leveraging state-of-the-art NLP techniques, we aspire to build an exceptionally accurate model that will revolutionize the user search experience by facilitating the identification of duplicate questions. This pioneering research paves the way for a more refined and enhanced approach to tackle the challenges of semantic similarity in the context of question pairs.
Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study
2020
Identifying semantically identical questions on, Question and Answering social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts. In this paper, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora's dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we d...
Deep Learning based Semantic Similarity Detection using Text Data
Information Technology And Control, 2020
Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) applications. As textual data is comparatively large in quantity and huge in volume than the numeric data, therefore measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short Term Memory network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two questions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. The model in comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches.
Contextualized Embeddings based Convolutional Neural Networks for Duplicate Question Identification
2021
Question Paraphrase Identification (QPI) is a critical task for large-scale Question-Answering forums. The purpose of QPI is to determine whether a given pair of questions are semantically identical or not. Previous approaches for this task have yielded promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. In this paper, we propose a novel architecture combining a Bidirectional Transformer Encoder with Convolutional Neural Networks for the QPI task. We produce the predictions from the proposed architecture using two different inference setups: Siamese and Matched Aggregation. Experimental results demonstrate that our model achieves state-of-the-art performance on the Quora Question Pairs dataset. We empirically prove that the addition of convolution layers to the model architecture improves the results in both inference setups. We also investigate the impact of partial and complete fine-tuning and analyze the trade...
An Enhanced Question Pair Similarity Using Machine Learning Approaches
International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022
Quora is a place to gain and share knowledge-about anything. It's a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world. Currently, Quora uses a Random Forest model to identify duplicate questions. Tackling this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not, so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers. We by enhancing the features level by level in each system of total 3 got the XG Boost algorithm as the best model in order to solve such problem, not only in the case of Quora but also with the Stack overflow, medium etc.
Duplicate Questions Pair Detection Using Siamese MaLSTM
IEEE Access, 2020
Quora is a growing platform comprising a user generated collection of questions and answers. The questions and answers are created, edited, and organized by the users. Enormous number of users on the Quora website makes it unavoidable to have multiple questions from different users with similar intent, which raises the issue of duplicate questions. Effectively detecting duplicate questions would make it easier to find high quality answers and help save time, which in turn would result in an improved user experience for writers and readers on Quora. In this paper, Quora Question Pairs dataset is collected from Kaggle for detection of duplicate questions. First, three types of word embeddings involving Google news vector embedding, FastText crawl embedding with 300 dimensions, and FastText crawl sub words embedding with 300 dimensions are implemented individually to vectorize all the questions and train the model. The final features used for prediction are blend of these three types of word embeddings. Then, Siamese MaLSTM (''Ma'' for Manhattan distance) Neural Network model is applied for prediction of duplicate questions in the dataset. Finally, the model is tested on 100000 pairs of questions. The experiments show that the proposed model achieves 91.14% accuracy which is better than the state-of-the-art models. INDEX TERMS Duplicate question pair detection, text mining, deep learning, MaLSTM, word embedding.
Semantic Question Matching in Data Constrained Environment
2018
Machine comprehension of various forms of semantically similar questions with same or similar answers has been an ongoing challenge. Especially in many industrial domains with limited set of questions, it is hard to identify proper semantic match for a newly asked question having the same answer but presented in different lexical form. This paper proposes a linguistically motivated taxonomy for English questions and an effective approach for question matching by combining deep learning models for question representations with general taxonomy based features. Experiments performed on short datasets demonstrate the effectiveness of the proposed approach as better matching classification was observed by coupling the standard distributional features with knowledge-based methods.
Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy
2018
In this paper, we propose a hybrid technique for semantic question matching. It uses a proposed two-layered taxonomy for English questions by augmenting state-of-the-art deep learning models with question classes obtained from a deep learning based question classifier. Experiments performed on three open-domain datasets demonstrate the effectiveness of our proposed approach. We achieve state-of-the-art results on partial ordering question ranking (POQR) benchmark dataset. Our empirical analysis shows that coupling standard distributional features (provided by the question encoder) with knowledge from taxonomy is more effective than either deep learning or taxonomy-based knowledge alone.
Using deep learning models for learning semantic text similarity of Arabic questions
International Journal of Electrical and Computer Engineering (IJECE), 2021
Question-answering platforms serve millions of users seeking knowledge and solutions for their daily life problems. However, many knowledge seekers are facing the challenge to find the right answer among similar answered questions and writer's responding to asked questions feel like they need to repeat answers many times for similar questions. This research aims at tackling the problem of learning the semantic text similarity among different asked questions by using deep learning. Three models are implemented to address the aforementioned problem: i) a supervised-machine learning model using XGBoost trained with pre-defined features, ii) an adapted Siamese-based deep learning recurrent architecture trained with pre-defined features, and iii) a pre-trained deep bidirectional transformer based on BERT model. Proposed models were evaluated using a reference Arabic dataset from the mawdoo3.com company. Evaluation results show that the BERT-based model outperforms the other two models with an F1=92.99%, whereas the Siamese-based model comes in the second place with F1=89.048%, and finally, the XGBoost as a baseline model achieved the lowest result of F1=86.086%.
Review on Exploring Similarity between Two Questions Using Machine Learning
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021
Question duplication is the main problem which is based on functionality of allowing users to ask questions . Questions are often answered, and the duplication problem is faced by question and answer sites such as Quora and Reddit, Stack-overflow, and others. Answers are segmented through various iterations of the same question due to question continuity. The aim is to detect the duplicate questions for reducing the redundancy in data. This is a worst experience of users, as the answers get segmented on various versions of the same question, it is bad for writers as well as seekers. Actually this problem also has been noticed on other platforms of Q&A. In this proposed work a simple neural architecture with natural language inference will be used. The approach gathers an attention to pound the problem into sub-problems that helps it to be solved separately, thus making it menially parallelizable. This work is just completely a new pattern, for the solution and it is also possible that it will not provide the complete solution to the problem but may help in increasing the efficiency of the model to predict the duplication's among several question pairs. Question duplication is the serious problem due to the segmentation of answers in various variants of the same question because ofduplication's in these discussion boards. Lastly, As a consequence, there is a lack of a rational search, solution indifference, knowledge separation, and an insufficiency of responses to the questioners. This could be avoided by employing Natural Language Processing as well as Machine Learning, which will help to improve the performance as well.