Using deep learning models for learning semantic text similarity of Arabic questions (original) (raw)

Deep Contextualized Pairwise Semantic Similarity for Arabic Language Questions

IEEE International Conference on Tools with Artificial Intelligence, 2019

Question semantic similarity is a challenging and active research problem that is very useful in many NLP applications, such as detecting duplicate questions in community question answering platforms such as Quora. Arabic is considered to be an under-resourced language, has many dialects, and rich in morphology. Combined together, these challenges make identifying semantically similar questions in Arabic even more difficult. In this paper, we introduce a novel approach to tackle this problem, and test it on two benchmarks; one for Modern Standard Arabic (MSA), and another for the 24 major Arabic dialects. We are able to show that our new system outperforms state-of-the-art approaches by achieving 93% F1-score on the MSA benchmark and 82% on the dialectical one. This is achieved by utilizing contextualized word representations (ELMo embeddings) trained on a text corpus containing MSA and dialectic sentences. This in combination with a pairwise fine-grained similarity layer, helps our question-to-question similarity model to generalize predictions on different dialects while being trained only on question-to-question MSA data.

ST NSURL 2019 Shared Task: Semantic Question Similarity in Arabic

2019

In this paper, we describe the solution that we propose for the shared task NSURL 2019 Semantic Question Similarity in Arabic. The proposed solution combines three approaches: lexical, statistical, and neural. The lexical approach is based on similarity measures. The statistical approach utilizes a set of binary classifiers. The neural approach uses a Siamese Deep Neural Network Model.

Exploring Deep Learning in Semantic Question Matching

IEEE, 2018

Question duplication is the major problem encountered by Q&A forums like Quora, Stack-overflow, Reddit, etc. Answers get fragmented across different versions of the same question due to the redundancy of questions in these forums. Eventually, this results in lack of a sensible search, answer fatigue, segregation of information and the paucity of response to the questioners. The duplicate questions can be detected using Machine Learning and Natural Language Processing. Dataset of more than 400,000 questions pairs provided by Quora are pre-processed through tokenization, lemmatization and removal of stop words. This pre-processed dataset is used for the feature extraction. Artificial Neural Network is then designed and the features hence extracted, are fit into the model. This neural network gives accuracy of 86.09%. In a nutshell, this research predicts the semantic coincidence between the question pairs extracting highly dominant features and hence, determine the probability of question being duplicate.

A comparative analysis on question classification task based on deep learning approaches

PeerJ Computer Science

Question classification is one of the essential tasks for automatic question answering implementation in natural language processing (NLP). Recently, there have been several text-mining issues such as text classification, document categorization, web mining, sentiment analysis, and spam filtering that have been successfully achieved by deep learning approaches. In this study, we illustrated and investigated our work on certain deep learning approaches for question classification tasks in an extremely inflected Turkish language. In this study, we trained and tested the deep learning architectures on the questions dataset in Turkish. In addition to this, we used three main deep learning approaches (Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN)) and we also applied two different deep learning combinations of CNN-GRU and CNN-LSTM architectures. Furthermore, we applied the Word2vec technique with both skip-gram and CBOW methods for word em...

Deep Learning based Semantic Similarity Detection using Text Data

Information Technology And Control, 2020

Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) applications. As textual data is comparatively large in quantity and huge in volume than the numeric data, therefore measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short Term Memory network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two questions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. The model in comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches.

NSURL-2019 Task 8: Semantic Question Similarity in Arabic

2019

Question semantic similarity (Q2Q) is a challenging task that is very useful in many NLP applications, such as detecting duplicate questions and question answering systems. In this paper, we present the results and findings of the shared task (Semantic Question Similarity in Arabic). The task was organized as part of the first workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) The goal of the task is to predict whether two questions are semantically similar or not, even if they are phrased differently. A total of 9 teams participated in the task. The datasets created for this task are made publicly available to support further research on Arabic Q2Q.

So2al-wa-Gwab: A new Arabic Question-Answering Dataset Trained on Answer Extraction Models

ACM Transactions on Asian and Low-Resource Language Information Processing, 2023

Question answering (QA) is the task of responding to questions posed by users automatically. A question-answering system is divided into three main components: question analysis, information retrieval, and answer extraction; where this paper has focused only on the answer extraction part. In the past couple of years, many QA systems have been developed and become mature and ready for use in diferent languages. Nevertheless, the advancement of Arabic QA systems still faces diferent obstacles and a lack of relevant resources and tools for researchers. This paper presents the So2al-wa-Gwab dataset since the publicly available datasets include various faults, such as the use of machine translation to build the data, a short context size, and a small number of question-answer pairings. Thus, this new dataset avoids the aforementioned drawbacks. Furthermore, in this paper, we have trained three deep learning models, namely, Bi-Directional low network (BiDAF), QA Network (QANet), and BERT model; and tested them on seven diferent datasets, thus providing a comprehensive comparison between existing Arabic QA datasets. The obtained results emphasize that machine-translated datasets fall back when compared with human-annotated data. Also, the QA task becomes harder as the context, from which to extract the answer, becomes larger. CCS Concepts: • Computing methodologies → Language resources.

UniMelb at SemEval-2016 Task 3: Identifying Similar Questions by combining a CNN with String Similarity Measures

This paper describes the results of the participation of The University of Melbourne in the community question-answering (CQA) task of SemEval 2016 (Task 3-B). We obtained a MAP score of 70.2% on the test set, by combining three classifiers: a NaiveBayes classifier and a support vector machine (SVM) each trained over lexical similarity features, and a convolutional neural network (CNN). The CNN uses word embeddings and machine translation evaluation scores as features.

Apply deep learning to improve the question analysis model in the Vietnamese question answering system

International Journal of Electrical and Computer Engineering (IJECE), 2023

Question answering (QA) system nowadays is quite popular for automated answering purposes, the meaning analysis of the question plays an important role, directly affecting the accuracy of the system. In this article, we propose an improvement for question-answering models by adding more specific question analysis steps, including contextual characteristic analysis, pos-tag analysis, and question-type analysis built on deep learning network architecture. Weights of extracted words through question analysis steps are combined with the best matching 25 (BM25) algorithm to find the best relevant paragraph of text and incorporated into the QA model to find the best and least noisy answer. The dataset for the question analysis step consists of 19,339 labeled questions covering a variety of topics. Results of the question analysis model are combined to train the question-answering model on the data set related to the learning regulations of Industrial University of Ho Chi Minh City. It includes 17,405 pairs of questions and answers for the training set and 1,600 pairs for the test set, where the robustly optimized BERT pretraining approach (RoBERTa) model has an F1-score accuracy of 74%. The model has improved significantly. For long and complex questions, the mode has extracted weights and correctly provided answers based on the question's contents.

Question to Question Similarity Analysis Using Morphological, Syntactic, Semantic, and Lexical Features

JUCS - Journal of Universal Computer Science

In the digitally connected world that we are living in, people expect to get answers to their questions spontaneously. This expectation increased the burden on Question/Answer platforms such as Stack Overflow and many others. A promising solution to this problem is to detect if a question being asked is similar to a question in the database, then present the answer of the detected question to the user. To address this challenge, we propose a novel Natural Language Processing (NLP) approach that detects if two Arabic questions are similar or not using their extracted morphological, syntactic, semantic, lexical, overlapping, and semantic lexical features. Our approach involves several phases including Arabic text processing, novel feature extraction, and text classifications. Moreover, we conducted a comparison between seven different machine learning classifiers. The included classifiers are: Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), Extreme Gradient...