Exploring Deep Learning in Semantic Question Matching (original) (raw)

Question duplication is the major problem encountered by Q&A forums like Quora, Stack-overflow, Reddit, etc. Answers get fragmented across different versions of the same question due to the redundancy of questions in these forums. Eventually, this results in lack of a sensible search, answer fatigue, segregation of information and the paucity of response to the questioners. The duplicate questions can be detected using Machine Learning and Natural Language Processing. Dataset of more than 400,000 questions pairs provided by Quora are pre-processed through tokenization, lemmatization and removal of stop words. This pre-processed dataset is used for the feature extraction. Artificial Neural Network is then designed and the features hence extracted, are fit into the model. This neural network gives accuracy of 86.09%. In a nutshell, this research predicts the semantic coincidence between the question pairs extracting highly dominant features and hence, determine the probability of question being duplicate.