How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models (original) (raw)
Related papers
Context-aware Stand-alone Neural Spelling Correction
Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the standalone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperform the previous state-ofthe-art result by 12.8% absolute F 0.5 score.
Making Sentence Embeddings Robust to User-Generated Content
2024
NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2× and 11× better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
Investigating neural architectures for short answer scoring
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 2017
Neural approaches to automated essay scoring have recently shown state-of-theart performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models-ngrams and embeddings-are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong nonneural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.
Misspelling Correction with Pre-trained Contextual Language Model
2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)
Spelling irregularities, known now as spelling mistakes, have been found for several centuries. As humans, we are able to understand most of the misspelled words based on their location in the sentence, perceived pronunciation, and context. Unlike humans, computer systems do not possess the convenient auto complete functionality of which human brains are capable. While many programs provide spelling correction functionality, many systems do not take context into account. Moreover, Artificial Intelligence systems function in the way they are trained on. With many current Natural Language Processing (NLP) systems trained on grammatically correct text data, many are vulnerable against adversarial examples, yet correctly spelled text processing is crucial for learning. In this paper, we investigate how spelling errors can be corrected in context, with a pretrained language model BERT. We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections. The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
Towards Instance-Based Content Scoring with Pre-Trained Transformer Models
2020
Pretrained contextual word representations based on transformer models have recently achieved state-of-the-art performance on content scoring for educational data using a similarity-based scoring approach with reference answers. In this work, we demonstrate how similar models can be adapted for content scoring using an instance-based approach (Horbach and Zesch 2019), in which a model is learned only from student responses (not reference answers). Our approach yields state-of-the-art performance on the ASAP-SAS short answer scoring dataset. Content scoring is the task of scoring the content of answers to free-response questions in educational applications (also known as short answer grading or scoring when responses are short, e.g. sentence length) (Burrows, Gurevych, and Stein 2015). Unlike systems for essay scoring, which target writing quality (e.g., ideas and elaboration, organization, style, and writing conventions such as grammar and spelling (Burstein, Tetreault, and Madnani ...
Sequence to Sequence Convolutional Neural Network for Automatic Spelling Correction
2019
The paper proposes a system that compensates most of the noise in a text in natural language caused by technical imperfection of the input device such as keyboard or scanner with optical character recognition, quick typing, or writer incompetence. Correcting the spelling errors in the text improves the performance of the following natural language processing. The incorrect sequence of characters is transcribed into another sequence of correct characters by a neural network with encoder-decoder architecture. Our approach to automatic spelling correction considers characters in an erroneous sentence as words of the source languages. The neural network searches for the best sequence of output characters for the given input. The proposed approach for spelling correction does not require any or minimal amount of training data. Instead, the error model is expressed by a simple component that distorts unannotated data and creates any necessary quantity of training examples for a neural net...
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction
arXiv (Cornell University), 2023
Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, datacentric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: "Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data?" To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with realworld data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.
Survey of Neural Text Representation Models
Information, 2020
In natural language processing, text needs to be transformed into a machine-readable representation before any processing. The quality of further natural language processing tasks greatly depends on the quality of those representations. In this survey, we systematize and analyze 50 neural models from the last decade. The models described are grouped by the architecture of neural networks as shallow, recurrent, recursive, convolutional, and attention models. Furthermore, we categorize these models by representation level, input level, model type, and model supervision. We focus on task-independent representation models, discuss their advantages and drawbacks, and subsequently identify the promising directions for future neural text representation models. We describe the evaluation datasets and tasks used in the papers that introduced the models and compare the models based on relevant evaluations. The quality of a representation model can be evaluated as its capability to generalize ...
NwQM: A neural quality assessment framework for Wikipedia
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Millions of people irrespective of socioeconomic and demographic backgrounds, depend on Wikipedia articles everyday for keeping themselves informed regarding popular as well as obscure topics. Articles have been categorized by editors into several quality classes, which indicate their reliability as encyclopedic content. This manual designation is an onerous task because it necessitates profound knowledge about encyclopedic language, as well navigating circuitous set of wiki guidelines. In this paper we propose Neural wikipedia Quality Monitor (NwQM), a novel deep learning model which accumulates signals from several key information sources such as article text, meta data and images to obtain improved Wikipedia article representation. We present comparison of our approach against a plethora of available solutions and show 8% improvement over state-of-the-art approaches with detailed ablation studies.
Automated text scoring, keeping it simple
2019
In traditional automated text scoring approaches stop-words are either immediately removed or authors do not give importance to the handling of stop-words [1, 2, 3]. Recent studies have, however, found that removing stop-words may adversely affect certain models and should not be considered a standard component of the text pre-processing pipeline [4]. Given an essay, the task is to predict a numerical score or grade. To improve the accuracy of existing neural network approaches for essay scoring, recent attempts have focused on developing increasingly complex neural networks with little to no consideration of the text pre-processing pipeline [5, 6, 7]. In this work we investigated the text pre-processing pipeline for automated text scoring (ATS). We investigated how stacked LSTMs coupled with an adjustment to the text pre-processing pipeline and basic word embedding models can achieve results on par with the state-of-the-art. We used the ASAP dataset to train a basic LSTM deep learn...