GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation (original) (raw)

Exploring Distinct Features for Automatic Short Answer Grading

Anais do XV Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2018)

Automatic short answer grading is the study field that addresses the assessment of students’ answers to questions in natural language. The grading of the answers is generally seen as a typical classification supervised learning. To stimulate research in the field, two datasets were publicly released in the SemEval 2013 competition task “Student Response Analysis”. Since then, some works have been developed to improve the results. In this context, the goal of this work is to tackle such task by implementing lessons learned from the literature in an effective way and report results for both datasets and all of its scenarios. The proposed method obtained better results in most scenarios of the competition task and, therefore, higher overall scores when compared to recent works.

Automated Short Answer Grading: A Simple Solution for a Difficult Task

2019

English. The task of short answer grading is aimed at assessing the outcome of an exam by automatically analysing students’ answers in natural language and deciding whether they should pass or fail the exam. In this paper, we tackle this task training an SVM classifier on real data taken from a University statistics exam, showing that simple concatenated sentence embeddings used as features yield results around 0.90 F1, and that adding more complex distance-based features lead only to a slight improvement. We also release the dataset, that to our knowledge is the first freely available dataset of this kind in Italian.1

Automatic short answer grading and feedback using text mining methods

Procedia Computer Science, 2020

Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.

Scoring Free-Responses Automatically: A Case Study of a Large-Scale Assessment

2004

C-rater is an automated scoring engine that measures a student’s understanding of content material through the use of natural language processing techniques. We describe the process used for building c-rater models using Alchemist, c-rater’s model-building interface. Results are given for a large-scale assessment that used c-rater to score 19 reading comprehension and five algebra questions. In total, about 170,000 short-answer responses were scored with an average of 85% accuracy.

Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating

Lecture Notes in Computer Science, 2020

This project proposes using BERT (Bidirectional Encoder Representations from Transformers) as a tool to assist educators with automated short answer grading (ASAG) as opposed to replacing human judgement in high-stakes scenarios. Many educators are hesitant to give authority to an automated system, especially in assessment tasks such as grading constructed response items. However, evaluating free-response text can be time and labor costly for one rater, let alone multiple raters. In addition, some degree of inconsistency exists within and between raters for assessing a given task. Recent advances in Natural Language Processing have resulted in subsequent improvements for technologies that rely on artificial intelligence and human language. New, state-of-theart models such as BERT, an open source, pre-trained language model, have decreased the amount of training data needed for specific tasks and in turn, have reduced the amount of human annotation necessary for producing a high-quality classification model. After training BERT on expert ratings of constructed responses, we use subsequent automated grading to calculate Cohen's Kappa as a measure of inter-rater reliability between the automated system and the human rater. For practical application, when the inter-rater reliability metric is unsatisfactory, we suggest that the human rater(s) use the automated model to call attention to ratings where a second opinion might be needed to confirm the rater's correctness and consistency of judgement.

Foundations for AI-Assisted Formative Assessment Feedback for Short-Answer Tasks in Large-Enrollment Classes

Bridging the Gap: Empowering and Educating Today’s Learners in Statistics. Proceedings of the Eleventh International Conference on Teaching Statistics, 2022

Research suggests "write-to-learn" tasks improve learning outcomes, yet constructed-response methods of formative assessment become unwieldy with large class sizes. This study evaluates natural language processing algorithms to assist this aim. Six short-answer tasks completed by 1,935 students were scored by several human raters using a detailed rubric and an algorithm. Results indicate substantial inter-rater agreement using quadratic weighted kappa for rater pairs (each QWK > 0.74) and group consensus (Fleiss' Kappa = 0.68). Additionally, intra-rater agreement was estimated for one rater who had scored 178 responses seven years prior (QWK = 0.88). With compelling rater agreement, the study then pilots cluster analysis of response text toward enabling instructors to ascribe meaning to clusters as a means for scalable formative assessment.

Evaluation Dataset (DT-Grade) and Word Weighting Approach towards Constructed Short Answers Assessment in Tutorial Dialogue Context

Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, 2016

Evaluating student answers often requires contextual information, such as previous utterances in conversational tutoring systems. For example, students use coreferences and write elliptical responses, i.e. incomplete but can be interpreted in context. The DT-Grade corpus which we present in this paper consists of short constructed answers extracted from tutorial dialogues between students and an Intelligent Tutoring System and annotated for their correctness in the given context and whether the contextual information was useful. The dataset contains 900 answers (of which about 25% required contextual information to properly interpret them). We also present a baseline system developed to predict the correctness label (such as correct, correct but incomplete) in which weights for the words are assigned based on context.

Automatic Grading of Portuguese Short Answers Using a Machine Learning Approach

Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação (Anais Estendidos do SBSI 2020), 2020

Short answers are routinely used in learning environments for students’ assessment. Despite its importance, teachers find the task of assessing discursive answers very time-consuming. Aiming at assisting in this problem, this work explores the Automatic Short Answer Grading (ASAG) field using a machine learning approach. The literature was reviewed and 44 papers using different techniques were analyzed considering many aspects. A Portuguese dataset was build with more than 7000 short answers. Different approaches were experimented and a final model was created with their combination. The model’s effectiveness showed to be satisfactory, with kappa scores indicating moderate/substantial agreement between the model and human grading.

A scoring rubric for automatic short answer grading system

TELKOMNIKA Telecommunication Computing Electronics and Control, 2019

During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students' learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student's answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error () value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.

Automatic short answer grading with SBERT on out-of-sample questions

2021

We explore how different components of an Automatic Short Answer Grading (ASAG) model affect the model’s ability to generalize to questions outside of those used for training. For supervised automatic grading models, human ratings are primarily used as ground truth labels. Producing such ratings can be resource heavy, as subject matter experts spend vast amounts of time carefully rating a sample of responses. Further, it is often the case that multiple raters must come to a census before a final groundtruth rating is established. If ASAG models were developed that could generalize to out-of-sample questions, educators may be able to quickly add new questions to an auto-graded assessment without a continued manual rating process. For this project we explore various methods for producing vector representations of student responses including state-of-the-art representation methods such as Sentence-BERT as well as more traditional approaches including Word2Vec and Bag-of-words. We exper...