An Annotated Bibliography of Writing Assessment: Machine Scoring and Evaluation of Essay-length Writing (original) (raw)
2012, Journal of Writing Assessment
This installment of the JWA annotated bibliography focuses on the phenomenon of machine scoring of whole essays composed by students and others. "Machine scoring" is defined as the rating of extended or essay writing by means of automated, computerized technology. Excluded is scoring of paragraph-sized free responses of the sort that occur in academic course examinations. Also excluded is software that checks only grammar, style, and spelling. Included, however, is software that provides other kinds of evaluative or diagnostic feedback along with a holistic score. While some entries in this bibliography describe, validate, and critique the ways computers "read" texts and generate scores and feedback, other sources critically examine how these results are used. The topic is timely, since the use of machine scoring of essays is rapidly growing in standardized testing, sorting of job and college applicants, admission to college, placement into and exit out of writing courses, content tests in academic courses, and value-added study of learning outcomes. This installment of the JWA annotated bibliography focuses on the phenomenon of machine scoring of whole essays composed by students and others. This bibliography extends Rich Haswell's bibliography on the topic, "A Bibliography of Machine Scoring of Student Writing, 1962-2005" in Machine Scoring of Student Essays: Truth and Consequences (Logan, UT: Utah State University Press, pp. 234-243). Here we define "machine scoring" as the rating of extended or essay writing by means of automated, computerized technology. We exclude scoring of paragraph-sized free responses of the sort that occur in academic course examinations. We also exclude software that checks only grammar, style, and spelling. But we include software that provides other kinds of evaluative or diagnostic feedback along with a holistic score. While some entries in this bibliography describe, validate, and critique the ways computers "read" texts and generate scores and feedback, other sources critically examine how these results are used. The topic is timely, since the use of machine scoring of essays is rapidly growing in standardized testing, sorting of job and college applicants, admission to college, placement into and exit out of writing courses, content tests in academic courses, and value-added study of learning outcomes. With this abridged collection of sources, our goal is to provide readers of JWA with a primer on the topic, or at least an introduction to the methods, jargon, and implications associated with computer-based writing evaluation. We have omitted pieces of historical interest, focusing mainly on developments in the last twenty years, although some pieces recount the forty-five years of machine scoring history. The scholarship should provide teachers of writing, writing program administrators, and writing assessment specialists a broad base of knowledge about how machine scoring is used and to what ends, as well as insights into some of the main theoretical and pedagogical issues current among those who advocate for or caution against machine scoring. At the least, we hope this collection of resources will provide readers with a sense of the conversation about machine scoring by experts in the enterprise of making and marketing the software and scholars in the field of writing studies and composition instruction. We all need to be better prepared to articulate the benefits, limitations, and problems of using machine scoring as a method of writing assessment and response. Attali, Yagli. (2004). Exploring the feedback and revision features of Criterion. Paper presented at the National Council of Measurement in Education, April 12-16, San Diego, CA. http://www.ets.org/Media/Research/pdf/erater\_NCME\_2004\_Attali\_B.pdf Reports on a large-scale, statistically based study of the changes in student essays, from grades 6-12, from the first to last submission to Educational Testing Services's Criterion. Evaluates "the effectiveness of the automated feedback" feature to decide whether "students understand the feedback provided to them and have the ability to attend to the comments" (p. 17). Concludes that statistical analysis demonstrates students were able to significantly lower the rate of specific errors and significantly increase the occurrence of certain desirable discourse elements (e.g., introduction, conclusion); therefore, students are able to revise and improve their essays using the automated feedback system. Although the study's research questions are framed in terms of how students understand the feedback and why they choose to revise, the comparison of first and last essay submissions is purely text based. The study doesn't present any data about how feedback was used by the students, if there was any teacher intervention, or if other factors external to the feedback could have influenced the students' revisions and final texts. Baron, Dennis. (1998). When professors get A's and machines get F's. The Chronicle of Higher Education (November 29), p. A56. Examines Intelligent Essay Assessor (IEA), arguing that while this assessment program claims to scan student essays consistently and objectively in seconds using what its developers call "latent semantic analysis," consistent, objective readings would not necessarily improve classroom assessments or teaching-if such readings could be achieved. Baron posits that consistency, whether in the grading or reading of student essays, is not a human trait. Breland, Hunter M. (1996). Computer-assisted writing assessment: The politics of science versus the humanities. In