Jill Burstein | Educational Testing Service (original) (raw)
Papers by Jill Burstein
The journal of writing analytics, 2022
• Background: Researchers interested in quantitative measures of student "success" in writing can... more • Background: Researchers interested in quantitative measures of student "success" in writing cannot control completely for contextual factors which are local and site-based (i.e., in context of a specific instructor's writing classroom at a specific institution). (In)ability to control for curriculum in studies of student writing achievement complicates interpretation of features measured in student writing. This article demonstrates how identifying and analyzing features of writing curriculum can provide dimensions of local context not captured in analysis of student-generated texts alone. Using a dataset of 48 curricular texts collected from 21 instructors teaching in five disciplines across six four-year public universities in the United States, this article: 1) presents a set of curriculum scoring rubrics developed through qualitative analysis, 2) describes a protocol for training raters to use the rubrics to score curricular texts to achieve rater agreement and generate quantitative data, and 3) explores how this framework 1 Jill Burstein completed her work on this paper while employed at ETS.
Routledge eBooks, Jan 30, 2003
Grantee Submission, 2013
We introduce a cognitive framework for measuring reading comprehension that includes the use of n... more We introduce a cognitive framework for measuring reading comprehension that includes the use of novel summary-writing tasks. We derive NLP features from the holistic rubric used to score the summaries written by students for such tasks and use them to design a preliminary, automated scoring system. Our results show that the automated approach performs very well on summaries written by students for two different passages.
Workshop on Innovative Use of NLP for Building Educational Applications, 2011
... 2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics},... more ... 2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {65--75}, url = {http://www.aclweb.org/anthology/W11-1408} } @InProceedings{delarosa-eskenazi:2011:EduApp, author = {Dela Rosa, Kevin and Eskenazi, Maxine}, title ...
Routledge eBooks, Feb 17, 2015
Artificial Intelligence in Education, Jun 8, 2007
K-12 teachers often need to develop text adaptations of authentic classroom texts as a way to pro... more K-12 teachers often need to develop text adaptations of authentic classroom texts as a way to provide reading-level appropriate materials to English language learners in their classrooms. Adaptations are done by hand, and include practices such as, writing text ...
Educational Testing Service, Feb 1, 2004
TOEFL logo are registered trademarks of Educational Testing Service. The Test of English as a For... more TOEFL logo are registered trademarks of Educational Testing Service. The Test of English as a Foreign Language is a trademark of Educational Testing Service. College Board is a registered trademark of the College Entrance Examination Board. Graduate Management Admission Test and GMAT are registered trademarks of the Graduate Management Admission Council.
ETS Research Report Series, Jun 1, 1995
The increased use of constructed-response items, like essays, creates a need for tools to score t... more The increased use of constructed-response items, like essays, creates a need for tools to score these responses automatically in part or as a whole. This study explores one approach to analyzing essay-length natural language constructed-responses. A decision model for scoring essays was developed and evaluated. The decision model uses off-the-shelf software for grammar and style checking of the English language. The best performing grammar checking programs from among several commercial programs were selected to construct a decision model for scoring the essays. Data produced from the selected grammar programs were used to make a decision about the score for an essay. Through statistical and linguistic methods, the performance of the decision model was analyzed in an effort to understand its usefulness and practicality in a production scoring setting. A sample of 80 essays was selected *
In this paper, we investigate the relationship between argumentation structures and (a) argument ... more In this paper, we investigate the relationship between argumentation structures and (a) argument content, and (b) the holistic quality of an argumentative essay. Our results suggest that structure-based approaches hold promise for automated evaluation of argumentative writing.
Elsevier eBooks, 2010
... The University of Akron Jill Burstein Derrick Higgins ... Latent Semantic Analysis and its va... more ... The University of Akron Jill Burstein Derrick Higgins ... Latent Semantic Analysis and its variants are employed by some developers to provide estimates as to how close the vocabulary in the candidate answer is to a targeted vocabularly set (Landauer, Foltz, & Laham, 1998). ...
Natural Language Engineering, May 22, 2006
Educational assessment applications, as well as other natural-language interfaces, need some mech... more Educational assessment applications, as well as other natural-language interfaces, need some mechanism for validating user responses. If the input provided to the system is infelicitous or uncooperative, the proper response may be to simply reject it, to route it to a bin for special processing, or to ask the user to modify the input. If problematic user input is instead handled as if it were the system's normal input, this may degrade users' confidence in the software, or suggest ways in which they might try to "game" the system. Our specific task in this domain is the identification of student essays which are "off-topic", or not written to the test question topic. Identification of off-topic essays is of great importance for the commercial essay evaluation system Criterion SM. The previous methods used for this task required 200-300 human scored essays for training purposes. However, there are situations in which no essays are available for training, such as when users (teachers) wish to spontaneously write a new topic for their students. For these kinds of cases, we need a system that works reliably without training data. This paper describes an algorithm that detects when a student's essay is off-topic without requiring a set of topic-specific essays for training. This new system is comparable in performance to previous models which require topic-specific essays for training, and provides more detailed information about the way in which an essay diverges from the requested essay topic.
North American Chapter of the Association for Computational Linguistics, 2004
Criterion SM Online Essay Evaluation Service includes a capability that labels sentences in stude... more Criterion SM Online Essay Evaluation Service includes a capability that labels sentences in student writing with essay-based discourse elements (e.g., thesis statements). We describe a new system that enhances Criterion's capability, by evaluating multiple aspects of coherence in essays. This system identifies features of sentences based on semantic similarity measures and discourse structure. A support vector machine uses these features to capture breakdowns in coherence due to relatedness to the essay question and relatedness between discourse elements. Intra-sentential quality is evaluated with rule-based heuristics. Results indicate that the system yields higher performance than a baseline on all three aspects.
IEEE Intelligent Systems, 2003
An essay-based discourse analysis system can help students improve their writing by identifying r... more An essay-based discourse analysis system can help students improve their writing by identifying relevant essay-based discourse elements in their essays. The system presented here uses a voting algorithm based on decisions from three independent discourse analysis systems to automatically label elements in student essays. * Because the original categories title and irrelevant occur infrequently, we collapsed them and any unlabeled text into the category other and used this for training and testing.
The Duolingo English Test is a groundbreaking, digitalfirst, computeradaptive measure of English ... more The Duolingo English Test is a groundbreaking, digitalfirst, computeradaptive measure of English language proficiency for communication and use in Englishmedium settings. The test measures four key English language proficiency constructs: Speaking, Writing, Reading, and Listening (SWRL), and is aligned with the Common European Framework of Reference for Languages (CEFR) proficiency levels and descriptors. As a digitalfirst assessment, the test uses "humanintheloop AI" from end to end for test security, automated item generation, and scoring of testtaker responses. This paper presents a novel theoretical assessment ecosystem for the Duolingo English Test. It is a theoretical representation of language assessment design, measurement, and test security processes, as well as the testtaker experience factors that contribute to the test validity argument and test impact. The test validity argument is constructed with a digitallyinformed chain of inferences that addresses digital affordances applied to the test. The ecosystem is composed of an integrated set of complex frameworks: (1) the Language Assessment Design Framework, (2) the Expanded EvidenceCentered Design Framework, (3) the Computational Psychometrics Framework, and (4) the Test Security Framework. Testtaker experience (TTX) is a test priority throughout the testtaking pipeline, such as low cost, anytime/anywhere, and shorter testing time. The test's expected impact is aligned with Duolingo's social mission to lower barriers to education access and offer a secure and delightful test experience, while providing a valid, fair, and reliable test score. The ecosystem leverages principles from assessment theory, computational psychometrics, design, data science, language assessment theory, NLP/AI, and test security. Note: This paper was updated on May 1, 2023 to reflect the addition of a new item type, Interactive Listening. The main change is an updated Table A2. Some additional editorial changes were introduced.
The journal of writing analytics, 2022
• Background: Researchers interested in quantitative measures of student "success" in writing can... more • Background: Researchers interested in quantitative measures of student "success" in writing cannot control completely for contextual factors which are local and site-based (i.e., in context of a specific instructor's writing classroom at a specific institution). (In)ability to control for curriculum in studies of student writing achievement complicates interpretation of features measured in student writing. This article demonstrates how identifying and analyzing features of writing curriculum can provide dimensions of local context not captured in analysis of student-generated texts alone. Using a dataset of 48 curricular texts collected from 21 instructors teaching in five disciplines across six four-year public universities in the United States, this article: 1) presents a set of curriculum scoring rubrics developed through qualitative analysis, 2) describes a protocol for training raters to use the rubrics to score curricular texts to achieve rater agreement and generate quantitative data, and 3) explores how this framework 1 Jill Burstein completed her work on this paper while employed at ETS.
Routledge eBooks, Jan 30, 2003
Grantee Submission, 2013
We introduce a cognitive framework for measuring reading comprehension that includes the use of n... more We introduce a cognitive framework for measuring reading comprehension that includes the use of novel summary-writing tasks. We derive NLP features from the holistic rubric used to score the summaries written by students for such tasks and use them to design a preliminary, automated scoring system. Our results show that the automated approach performs very well on summaries written by students for two different passages.
Workshop on Innovative Use of NLP for Building Educational Applications, 2011
... 2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics},... more ... 2011}, address = {Portland, Oregon}, publisher = {Association for Computational Linguistics}, pages = {65--75}, url = {http://www.aclweb.org/anthology/W11-1408} } @InProceedings{delarosa-eskenazi:2011:EduApp, author = {Dela Rosa, Kevin and Eskenazi, Maxine}, title ...
Routledge eBooks, Feb 17, 2015
Artificial Intelligence in Education, Jun 8, 2007
K-12 teachers often need to develop text adaptations of authentic classroom texts as a way to pro... more K-12 teachers often need to develop text adaptations of authentic classroom texts as a way to provide reading-level appropriate materials to English language learners in their classrooms. Adaptations are done by hand, and include practices such as, writing text ...
Educational Testing Service, Feb 1, 2004
TOEFL logo are registered trademarks of Educational Testing Service. The Test of English as a For... more TOEFL logo are registered trademarks of Educational Testing Service. The Test of English as a Foreign Language is a trademark of Educational Testing Service. College Board is a registered trademark of the College Entrance Examination Board. Graduate Management Admission Test and GMAT are registered trademarks of the Graduate Management Admission Council.
ETS Research Report Series, Jun 1, 1995
The increased use of constructed-response items, like essays, creates a need for tools to score t... more The increased use of constructed-response items, like essays, creates a need for tools to score these responses automatically in part or as a whole. This study explores one approach to analyzing essay-length natural language constructed-responses. A decision model for scoring essays was developed and evaluated. The decision model uses off-the-shelf software for grammar and style checking of the English language. The best performing grammar checking programs from among several commercial programs were selected to construct a decision model for scoring the essays. Data produced from the selected grammar programs were used to make a decision about the score for an essay. Through statistical and linguistic methods, the performance of the decision model was analyzed in an effort to understand its usefulness and practicality in a production scoring setting. A sample of 80 essays was selected *
In this paper, we investigate the relationship between argumentation structures and (a) argument ... more In this paper, we investigate the relationship between argumentation structures and (a) argument content, and (b) the holistic quality of an argumentative essay. Our results suggest that structure-based approaches hold promise for automated evaluation of argumentative writing.
Elsevier eBooks, 2010
... The University of Akron Jill Burstein Derrick Higgins ... Latent Semantic Analysis and its va... more ... The University of Akron Jill Burstein Derrick Higgins ... Latent Semantic Analysis and its variants are employed by some developers to provide estimates as to how close the vocabulary in the candidate answer is to a targeted vocabularly set (Landauer, Foltz, & Laham, 1998). ...
Natural Language Engineering, May 22, 2006
Educational assessment applications, as well as other natural-language interfaces, need some mech... more Educational assessment applications, as well as other natural-language interfaces, need some mechanism for validating user responses. If the input provided to the system is infelicitous or uncooperative, the proper response may be to simply reject it, to route it to a bin for special processing, or to ask the user to modify the input. If problematic user input is instead handled as if it were the system's normal input, this may degrade users' confidence in the software, or suggest ways in which they might try to "game" the system. Our specific task in this domain is the identification of student essays which are "off-topic", or not written to the test question topic. Identification of off-topic essays is of great importance for the commercial essay evaluation system Criterion SM. The previous methods used for this task required 200-300 human scored essays for training purposes. However, there are situations in which no essays are available for training, such as when users (teachers) wish to spontaneously write a new topic for their students. For these kinds of cases, we need a system that works reliably without training data. This paper describes an algorithm that detects when a student's essay is off-topic without requiring a set of topic-specific essays for training. This new system is comparable in performance to previous models which require topic-specific essays for training, and provides more detailed information about the way in which an essay diverges from the requested essay topic.
North American Chapter of the Association for Computational Linguistics, 2004
Criterion SM Online Essay Evaluation Service includes a capability that labels sentences in stude... more Criterion SM Online Essay Evaluation Service includes a capability that labels sentences in student writing with essay-based discourse elements (e.g., thesis statements). We describe a new system that enhances Criterion's capability, by evaluating multiple aspects of coherence in essays. This system identifies features of sentences based on semantic similarity measures and discourse structure. A support vector machine uses these features to capture breakdowns in coherence due to relatedness to the essay question and relatedness between discourse elements. Intra-sentential quality is evaluated with rule-based heuristics. Results indicate that the system yields higher performance than a baseline on all three aspects.
IEEE Intelligent Systems, 2003
An essay-based discourse analysis system can help students improve their writing by identifying r... more An essay-based discourse analysis system can help students improve their writing by identifying relevant essay-based discourse elements in their essays. The system presented here uses a voting algorithm based on decisions from three independent discourse analysis systems to automatically label elements in student essays. * Because the original categories title and irrelevant occur infrequently, we collapsed them and any unlabeled text into the category other and used this for training and testing.
The Duolingo English Test is a groundbreaking, digitalfirst, computeradaptive measure of English ... more The Duolingo English Test is a groundbreaking, digitalfirst, computeradaptive measure of English language proficiency for communication and use in Englishmedium settings. The test measures four key English language proficiency constructs: Speaking, Writing, Reading, and Listening (SWRL), and is aligned with the Common European Framework of Reference for Languages (CEFR) proficiency levels and descriptors. As a digitalfirst assessment, the test uses "humanintheloop AI" from end to end for test security, automated item generation, and scoring of testtaker responses. This paper presents a novel theoretical assessment ecosystem for the Duolingo English Test. It is a theoretical representation of language assessment design, measurement, and test security processes, as well as the testtaker experience factors that contribute to the test validity argument and test impact. The test validity argument is constructed with a digitallyinformed chain of inferences that addresses digital affordances applied to the test. The ecosystem is composed of an integrated set of complex frameworks: (1) the Language Assessment Design Framework, (2) the Expanded EvidenceCentered Design Framework, (3) the Computational Psychometrics Framework, and (4) the Test Security Framework. Testtaker experience (TTX) is a test priority throughout the testtaking pipeline, such as low cost, anytime/anywhere, and shorter testing time. The test's expected impact is aligned with Duolingo's social mission to lower barriers to education access and offer a secure and delightful test experience, while providing a valid, fair, and reliable test score. The ecosystem leverages principles from assessment theory, computational psychometrics, design, data science, language assessment theory, NLP/AI, and test security. Note: This paper was updated on May 1, 2023 to reflect the addition of a new item type, Interactive Listening. The main change is an updated Table A2. Some additional editorial changes were introduced.