Steven Stemler - Academia.edu (original) (raw)

Papers by Steven Stemler

Research paper thumbnail of A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability

Practical Assessment, Research and Evaluation, 2004

This article argues that the general practice of describing interrater reliability as a single, u... more This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results. Many educational and psychological studies require the use of independent judges, or raters, in order to quantify some aspect of behavior. For example, judges may be used to score open-response items on a standardized test, to rate the performance of expert athletes at a sporting event, or to empirically test the viability of a new scoring rubric. Judges are most often used when behaviors of interest cannot be objectively scored in a simple right/wrong sense, but instead require some rating of the degree to which observed behaviors represent particular levels of a construct of interest (e.g., athletic excellence, history competence). The task of judging behavior invites some degree of subjectivity in that the rating given will depend upon the judge's interpretation of the construct. One strategy for reducing subjectivity is to develop scoring rubrics (Mertler, 2001; Moskal & Leydens, 2000; Tierney & Simon, 2004). The purpose of training judges how to interpret a scoring rubric and consistently apply the levels of a rating scale associated with the rubric is to impose some level of objectivity onto the rating scale. Consensus Estimates

Research paper thumbnail of The Social Psychology of Conflict Resolution

contemporary Psychology, Oct 1, 2003

Research paper thumbnail of Examining school effectiveness at the fourth grade: A hierarchical analysis of the Third International Mathematics and Science Study (TIMSS)

Research paper thumbnail of Native American/Tribal Schools

Routledge eBooks, Aug 6, 2013

Research paper thumbnail of Common and Unique Elements to School Mission Statements

Research paper thumbnail of There Is More to Teaching than Instruction: Seven Strategies for Dealing with the Social Side of Teaching. Publication Series No.1

Research paper thumbnail of Public Elementary Schools

Research paper thumbnail of There’s more to teaching than instruction: seven strategies for dealing with the practical side of teaching<sup>1</sup>

Educational studies, Mar 1, 2006

Research paper thumbnail of The Man in the Mirror: Reflections on Identity Formation and Intergroup Conflict

contemporary Psychology, Dec 1, 2004

Research paper thumbnail of Interrater Reliability

Encyclopedia of Measurement and Statistics, Aug 13, 2013

Research paper thumbnail of Are Creative People Better than Others at Recognizing Creative Work?

Thinking Skills and Creativity, Dec 1, 2020

It is often assumed that people with high ability in a domain will be excellent raters of quality... more It is often assumed that people with high ability in a domain will be excellent raters of quality within that same domain. This assumption is an underlying principle of using raters for creativity tasks, as in the Consensual Assessment Technique. While several prior studies have examined expert-novice differences in ratings, none have examined whether experts’ ability to identify the quality of a creative product is being driven more by their ability to identify high quality work, low quality work, or both. To address this question, a sample of 142 participants completed individual difference measures and rated the quality of several sets of creative captions. Unbeknownst to the participants, the captions had been identified a prior by expert raters as being of particularly high or low quality. Hierarchical regression analyses revealed that after controlling for participants’ background and personality, those who scored significantly higher on any of three external measures of creativity also rated low-quality captions significantly lower than their peers; however, they did not rate the high-quality captions significantly higher. These findings support research in other domains suggesting that ratings of quality may be driven more by the lower end of the quality spectrum than the high end.

Research paper thumbnail of Analyzing Gender Differences for High-achieving Students on Timss

Kluwer Academic Publishers eBooks, Dec 19, 2005

Research paper thumbnail of Content Analysis

Emerging Trends in the Social and Behavioral Sciences, May 15, 2015

Research paper thumbnail of Aligning Mission and Measurement

Plenum series on human exceptionality, 2016

the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustra... more the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Research paper thumbnail of What Should University Admissions Tests Predict?

Educational Psychologist, 2012

Research paper thumbnail of Best Practices in Interrater Reliability Three Common Approaches

SAGE Publications, Inc. eBooks, 2008

Research paper thumbnail of College Admissions, the MIA Model, and MOOCs: Commentary on Niessen and Meijer (2017)

Perspectives on Psychological Science, May 1, 2017

Research paper thumbnail of Measuring Creativity in the Classroom

Research paper thumbnail of The Undercurrent of American Education

contemporary Psychology, Aug 1, 2003

Research paper thumbnail of A Closer Look at the Wesleyan Intercultural Competence Scale

Routledge eBooks, Jun 26, 2023

Research paper thumbnail of A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability

Practical Assessment, Research and Evaluation, 2004

This article argues that the general practice of describing interrater reliability as a single, u... more This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results. Many educational and psychological studies require the use of independent judges, or raters, in order to quantify some aspect of behavior. For example, judges may be used to score open-response items on a standardized test, to rate the performance of expert athletes at a sporting event, or to empirically test the viability of a new scoring rubric. Judges are most often used when behaviors of interest cannot be objectively scored in a simple right/wrong sense, but instead require some rating of the degree to which observed behaviors represent particular levels of a construct of interest (e.g., athletic excellence, history competence). The task of judging behavior invites some degree of subjectivity in that the rating given will depend upon the judge's interpretation of the construct. One strategy for reducing subjectivity is to develop scoring rubrics (Mertler, 2001; Moskal & Leydens, 2000; Tierney & Simon, 2004). The purpose of training judges how to interpret a scoring rubric and consistently apply the levels of a rating scale associated with the rubric is to impose some level of objectivity onto the rating scale. Consensus Estimates

Research paper thumbnail of The Social Psychology of Conflict Resolution

contemporary Psychology, Oct 1, 2003

Research paper thumbnail of Examining school effectiveness at the fourth grade: A hierarchical analysis of the Third International Mathematics and Science Study (TIMSS)

Research paper thumbnail of Native American/Tribal Schools

Routledge eBooks, Aug 6, 2013

Research paper thumbnail of Common and Unique Elements to School Mission Statements

Research paper thumbnail of There Is More to Teaching than Instruction: Seven Strategies for Dealing with the Social Side of Teaching. Publication Series No.1

Research paper thumbnail of Public Elementary Schools

Research paper thumbnail of There’s more to teaching than instruction: seven strategies for dealing with the practical side of teaching<sup>1</sup>

Educational studies, Mar 1, 2006

Research paper thumbnail of The Man in the Mirror: Reflections on Identity Formation and Intergroup Conflict

contemporary Psychology, Dec 1, 2004

Research paper thumbnail of Interrater Reliability

Encyclopedia of Measurement and Statistics, Aug 13, 2013

Research paper thumbnail of Are Creative People Better than Others at Recognizing Creative Work?

Thinking Skills and Creativity, Dec 1, 2020

It is often assumed that people with high ability in a domain will be excellent raters of quality... more It is often assumed that people with high ability in a domain will be excellent raters of quality within that same domain. This assumption is an underlying principle of using raters for creativity tasks, as in the Consensual Assessment Technique. While several prior studies have examined expert-novice differences in ratings, none have examined whether experts’ ability to identify the quality of a creative product is being driven more by their ability to identify high quality work, low quality work, or both. To address this question, a sample of 142 participants completed individual difference measures and rated the quality of several sets of creative captions. Unbeknownst to the participants, the captions had been identified a prior by expert raters as being of particularly high or low quality. Hierarchical regression analyses revealed that after controlling for participants’ background and personality, those who scored significantly higher on any of three external measures of creativity also rated low-quality captions significantly lower than their peers; however, they did not rate the high-quality captions significantly higher. These findings support research in other domains suggesting that ratings of quality may be driven more by the lower end of the quality spectrum than the high end.

Research paper thumbnail of Analyzing Gender Differences for High-achieving Students on Timss

Kluwer Academic Publishers eBooks, Dec 19, 2005

Research paper thumbnail of Content Analysis

Emerging Trends in the Social and Behavioral Sciences, May 15, 2015

Research paper thumbnail of Aligning Mission and Measurement

Plenum series on human exceptionality, 2016

the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustra... more the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Research paper thumbnail of What Should University Admissions Tests Predict?

Educational Psychologist, 2012

Research paper thumbnail of Best Practices in Interrater Reliability Three Common Approaches

SAGE Publications, Inc. eBooks, 2008

Research paper thumbnail of College Admissions, the MIA Model, and MOOCs: Commentary on Niessen and Meijer (2017)

Perspectives on Psychological Science, May 1, 2017

Research paper thumbnail of Measuring Creativity in the Classroom

Research paper thumbnail of The Undercurrent of American Education

contemporary Psychology, Aug 1, 2003

Research paper thumbnail of A Closer Look at the Wesleyan Intercultural Competence Scale

Routledge eBooks, Jun 26, 2023