Steven Stemler - Academia.edu (original) (raw)
Papers by Steven Stemler
Practical Assessment, Research and Evaluation, 2004
This article argues that the general practice of describing interrater reliability as a single, u... more This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results. Many educational and psychological studies require the use of independent judges, or raters, in order to quantify some aspect of behavior. For example, judges may be used to score open-response items on a standardized test, to rate the performance of expert athletes at a sporting event, or to empirically test the viability of a new scoring rubric. Judges are most often used when behaviors of interest cannot be objectively scored in a simple right/wrong sense, but instead require some rating of the degree to which observed behaviors represent particular levels of a construct of interest (e.g., athletic excellence, history competence). The task of judging behavior invites some degree of subjectivity in that the rating given will depend upon the judge's interpretation of the construct. One strategy for reducing subjectivity is to develop scoring rubrics (Mertler, 2001; Moskal & Leydens, 2000; Tierney & Simon, 2004). The purpose of training judges how to interpret a scoring rubric and consistently apply the levels of a rating scale associated with the rubric is to impose some level of objectivity onto the rating scale. Consensus Estimates
contemporary Psychology, Oct 1, 2003
Routledge eBooks, Aug 6, 2013
Educational studies, Mar 1, 2006
contemporary Psychology, Dec 1, 2004
Encyclopedia of Measurement and Statistics, Aug 13, 2013
Thinking Skills and Creativity, Dec 1, 2020
It is often assumed that people with high ability in a domain will be excellent raters of quality... more It is often assumed that people with high ability in a domain will be excellent raters of quality within that same domain. This assumption is an underlying principle of using raters for creativity tasks, as in the Consensual Assessment Technique. While several prior studies have examined expert-novice differences in ratings, none have examined whether experts’ ability to identify the quality of a creative product is being driven more by their ability to identify high quality work, low quality work, or both. To address this question, a sample of 142 participants completed individual difference measures and rated the quality of several sets of creative captions. Unbeknownst to the participants, the captions had been identified a prior by expert raters as being of particularly high or low quality. Hierarchical regression analyses revealed that after controlling for participants’ background and personality, those who scored significantly higher on any of three external measures of creativity also rated low-quality captions significantly lower than their peers; however, they did not rate the high-quality captions significantly higher. These findings support research in other domains suggesting that ratings of quality may be driven more by the lower end of the quality spectrum than the high end.
Kluwer Academic Publishers eBooks, Dec 19, 2005
Emerging Trends in the Social and Behavioral Sciences, May 15, 2015
Plenum series on human exceptionality, 2016
the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustra... more the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Educational Psychologist, 2012
SAGE Publications, Inc. eBooks, 2008
Perspectives on Psychological Science, May 1, 2017
contemporary Psychology, Aug 1, 2003
Routledge eBooks, Jun 26, 2023
Practical Assessment, Research and Evaluation, 2004
This article argues that the general practice of describing interrater reliability as a single, u... more This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results. Many educational and psychological studies require the use of independent judges, or raters, in order to quantify some aspect of behavior. For example, judges may be used to score open-response items on a standardized test, to rate the performance of expert athletes at a sporting event, or to empirically test the viability of a new scoring rubric. Judges are most often used when behaviors of interest cannot be objectively scored in a simple right/wrong sense, but instead require some rating of the degree to which observed behaviors represent particular levels of a construct of interest (e.g., athletic excellence, history competence). The task of judging behavior invites some degree of subjectivity in that the rating given will depend upon the judge's interpretation of the construct. One strategy for reducing subjectivity is to develop scoring rubrics (Mertler, 2001; Moskal & Leydens, 2000; Tierney & Simon, 2004). The purpose of training judges how to interpret a scoring rubric and consistently apply the levels of a rating scale associated with the rubric is to impose some level of objectivity onto the rating scale. Consensus Estimates
contemporary Psychology, Oct 1, 2003
Routledge eBooks, Aug 6, 2013
Educational studies, Mar 1, 2006
contemporary Psychology, Dec 1, 2004
Encyclopedia of Measurement and Statistics, Aug 13, 2013
Thinking Skills and Creativity, Dec 1, 2020
It is often assumed that people with high ability in a domain will be excellent raters of quality... more It is often assumed that people with high ability in a domain will be excellent raters of quality within that same domain. This assumption is an underlying principle of using raters for creativity tasks, as in the Consensual Assessment Technique. While several prior studies have examined expert-novice differences in ratings, none have examined whether experts’ ability to identify the quality of a creative product is being driven more by their ability to identify high quality work, low quality work, or both. To address this question, a sample of 142 participants completed individual difference measures and rated the quality of several sets of creative captions. Unbeknownst to the participants, the captions had been identified a prior by expert raters as being of particularly high or low quality. Hierarchical regression analyses revealed that after controlling for participants’ background and personality, those who scored significantly higher on any of three external measures of creativity also rated low-quality captions significantly lower than their peers; however, they did not rate the high-quality captions significantly higher. These findings support research in other domains suggesting that ratings of quality may be driven more by the lower end of the quality spectrum than the high end.
Kluwer Academic Publishers eBooks, Dec 19, 2005
Emerging Trends in the Social and Behavioral Sciences, May 15, 2015
Plenum series on human exceptionality, 2016
the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustra... more the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Educational Psychologist, 2012
SAGE Publications, Inc. eBooks, 2008
Perspectives on Psychological Science, May 1, 2017
contemporary Psychology, Aug 1, 2003
Routledge eBooks, Jun 26, 2023