Estimating psychometric reliability with one observation per subject (original) (raw)
Related papers
Test Reliability at the Individual Level
Reliability has a long history as one of the key psychometric properties of a test. However, a given test might not measure people equally reliably. Test scores from some individuals might have considerably greater error than others. This study proposed two approaches using intraindividual variation to estimate test reliability for each person. A simulation study suggested that the parallel tests approach and the structural equation modeling approach recovered the simulated reliability coefficients. Then in an empirical study, where 45 females were measured daily on the Positive and Negative Affect Schedule (PANAS) for 45 consecutive days, separate estimates of reliability were generated for each person. Results showed that reliability estimates of the PANAS varied substantially from person to person. The methods provided in this article apply to tests measuring changeable attributes and require repeated measures across time on each individual. This article also provides a set of parallel forms of PANAS.
A family of measures to evaluate scale reliability in a longitudinal setting
Journal of The Royal Statistical Society Series A-statistics in Society, 2009
The concept of reliability denotes one of the most important psychometric properties of a measurement scale. Reliability refers to the capacity of the scale to discriminate between subjects in a given population. In classical test theory, it is often estimated using the intraclass correlation coefficient based on two replicate measurements. However, the modelling framework used in this theory is often too narrow when applied in practical situations. Generalizability theory has extended reliability theory to a much broader a framework, but is confronted with some limitations when applied in a longitudinal setting. In this paper, we explore how the definition of reliability can be generalized to a setting where subjects are measured repeatedly over time. Based on four defining properties for the concept of reliability, we propose a family of reliability measures, which circumscribes the area in which reliability measures should be sought for. It is shown how different members assess different aspects of the problem and that the reliability of the instrument can depend on the way it is used. The methodology is motivated by and illustrated on data from a clinical study on schizophrenia. Based on this study, we estimate and compare the reliabilities of two different rating scales to evaluate the severity of the disorder.
Measuring, estimating, and understanding the psychometric function: A commentary
Perception & Psychophysics, 2001
The psychometric function, relating the subject's response to the physical stimulus, is fundamental to psychophysics. This paper examines various psychometric function topics, many inspired by this special symposium issue of Perception & Psychophysics: What are the relative merits of objective yes/no versus forced choice tasks (including threshold variance)? What are the relative merits of adaptive versus constant stimuli methods? What are the relativemerits of likelihood versus up-down staircaseadaptive methods? Is 2AFC free of substantial bias? Is there no efficient adaptive method for objective yes/no tasks? Should adaptive methods aim for 90% correct? Can adding more responses to forced choice and objective yes/no tasks reduce the threshold variance? What is the best way to deal with lapses? How is the Weibull function intimately related to the d¢ function? What causes bias in the likelihood goodness-of-fit? What causes bias in slope estimates from adaptive methods? How good are nonparametric methods for estimating psychometric function parameters? Of what value is the psychometric function slope? How are various psychometric functions related to each other? The resolution of many of these issues is surprising.
On the analysis of psychometric functions: The Spearman-Kärber method
Perception & Psychophysics, 2001
With computer simulations, we examined the performance of the Spearman-Kärber method for analyzing psychometric functions and compared this method with the standard technique of probit analysis. The Spearman-Kärber method was found to be superior in most cases. It generally yielded less biased and less variable estimates of the location and dispersion of a psychometric function, and it provided more power to detect differences in these parameters across experimental conditions. Moreover, the Spearman-Kärber method provided information about the skewness and higher moments of psychometric functions that is beyond the scope of probit analysis. These advantages of the Spearman-Kärber method suggest that it should often be used in preference to probit analysis for the analysis of observed psychometric functions.
On the analysis of psychometric functions: The Spearman-K�rber method
Atten Percept Psychophys, 2001
With computer simulations, we examined the performance of the Spearman-Kärber method for analyzing psychometric functions and compared this method with the standard technique of probit analysis. The Spearman-Kärber method was found to be superior in most cases. It generally yielded less biased and less variable estimates of the location and dispersion of a psychometric function, and it provided more power to detect differences in these parameters across experimental conditions. Moreover, the Spearman-Kärber method provided information about the skewness and higher moments of psychometric functions that is beyond the scope of probit analysis. These advantages of the Spearman-Kärber method suggest that it should often be used in preference to probit analysis for the analysis of observed psychometric functions.
A Unified Approach to Multi-item Reliability
Biometrics, 2010
Summary The reliability of multi-item scales has received a lot of attention in the psychometric literature, where a myriad of measures like the Cronbach's α or the Spearman–Brown formula have been proposed. Most of these measures, however, are based on very restrictive models that apply only to unidimensional instruments. In this article, we introduce two measures to quantify the reliability of multi-item scales based on a more general model. We show that they capture two different aspects of the reliability problem and satisfy a minimum set of intuitive properties. The relevance and complementary value of the measures is studied and earlier approaches are placed in a broader theoretical framework. Finally, we apply them to investigate the reliability of the Positive and Negative Syndrome Scale, a rating scale for the assessment of the severity of schizophrenia.
The Estimation of Reliability in Longitudinal Models
International Journal of Behavioral Development, 1998
Despite the increasing attention devoted to the study and analysis of longitudinal data, relatively little consideration has been directed toward understanding the issues of reliability and measurement error. Perhaps one reason for this neglect has been that traditional methods of estimation (e.g. generalisability theory) require assumptions that are often not tenable in longitudinal designs. This paper first examines applications of generalisability theory to the estimation of m easurement error and reliability in longitudinal research, and notes how factors such as missing data, correlated errors, and true score instability prohibit traditional variance com ponent estimation. Next, we discuss how estimation methods using restricted maximum likelihood can account for these factors, thereby providing m any advantages over traditional estimation methods. Finally, we provide a substantive exam ple illustrating these advantages, and include brief discussions of programming and software...
Educational and Psychological Measurement, 2002
The present article addresses reliability issues in light of recent studies and debates focused on psychometrics versus datametrics terminology and reliability generalization (RG) introduced by Vacha-Haase. The purpose here was not to moderate arguments presented in these debates but to discuss multiple perspectives on score reliability and how they may affect research practice, editorial policies, and RG across studies. Issues of classical error variance and reliability are discussed across models of classical test theory, generalizability theory, and item response theory. Potential problems with RG across studies are discussed in relation to different types of reliability, different test forms, different number of items, misspecifications, and confounding independent variables in a single RG analysis.
CASPER: A personal computer package for exploring psychometrics
Behavior Research Methods, Instruments, & Computers, 1991
CASPER is a psychometrics software package suitable for instructional and research applications with IBM-PC-compatible computers. CASPER lets the user simulate or directly enter psychometric data. Numerous statistical analyses, file handling procedures, and graphics are included. Analyses include factor analysis, multiple regression, correlation/partial correlation, m0ments analysis, reliability analysis, and item analysis. Psychometric methods are an important part of every psychologist's undergraduate and graduate education. Tests and measurements courses are taught in most psychology and education departments. Although many computer programs have been developed to teach the statistics used in experimental design, few or none have been developed that specifically address psychometric statistics. The Construct Analysis and Simulation package for Education through Research (CASPER), an integrated, easy-to-use package that requires no programming skills, was written to fill this void. CASPER was designed to allow students and researchers to explore psychometrics through both Monte Carlo simulations and real data analysis. Users simulate true scores, observation scores, and measurement error, using random numbers or limiting case data, in order to model psychometric construct domains. These domains may be constituted from either interval or nominal scales, allowing the simulation of Likert scale, true-false, or multiple-choice tests. The logic of the simulation follows the rationale of the classical true score model (Nunnally, 1978). The user may vary the number, range, and distributions of true scores. Errorless item scores are then generated; these are the ratings applied to questionnaire items if no measurement error is involved. Total scores derived from errorless item scores are identical to true scores, and the ratings are perfectly valid and reliable. To make the data more realistic, users may add various types and amounts of error variance to the errorless item scores, creating observation Requests for reprints should be sent to William V. Chambers, Depart
Psychology in The Schools, 2007
Direct observation of behaviors is a data collection method customarily used in clinical and educational settings. Repeated measures and small samples are inherent characteristics of observational studies that pose challenges to the numerical estimation of reliability for observational data. In this article, we review some debates about the use of Generalizability Theory in estimating reliability of single-subject observational data. We propose that it could be used but under a clearly stated set of conditions. The conceptualization of facets and object of measurement for a common design of observational research is elucidated under a different light. We provide two numerical examples to illustrate the ideas. Limitations of using Generalizability Theory to estimate reliability of observational data are discussed.