Does IRT Provide More Sensitive Measures of Latent Traits in Statistical Tests? An Empirical Examination (original) (raw)
Related papers
Journal of Applied Measurement
Lord (1980) presented a purely conceptual equation to approximate the nonlinear functional relationship between classical test theory (CTT; aka true score theory) and item response theory (IRT) item discrimination indices. The current project proposes a modification to his equation that makes it useful in practice. The suggested modification acknowledges the more common contemporary CTT discrimination index of a corrected item-total correlation and incorporates item difficulty. We simulated slightly over 768 trillion individual item responses to uncover a best-fitting empirical function relating the IRT and CTT discrimination indices. To evaluate the effectiveness of the function, we applied it to real-world test data from 16 workforce and educational tests. Our modification results in shifted functional asymptotes, slopes, and points of inflection across item difficulties. Validation with the workforce and educational tests suggests good prediction under common assumption testing conditions (approximately normal distribution of abilities and moderate item difficulties) and greater precision than Lord's (1980) formula.
The " New Psychometrics " – Item Response Theory
Classical test theory is concerned with the reliability of a test and assumes that the items within the test are sampled at random from a domain of relevant items. Reliability is seen as a characteristic of the test and of the variance of the trait it measures. Items are treated as random replicates of each other and their characteristics, if examined at all, are expressed as correlations with total test score or as factor loadings on the putative latent variable(s) of interest. Characteristics of their properties are not analyzed in detail. This led Mellenbergh (1996) to the distinction between theories of tests (Lord and Novick, 1968) and a theories of items (Lord, 1952; Rasch, 1960). The so-called "New Psychometrics" (Embretson and Hershberger, 1999; Embretson and Reise, 2000; Van der Linden and Hambleton, 1997) is a theory of how people respond to items and is known as Item Response Theory or IRT . Over the past twenty years there has been explosive growth in programs that can do IRT, and within R there are at least four very powerful packages: eRm (Mair and Hatzinger, 2007),
Item Response Theory: An Introduction to Latent Trait Models to Test and Item Development
Testing in educational system perform a number of functions, the results from a test can be used to make a number of decisions in education. It is therefore well accepted in the education literature that, testing is an important element of education. To effectively utilize the tests in educational policies and quality assurance its validity and reliability estimates are necessary. There are two generally acceptable frameworks used in evaluating the quality of test in educational and psychological measurements, these are; Classical Test Theory (CTT) and Item Response Theory (IRT). The estimates of test items validity and reliability depend on a particular measurement model used. It is vital for a test developer to be familiar with the different test development and item analysis methods in order to facilitate the development of a new test. The CTT is a traditional approach which was widely criticise in the measurement community for its shortcomings such as sample dependency of coefficient measures and estimates of measurement error. However, the IRT is a modern approach which provides solutions to most of the CTT " s identified shortcomings. This paper therefore, provides a comprehensive overview of the IRT and its procedures as applied to test item development and analysis. The paper concludes with some suggestions for test developers and test specialists at all levels to adopt IRT for its identified crucial theoretical and empirical gains over CTT. IRT based parameter estimates should be superior and reliable than CTT based parameter estimates. With these features, IRT can help resolve the problems associated with test design based on CTT
Effects of Local Item Dependence on the Validity of IRT Item, Test, and Ability Statistics
Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies, and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. In this study, we (a) review methods for detecting local item dependence (LID), (b) discuss the use of testlets to account for LID in context-dependent item sets, (c) apply LID detection methods and testlet-based item calibrations to data from a large-scale, high stakes admissions test, and (d) evaluate the results with respect to test score reliability and examinee proficiency estimation. The results suggest the presence of LID impacts estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model.
Psicologica, 2007
A theoretical advantage of item response theory (IRT) models is that trait estimates based on these models provide more test information than any other type of test score. It is still unclear, however, whether using IRT trait estimates improves external validity results in comparison with the results that can be obtained by using simple raw scores. This paper discusses some methodological results based on the 2-parameter logistic model (2PLM) and is concerned with three issues: first, how validity coefficients based on IRT trait estimates must be interpreted; second, how inferences about these coefficients can be made; and third, which differences in external validity can be expected if the 2PLM is correct for the data and IRT scores are used in place of raw scores. Four empirical examples in the personality domain provided further evidence for the results that can be expected in real research in which the model is, at best, a good approximation to the data. A general result of thes...