Use of the Lagrange Multiplier Test for Assessing Measurement Invariance Under Model Misspecification (original) (raw)
Related papers
Educational and Psychological Measurement, 2017
Lagrange multiplier (LM) or score tests have seen renewed interest for the purpose of diagnosing misspecification in item response theory (IRT) models. LM tests can also be used to test whether parameters differ from a fixed value. We argue that the utility of LM tests depends on both the method used to compute the test and the degree of misspecification in the initially fitted model. We demonstrate both of these points in the context of a multidimensional IRT framework. Through an extensive Monte Carlo simulation study, we examine the performance of LM tests under varying degrees of model misspecification, model size, and different information matrix approximations. A generalized LM test designed specifically for use under misspecification, which has apparently not been previously studied in an IRT framework, performed the best in our simulations. Finally, we reemphasize caution in using LM tests for model specification searches.
Educational and Psychological Measurement, 2016
A latent variable modeling method for studying measurement invariance when evaluating latent constructs with multiple binary or binary scored items with no guessing is outlined. The approach extends the continuous indicator procedure described by Raykov and colleagues, utilizes similarly the false discovery rate approach to multiple testing, and permits one to locate violations of measurement invariance in loading or threshold parameters. The discussed method does not require selection of a reference observed variable and is directly applicable for studying differential item functioning with one- or two-parameter item response models. The extended procedure is illustrated on an empirical data set.
The Asymptotic Power of the Lagrange Multiplier Tests for Misspecified IRT Models
Springer Proceedings in Mathematics & Statistics, 2021
This article studies the power of the Lagrange Multiplier Test and the Generalized Lagrange Multiplier Test to detect measurement non-invariance in Item Response Theory (IRT) models for binary data. We study the performance of these two tests under correct model specification and incorrect distribution of the latent variable. The asymptotic distribution of each test under the alternative hypothesis depends on a noncentrality parameter that is used to compute the power. We present two different procedures to compute the noncentrality parameter and consequently the power of the tests. The performance of the two methods is evaluated through a simulation study. They turn out to be very similar to the classic empirical power but less time consuming. Moreover, the results highlight that the Lagrange Multiplier Test is more powerful than the Generalized Lagrange Multiplier Test to detect measurement noninvariance under all simulation conditions.
The Value Of Item Response Theory In Invariance Testing
Journal of Applied Business Research (JABR), 2016
The goal of the current study was to assess the Employee Engagement Instrument (EEI) from an item response theory (IRT) perspective, with a specific focus on measurement invariance for annual turnover. The sample comprised 4 099 respondents from all business sectors in South Africa. This article describes the logic and procedures used to test for factorial invariance across groups in the context of construct validation. The procedures included testing for configural and metric invariance in the framework of multiple-group confirmatory factor analysis (CFA).The results confirmed the factor analytic structure of the model fit for some of the individual scales of the EEI. The measurement invariance of the EEI as a function of annual turnover was confirmed. However, the results indicated that the EEI needs to be refined for future research.
Identifying the source of misfit in item response theory models
Multivariate Behavioral Research, 2014
When an item response theory model fails to fit adequately, the items for which the model provides a good fit and those for which it does not must be determined. To this end, we compare the performance of several fit statistics for item pairs with known asymptotic distributions under maximum likelihood estimation of the item parameters: (a) a mean and variance adjustment to bivariate Pearson’s X2, (b) a bivariate subtable analog to Reiser’s (1996) overall goodness-of- fit test, (c) a z statistic for the bivariate residual cross product, and (d) Maydeu-Olivares and Joe’s (2006) M2 statistic applied to bivariate subtables. The unadjusted Pearson’s X2 with heuristically determined degrees of freedom is also included in the comparison. For binary and ordinal data, our simulation results suggest that the z statistic has the best Type I error and power behavior among all the statistics under investigation when the observed information matrix is used in its computation. However, if one has to use the cross-product information, the mean and variance adjusted X2 is recommended. We illustrate the use of pairwise fit statistics in 2 real-data examples and discuss possible extensions of the current research in various directions.
Online Submission, 2006
One of the most important goals of international studies in educational research is the comparison of learning outcomes across participating countries. In order to compare results it is necessary to collect data using comparable measures. Studies like TIMSS, CIVED, PIRLS or PISA invest considerable efforts in attempts to develop tests which are appropriately translated into the test languages, culturally unbiased and suitable for the diverse educational systems across participating countries. Typically, IRT (Item Response Theory) scaling methodology (see Hambleton, Swaminathan and Rogers, 1991) is used to review Differential Item Functioning (DIF) for countries and detect country-specific item misfit (see examples in Adams, 2002; Schulz and Sibberns, 2004). Likewise, it is of great importance to achieve similar levels of comparability for measures derived from contextual questionnaires. Data collected from contextual questionnaires are often used to explain variation in student performance. However, many constructs measured in student questionnaire (for example self-related cognitions regarding areas of learning, classroom climate etc.) can often also be regarded as important learning outcomes. In the OECD PISA, for example, study contextual data are collected through student and school questionnaires. Questionnaire items are treated in three different ways (see OECD, 2005, pp. 271-319): • They are reported as single items (for example gender, grade). • They are converted into "simple indices" through the arithmetical transformation or recoding of one or more items. • They are scaled. Typically, Item Response Theory (IRT) is used as scaling methodology in order to obtain individual student scores (Weighted Likelihood Estimates). Language differences can have a powerful effect on equivalence (or nonequivalence). Typically, source versions (in English or French) are translated into the language used in a country. In most international studies, reviews of national adaptations and thorough translation verifications are implemented in order to ensure a maximum of "linguistic equivalence" (see Grisay, 2002; Chrostowski and Malak, 2004). However, it is well known that even slight deviations in wording (sometimes necessary due to linguistic differences between source and target language) may lead to differences in item responses (
Nonparametric item response theory in action: An overview of the special issue
Applied Psychological Measurement, 2001
Although most item response theory (IRT) applications and related methodologies involve model fitting within a single parametric IRT (PIRT) family [e.g., the Rasch (1960) model or the threeparameter logistic model ( 3PLM; ], nonparametric IRT (NIRT) research has been growing in recent years. Three broad motivations for the development and continued interest in NIRT can be identified: 1. To identify a commonality among PIRT and IRT-like models, model features [e.g., local independence (LI), monotonicity of item response functions (IRFs), unidimensionality of the latent variable] should be characterized, and it should be discovered what happens when models satisfy only weakened versions of these features. Characterizing successful and unsuccessful inferences under these broad model features can be attempted in order to understand how IRT models aggregate information from data. All this can be done with NIRT. 2. Any model applied to data is likely to be incorrect. When a family of PIRT models has been shown (or is suspected) to fit poorly, a more flexible family of NIRT models often is desired. These NIRT models have been used to: (1) assess violations of LI due to nuisance traits (e.g., latent variable multidimensionality) or the testing context influencing test performance (e.g., speededness and question wording), (2) clarify questions about the sources and effects of differential item functioning, (3) provide a flexible context in which to develop methodology for establishing the most appropriate number of latent dimensions underlying a test, and (4) serve as alternatives for PIRT models in tests of fit. 3. In psychological and sociological research, when it is necessary to develop a new questionnaire or measurement instrument, there often are fewer examinees and items than are desired for fitting PIRT models in large-scale educational testing. NIRT provides tools that are easy to use in small samples. It can identify items that scale together well (follow a particular set of NIRT assumptions). NIRT also identifies several subscales with simple structure among the scales, if the items do not form a single unidimensional scale.