The Stability of Four Methods for Estimating Item Bias (original) (raw)

Accounting for Statistical Artifacts in Item Bias Research

Journal of Educational Statistics, 1984

Theoretically preferred IRT bias detection procedures were applied to both a mathematics achievement and vocabulary test. The data were from black and white seniors on the High School and Beyond data files. To account for statistical artifacts, each analysis was repeated on randomly equivalent samples of blacks and whites ( n’s = 1,500). Furthermore, to establish a baseline for judging bias indices that might be attributable only to sampling fluctuations, bias analyses were conducted comparing randomly selected groups of whites. To assess the effect of mean group differences on the appearance of bias, pseudo-ethnic groups were created, that is, samples of whites were selected to simulate the average black-white difference. The validity and sensitivity of the IRT bias indices was supported by several findings. A relatively large number of items (10 of 29) on the math test were found to be consistently biased; they were replicated in parallel analyses. The bias indices were substantia...

An Investigation of the Relationship between Reliability, Power, and the Type I Error Rate of the Mantel-Haenszel and Simultaneous Item Bias Detection Procedures

1992

This study examines the relationship between levels of reliability and the power of two bias and differential item functioning (DIF) detection methods. Both methods, the Mantel-Haenszel (MH) (Holland & Thayer, 1988) and the Simultaneous Item Bias (SIB) (Shealy & Stout, 1991), use examinees' raw scores as a conditioning variable in the computation of differential performance between two groups of interest. As a result, the extent to which examinees' observed scores accurately reflect their true abilities plays an important role. If examinees are misrepresented by their cbserved score (as for a test with low reliability) then the ability of bias detection methods to determine item bias may not be very accurate. Results suggest that for a fixed length test, the power of both statistics increases moderately as reliability is increased and substantially sample size increased. However, the combination of small sample sizes and higli relibility resulted in a decrease of power. For most of the simulated conditions the MI4 procedure and SIB had very similar rates of correctly rejecting the biased item.

A simulation study of item bias using a two-parameter item response model

Applied psychological …, 1985

Possible underlying causes of item bias were examined using a simulation procedure. Data sets were generated to conform to specified factor structures and mean factor scores. Comparisons between the item parameters of various data sets were made with one data set representing the "majority" group and another data set representing the "minority" group. Results indicated that items that required a secondary ability, on which two groups differed in mean level, were generally more biased than those items that do not require a secondary ability. Items with different factor structures in two groups were not consistently identified as more biased than those having similar factor structures. A substantial amount of agreement was found among the bias indices used in the study.

Detecting Biased Items When Developing a Scale: a Quantitative Approach

2018

In survey research, it is well known that the quality of responses is significantly altered by apparently trivial variations in the linguistic or grammatical properties of survey items. Yet numerous seemingly minor changes are made to survey items in the course of the scale development process so that they comply with other requirements (e.g., content validity). As a result, researchers may inadvertently introduce systematic measurement error that is not accounted for in the final model. Remedies to this problem are widely known, but reliable methods to diagnose it do not readily exist. In an effort to address this shortcoming, we develop a quantitative method to detect biased items and reinforce the reliability of IS measurement instruments. In this paper, we provide step by step implementation guidelines and show how to apply the method and interpret the output results.

A Catalog of Biases in Questionnaires

Bias in questionnaires is an important issue in public health research. To collect the most accurate data from respondents, investigators must understand and be able to prevent or at least minimize bias in the design of their questionnaires. This paper identifies and categorizes 48 types of bias in questionnaires based on a review of the literature and offers an example of each type. The types are categorized according to three main sources of bias: the way a question is designed, the way the questionnaire as a whole is designed, and how the questionnaire is administered. This paper is intended to help investigators in public health understand the mechanism and dynamics of problems in questionnaire design and to provide a checklist for identifying potential bias in a questionnaire before it is administered.

Reversed item bias: An integrative model

Psychological Methods, 2013

for their feedback on previous versions of this manuscript. They would also like to acknowledge the constructive comments of the review team.

Efforts Toward the Development of Unbiased Selection and Assessment Instruments

1977

Invest igations into item bias provide' an empirical basis for the identification and elimination of test items which appear to measure different traits across populations or cultural groups. The Psychometric rationales 'for six approaches to the identification of biased test items are revAeyed; (1) Transformed item'clifficultivs: within-group p-values are standardized and ompared betweeff-groups. (2) Analysis of variance: bias is operationally defined in terms of significant item by group interaction effects. (3) Chi-square: individual items are investigated in terms of between group score level differenCes in expected and observed proportions of correct respohses. (4) Item (characteristic curve th9ory: differences in the probabilities of a correct response, given examinees of the same underlying ability and .different culture groups, rare evaluated. (5) Factor analytic item bias is investigated in terms of culture specific and cltyie common sOiiices of variance, or in terms) of loadings on a biasell.test ,factor, *