Using Contingency Table Approaches in Differential Item Functioning Analysis: A Comparison (original) (raw)
Related papers
2011
This study identified biased test items through differential item functioning analysis using four contingency table approaches: Chi-Square, Distracter Response Analysis, Logistic Regression, and Mantel-Haenszel Statistic. The study made use of test scores of 200 junior high school students. One hundred students came from a public school, and the other 100 were private school examinees. One hundred students were males and 100 were females. Basing from their English II grades, 95 students were classified as low ability and 105 as high ability students. A researcher-constructed and validated Chemistry Achievement Test was used as research instrument. The results from the four methods used were compared, and it was found that school type, gender, and English ability bias exists. There was a high degree of agreement between the Logistic Regression and the Mantel-Haenszel Statistic in identifying biased test items.
World Journal of Educational Research
This study looked into differentially functioning items in a Chemistry Achievement Test. It also<br />examined the effect of eliminating differentially functioning items on the content and concurrent validity,<br />and internal consistency reliability of the test. Test scores of two hundred junior high school students<br />matched on school type were subjected to Differential Item Functioning (DIF) analysis. One hundred<br />students came from a public school, while the other 100 were private school examinees. The<br />descriptive-comparative research design utilizing differential item functioning analysis and validity and<br />reliability analysis was employed. The Chi-Square, Distractor Response Analysis, Logistic Regression,<br />and the Mantel-Haenszel Statistic were the methods used in the DIF analysis. A six-point scale ranging<br />from inadequate to adequate was used to assess the content validity of the test. Pearson r was use...
Journal of Education and Practice
This study identified biased test items through DIF analysis using Logistic Regression and Mantel-Haenszel Statistic. Biases were confirmed using LDA based on FGD's and interviews with teachers and students. The study made use of test scores from 99 male and 101 female grade 8 students to which 108 students were classified as low ability and 92 as high ability students based on their current English grades. A researcher-constructed and validated Statistics Achievement Test was used as research instrument based on Grades 7 and 8 Philippine K-12 competencies. The results from the two methods were compared, and it was found that sex and English ability bias exist. LDA reveals that bias favors females and high ability group in English which is associated with their capability of memorization and retention of topics taught procedurally. Recommendations include (1) incorporation of DIF analysis in test development; (2) the use of at least two methods in item bias detection; (3) the conduct of LDA or the qualitative component of DIF analysis is vital in understanding DIF to account for context specificity; and (4) for future research, the need to incorporate classroom observation as basis for LDA in DIF justification.
American Journal of Educational Research, 2019
Differential Item Functioning (DIF) is a statistical method that determines if test measurements distinguish abilities by comparing two sub-population outcomes on an item. The Mantel-Haenszel (MH) and Logistic Regression (LR) statistics provide effect size measures that quantify the magnitude of DIF. The purpose of the study was to investigate through simulation the effects of sample size, ability distribution and test length on the number of DIF detections using MH and LR methods. A Factorial research design was used in the study. The population of the study consisted of 2000 examinee responses. A stratified random sampling technique was used with the stratifying criteria as the reference (r) and focal (f) groups. Small sample sizes (20r/20f), (60r/60f) and a large sample size (1000r/1000f) were established. WinGen3 statistical software was used to generate dichotomous item response data. The average effect sizes were obtained for 1000 replications. The number of DIF items were use...
Assessment of Gender-related Differential Item Functioning of Teacher-Made Chemistry test
2019
A good item that will measure the intended domain is expected to be free of biases. But several studies have confirmed that some items in a test reveal biases due to a group of testees.. A generally acceptable analytical technique that can be used to discover biases in test items is the Differential Item Functioning (DIF) which Item Response Theory (IRT) offers to check differences in psychometric properties due to the groups that testees belong. Thus, this study used the DIF technique to detect gender biased items in a teacher made Chemistry test. BILOG-MG was employed using 350 (183 males and 167 females) students from 10 Senior secondary school Two (SSII), randomly drawn from Obio/Akpor Local Government Area of Port Harcourt, River State, Nigeria. The study showed that out of one hundred items, fifty-three items were biased. However, 26(49.1%) out of 53 were in favour of the female while 27(50.9%) were in favour of the male which confirmed biases. DIF is effective in detecting group biases of test items. The study concluded that Differential Item Functioning should always be used by scale developers before collating the final items for a test.
Northwest Evaluation Association, 2004
Data sets. For each grade level, 10,000 test records were randomly selected from the pool of complete ISAT records. This represented approximately 57, 55, and 61 percent of the available complete test records in grades 4, 8 and 10 respectively. A 'complete' test record was defined as one in which the student had taken all three content area tests and contained recognized ethnicity and gender codes. Two procedures were applied to this initial record set to insure the integrity of the test scores. First, all records were scanned to determine if the test had been terminated and then resumed at any point during the administration of the fixed portion of the test. If this was found to be the case, the record was eliminated. This was done to minimize the effects of test items or answers being shared with students from the time their test was terminated (temporarily paused) to the time the test was resumed. Second, content area test records were also eliminated when their scores would not be considered valid if NWEA's standard scoring rules had been applied. Under these rules, a content area test record was eliminated when the proportion correct was less than chance (.25 for reading/language usage and .20 for mathematics) plus .05 or was equal to or greater than .92. This procedure eliminated between 3.6% (Grade 8, Language) and 10.9% (Grade 10, Mathematics) of the remaining test records in each content area set. Ethnic group membership was used to define the groups for one set of DIF analyses. The minimum number of subgroup members for an analysis was set at 300. Under most circumstances, this is an adequate number of students in each grade level to allow stable estimates of item difficulties to be calculated. Only two ethnic groups (Caucasian and Hispanic) had an adequate number of members at each grade level when considering the entire state population. Thus, for the ethnic group analyses only Hispanic and non-Hispanic groups were formed. Records with ethnic group showing as African American, Asian/Pacific Islander, Native American/Alaskan Native, or Caucasian were selected for the non-Hispanic group. Records with ethnic group indicated as 'unknown' in the database were not included. In all analyses involving ethnicity, Hispanic students were considered the Focal group and non-Hispanic students were considered the Reference group. The second set of DIF analyses used gender to define the groups. In all analyses involving gender, female students were considered the Focal group and male students were considered the Reference group. The resulting numbers of content area test records by student grade, ethnicity, and gender appear in Table 1. As the table shows, Language Usage tests had the greatest percentage of tests with valid scores after data integrity procedures were applied. In mathematics across all three grades, approximately 15 percent of the test records were excluded. Test records from Hispanic students closely followed their pattern of representation in the general population with the highest percentage being in grade 4. Language usage was the only area where the percentage of Hispanic students with valid test records present slightly exceeded the percentage of Hispanic students statewide. For Reading and Mathematics, Hispanic students were slightly underrepresented.
The comparison of differential item functioning predicted through experts and statistical techniques
Cypriot Journal of Educational Sciences
Validity is one of the psychometric properties of the achievement tests. To determine the validity one of the examination is Item Bias Studies, which are based on Differential Item Functioning (DIF) analyses and field experts’ opinion. In this study, field experts were asked to estimate the DIF levels of the items to compare the estimations obtained from different statistical techniques. Firstly, the experts were asked to examine the questions and make the DIF level estimations according to the gender variable for the DIF estimation, the agreement of the experts was examined. Secondly, DIF levels were calculated by using the logistic regression and Mantel-Haenszel test. Thirdly, the experts’ estimations and the statistical analyses results were compared. As a conclusion, it was observed that the experts and the statistical techniques were in agreement among themselves and they were partially different from each other for the Sciences and equal for the Social Sciences tests.
Educational Measurement and Evaluation Studies, 2017
The performance measurement of examinees through exerting statistical control on their ability may reveal different results in different gender groups. In this case, gender-related differential item functioning and differential test functioning as well as item/test bias is likely to occur. The main purpose of this research was to investigate the gender-related DIF/DTF and gender-related bias throughout national entrance exams in Iran. Then, special tests of a test booklet in 5 experimental groups participating in national entrance exams were chosen from 2008 to 2011 through one-stage cluster sampling method. Next, logistic regression and item response theory were used to investigate DIF and DTF, respectively. The results showed that, on average, about 14% of investigated test items had gender-related DIF with negligible effect size (EF<0.0001) and about 2% of them were biased against females or males based on experts’ viewpoints. Concerning the DTF analysis, it was shown that except for Dramatic Creativity Test of Art Group, which was a bit biased against females, the other tests were not characterized with DTF.
Hazrat-e Masoumeh University , 2024
The present mixed-method study aimed at investigating the presence of Differential Item Functioning (DIF) and Differential Skill Functioning (DSF) in a high-stakes language proficiency test in Iran, the English Proficiency Test (EPT) with different academic backgrounds (i.e., Humanities vs. Sciences & Engineering) using Item Response Theory (IRT) and Mantel-Haenszel (MH) approaches. It also aimed at detecting if there is any correlation between IRT and MH methods and also detecting which DIF items are biased. The English subtest consisted of a total of 100 items. The participants (N = 642) were selected by convenience sampling from universities in Tehran. The results displayed DIF between Sciences and Humanities students, but they did not show DSF in favor of a particular academic discipline group. Hence, on the basis of the findings, it was concluded that the EPT scores are not free of construct-irrelevant variance but because 14 DIF detected items out of 100 items were too small, the overall fairness of the test was confirmed. In addition, it was found that some of DIF detected items were bias and some of them functioned differently, simply because the two groups differed in their abilities. The positive correlation between IRT and MH methods was also proved.
comparison of four differential Item functioning procedures in the presence of multidimensionality
Educational Research and Reviews, 2016
Differential item functioning (DIF), or item bias, is a relatively new concept. It has been one of the most controversial and the most studied subject in measurement theory. DIF occurs when people who have the same ability level but from different groups have a different probability of a correct response. According to Item Response Theory (IRT), DIF occurs when item characteristic curves (ICC) of two groups are not identical or do not have the same item parameters after rescaling. Also, DIF might occur when latent ability space is misspecified. When the groups have different multidimensional ability distributions and test items chosen to discriminate among these abilities, using unidimensional scoring, might flag items as DIF items. The purpose of this study was to compare four DIF procedures the Mantel Haenszel (MH), the Simultaneous Item Bias Test (SIBTEST), the IRT, the Logistic Regression (LR) when the underlying ability distribution is erroneously assumed to be homogenous. To illustrate the effect of assuming a homogenous ability distribution for the groups while they differ in terms of their underlying multidimensional ability levels on the DIF procedures, two different data sets were generated; one set in which DIF occurs, and one set in which no DIF occurs by using 2PL model. The UNIGEN program was used to generate the data. Each of the data sets contained 1000 examinees and 25 items. Item parameters where chosen to be capable of measuring a two dimensional ability distribution of the two groups. The MH, the SIBTEST, the AREA and the LR procedures were applied to the data both with DIF and without DIF. The study showed that all the four methods identified items as biased when the ability space was misspecified.