Hossein Karami | University of Tehran (original) (raw)

Books by Hossein Karami

Research paper thumbnail of Fairness Issues in Educational Assessment

Papers by Hossein Karami

Research paper thumbnail of The Development and Validation of a Bilingual Version of the Vocabulary Size Test

RELC Journal, 2012

This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabula... more This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabulary Size Test (VST). Due to the particular educational system in Iran, there is a dire need for a test that can effectively estimate English learners’ vocabulary sizes. Previous research ( Nguyen and Nation, 2011 ) has indicated that bilingual versions of the VST can be more efficient than the monolingual one. A calibration of the Persian version of the test with 190 English learners indicated that the test enjoys a high level of validity and reliability. The results of a factor analysis revealed a single construct, presumably word knowledge, is underlying the test. A one-way between-subjects ANOVA also indicated that the test can effectively distinguish between different proficiency levels. The hypothesized difficulty order was also realized in the data though it was found that clusters of 1,000 word levels provide more meaningful difficulty levels as they are less susceptible to the idi...

Research paper thumbnail of Developing and Validating a New Version of an EFL Multiple-Choice Reading Comprehension Test Based on Fuzzy Logic

Multiple-choice tests do not assess examinees’ knowledge in accord with reality. In fact, the par... more Multiple-choice tests do not assess examinees’ knowledge in accord with reality. In fact, the partial knowledge of examinees is not assessed. Providing a new approach to the assessment of reading comprehension in the framework of fuzzy logic, this study aims to measure this partial knowledge. In this approach, participants have to choose as many correct options as there are considering the stem. Therefore, the correct answer to each question can range from one option to all options. For the first session, an expository and an argumentative genre, and for the second session, the reading section of a TOEFL test was used. The results showed that the approach is fairer as it considers the partial knowledge of the examinees while in other common multiple-choice tests this is ignored. Also, the use of idea units as the units comprising a text gives us clues regarding the degree of difficulty of different parts of a text as well as clues about why misunderstanding may occur of the same tex...

Research paper thumbnail of Fairness Issues in Educational Assessment

Research paper thumbnail of Examining the validity of the Achievement Emotions Questionnaire for measuring more emotions in the foreign language classroom

Journal of Multilingual and Multicultural Development, 2020

Research paper thumbnail of Examining the psychometric features of the Teacher's Sense of Efficacy Scale in the English-as-a-foreign-language teaching context

Though the Teacher's Sense of Efficacy Scale (TSES) is widely used with English-as-a-foreign-lang... more Though the Teacher's Sense of Efficacy Scale (TSES) is widely used with English-as-a-foreign-language (EFL) teachers, these teachers constitute a cohort left out from the widespread project to validate the scale in different fields of education. In addition, the previous validation studies have mostly applied factor analysis for this purpose, which leaves some aspects of validity untouched. Thus, through factor and Rasch analysis of the data from 435 Iranian EFL teachers, the present study was undertaken to provide evidence on the validity of using the TSES with EFL teachers. Both confirmatory factor analysis and Rasch modeling confirmed the claimed three-factor structure of the TSES; however, Rasch analysis identified a misfitting item and frequent problems with the functioning of the response categories. In search of an alternative rating scale for the TSES, Rasch analysis supported the validity of a revised rating scale with five categories. Both traditional reliability and Rasch-specific indices showed that the revised rating scale was very similar to the original rating scale of the TSES in terms of measurement precision.

Research paper thumbnail of Validation of a bilingual version of the vocabulary size test: comparison with the monolingual version

This study was set to cross-validate a bilingual Persian-English version of the Vocabulary Size T... more This study was set to cross-validate a bilingual Persian-English version of the Vocabulary Size Test (VST) against the monolingual English version and compare Iranian EFL learners' performance on the two versions. Various bilingual versions of the VST have been developed based on the assumption that bilingual versions are not affected by the grammar and reading demands of the long options in the monolingual version. To serve the purposes of the study, the Persian-English version and monolingual version of the VST were administered to 116 Iranian EFL learners. Results indicated that a single dimension was underlying both versions. Combined factor analysis indicated that both versions assessed the same construct. Further, both versions were capable of distinguishing learners of varying English proficiency levels and there was also a rough order of difficulty across frequency levels in both of them. Separate paired-samples t-tests revealed that the low-and mid-proficiency groups had differential performance on the two versions of the VST. Their L2 vocabulary knowledge was significantly underestimated by the monolingual version as shown by the mean-differences between the two versions of the VST for these two groups. In contrast, the results also revealed that the high-proficiency group in the study did not show such differential performance on the two version of the VST as the mean-difference between the two versions of the VST did not reach statistical significance for this last group. Hence, it is argued that advanced learners are competent enough so that their performance is not affected by grammar and reading demands of the long options in the monolingual version.

Research paper thumbnail of The impact of background knowledge on test performance: A multivariate G-theory approach

Validity has been declared to be the single most important consideration in language testing and ... more Validity has been declared to be the single most important consideration in language testing and educational measurement. Messick (1989) famously identified construct-irrelevant variance as one of the major threats against test validity. Hence, test users ought to make every possible effort to make sure that test scores are not unduly affected by construct-irrelevant factors. This study applied multivariate Generalizability Theory to examine the effect of academic background on test scores obtained from the Iranian national university entrance exam. The results revealed that the relative contribution of the various sources of variance was not the same across the academic background groups. In addition, dependability indices were significantly different across the groupings Furthermore, the decision studies revealed that the groups do not necessarily need to take the same number of items so that a high reliability is obtained. Overall, the results indicate that academic background exerts a remarkable influence on the dependability of the scores.

Research paper thumbnail of Exploratory Factor Analysis as a construct validation tool: (mis)applications in applied linguistics research.

Factor analysis has been frequently exploited in applied research to provide evidence about the u... more Factor analysis has been frequently exploited in applied research to provide evidence about the underlying factors in various measurement instruments. A close inspection of a large number of studies published in leading applied linguistic journals shows that there is a misconception among applied linguists as to the relative merits of exploratory factor analysis and principal components analysis (PCA) and the kind of interpretations that can be drawn from each method. In addition, it is argued that the widespread application of orthogonal, rather than oblique, rotations and also the criteria used for factor selection are not in keeping with the findings in psychometrics. It is further argued that the current situation is partly due to the fact that PCA and orthogonal rotation are default options in mainstream statistical packages such as SPSS and the guidebooks on such software do not provide an explanation of the issues discussed in this article.

Research paper thumbnail of Review of “Rasch analysis in the human sciences” by Boone, W.J., Staver, J. R., & Yale, M. S. (2014)

Research paper thumbnail of The Quest for Fairness in Language Testing

The search for fairness in language testing is distinct from other areas of educational measureme... more The search for fairness in language testing is distinct from other areas of educational measurement as the object of measurement, that is, language, is part of the identity of the test takers. So, a host of issues enter the scene when one starts to reflect on how to assess people's language abilities. As the quest for fairness in language testing is still in its infancy, even the need for such a research has been controversial, with some (e.g., Davies, 2010) arguing that such research is entirely in vain. This paper will provide an overview of some of the issues involved. Special attention will be given to critical language testing (CLT) as it has had a large impact on language testing research. It will be argued that although CLT has been very effective in revealing the ideological and value implications of the constructs of focus in language testing, extremism in this direction is not justified.

Research paper thumbnail of Guest Editorial: Fairness Issues in Educational Assessment

Research paper thumbnail of An investigation of the gender differential performance on a high-stakes language proficiency test in Iran

There has been a growing consensus among the educational measurement experts and psychometricians... more There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited Generalizability Theory to examine the presence of gender differential performance on a high stakes language proficiency test, the University of Tehran English Proficiency Test (UTEPT). An analysis of the performance of 2343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability.

Research paper thumbnail of The Relative Impact of Persons, Items, Subtests, and Academic Background on Performance on a Language Proficiency Test

This study exploited generalizability theory to explore the impact of persons, items, subtests, a... more This study exploited generalizability theory to explore the impact of persons, items, subtests, and academic background on the dependability of the scores from a high-stakes language proficiency test, the University of Tehran Language English Proficiency Test (UTEPT). To this end and following Brown (1999), three questions were posed: 1. What are the distributional characteristics and CTT reliability of UTEPT test scores? 2. What are the relative contributions of persons, items, and subtests to the dependability of scores for each group and for all the groups combined? 3. What are the relative contributions of persons, items, subtests, academic background as well as their various interactions to the dependability of the scores when all groups are combined? To investigate the issues, 5795 examinees from four different academic backgrounds were selected from among all the participants who had taken the test in 2004. The results of the study indicated that the relative contributions of the facets were not stable across all groups, though highly similar. In addition, with academic background added as a facet, there was no significant interaction between items and fields, and the dependability of the scores did not decrease either. This result shows that background knowledge does not lead to bias in the UTEPT. This use of G-theory could be extended profitably to other measuring situations.

Research paper thumbnail of The Development and Validation of a Bilingual Version of the Vocabulary Size Test

This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabula... more This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabulary Size Test (VST). Due to the particular educational system in Iran, there is a dire need for a test that can effectively estimate English learners’ vocabulary sizes. Previous research (Nguyen & Nation, 2011) has indicated that bilingual versions of the VST can be more efficient than the monolingual one. A calibration of the Persian version of the test with 190 English learners indicated that the test enjoys a high level of validity and reliability. The results of a factor analysis revealed a single construct, presumably word knowledge, is underlying the test. A one-way between-subjects ANOVA also indicated that the test can effectively distinguish between different proficiency levels. The hypothesized difficulty order was also realized in the data though it was found that clusters of 1000 word levels provide more meaningful difficulty levels as they are less susceptible to the idiosyncrasies at each 1000 level. The results were also against the common assumption in the literature that not all test takers should sit the entire test. The administration of the whole test leads to a more valid estimate of the examinees’ vocabulary sizes.

Research paper thumbnail of A Rasch calibration of the university entrance exam

Research paper thumbnail of An introduction to Differential Item Functioning

Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychome... more Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychometric circles. Judicious application of this methodology by the researchers, however, requires an understanding of the technical complexities involved. This has become an impediment in the way of specially non-mathematically oriented researches. This paper is an attempt to bridge the gap. It provides a non-technical introduction to the fundamental concepts involved in DIF analysis. In addition, an introductory level explanation of a number of the most frequently applied DIF detection techniques will be offered. These include Logistic Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch model. For each method, a number of the relevant software are also introduced.

Research paper thumbnail of Examining the effects of proficiency, gender, and task type on the use of communication strategies.

"This paper reports on the study of the frequency of communication strategies, their relationshi... more "This paper reports on the study of the frequency of communication
strategies, their relationship to task types, and gender differences in the use of CSs. A CS questionnaire was administered to 227 students at elementary, pre-intermediate, and intermediate levels. The results indicated that a) language proficiency does not influence the frequency of the CSs b) the task type has a significant impact on the type of CS employed, c) gender differences in the use of CSs are only significant for
circumlocution, asking for clarification, omission, comprehension check, use of fillers,
and over explicitness."

Research paper thumbnail of Differential Item Functioning and ad hoc interpretations.

A plethora of research studies has focused on Differential Item Functioning. Despite the diversit... more A plethora of research studies has focused on Differential Item Functioning. Despite the diversity of DIF detection techniques offered, little research has been done on the interpretation of DIF results. This study was undertaken to investigate whether there is any order to the interpretations offered for the real cause of items flagged as displaying DIF. The analysis of the opinion of experts showed that there is no such order. It is argued that such “ad hoc” interpretations have rendered DIF analysis of little use. It is further suggested that research should focus on devising a mechanism for basing DIF interpretations on principled grounds.

Research paper thumbnail of Detecting gender bias in a language proficiency test

The present study makes use of the Rasch model to investigate the presence of DIF between male an... more The present study makes use of the Rasch model to investigate the presence of DIF between male and female examinees taking the University of Tehran English Proficiency Test (UTEPT). The results of the study indicated that 19 items are functioning differentially for the two groups. Only 3 items, however, displayed DIF with practical significance. A close inspection of the items indicated that the presence of DIF may be interpreted as impact rather than bias. Therefore, it is concluded that the presence of the differentially functioning may not render the test unfair. On the other hand, it is argued that the fairness of the test may be under question due to other factors.

Research paper thumbnail of Fairness Issues in Educational Assessment

Research paper thumbnail of The Development and Validation of a Bilingual Version of the Vocabulary Size Test

RELC Journal, 2012

This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabula... more This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabulary Size Test (VST). Due to the particular educational system in Iran, there is a dire need for a test that can effectively estimate English learners’ vocabulary sizes. Previous research ( Nguyen and Nation, 2011 ) has indicated that bilingual versions of the VST can be more efficient than the monolingual one. A calibration of the Persian version of the test with 190 English learners indicated that the test enjoys a high level of validity and reliability. The results of a factor analysis revealed a single construct, presumably word knowledge, is underlying the test. A one-way between-subjects ANOVA also indicated that the test can effectively distinguish between different proficiency levels. The hypothesized difficulty order was also realized in the data though it was found that clusters of 1,000 word levels provide more meaningful difficulty levels as they are less susceptible to the idi...

Research paper thumbnail of Developing and Validating a New Version of an EFL Multiple-Choice Reading Comprehension Test Based on Fuzzy Logic

Multiple-choice tests do not assess examinees’ knowledge in accord with reality. In fact, the par... more Multiple-choice tests do not assess examinees’ knowledge in accord with reality. In fact, the partial knowledge of examinees is not assessed. Providing a new approach to the assessment of reading comprehension in the framework of fuzzy logic, this study aims to measure this partial knowledge. In this approach, participants have to choose as many correct options as there are considering the stem. Therefore, the correct answer to each question can range from one option to all options. For the first session, an expository and an argumentative genre, and for the second session, the reading section of a TOEFL test was used. The results showed that the approach is fairer as it considers the partial knowledge of the examinees while in other common multiple-choice tests this is ignored. Also, the use of idea units as the units comprising a text gives us clues regarding the degree of difficulty of different parts of a text as well as clues about why misunderstanding may occur of the same tex...

Research paper thumbnail of Fairness Issues in Educational Assessment

Research paper thumbnail of Examining the validity of the Achievement Emotions Questionnaire for measuring more emotions in the foreign language classroom

Journal of Multilingual and Multicultural Development, 2020

Research paper thumbnail of Examining the psychometric features of the Teacher's Sense of Efficacy Scale in the English-as-a-foreign-language teaching context

Though the Teacher's Sense of Efficacy Scale (TSES) is widely used with English-as-a-foreign-lang... more Though the Teacher's Sense of Efficacy Scale (TSES) is widely used with English-as-a-foreign-language (EFL) teachers, these teachers constitute a cohort left out from the widespread project to validate the scale in different fields of education. In addition, the previous validation studies have mostly applied factor analysis for this purpose, which leaves some aspects of validity untouched. Thus, through factor and Rasch analysis of the data from 435 Iranian EFL teachers, the present study was undertaken to provide evidence on the validity of using the TSES with EFL teachers. Both confirmatory factor analysis and Rasch modeling confirmed the claimed three-factor structure of the TSES; however, Rasch analysis identified a misfitting item and frequent problems with the functioning of the response categories. In search of an alternative rating scale for the TSES, Rasch analysis supported the validity of a revised rating scale with five categories. Both traditional reliability and Rasch-specific indices showed that the revised rating scale was very similar to the original rating scale of the TSES in terms of measurement precision.

Research paper thumbnail of Validation of a bilingual version of the vocabulary size test: comparison with the monolingual version

This study was set to cross-validate a bilingual Persian-English version of the Vocabulary Size T... more This study was set to cross-validate a bilingual Persian-English version of the Vocabulary Size Test (VST) against the monolingual English version and compare Iranian EFL learners' performance on the two versions. Various bilingual versions of the VST have been developed based on the assumption that bilingual versions are not affected by the grammar and reading demands of the long options in the monolingual version. To serve the purposes of the study, the Persian-English version and monolingual version of the VST were administered to 116 Iranian EFL learners. Results indicated that a single dimension was underlying both versions. Combined factor analysis indicated that both versions assessed the same construct. Further, both versions were capable of distinguishing learners of varying English proficiency levels and there was also a rough order of difficulty across frequency levels in both of them. Separate paired-samples t-tests revealed that the low-and mid-proficiency groups had differential performance on the two versions of the VST. Their L2 vocabulary knowledge was significantly underestimated by the monolingual version as shown by the mean-differences between the two versions of the VST for these two groups. In contrast, the results also revealed that the high-proficiency group in the study did not show such differential performance on the two version of the VST as the mean-difference between the two versions of the VST did not reach statistical significance for this last group. Hence, it is argued that advanced learners are competent enough so that their performance is not affected by grammar and reading demands of the long options in the monolingual version.

Research paper thumbnail of The impact of background knowledge on test performance: A multivariate G-theory approach

Validity has been declared to be the single most important consideration in language testing and ... more Validity has been declared to be the single most important consideration in language testing and educational measurement. Messick (1989) famously identified construct-irrelevant variance as one of the major threats against test validity. Hence, test users ought to make every possible effort to make sure that test scores are not unduly affected by construct-irrelevant factors. This study applied multivariate Generalizability Theory to examine the effect of academic background on test scores obtained from the Iranian national university entrance exam. The results revealed that the relative contribution of the various sources of variance was not the same across the academic background groups. In addition, dependability indices were significantly different across the groupings Furthermore, the decision studies revealed that the groups do not necessarily need to take the same number of items so that a high reliability is obtained. Overall, the results indicate that academic background exerts a remarkable influence on the dependability of the scores.

Research paper thumbnail of Exploratory Factor Analysis as a construct validation tool: (mis)applications in applied linguistics research.

Factor analysis has been frequently exploited in applied research to provide evidence about the u... more Factor analysis has been frequently exploited in applied research to provide evidence about the underlying factors in various measurement instruments. A close inspection of a large number of studies published in leading applied linguistic journals shows that there is a misconception among applied linguists as to the relative merits of exploratory factor analysis and principal components analysis (PCA) and the kind of interpretations that can be drawn from each method. In addition, it is argued that the widespread application of orthogonal, rather than oblique, rotations and also the criteria used for factor selection are not in keeping with the findings in psychometrics. It is further argued that the current situation is partly due to the fact that PCA and orthogonal rotation are default options in mainstream statistical packages such as SPSS and the guidebooks on such software do not provide an explanation of the issues discussed in this article.

Research paper thumbnail of Review of “Rasch analysis in the human sciences” by Boone, W.J., Staver, J. R., & Yale, M. S. (2014)

Research paper thumbnail of The Quest for Fairness in Language Testing

The search for fairness in language testing is distinct from other areas of educational measureme... more The search for fairness in language testing is distinct from other areas of educational measurement as the object of measurement, that is, language, is part of the identity of the test takers. So, a host of issues enter the scene when one starts to reflect on how to assess people's language abilities. As the quest for fairness in language testing is still in its infancy, even the need for such a research has been controversial, with some (e.g., Davies, 2010) arguing that such research is entirely in vain. This paper will provide an overview of some of the issues involved. Special attention will be given to critical language testing (CLT) as it has had a large impact on language testing research. It will be argued that although CLT has been very effective in revealing the ideological and value implications of the constructs of focus in language testing, extremism in this direction is not justified.

Research paper thumbnail of Guest Editorial: Fairness Issues in Educational Assessment

Research paper thumbnail of An investigation of the gender differential performance on a high-stakes language proficiency test in Iran

There has been a growing consensus among the educational measurement experts and psychometricians... more There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited Generalizability Theory to examine the presence of gender differential performance on a high stakes language proficiency test, the University of Tehran English Proficiency Test (UTEPT). An analysis of the performance of 2343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability.

Research paper thumbnail of The Relative Impact of Persons, Items, Subtests, and Academic Background on Performance on a Language Proficiency Test

This study exploited generalizability theory to explore the impact of persons, items, subtests, a... more This study exploited generalizability theory to explore the impact of persons, items, subtests, and academic background on the dependability of the scores from a high-stakes language proficiency test, the University of Tehran Language English Proficiency Test (UTEPT). To this end and following Brown (1999), three questions were posed: 1. What are the distributional characteristics and CTT reliability of UTEPT test scores? 2. What are the relative contributions of persons, items, and subtests to the dependability of scores for each group and for all the groups combined? 3. What are the relative contributions of persons, items, subtests, academic background as well as their various interactions to the dependability of the scores when all groups are combined? To investigate the issues, 5795 examinees from four different academic backgrounds were selected from among all the participants who had taken the test in 2004. The results of the study indicated that the relative contributions of the facets were not stable across all groups, though highly similar. In addition, with academic background added as a facet, there was no significant interaction between items and fields, and the dependability of the scores did not decrease either. This result shows that background knowledge does not lead to bias in the UTEPT. This use of G-theory could be extended profitably to other measuring situations.

Research paper thumbnail of The Development and Validation of a Bilingual Version of the Vocabulary Size Test

This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabula... more This paper reports an attempt to develop and validate a bilingual Persian version of the Vocabulary Size Test (VST). Due to the particular educational system in Iran, there is a dire need for a test that can effectively estimate English learners’ vocabulary sizes. Previous research (Nguyen & Nation, 2011) has indicated that bilingual versions of the VST can be more efficient than the monolingual one. A calibration of the Persian version of the test with 190 English learners indicated that the test enjoys a high level of validity and reliability. The results of a factor analysis revealed a single construct, presumably word knowledge, is underlying the test. A one-way between-subjects ANOVA also indicated that the test can effectively distinguish between different proficiency levels. The hypothesized difficulty order was also realized in the data though it was found that clusters of 1000 word levels provide more meaningful difficulty levels as they are less susceptible to the idiosyncrasies at each 1000 level. The results were also against the common assumption in the literature that not all test takers should sit the entire test. The administration of the whole test leads to a more valid estimate of the examinees’ vocabulary sizes.

Research paper thumbnail of A Rasch calibration of the university entrance exam

Research paper thumbnail of An introduction to Differential Item Functioning

Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychome... more Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychometric circles. Judicious application of this methodology by the researchers, however, requires an understanding of the technical complexities involved. This has become an impediment in the way of specially non-mathematically oriented researches. This paper is an attempt to bridge the gap. It provides a non-technical introduction to the fundamental concepts involved in DIF analysis. In addition, an introductory level explanation of a number of the most frequently applied DIF detection techniques will be offered. These include Logistic Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch model. For each method, a number of the relevant software are also introduced.

Research paper thumbnail of Examining the effects of proficiency, gender, and task type on the use of communication strategies.

"This paper reports on the study of the frequency of communication strategies, their relationshi... more "This paper reports on the study of the frequency of communication
strategies, their relationship to task types, and gender differences in the use of CSs. A CS questionnaire was administered to 227 students at elementary, pre-intermediate, and intermediate levels. The results indicated that a) language proficiency does not influence the frequency of the CSs b) the task type has a significant impact on the type of CS employed, c) gender differences in the use of CSs are only significant for
circumlocution, asking for clarification, omission, comprehension check, use of fillers,
and over explicitness."

Research paper thumbnail of Differential Item Functioning and ad hoc interpretations.

A plethora of research studies has focused on Differential Item Functioning. Despite the diversit... more A plethora of research studies has focused on Differential Item Functioning. Despite the diversity of DIF detection techniques offered, little research has been done on the interpretation of DIF results. This study was undertaken to investigate whether there is any order to the interpretations offered for the real cause of items flagged as displaying DIF. The analysis of the opinion of experts showed that there is no such order. It is argued that such “ad hoc” interpretations have rendered DIF analysis of little use. It is further suggested that research should focus on devising a mechanism for basing DIF interpretations on principled grounds.

Research paper thumbnail of Detecting gender bias in a language proficiency test

The present study makes use of the Rasch model to investigate the presence of DIF between male an... more The present study makes use of the Rasch model to investigate the presence of DIF between male and female examinees taking the University of Tehran English Proficiency Test (UTEPT). The results of the study indicated that 19 items are functioning differentially for the two groups. Only 3 items, however, displayed DIF with practical significance. A close inspection of the items indicated that the presence of DIF may be interpreted as impact rather than bias. Therefore, it is concluded that the presence of the differentially functioning may not render the test unfair. On the other hand, it is argued that the fairness of the test may be under question due to other factors.

Research paper thumbnail of Differential Item Functioning: Current problems and future directions

With the rising concerns over the fairness of language tests, Differential Item Functioning (DIF)... more With the rising concerns over the fairness of language tests, Differential Item Functioning (DIF) has been increasingly applied in bias analysis. Despite it widespread use in psychometric circles, however, DIF is facing a number of serious problems. This paper is an attempt to shed some light on a number of the issues involved in DIF analysis. Specifically, the paper is focused on four problems: the inter-method indeterminacy, the intra-method indeterminacy, the ad hoc interpretations, and the impact of DIF on validity. In order to orient the reader, the paper also provides a brief introduction the fundamental concepts in DIF analysis.

Research paper thumbnail of Guest Editorial: Fairness Issues in Educational Assessment