Naomi Gafni - Academia.edu (original) (raw)
Papers by Naomi Gafni
The objective of this study was to investigate differential tendencies to avoid guessing as a fun... more The objective of this study was to investigate differential tendencies to avoid guessing as a function of three variables: (1) lingual-cultural-group; (2) gender; and (3) examination year. The Psychometric Entrance Test (PET) for universities in Israel was used, which is administered in Hebrew, Arabic, English, French, Spanish, and Russian. The PET is a battery of five subtests, and encompasses about 200 test items. Three of the five subtests were used in this study: figural reasoning, mathematical reasoning, and English. Data for 12,440 male and 10,532 female examinees were analyzed. The tendency to avoid guessing was measured by the proportion of two types of unanswered items: unreached items, and omitted items. A factor analysis using VARIMAX rotation indicated a strong two-factor structure, in which all indices based on omitted items loaded on the first factor and all indices based on unreached items loaded on he second factor. An analysis of covariance with the corrected-for-gu...
Journal of Dental Education, 2011
Virtual reality force feedback simulators provide a haptic (sense of touch) feedback through the ... more Virtual reality force feedback simulators provide a haptic (sense of touch) feedback through the device being held by the user. The simulator's goal is to provide a learning experience resembling reality. A newly developed haptic simulator (IDEA Dental, Las Vegas, NV, USA) was assessed in this study. Our objectives were to assess the simulator's ability to serve as a tool for dental instruction, self-practice, and student evaluation, as well as to evaluate the sensation it provides. A total of thirty-three evaluators were divided into two groups. The first group consisted of twenty-one experienced dental educators; the second consisted of twelve fifth-year dental students. Each participant performed drilling tasks using the simulator and filled out a questionnaire regarding the simulator and potential ways of using it in dental education. The results show that experienced dental faculty members as well as advanced dental students found that the simulator could provide significant potential benefits in the teaching and self-learning of manual dental skills. Development of the simulator's tactile sensation is needed to attune it to genuine sensation. Further studies relating to aspects of the simulator's structure and its predictive validity, its scoring system, and the nature of the performed tasks should be conducted.
Journal of Dental Education, 2003
This paper discusses the need for reliable and valid measures of personality and motivational fac... more This paper discusses the need for reliable and valid measures of personality and motivational factors in the prediction of success and attrition in a dental school. The admissions system currently used in most schools includes personality factors that are measured by an interview. Our study examined whether the interview could be replaced by a standardized, open-ended questionnaire, thus increasing standardization and objectivity and avoiding the possible biases of the interview. The relationship between the standardized questionnaire score and the interview score in a dental school in Israel was examined, as well as the relationship between the standardized questionnaire score and the admissions decisions. The results showed that the questionnaire and the interview probably measure a common construct, enabling us to tentatively recommend a two-stage admissions process: all candidates meeting certain academic criteria should be asked to answer the questionnaire; those candidates scoring above a certain percentile on the questionnaire should either be admitted outright or invited for an interview.
Assessment in Education: Principles, Policy & Practice, 2016
It is my privilege to comment on this series of papers, written by leading experts in the field o... more It is my privilege to comment on this series of papers, written by leading experts in the field of validity theory. I have devoted a substantial part of my career to the development of admissions tests and other educational tests, and to the investigation of their validity. As such, I am keenly aware of the complexities involved in this process. It was with great interest and pleasure that I read these papers, beginning with the one authored by Newton and Shaw. From their conclusions, I anticipated a heated debate among the contributors to this issue vis-à-vis the usage of the term ‘validity’. Yet surprisingly, the majority adopted the definition of validity set forth in the 2014 edition of The Standards for Educational and Psychological Testing, which describes validity in terms of both interpretations and uses: ‘Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’. However, there is some disagreement among the contributors with regard to making a distinction between supporting interpretations based on test scores and justifying test use. All the authors agree that both issues are important and require examination. Pamela Moss takes the integration of these two issues even further. According to the Standards, when test scores are interpreted in more than one way, each proposed interpretation must be validated. Moss’s suggestion is consistent with this stipulation. She operationalises the interpretations and uses of test scores, and categorises them either as ‘intended’ (systemic level) or ‘actual’ (local school level). She shifts the focus of validity from intended interpretations and uses of test scores to actual interpretations and uses by local professionals, including teachers, school leaders and policy-makers. She correctly claims that ‘we need validity theory to help educators connect the test-based data to their own practice and to consider explanations and explore solutions. It is here that the primary potential of testing to improve schooling lies’. How true! I believe that this is what is meant by ‘validity of test use’ in the context of educational testing. The actual interpretations are part of the consequences of using the test (some of which might have been intended and some not).
International Journal of Testing, 2017
Many medical schools have adopted multiple mini-interviews (MMI) as an advanced selection tool. M... more Many medical schools have adopted multiple mini-interviews (MMI) as an advanced selection tool. MMIs are expensive and used to test only a few dozen candidates per day, making it infeasible to develop a different test version for each test administration. Therefore, some items are reused both within and across years. This study investigated the influence of item reuse and familiarity with test structure on scores for a high-stakes non-cognitive test administered every year to hundreds of medical school applicants. The results show no evidence that reusing items makes the test easier or less differentiating. Data are consistent across years, test versions and item types. Two additional aspects of familiarity with test structure and test items were examined: whether retaking the test has an effect and how candidates prepare for the test. It was found that retaking the test in and of itself does not result in higher scores. Notwithstanding the results, in the age of the Internet, test developers must assume that all items included in any particular test version will be fully exposed. Hence, in order to preserve test validity, reliability and fairness, it is important to rethink and carefully preplan test versions, as well as decide what information to divulge to candidates.
Items in the verbal (Hebrew and English) sections of the Psychometric Entrance Test (PET) adminis... more Items in the verbal (Hebrew and English) sections of the Psychometric Entrance Test (PET) administered for university admission in Israel were studied for differential item functioning (DIF) between the sexes. Analyses were conducted for 4,354 males and 4,901 females taking Form 3 of the PET in April 1984, and 3,785 males and 3,615 females taking Form 17 of the PET in April 1987. Three subtests were examined: (1) veral reasoning; (2) English; and (3) mathematical reasoning (a control non-verbal test). DIF was determined for the 1984 population through: the weighted sum of the differences between the twc groups and across all ability groups; and the root of the mean squared differences as defined above. These two indices and a Mantel-Haenszel chi square test examined DIF for the 1987 group. About one-third of the items in the verbal and mathematics reasoning 1-.rts were found to have DIE', but few English subtest items did so. The content of some of the items exhibiting DIF was clearly related to stereotypical perceptions of feminine and masculine areas of interest. Implications for test content are discussed. (SLD)
... as part of a symposium entitled "Assessment of Non-Cognitive Factors in Student Selectio... more ... as part of a symposium entitled "Assessment of Non-Cognitive Factors in Student Selection for Medical Schools"
The Israel Medical Association journal : IMAJ, 2006
The Israeli Board of Anesthesiology Examination Committee added a simulation-based Objective Stru... more The Israeli Board of Anesthesiology Examination Committee added a simulation-based Objective Structured Clinical Evaluation component to the board examination process. This addition was made in order to evaluate medical competence and considers certain domains that contribute to professionalism. This unique and new process needed to be validated. To validate and evaluate the reliability and realism of incorporating simulation-based OSCE into the Israeli Board Examination in Anesthesia. Validation was performed before the exam regarding Content Validity using the modified Delphi technique by members of the Task Force of the Israeli Board Examination Committee in Anesthesiology. The examination has been administered six times in the past 3 years to a total of 145 examinees. The pass rate ranged from 62% (trauma) to 91% (regional anesthesia). The mean inter-rater correlations for the total score (all items), for the Critical checklist items score, and for the Global (General) rating we...
Studies in Educational Evaluation, 1994
The objective of this study was to investigate differential tendencies to avoid guessing as a fun... more The objective of this study was to investigate differential tendencies to avoid guessing as a function of three variables: (1) lingual-cultural-group; (2) gender; and (3) examination year. The Psychometric Entrance Test (PET) for universities in Israel was used, which is administered in Hebrew,
Medical Education, 2012
CONTEXT Assessment centres used in evaluating the non-cognitive attributes of medical school cand... more CONTEXT Assessment centres used in evaluating the non-cognitive attributes of medical school candidates must generate scores that reflect as accurate a measurement as possible of these attributes. Thus far, reliability coefficients for such centres have been based on limited samples and individual administrations, without reference to the error of variance that may result from retesting, or from the existence of multiple centres designed to measure the same attributes. METHODS The National Institute for Testing and Evaluation in Israel has developed and administered two assessment centres: MOR is used by two medical schools and one dental school, and MIRKAM by another medical school. Each centre comprises eight or nine behavioural stations, a standardised biographical questionnaire, and a judgement and decision-making questionnaire. We calculated generalisability coefficients for each centre's eight or nine stations by year, composite reliability coefficients for the overall assessment centres, test-retest correlation coefficients for repeaters, and a correlation coefficient between the centres.
ABSTRACT Test transadaptation (translation and adaptation) is the process whereby a test construc... more ABSTRACT Test transadaptation (translation and adaptation) is the process whereby a test constructed in one language and culture is prepared for use in a second language and culture. Test transadaptation involves both the translation and adaptation of items written originally in the source language and the replacement of items unsuitable for translation/adaptation with items written in the target language. In the process, the transadaptation team effects a series of changes and modifications before the test attains its final transadapted form. One of the International Test Commission (ITC) Guidelines for Test Translation and Adaptation is (Guideline D1., ITC, 2001; Hambleton, 2005): "Test developers/publishers should ensure that the adaptation process takes full account of linguistic and cultural differences in the intended populations." The rationale provided for this guideline is that "because a single translator cannot be expected to have all of the required qualities and brings a single perspective to the task of translation, in general, it seems clear that a team of specialists is needed to accomplish an accurate adaptation." Two principal questions must be asked with regard to the product of test transadaptation produced by a team of experts: one is whether the transadapted product is of a high quality and the other is whether another team of experts would have done a better job. The first question can be answered by investigating the equivalence of the source and the transadapted test. One way to approach the second question is to examine the variance between transadaptations of the same test produced by independent teams. Such an investigation can provide us with an "estimate" for a standard error of transadaptation. The smaller this error, the more confidence we have in the transadaptation process and the final product. 3 The purpose of this study was to systematically investigate the variance between tests transadapted from the same source test by two independent teams
The use of CAT in higher education admissions testing in Israel is described. This includes: (1) ... more The use of CAT in higher education admissions testing in Israel is described. This includes: (1) AMIRAM-a CAT of English as a foreign language that has been used by various institutions of higher education for placement purposes for the past 22 years, and (2) MIFAM-a CAT version of the Psychometric Entrance Test that has been in use for nine years as a higher education admissions tool for examinees with disabilities. Both applications run in parallel with paper-and-pencil test versions. This presentation focuses on the specific procedures used to produce equitable scores across the two media as well as examining the suitability of CAT for examinees with disabilities. Also discussed are a number of practical issues that were encountered during conversion of the Psychometric Entrance Test (PET) to a CAT format. Issues that pertain to the meeting of content specifications, item exposure, item banks, item bank dimensionality, and equating, are identified and discussed in the context of evolutionary changes in the MIFAM program.
Applied Measurement in Education, 2002
ABSTRACT In this study we examined whether the measures used in the admission of students to univ... more ABSTRACT In this study we examined whether the measures used in the admission of students to universities in Israel are gender biased. The criterion used to measure bias was performance in the first year of university study; the predictors consisted of an admission score, a high school matriculation score, and a standardized test score as well as its component subtest scores. Statistically, bias was defined according to the boundary conditions given in Linn (1984). No gender bias was detected when using the admission score (which is used for selection) as a predictor of first-year performance in the university. Bias in favor of women was found predominantly using school grades as predictor whereas bias against women was found predominantly in using the standardized test scores. It was concluded that the admission score is a valid and unbiased predictor of first-year university performance for the two genders.
Anesthesia & Analgesia, 2005
We prospectively assessed the feasibility of international sharing of simulation-based evaluation... more We prospectively assessed the feasibility of international sharing of simulation-based evaluation tools despite differences in language, education, and anesthesia practice, in an Israeli study, using validated scenarios from a multi-institutional United States (US) study. Thirty-one Israeli junior anesthesia residents performed four simulation scenarios. Training sessions were videotaped and performance was assessed using two validated scoring systems (Long and Short Forms) by two independent raters. Subjects scored from 37 to 95 (70 +/- 12) of 108 possible points with the "Long Form" and "Short Form" scores ranging from 18 to 35 (28.2 +/- 4.5) of 40 possible points. Scores >70% of the maximal score were achieved by 61% of participants in comparison to only 5% in the original US study. The scenarios were rated as very realistic by 80% of the participants (grade 4 on a 1-4 scale). Reliability of the original assessment tools was demonstrated by internal consistencies of 0.66 for the Long and 0.75 for the Short Form (Cronbach alpha statistic). Values in the original study were 0.72-0.76 for the Long and 0.71-0.75 for the Short Form. The reliability did not change when a revised Israeli version of the scoring was used. Interrater reliability measured by Pearson correlation was 0.91 for the Long and 0.96 for the Short Form (P < 0.01). The high scores for plausibility given to the scenarios and the similar reliability of the original assessment tool support the feasibility of using simulation-based evaluation tools, developed in the US, in Israel. The higher scores achieved by Israeli residents may be related to the fact that most Israeli residents are immigrants with previous training in anesthesia. Simulation-based assessment tools developed in a multi-institutional study in the United States can be used in Israel despite the differences in language, education, and medical system.
Anesthesia & Analgesia, 2006
We describe the unique process whereby simulation-based, objective structured clinical evaluation... more We describe the unique process whereby simulation-based, objective structured clinical evaluation (OSCE) has been incorporated into the Israeli board examination in anesthesiology. Development of the examination included three steps: a) definition of clinical conditions that residents are required to handle competently, b) definition of tasks pertaining to each of the conditions, and c) incorporation of the tasks into hands-on simulation-based examination stations in the OSCE format, including 1) trauma management, 2) resuscitation, 3) crisis management in the operating room, 4) regional anesthesia, and 5) mechanical ventilation. Members of the Israeli Board of Anesthesiology Examination Committee assisted by experts from the Israel Center for Medical Simulation and from Israel's National Institute for Testing and Evaluation were involved in this process and in the development of the assessment tools, orientation of examinees, and preparation of examiners. The examination has been administered 4 times in the past 2 yr to 104 examinees and has gradually progressed from being a minor part of the oral board examination to a prerequisite component of this test. The pass rate ranged from 70% in resuscitation to 91% in regional anesthesia. The mean inter-rater correlations for all the checklist items, for the score based on the critical checklist items only, and for the general rating were 0.89, 0.86, and 0.76, respectively. The overall Kappa coefficients (the inter-rater agreement coefficient) for the total score and the critical checklist items were 0.71 and 0.76, respectively. The correlation between the total score and the general score was 0.76. According to a subjective feedback questionnaire, most (70%-90%) participants found the difficulty level of the examination stations reasonable to very easy and prefer this method of examination to a conventional oral examination. The incorporation of OSCE-driven modalities in the certification of anesthesiologists in Israel is a continuing process of evaluation and assessment.
Applied Psychological Measurement, 1990
Equating error was estimated using the same test by three linear equating methods in three paradi... more Equating error was estimated using the same test by three linear equating methods in three paradigms: (1) single-link equating of a test to itself, in which a test was administered on two different dates and the later administration was equated to the earlier adminis tration ; (2) circular equating through a chain, start ing and ending at the same test; and (3) pseudo-circular equating, in which a test was equated to itself as in the first approach through equating chains contain ing a different number of links as in the second ap proach. The mean difference between the actual scores and the equated scores, as well as the root mean square of this difference, were used as the criterion measures for equating error. The results suggested a superiority of the Tucker method for the conventional circular equating chain, and the Levine and vci methods yielded smaller errors in about half the equating chains for the pseudo-circular chain. Unexpectedly, there was not found to be a clear rela...
Higher Education Admissions Practices, 2020
The equivalence of paper-and-pencil (P&P) and computer-based tests (CBTs) has become an important... more The equivalence of paper-and-pencil (P&P) and computer-based tests (CBTs) has become an important focus of research in the past 20 years. However, few studies have specifically addressed the equivalence of Internet-based tests (IBTs) and P&P administrations of high-stakes admissions tests (Potosky & Bobko, 2004). Despite the fact that there is a shortage of evidence with regard to the equivalence of scores obtained in the IBT and P&P modalities, the number of tests administered via the Internet is constantly rising. The goal of the present study was to compare the scores of examinees who took the P&P version of a scholastic ability test with the scores of those who took it via the Internet. The study was conducted using the Psychometric Entrance Test used for admission to institutions of higher education in Israel. 370 examinees participated in the study. Half were given a Web-based format in a computer lab and the other half were given the same test in P&P format. The study confirmed the equivalence between IBT and traditional P&P versions of the test for the sample.
International Journal of Testing, 2014
Medical …, 2008
Results In the years 200405, the 588 medical school candidates with the highest cognitive scores... more Results In the years 200405, the 588 medical school candidates with the highest cognitive scores were tested; this resulted in a change of approximately 20% in the cohort of accepted students compared with previous admission criteria. Internal consistency ranged from 0.80 to 0.88; ...
The objective of this study was to investigate differential tendencies to avoid guessing as a fun... more The objective of this study was to investigate differential tendencies to avoid guessing as a function of three variables: (1) lingual-cultural-group; (2) gender; and (3) examination year. The Psychometric Entrance Test (PET) for universities in Israel was used, which is administered in Hebrew, Arabic, English, French, Spanish, and Russian. The PET is a battery of five subtests, and encompasses about 200 test items. Three of the five subtests were used in this study: figural reasoning, mathematical reasoning, and English. Data for 12,440 male and 10,532 female examinees were analyzed. The tendency to avoid guessing was measured by the proportion of two types of unanswered items: unreached items, and omitted items. A factor analysis using VARIMAX rotation indicated a strong two-factor structure, in which all indices based on omitted items loaded on the first factor and all indices based on unreached items loaded on he second factor. An analysis of covariance with the corrected-for-gu...
Journal of Dental Education, 2011
Virtual reality force feedback simulators provide a haptic (sense of touch) feedback through the ... more Virtual reality force feedback simulators provide a haptic (sense of touch) feedback through the device being held by the user. The simulator's goal is to provide a learning experience resembling reality. A newly developed haptic simulator (IDEA Dental, Las Vegas, NV, USA) was assessed in this study. Our objectives were to assess the simulator's ability to serve as a tool for dental instruction, self-practice, and student evaluation, as well as to evaluate the sensation it provides. A total of thirty-three evaluators were divided into two groups. The first group consisted of twenty-one experienced dental educators; the second consisted of twelve fifth-year dental students. Each participant performed drilling tasks using the simulator and filled out a questionnaire regarding the simulator and potential ways of using it in dental education. The results show that experienced dental faculty members as well as advanced dental students found that the simulator could provide significant potential benefits in the teaching and self-learning of manual dental skills. Development of the simulator's tactile sensation is needed to attune it to genuine sensation. Further studies relating to aspects of the simulator's structure and its predictive validity, its scoring system, and the nature of the performed tasks should be conducted.
Journal of Dental Education, 2003
This paper discusses the need for reliable and valid measures of personality and motivational fac... more This paper discusses the need for reliable and valid measures of personality and motivational factors in the prediction of success and attrition in a dental school. The admissions system currently used in most schools includes personality factors that are measured by an interview. Our study examined whether the interview could be replaced by a standardized, open-ended questionnaire, thus increasing standardization and objectivity and avoiding the possible biases of the interview. The relationship between the standardized questionnaire score and the interview score in a dental school in Israel was examined, as well as the relationship between the standardized questionnaire score and the admissions decisions. The results showed that the questionnaire and the interview probably measure a common construct, enabling us to tentatively recommend a two-stage admissions process: all candidates meeting certain academic criteria should be asked to answer the questionnaire; those candidates scoring above a certain percentile on the questionnaire should either be admitted outright or invited for an interview.
Assessment in Education: Principles, Policy & Practice, 2016
It is my privilege to comment on this series of papers, written by leading experts in the field o... more It is my privilege to comment on this series of papers, written by leading experts in the field of validity theory. I have devoted a substantial part of my career to the development of admissions tests and other educational tests, and to the investigation of their validity. As such, I am keenly aware of the complexities involved in this process. It was with great interest and pleasure that I read these papers, beginning with the one authored by Newton and Shaw. From their conclusions, I anticipated a heated debate among the contributors to this issue vis-à-vis the usage of the term ‘validity’. Yet surprisingly, the majority adopted the definition of validity set forth in the 2014 edition of The Standards for Educational and Psychological Testing, which describes validity in terms of both interpretations and uses: ‘Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’. However, there is some disagreement among the contributors with regard to making a distinction between supporting interpretations based on test scores and justifying test use. All the authors agree that both issues are important and require examination. Pamela Moss takes the integration of these two issues even further. According to the Standards, when test scores are interpreted in more than one way, each proposed interpretation must be validated. Moss’s suggestion is consistent with this stipulation. She operationalises the interpretations and uses of test scores, and categorises them either as ‘intended’ (systemic level) or ‘actual’ (local school level). She shifts the focus of validity from intended interpretations and uses of test scores to actual interpretations and uses by local professionals, including teachers, school leaders and policy-makers. She correctly claims that ‘we need validity theory to help educators connect the test-based data to their own practice and to consider explanations and explore solutions. It is here that the primary potential of testing to improve schooling lies’. How true! I believe that this is what is meant by ‘validity of test use’ in the context of educational testing. The actual interpretations are part of the consequences of using the test (some of which might have been intended and some not).
International Journal of Testing, 2017
Many medical schools have adopted multiple mini-interviews (MMI) as an advanced selection tool. M... more Many medical schools have adopted multiple mini-interviews (MMI) as an advanced selection tool. MMIs are expensive and used to test only a few dozen candidates per day, making it infeasible to develop a different test version for each test administration. Therefore, some items are reused both within and across years. This study investigated the influence of item reuse and familiarity with test structure on scores for a high-stakes non-cognitive test administered every year to hundreds of medical school applicants. The results show no evidence that reusing items makes the test easier or less differentiating. Data are consistent across years, test versions and item types. Two additional aspects of familiarity with test structure and test items were examined: whether retaking the test has an effect and how candidates prepare for the test. It was found that retaking the test in and of itself does not result in higher scores. Notwithstanding the results, in the age of the Internet, test developers must assume that all items included in any particular test version will be fully exposed. Hence, in order to preserve test validity, reliability and fairness, it is important to rethink and carefully preplan test versions, as well as decide what information to divulge to candidates.
Items in the verbal (Hebrew and English) sections of the Psychometric Entrance Test (PET) adminis... more Items in the verbal (Hebrew and English) sections of the Psychometric Entrance Test (PET) administered for university admission in Israel were studied for differential item functioning (DIF) between the sexes. Analyses were conducted for 4,354 males and 4,901 females taking Form 3 of the PET in April 1984, and 3,785 males and 3,615 females taking Form 17 of the PET in April 1987. Three subtests were examined: (1) veral reasoning; (2) English; and (3) mathematical reasoning (a control non-verbal test). DIF was determined for the 1984 population through: the weighted sum of the differences between the twc groups and across all ability groups; and the root of the mean squared differences as defined above. These two indices and a Mantel-Haenszel chi square test examined DIF for the 1987 group. About one-third of the items in the verbal and mathematics reasoning 1-.rts were found to have DIE', but few English subtest items did so. The content of some of the items exhibiting DIF was clearly related to stereotypical perceptions of feminine and masculine areas of interest. Implications for test content are discussed. (SLD)
... as part of a symposium entitled "Assessment of Non-Cognitive Factors in Student Selectio... more ... as part of a symposium entitled "Assessment of Non-Cognitive Factors in Student Selection for Medical Schools"
The Israel Medical Association journal : IMAJ, 2006
The Israeli Board of Anesthesiology Examination Committee added a simulation-based Objective Stru... more The Israeli Board of Anesthesiology Examination Committee added a simulation-based Objective Structured Clinical Evaluation component to the board examination process. This addition was made in order to evaluate medical competence and considers certain domains that contribute to professionalism. This unique and new process needed to be validated. To validate and evaluate the reliability and realism of incorporating simulation-based OSCE into the Israeli Board Examination in Anesthesia. Validation was performed before the exam regarding Content Validity using the modified Delphi technique by members of the Task Force of the Israeli Board Examination Committee in Anesthesiology. The examination has been administered six times in the past 3 years to a total of 145 examinees. The pass rate ranged from 62% (trauma) to 91% (regional anesthesia). The mean inter-rater correlations for the total score (all items), for the Critical checklist items score, and for the Global (General) rating we...
Studies in Educational Evaluation, 1994
The objective of this study was to investigate differential tendencies to avoid guessing as a fun... more The objective of this study was to investigate differential tendencies to avoid guessing as a function of three variables: (1) lingual-cultural-group; (2) gender; and (3) examination year. The Psychometric Entrance Test (PET) for universities in Israel was used, which is administered in Hebrew,
Medical Education, 2012
CONTEXT Assessment centres used in evaluating the non-cognitive attributes of medical school cand... more CONTEXT Assessment centres used in evaluating the non-cognitive attributes of medical school candidates must generate scores that reflect as accurate a measurement as possible of these attributes. Thus far, reliability coefficients for such centres have been based on limited samples and individual administrations, without reference to the error of variance that may result from retesting, or from the existence of multiple centres designed to measure the same attributes. METHODS The National Institute for Testing and Evaluation in Israel has developed and administered two assessment centres: MOR is used by two medical schools and one dental school, and MIRKAM by another medical school. Each centre comprises eight or nine behavioural stations, a standardised biographical questionnaire, and a judgement and decision-making questionnaire. We calculated generalisability coefficients for each centre's eight or nine stations by year, composite reliability coefficients for the overall assessment centres, test-retest correlation coefficients for repeaters, and a correlation coefficient between the centres.
ABSTRACT Test transadaptation (translation and adaptation) is the process whereby a test construc... more ABSTRACT Test transadaptation (translation and adaptation) is the process whereby a test constructed in one language and culture is prepared for use in a second language and culture. Test transadaptation involves both the translation and adaptation of items written originally in the source language and the replacement of items unsuitable for translation/adaptation with items written in the target language. In the process, the transadaptation team effects a series of changes and modifications before the test attains its final transadapted form. One of the International Test Commission (ITC) Guidelines for Test Translation and Adaptation is (Guideline D1., ITC, 2001; Hambleton, 2005): "Test developers/publishers should ensure that the adaptation process takes full account of linguistic and cultural differences in the intended populations." The rationale provided for this guideline is that "because a single translator cannot be expected to have all of the required qualities and brings a single perspective to the task of translation, in general, it seems clear that a team of specialists is needed to accomplish an accurate adaptation." Two principal questions must be asked with regard to the product of test transadaptation produced by a team of experts: one is whether the transadapted product is of a high quality and the other is whether another team of experts would have done a better job. The first question can be answered by investigating the equivalence of the source and the transadapted test. One way to approach the second question is to examine the variance between transadaptations of the same test produced by independent teams. Such an investigation can provide us with an "estimate" for a standard error of transadaptation. The smaller this error, the more confidence we have in the transadaptation process and the final product. 3 The purpose of this study was to systematically investigate the variance between tests transadapted from the same source test by two independent teams
The use of CAT in higher education admissions testing in Israel is described. This includes: (1) ... more The use of CAT in higher education admissions testing in Israel is described. This includes: (1) AMIRAM-a CAT of English as a foreign language that has been used by various institutions of higher education for placement purposes for the past 22 years, and (2) MIFAM-a CAT version of the Psychometric Entrance Test that has been in use for nine years as a higher education admissions tool for examinees with disabilities. Both applications run in parallel with paper-and-pencil test versions. This presentation focuses on the specific procedures used to produce equitable scores across the two media as well as examining the suitability of CAT for examinees with disabilities. Also discussed are a number of practical issues that were encountered during conversion of the Psychometric Entrance Test (PET) to a CAT format. Issues that pertain to the meeting of content specifications, item exposure, item banks, item bank dimensionality, and equating, are identified and discussed in the context of evolutionary changes in the MIFAM program.
Applied Measurement in Education, 2002
ABSTRACT In this study we examined whether the measures used in the admission of students to univ... more ABSTRACT In this study we examined whether the measures used in the admission of students to universities in Israel are gender biased. The criterion used to measure bias was performance in the first year of university study; the predictors consisted of an admission score, a high school matriculation score, and a standardized test score as well as its component subtest scores. Statistically, bias was defined according to the boundary conditions given in Linn (1984). No gender bias was detected when using the admission score (which is used for selection) as a predictor of first-year performance in the university. Bias in favor of women was found predominantly using school grades as predictor whereas bias against women was found predominantly in using the standardized test scores. It was concluded that the admission score is a valid and unbiased predictor of first-year university performance for the two genders.
Anesthesia & Analgesia, 2005
We prospectively assessed the feasibility of international sharing of simulation-based evaluation... more We prospectively assessed the feasibility of international sharing of simulation-based evaluation tools despite differences in language, education, and anesthesia practice, in an Israeli study, using validated scenarios from a multi-institutional United States (US) study. Thirty-one Israeli junior anesthesia residents performed four simulation scenarios. Training sessions were videotaped and performance was assessed using two validated scoring systems (Long and Short Forms) by two independent raters. Subjects scored from 37 to 95 (70 +/- 12) of 108 possible points with the "Long Form" and "Short Form" scores ranging from 18 to 35 (28.2 +/- 4.5) of 40 possible points. Scores >70% of the maximal score were achieved by 61% of participants in comparison to only 5% in the original US study. The scenarios were rated as very realistic by 80% of the participants (grade 4 on a 1-4 scale). Reliability of the original assessment tools was demonstrated by internal consistencies of 0.66 for the Long and 0.75 for the Short Form (Cronbach alpha statistic). Values in the original study were 0.72-0.76 for the Long and 0.71-0.75 for the Short Form. The reliability did not change when a revised Israeli version of the scoring was used. Interrater reliability measured by Pearson correlation was 0.91 for the Long and 0.96 for the Short Form (P < 0.01). The high scores for plausibility given to the scenarios and the similar reliability of the original assessment tool support the feasibility of using simulation-based evaluation tools, developed in the US, in Israel. The higher scores achieved by Israeli residents may be related to the fact that most Israeli residents are immigrants with previous training in anesthesia. Simulation-based assessment tools developed in a multi-institutional study in the United States can be used in Israel despite the differences in language, education, and medical system.
Anesthesia & Analgesia, 2006
We describe the unique process whereby simulation-based, objective structured clinical evaluation... more We describe the unique process whereby simulation-based, objective structured clinical evaluation (OSCE) has been incorporated into the Israeli board examination in anesthesiology. Development of the examination included three steps: a) definition of clinical conditions that residents are required to handle competently, b) definition of tasks pertaining to each of the conditions, and c) incorporation of the tasks into hands-on simulation-based examination stations in the OSCE format, including 1) trauma management, 2) resuscitation, 3) crisis management in the operating room, 4) regional anesthesia, and 5) mechanical ventilation. Members of the Israeli Board of Anesthesiology Examination Committee assisted by experts from the Israel Center for Medical Simulation and from Israel's National Institute for Testing and Evaluation were involved in this process and in the development of the assessment tools, orientation of examinees, and preparation of examiners. The examination has been administered 4 times in the past 2 yr to 104 examinees and has gradually progressed from being a minor part of the oral board examination to a prerequisite component of this test. The pass rate ranged from 70% in resuscitation to 91% in regional anesthesia. The mean inter-rater correlations for all the checklist items, for the score based on the critical checklist items only, and for the general rating were 0.89, 0.86, and 0.76, respectively. The overall Kappa coefficients (the inter-rater agreement coefficient) for the total score and the critical checklist items were 0.71 and 0.76, respectively. The correlation between the total score and the general score was 0.76. According to a subjective feedback questionnaire, most (70%-90%) participants found the difficulty level of the examination stations reasonable to very easy and prefer this method of examination to a conventional oral examination. The incorporation of OSCE-driven modalities in the certification of anesthesiologists in Israel is a continuing process of evaluation and assessment.
Applied Psychological Measurement, 1990
Equating error was estimated using the same test by three linear equating methods in three paradi... more Equating error was estimated using the same test by three linear equating methods in three paradigms: (1) single-link equating of a test to itself, in which a test was administered on two different dates and the later administration was equated to the earlier adminis tration ; (2) circular equating through a chain, start ing and ending at the same test; and (3) pseudo-circular equating, in which a test was equated to itself as in the first approach through equating chains contain ing a different number of links as in the second ap proach. The mean difference between the actual scores and the equated scores, as well as the root mean square of this difference, were used as the criterion measures for equating error. The results suggested a superiority of the Tucker method for the conventional circular equating chain, and the Levine and vci methods yielded smaller errors in about half the equating chains for the pseudo-circular chain. Unexpectedly, there was not found to be a clear rela...
Higher Education Admissions Practices, 2020
The equivalence of paper-and-pencil (P&P) and computer-based tests (CBTs) has become an important... more The equivalence of paper-and-pencil (P&P) and computer-based tests (CBTs) has become an important focus of research in the past 20 years. However, few studies have specifically addressed the equivalence of Internet-based tests (IBTs) and P&P administrations of high-stakes admissions tests (Potosky & Bobko, 2004). Despite the fact that there is a shortage of evidence with regard to the equivalence of scores obtained in the IBT and P&P modalities, the number of tests administered via the Internet is constantly rising. The goal of the present study was to compare the scores of examinees who took the P&P version of a scholastic ability test with the scores of those who took it via the Internet. The study was conducted using the Psychometric Entrance Test used for admission to institutions of higher education in Israel. 370 examinees participated in the study. Half were given a Web-based format in a computer lab and the other half were given the same test in P&P format. The study confirmed the equivalence between IBT and traditional P&P versions of the test for the sample.
International Journal of Testing, 2014
Medical …, 2008
Results In the years 200405, the 588 medical school candidates with the highest cognitive scores... more Results In the years 200405, the 588 medical school candidates with the highest cognitive scores were tested; this resulted in a change of approximately 20% in the cohort of accepted students compared with previous admission criteria. Internal consistency ranged from 0.80 to 0.88; ...