Jeff Stewart | Tokyo University of Science (original) (raw)

Papers by Jeff Stewart

Research paper thumbnail of What the research shows about written receptive vocabulary testing: A reply to Webb ACCEPTED MANUSCRIPT

Studies in Second Language Acquisition, 2021

In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., 2021), Stuart W... more In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., 2021), Stuart Webb (2021) asserts that there is no research supporting our suggestions for improving tests of written receptive vocabulary knowledge by (a) using meaning-recall items, (b) making fewer presumptions about learner knowledge of word families, and (c) using appropriate test lengths. As we will show, this is not the case.

Research paper thumbnail of Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach

Language Testing, 2020

Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the a... more Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the assessment of L2 written receptive vocabulary knowledge. However, to date, there has been relatively little research regarding which modalities of vocabulary knowledge have the strongest correlations to reading proficiency, and observed differences have often been statistically non-significant. The present research employs a bootstrapping approach to reach a clearer understanding of relationships between various modalities of vocabulary knowledge to reading proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000 most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips, 2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall, and meaning recognition. These pools of test items were then sampled with replacement to create 1000 simulated tests ranging in length from five to 200 items and the results were correlated to the Test of English for International Communication (TOEIC®) Reading scores. For all examined test lengths, meaning-recall vocabulary tests had the highest average correlations to reading proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite the theoretically closer relationship of vocabulary recognition to reading.

Research paper thumbnail of Stewart, J., McLean, S., & Kramer, B. (2017) A Response to Holster and Lake Regarding Guessing and the Rasch Model. Language Assessment Quarterly. http://www.tandfonline.com/doi/figure/10.1080/15434303.2016.1262377?scroll=top&needAccess=true

Stewart (2014) questioned vocabulary size estimation methods proposed by Beglar and Nation for th... more Stewart (2014) questioned vocabulary size estimation methods proposed by Beglar and Nation for the VST, further arguing Rasch mean square (MSQ) fit statistics cannot determine the proportion of random guesses contained in the average learner’s raw score, as the average value will be near 1 by design. He illustrated this by demonstrating this is true even of entirely random data. Holster and Lake (2016) appear to misinterpret this as a claim that Rasch analyses cannot distinguish random data from real responses. To test this, they compare real data to random and note that, predictably, the statistic easily distinguishes the two, and that reliability for random data is near zero. However, while certainly true, this fact is not relevant to Stewart’s argument that multiple-choice options inflate the test’s size estimates, or that MSQ statistics cannot be used to detect this. We further illustrate this by showing real data retains average MSQ values near 1 even when unknown items skipped by learners are imputed with random guesses. Furthermore, the imputed data does not exhibit “problematic guessing” under Holster and Lake’s own criteria, despite size inflation under Beglar and Nation’s suggested scoring. We conclude by discussing uses of the 3PL model.

Research paper thumbnail of Estimating Learners’ Vocabulary Size under Item Response Theory

Vocabulary Learning and Instruction, Sep 2014

Perhaps the most qualitatively interpretable vocabulary test score is an estimate of the total nu... more Perhaps the most qualitatively interpretable vocabulary test score is an estimate of the total number of words the learner knows in the tested domain, such as a frequency word list, or vocabulary taught as part of a course curriculum. In cases where it is not possible to test the entire domain word-for-word, vocabulary tests such as the vocabulary levels test (Nation, 1990) and vocabulary size test (Beglar, 2010; Nation & Beglar, 2007) typically employ a polling method, in which total vocabulary size is inferred from a sample of tested words. A drawback of this method is that it assumes the tested words are randomly sampled from and therefore representative of the tested domain, which can affect test reliability in cases where there are many words in the domain that are far below or above learners’ ability. This paper outlines an alternate method for estimating vocabulary size from a test score using item response theory, which allows estimation of total vocabulary size from a nonrandom sample of words well matched to learners’ ability, resulting in tests of practical length with high reliability that can be used to estimate the total number of words a learner knows. Such a test scoring method, currently in use at a private university in southern Japan, is used as an example.

Research paper thumbnail of Do Multiple-Choice Options Inflate Estimates of Vocabulary Size on the VST?

Language Assessment Quarterly

Validated under a Rasch framework (Beglar, 2010), The Vocabulary Size Test (VST) (Nation & Beglar... more Validated under a Rasch framework (Beglar, 2010), The Vocabulary Size Test (VST) (Nation & Beglar, 2007) is an increasingly popular measure of decontextualized written receptive vocabulary size in the field of second language acquisition. However, although the validation indicates that the test has high internal reliability, still unaddressed is the possibility that it overestimates learner vocabulary size due to guessing effects inherent in its multiple-choice format, as size estimates are made by multiplying its raw score by a constant (100 or 200). This paper argues that the VST’s multiple-choice format results in a test of passive recognition of words that does not approximate the experience of readers of authentic English texts, details drawbacks of the Rasch framework and mean-square fit statistics in detecting the overall contribution of guessing effects to raw test scores that could have allowed such deficiencies to remain undetected during the test’s validation, overviews challenges that multiple-choice formats pose for vocabulary tests, and concludes by proposing methods of testing and analysis that can address these concerns.

Research paper thumbnail of Optimizing scoring formulas for yes/no  vocabulary tests with linear models

Shiken Research Bulletin, Nov 2012

"Yes/No tests offer an expedient method of testing learners’ vocabulary knowledge, although a dra... more "Yes/No tests offer an expedient method of testing learners’ vocabulary knowledge, although a drawback of this method is that since the method is self-report, actual knowledge cannot be confirmed. “Pseudowords” have been used within such lists to test if learners are reporting knowledge of words they cannot possibly know, but it is unclear how to use this information to adjust scores. Although a variety of scoring formulas have been proposed in the literature, empirical research (e.g., Mochida & Harrington, 2006) has found little evidence of their efficacy.

The authors propose that a standard least squares model (multiple regression), in which the counts of words reported known and counts of pseudowords reported known are added as separate predictor variables, can be used to generate scoring formulas that have substantially higher predictive power. This is demonstrated on pilot data, and limitations of the method and goals of future research are discussed.
"

Research paper thumbnail of A Multiple-Choice Test of Active Vocabulary Knowledge

Though “Passive” multiple-choice tests of second language vocabulary knowledge such as the Vocabu... more Though “Passive” multiple-choice tests of second language vocabulary knowledge such as the Vocabulary Levels Test (Nation 1990) are widespread, tests of "Active" knowledge are used less frequently, perhaps due to the inconvenience of hand-scoring written answers. This paper proposes a multiple-choice measure of active knowledge in which the first letter of the target word is selected. A test employing the format is shown to correlate highly to a conventional active test (.93), and exhibit higher reliability than a comparable passive measure.

Research paper thumbnail of Estimating guessing effects on the Vocabulary Levels Test for differing degrees of word knowledge

Research paper thumbnail of Comparing Multidimensional and Continuum Models of Vocabulary Acquisition: An Empirical Examination of the Vocabulary Knowledge Scale

Second language vocabulary acquisition has been modeled both as multidimensional in nature and as... more Second language vocabulary acquisition has been modeled both as multidimensional in nature and as a continuum wherein the learner's knowledge of a word develops along a cline from recognition through production. In order to empirically examine and compare these models, the authors assess the degree to which the Vocabulary Knowledge Scale (VKS; Paribakht & Wesche, 1993), which implicitly assumes a cline model of acquisition, conforms to a linear trait model under the Rasch Partial Credit Model, and determine the dimensionality of the individual tasks contained on the scale (self-report, first language [L1] equivalent, and sentence) using DETECT. The authors find that, although the VKS functions adequately overall as a measurement model, Stages 3 (can give an adequate L1 equivalent) and 4 (can use with semantic appropriateness) are psychometrically indistinct, suggesting they should be collapsed into a single category of definitional knowledge. Analysis under DIMTEST and DETECT indicates that other forms of vocabulary knowledge measured by the VKS are weakly multidimensional, which has implications for continuum models of vocabulary acquisition.

Research paper thumbnail of Examining the Reliability of a TOEIC Bridge Practice Test under 1 and 3 Parameter Item Response Models

Unlike classical test theory (CTT), where estimates of reliability are assumed to apply to all me... more Unlike classical test theory (CTT), where estimates of reliability are assumed to apply to all members of a population, item response theory provides a theoretical framework under which reliability can vary by test score. However, different IRT models can result in very different interpretations of reliability, as models that account for item quality (slopes) and probability of a correct guess significantly alter estimates. This is illustrated by fitting a TOEIC Bridge practice test to 1 (Rasch) and 3 parameter logistic models and comparing results. Under the Bayesian Information Criterion (BIC) the 3-parameter model provided superior fit. The implications of this are discussed.

Research paper thumbnail of The LERC Vocabulary Program: Score Gains for First and Second Year Students

Kyushu Sangyo University Language Education and Research Journal, Apr 1, 2012

Center (LERC)'s compulsory vocabulary program for first and second year students. The goal of thi... more Center (LERC)'s compulsory vocabulary program for first and second year students. The goal of this report is to overview student gains in vocabulary throughout the course of a semester. Score gains of 1537 first-year students enrolled in Low, Mid and High level English Conversation classes and 1165 second-year students enrolled in Low, Mid and High level English Conversation classes were analyzed. All first-year class levels saw gains beyond the center's goal of 100 scale score points (1 logit). Second-year High level classes saw gains of 170 points, which greatly exceeded that goal, and Mid classes fell just below benchmarks with a gain of 89 points. However, though the change was statistically significant, second-year Low classes fell well short of expectations, with a mean gain of only 26 points. Possible reasons for this are discussed, and suggestions are made for curriculum adjustments for these classes in 2012. 2011 marked the first year of the Language and Education Research Center (LERC)'s compulsory vocabulary program (Fryer, Stewart, Anderson, Bovee & Gibson, 2011). The purpose of this report is to detail the program's effect on the English vocabulary knowledge of students enrolled in the program, as measured by the program's summative Pre and Post tests.

Research paper thumbnail of Does IRT Provide More Sensitive Measures of Latent Traits in Statistical Tests? An Empirical Examination

It has been frequently stated that Item Response Theory produces interval-scale measures where ra... more It has been frequently stated that Item Response Theory produces interval-scale measures where raw scores can only provide ordinal measures, and that therefore, researchers should choose IRT measures when selecting variables for common statistical tests, because raw scores may not meet their assumptions (Wright, 1992; Harwell & Gattie, 2001). In this study, this claim is empirically examined by conducting Pearson Correlations and ANOVAs on two data sets using raw scores, Rasch Person Measures and 2-Parameter IRT ability estimates, in order to determine if results differed as a consequence. Raw Scores and Rasch Person Measures were very highly correlated, and lead to extremely similar results in all cases. For a well-constructed, reliable test the same was true of 2PL ability estimates. However, in cases where the test has middling to poor reliability, 2PL ability estimates appear to produce a somewhat more sensitive measure of a latent trait than raw scores, which can result in meaningful differences in statistical tests.

Research paper thumbnail of Equating classroom pre and post tests under item response theory

The authors illustrate how classroom pre-tests can be used to gather information for an item bank... more The authors illustrate how classroom pre-tests can be used to gather information for an item bank from which to construct summative tests with appropriate measurement properties, and detail methods for equating pre and post-test forms under item response theory in such a manner that resulting ability estimates between conditions are comparable.

International papers with Impact factor by Jeff Stewart

Research paper thumbnail of McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge A bootstrapping approach. Language Testing, 37(3)389-411. doi.org/10.1177/0265532219898380  OPEN ACCESS

Language Testing, 2020

Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the a... more Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the
assessment of L2 written receptive vocabulary knowledge. However, to date, there has been
relatively little research regarding which modalities of vocabulary knowledge have the strongest
correlations to reading proficiency, and observed differences have often been statistically
non-significant. The present research employs a bootstrapping approach to reach a clearer
understanding of relationships between various modalities of vocabulary knowledge to reading
proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000
most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips,
2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall,
and meaning recognition. These pools of test items were then sampled with replacement to create
1000 simulated tests ranging in length from five to 200 items and the results were correlated to
the Test of English for International Communication (TOEIC®) Reading scores. For all examined
test lengths, meaning-recall vocabulary tests had the highest average correlations to reading
proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary
recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite
the theoretically closer relationship of vocabulary recognition to reading.

Research paper thumbnail of What the research shows about written receptive vocabulary testing: A reply to Webb ACCEPTED MANUSCRIPT

Studies in Second Language Acquisition, 2021

In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., 2021), Stuart W... more In response to our State-of-the-Scholarship critical commentary (Stoeckel et al., 2021), Stuart Webb (2021) asserts that there is no research supporting our suggestions for improving tests of written receptive vocabulary knowledge by (a) using meaning-recall items, (b) making fewer presumptions about learner knowledge of word families, and (c) using appropriate test lengths. As we will show, this is not the case.

Research paper thumbnail of Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach

Language Testing, 2020

Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the a... more Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the assessment of L2 written receptive vocabulary knowledge. However, to date, there has been relatively little research regarding which modalities of vocabulary knowledge have the strongest correlations to reading proficiency, and observed differences have often been statistically non-significant. The present research employs a bootstrapping approach to reach a clearer understanding of relationships between various modalities of vocabulary knowledge to reading proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000 most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips, 2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall, and meaning recognition. These pools of test items were then sampled with replacement to create 1000 simulated tests ranging in length from five to 200 items and the results were correlated to the Test of English for International Communication (TOEIC®) Reading scores. For all examined test lengths, meaning-recall vocabulary tests had the highest average correlations to reading proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite the theoretically closer relationship of vocabulary recognition to reading.

Research paper thumbnail of Stewart, J., McLean, S., & Kramer, B. (2017) A Response to Holster and Lake Regarding Guessing and the Rasch Model. Language Assessment Quarterly. http://www.tandfonline.com/doi/figure/10.1080/15434303.2016.1262377?scroll=top&needAccess=true

Stewart (2014) questioned vocabulary size estimation methods proposed by Beglar and Nation for th... more Stewart (2014) questioned vocabulary size estimation methods proposed by Beglar and Nation for the VST, further arguing Rasch mean square (MSQ) fit statistics cannot determine the proportion of random guesses contained in the average learner’s raw score, as the average value will be near 1 by design. He illustrated this by demonstrating this is true even of entirely random data. Holster and Lake (2016) appear to misinterpret this as a claim that Rasch analyses cannot distinguish random data from real responses. To test this, they compare real data to random and note that, predictably, the statistic easily distinguishes the two, and that reliability for random data is near zero. However, while certainly true, this fact is not relevant to Stewart’s argument that multiple-choice options inflate the test’s size estimates, or that MSQ statistics cannot be used to detect this. We further illustrate this by showing real data retains average MSQ values near 1 even when unknown items skipped by learners are imputed with random guesses. Furthermore, the imputed data does not exhibit “problematic guessing” under Holster and Lake’s own criteria, despite size inflation under Beglar and Nation’s suggested scoring. We conclude by discussing uses of the 3PL model.

Research paper thumbnail of Estimating Learners’ Vocabulary Size under Item Response Theory

Vocabulary Learning and Instruction, Sep 2014

Perhaps the most qualitatively interpretable vocabulary test score is an estimate of the total nu... more Perhaps the most qualitatively interpretable vocabulary test score is an estimate of the total number of words the learner knows in the tested domain, such as a frequency word list, or vocabulary taught as part of a course curriculum. In cases where it is not possible to test the entire domain word-for-word, vocabulary tests such as the vocabulary levels test (Nation, 1990) and vocabulary size test (Beglar, 2010; Nation & Beglar, 2007) typically employ a polling method, in which total vocabulary size is inferred from a sample of tested words. A drawback of this method is that it assumes the tested words are randomly sampled from and therefore representative of the tested domain, which can affect test reliability in cases where there are many words in the domain that are far below or above learners’ ability. This paper outlines an alternate method for estimating vocabulary size from a test score using item response theory, which allows estimation of total vocabulary size from a nonrandom sample of words well matched to learners’ ability, resulting in tests of practical length with high reliability that can be used to estimate the total number of words a learner knows. Such a test scoring method, currently in use at a private university in southern Japan, is used as an example.

Research paper thumbnail of Do Multiple-Choice Options Inflate Estimates of Vocabulary Size on the VST?

Language Assessment Quarterly

Validated under a Rasch framework (Beglar, 2010), The Vocabulary Size Test (VST) (Nation & Beglar... more Validated under a Rasch framework (Beglar, 2010), The Vocabulary Size Test (VST) (Nation & Beglar, 2007) is an increasingly popular measure of decontextualized written receptive vocabulary size in the field of second language acquisition. However, although the validation indicates that the test has high internal reliability, still unaddressed is the possibility that it overestimates learner vocabulary size due to guessing effects inherent in its multiple-choice format, as size estimates are made by multiplying its raw score by a constant (100 or 200). This paper argues that the VST’s multiple-choice format results in a test of passive recognition of words that does not approximate the experience of readers of authentic English texts, details drawbacks of the Rasch framework and mean-square fit statistics in detecting the overall contribution of guessing effects to raw test scores that could have allowed such deficiencies to remain undetected during the test’s validation, overviews challenges that multiple-choice formats pose for vocabulary tests, and concludes by proposing methods of testing and analysis that can address these concerns.

Research paper thumbnail of Optimizing scoring formulas for yes/no  vocabulary tests with linear models

Shiken Research Bulletin, Nov 2012

"Yes/No tests offer an expedient method of testing learners’ vocabulary knowledge, although a dra... more "Yes/No tests offer an expedient method of testing learners’ vocabulary knowledge, although a drawback of this method is that since the method is self-report, actual knowledge cannot be confirmed. “Pseudowords” have been used within such lists to test if learners are reporting knowledge of words they cannot possibly know, but it is unclear how to use this information to adjust scores. Although a variety of scoring formulas have been proposed in the literature, empirical research (e.g., Mochida & Harrington, 2006) has found little evidence of their efficacy.

The authors propose that a standard least squares model (multiple regression), in which the counts of words reported known and counts of pseudowords reported known are added as separate predictor variables, can be used to generate scoring formulas that have substantially higher predictive power. This is demonstrated on pilot data, and limitations of the method and goals of future research are discussed.
"

Research paper thumbnail of A Multiple-Choice Test of Active Vocabulary Knowledge

Though “Passive” multiple-choice tests of second language vocabulary knowledge such as the Vocabu... more Though “Passive” multiple-choice tests of second language vocabulary knowledge such as the Vocabulary Levels Test (Nation 1990) are widespread, tests of "Active" knowledge are used less frequently, perhaps due to the inconvenience of hand-scoring written answers. This paper proposes a multiple-choice measure of active knowledge in which the first letter of the target word is selected. A test employing the format is shown to correlate highly to a conventional active test (.93), and exhibit higher reliability than a comparable passive measure.

Research paper thumbnail of Estimating guessing effects on the Vocabulary Levels Test for differing degrees of word knowledge

Research paper thumbnail of Comparing Multidimensional and Continuum Models of Vocabulary Acquisition: An Empirical Examination of the Vocabulary Knowledge Scale

Second language vocabulary acquisition has been modeled both as multidimensional in nature and as... more Second language vocabulary acquisition has been modeled both as multidimensional in nature and as a continuum wherein the learner's knowledge of a word develops along a cline from recognition through production. In order to empirically examine and compare these models, the authors assess the degree to which the Vocabulary Knowledge Scale (VKS; Paribakht & Wesche, 1993), which implicitly assumes a cline model of acquisition, conforms to a linear trait model under the Rasch Partial Credit Model, and determine the dimensionality of the individual tasks contained on the scale (self-report, first language [L1] equivalent, and sentence) using DETECT. The authors find that, although the VKS functions adequately overall as a measurement model, Stages 3 (can give an adequate L1 equivalent) and 4 (can use with semantic appropriateness) are psychometrically indistinct, suggesting they should be collapsed into a single category of definitional knowledge. Analysis under DIMTEST and DETECT indicates that other forms of vocabulary knowledge measured by the VKS are weakly multidimensional, which has implications for continuum models of vocabulary acquisition.

Research paper thumbnail of Examining the Reliability of a TOEIC Bridge Practice Test under 1 and 3 Parameter Item Response Models

Unlike classical test theory (CTT), where estimates of reliability are assumed to apply to all me... more Unlike classical test theory (CTT), where estimates of reliability are assumed to apply to all members of a population, item response theory provides a theoretical framework under which reliability can vary by test score. However, different IRT models can result in very different interpretations of reliability, as models that account for item quality (slopes) and probability of a correct guess significantly alter estimates. This is illustrated by fitting a TOEIC Bridge practice test to 1 (Rasch) and 3 parameter logistic models and comparing results. Under the Bayesian Information Criterion (BIC) the 3-parameter model provided superior fit. The implications of this are discussed.

Research paper thumbnail of The LERC Vocabulary Program: Score Gains for First and Second Year Students

Kyushu Sangyo University Language Education and Research Journal, Apr 1, 2012

Center (LERC)'s compulsory vocabulary program for first and second year students. The goal of thi... more Center (LERC)'s compulsory vocabulary program for first and second year students. The goal of this report is to overview student gains in vocabulary throughout the course of a semester. Score gains of 1537 first-year students enrolled in Low, Mid and High level English Conversation classes and 1165 second-year students enrolled in Low, Mid and High level English Conversation classes were analyzed. All first-year class levels saw gains beyond the center's goal of 100 scale score points (1 logit). Second-year High level classes saw gains of 170 points, which greatly exceeded that goal, and Mid classes fell just below benchmarks with a gain of 89 points. However, though the change was statistically significant, second-year Low classes fell well short of expectations, with a mean gain of only 26 points. Possible reasons for this are discussed, and suggestions are made for curriculum adjustments for these classes in 2012. 2011 marked the first year of the Language and Education Research Center (LERC)'s compulsory vocabulary program (Fryer, Stewart, Anderson, Bovee & Gibson, 2011). The purpose of this report is to detail the program's effect on the English vocabulary knowledge of students enrolled in the program, as measured by the program's summative Pre and Post tests.

Research paper thumbnail of Does IRT Provide More Sensitive Measures of Latent Traits in Statistical Tests? An Empirical Examination

It has been frequently stated that Item Response Theory produces interval-scale measures where ra... more It has been frequently stated that Item Response Theory produces interval-scale measures where raw scores can only provide ordinal measures, and that therefore, researchers should choose IRT measures when selecting variables for common statistical tests, because raw scores may not meet their assumptions (Wright, 1992; Harwell & Gattie, 2001). In this study, this claim is empirically examined by conducting Pearson Correlations and ANOVAs on two data sets using raw scores, Rasch Person Measures and 2-Parameter IRT ability estimates, in order to determine if results differed as a consequence. Raw Scores and Rasch Person Measures were very highly correlated, and lead to extremely similar results in all cases. For a well-constructed, reliable test the same was true of 2PL ability estimates. However, in cases where the test has middling to poor reliability, 2PL ability estimates appear to produce a somewhat more sensitive measure of a latent trait than raw scores, which can result in meaningful differences in statistical tests.

Research paper thumbnail of Equating classroom pre and post tests under item response theory

The authors illustrate how classroom pre-tests can be used to gather information for an item bank... more The authors illustrate how classroom pre-tests can be used to gather information for an item bank from which to construct summative tests with appropriate measurement properties, and detail methods for equating pre and post-test forms under item response theory in such a manner that resulting ability estimates between conditions are comparable.

Research paper thumbnail of McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge A bootstrapping approach. Language Testing, 37(3)389-411. doi.org/10.1177/0265532219898380  OPEN ACCESS

Language Testing, 2020

Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the a... more Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the
assessment of L2 written receptive vocabulary knowledge. However, to date, there has been
relatively little research regarding which modalities of vocabulary knowledge have the strongest
correlations to reading proficiency, and observed differences have often been statistically
non-significant. The present research employs a bootstrapping approach to reach a clearer
understanding of relationships between various modalities of vocabulary knowledge to reading
proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000
most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips,
2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall,
and meaning recognition. These pools of test items were then sampled with replacement to create
1000 simulated tests ranging in length from five to 200 items and the results were correlated to
the Test of English for International Communication (TOEIC®) Reading scores. For all examined
test lengths, meaning-recall vocabulary tests had the highest average correlations to reading
proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary
recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite
the theoretically closer relationship of vocabulary recognition to reading.