L1 loanword frequency and vocabulary test item facility (original) (raw)

Rethinking Vocabulary Size Tests: Frequency Versus Item Difficulty

2016

Rethinking Vocabulary Size Test Design: Frequency Versus Item Difficulty Brett James Hashimoto Department of Linguistics and English Language, BYU Master of Arts For decades, vocabulary size tests have been built upon the idea that if a test-taker knows enough words at a given level of frequency based on a list from corpus, they will also know other words of that approximate frequency as well as all words that are more frequent. However, many vocabulary size tests are based on corpora that are as out-of-date as 70 years old and that may be ill-suited for these tests. Based on these potentially problematic areas, the following research questions were asked. First, to what degree would a vocabulary size test based on a large, contemporary corpus be reliable and valid? Second, would it be more reliable and valid than previously designed vocabulary size tests? Third, do words across, 1,000-word frequency bands vary in their item difficulty? In order to answer these research questions, 4...

The Interrelationship Among Word Frequency, Learner Behavior in a Vocabulary Size Test, and Teachers' Perception of Difficult Words

2009

This study invcstis,ated the relatiQnship among word frequency, learner perfbmiance, and teacher intuition about difficult words. It was expected that gaining deeper insights into such relationship would demonstrate the frequency of words reflected on the difficulty of them. Firstly, by employing a vQcabulary size test that takes learner confiden ¢ e into account, 180 university learners' vocabulary size was measured. The results of thc test confirmed that the sceres corrected by the degree ofconfidcnce "'ere more sensitive to frequency levels than the raw scores, Secondly, when the distribution of difficult words judged by tcachers xny,ere investigated across frequency levels, thc nuniber of difficult words were in accordance with the i}equency levels, LastlM the relat'ionship betw,een learner performance in vocabulary size test and teacher intuition about word ditficulty was investigated. As a result, a strong correlation was revealed. The result is intcrpreted to show that the teachers were in fact capable o £ ' predicting diiiElcult words. Combining all the results together, this study confirmed a close relationship betw'een the frequency ofwords and their dit]ficulty.

Effects of Test Format in Assessing L2 Vocabulary Knowledge and Skills

2021

In this paper, we present preliminary results of a study in which we examined the relative contribution of English learners' vocabulary to predicting their reading and grammar knowledge, by employing two different formats of vocabulary tests that require active and passive recognition, respectively. We administered a series of English tests to over 820 university students, including TOEFL ITP, a reading test, a grammar test, and a vocabulary test with 80 items of two different formats. The target 80 vocabulary items were selected from Level 2 to Level 6 of JACET 8000. We analyzed the test data using statistical techniques in order to observe the relationships between the vocabulary levels and the language skills, and between the test format, the vocabulary level, and the language skills. We also examined the relative contribution of the vocabulary estimated in different item formats to predicting students' performance on other skills tests. The findings suggest that there was a very strong trait effect, dominating both methods when they are presented together with the vocabulary trait on the common factor structure. Also, the contributions of each method to predicting skills' performance were not consistent across the traits of grammar and reading.

A reassessment of frequency and vocabulary size in L2 vocabulary teaching

Language Teaching, 2012

The high-frequency vocabulary of English has traditionally been thought to consist of the 2,000 most frequent word families, and low-frequency vocabulary as that beyond the 10,000 frequency level. This paper argues that these boundaries should be reassessed on pedagogic grounds. Based on a number of perspectives (including frequency and acquisition studies, the amount of vocabulary necessary for English usage, the range of graded readers, and dictionary defining vocabulary), we argue that high-frequency English vocabulary should include the most frequent 3,000 word families. We also propose that the low-frequency vocabulary boundary should be lowered to the 9,000 level, on the basis that 8–9,000 word families are sufficient to provide the lexical resources necessary to be able to read a wide range of authentic texts (Nation 2006). We label the vocabulary between high-frequency (3,000) and low-frequency (9,000+) as mid-frequency vocabulary. We illustrate the necessity of mid-frequenc...

Dispersion and frequency: Is there any difference as regards their relation to L2 vocabulary gains?

Despite the current importance given to L2 vocabulary acquisition in the last two decades, considerable deficiencies are found in L2 students' vocabulary size. One of the aspects that may influence vocabulary learning is word frequency. However, scholars warn that frequency may lead to wrong conclusions if the way words are distributed is ignored. That is to say, it seems that not only the number of occurrences (frequency) might affect L2 vocabulary acquisition, but also the way occurrences are distributed (distribution). This relationship between these two factors is represented by the so-called Gries' index, known as dispersion. The present study aims to find out whether dispersion is more an accurate and reliable predictor for L2 vocabulary learning than frequency only. KEYWORDS: Distributed learning, second language vocabulary acquisition, word dispersion, word frequency.

Investigating Local Item Dependence in the Vocabulary Levels Test

Doctoral Dissertation, 2019

The Vocabulary Levels Test (VLT) has been used as a placement test, diagnostic test and benchmark for learning in pre- and post-test type of studies. Compared to other vocabulary size tests like the VST and Yes/No test, the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question. The purpose of this dissertation is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT. Specifically, data were collected for 302 Taiwanese university students or university graduates and Winsteps was used to run two types of dimensionality tests (1. Principal Components Analysis of Residuals [PCAR] and 2. Yen’s Q3 statistic that identifies pairs of locally dependent items) on 20 different data levels: 1. three combined levels of the VLT2, 3, and 5 (1 data level) 2. each independent VLT level (3 data levels) 3. four ability groups versus combined VLT levels (4 data levels) 4. four ability groups versus three independent VLT levels (12 data levels). Two more analyses were also conducted: simulated data with non-random residuals factored out were also compared to the empirical data, and items were grouped into three-item clusters to perform a Rasch analysis of testlets. In total, this study synthesized the results of 42 different analyses and qualitatively investigated the resulting problematic testlets using 1. response patterns of answer keys, distractors and items left unanswered, and 2. word frequency and dispersion information from COCA the largest and most updated currently available English language corpus (Davies, 2008-). Similar to previous research findings, the unidimensional Rasch analyses showed acceptable fit statistics, person and item reliability, and very little unexplained variance, especially when compared with the simulated data. The testlet analysis also did not uncover any obviously problematic testlets. However, from a combination of the above 20 levels of analysis, more than a third of the testlets appeared either to 1. have a pair of locally dependent (LD) items that were weakly to moderately dependent on each other (correlation of 0.3-0.7), and/or 2. have items with substantive PCAR loadings (beyond +/- 0.3) on a dimension that was not the Rasch dimension of vocabulary knowledge. Additional qualitative investigations were conducted in an effort to better understand and explain the Rasch statistical results. A subset of seven testlets that emerged from at least two of the above analyses were assumed to be the most likely candidates of problematic LID, and these were more closely scrutinized using qualitative procedures of checking item wording and word frequency. Although the statistical and qualitative procedures cannot conclusively show that the cause of LID is item chaining, the seven items share a number of characteristics that clearly create a problematic dynamic that undermines the proper functioning of testlets. These characteristics include a pair of items that considerably differ in difficulty measures from the third item in the cluster, which I have called a “2-vs-1 difficulty bundle”; in fact, 19 out of 30 testlets shared this configuration. However, when these difficulty bundles in a testlet are fairly close together but far apart from the outlying third item, the Q3 LID analysis identified them as either weakly or moderately locally dependent; this was the case for six testlets (20% of the total). And when this LID pair was the first two items with the first item more difficult than the second, and much more difficult than the third outlying item (with a quarter to one third of test-takers leaving the pair unanswered), the first item in the testlet was identified by the PCAR as negatively correlating with Rasch dimension of vocabulary knowledge; this was the case for four testlets (13% of the total) in VLT3 and 5. A key issue that emerged from this investigation is item difficulty in a vocabulary diagnostic test like the VLT, which has been variously ignored or treated as a “nuisance variable” by researchers (Culligan, 2015). Difficulty in this type of test has never, to the best of my knowledge, been overtly theorized, but has been tacitly operationalized as a function of word frequency from a corpus. Despite some unargued claims to the contrary (Schmitt et al. [2001] for the VLT, and Beglar [2007] for the VST), the assumption is that the less frequent (i.e., less common) the word, the more difficult the word-item on the VLT. This study shows that this is problematic for at least two reasons. First, the Schmitt et al. (2001) VLT versions are based on outdated and small corpora that have inaccurate word frequency information for all the VLT levels, but especially for the lower VLT3 and 5 levels; this is primarily because word frequency information will be necessarily inconsistent and skewed for less common words when using smaller corpora that contain a relatively small number of randomly sampled texts and do not account for dispersion (i.e., how many texts in the corpus containing the word). Secondly, and most importantly, difficulty measures—even when accounting for dispersion information—are often uncorrelated with frequency information, which shows that the learner’s second language (L2) lexicon does not mirror authentic English corpora, especially beyond the first 2000 words. Suggestions are given to help bridge the gap between frequency and difficulty. Key words: Vocabulary testing, local item dependence, unidimensionality, Rasch model, latent trait, Vocabulary Levels Test

How Can Cumulative Tests Be Applicable to Effective L2 Vocabulary Instruction

2022

Research has revealed that students who take cumulative tests, which target both recently and previously learned items, show a superior retention of those items in comparison to those who take noncumulative tests. In this study, a modified version of cumulative tests, which was tentatively named Random-Selection Tests (RST), was designed and the effects of RST on L2 vocabulary learning were examined. In Week 1, first-year university students in Japan took a pretest that comprised 50 target words. Subsequently, they were given a word list that comprised 50 English and Japanese word pairs and asked to memorize as many words as possible outside class time. In Week 2, they took a small test that contained 10 words chosen randomly from among the 50 words. For five consecutive weeks, they took small tests on the 50 words (Weeks 2 to 6). In Week 7, the students took a posttest that contained all 50 words. The results revealed that increases in the total amount of study time and number of times a certain word appeared in small tests were directly proportional to an increase in posttest scores. The study also found that when the students had high scores, they tended to decrease their study time. However, RST benefited from spaced retrieval practice. The way in which English teachers can apply these results to L2 vocabulary instruction is discussed.

Testing Vocabulary Knowledge: Size, Strength, and Computer Adaptiveness

In this article, we describe the development and trial of a bilingual computerized test of vocabulary size, the number of words the learner knows, and strength, a combination of four aspects of knowledge of meaning that are assumed to constitute a hierarchy of difficulty: passive recognition (easiest), active recognition, passive recall, and active recall (hardest). The participants were 435 learners of English as a second language. We investigated whether the above hierarchy was valid and which strength modality correlated best with classroom language performance. Results showed that the hypothesized hierarchy was present at all word frequency levels, that passive recall was the best predictor of classroom language performance, and that growth in vocabulary knowledge was different for the different strength modalities.