Jonathan Weeks - Academia.edu (original) (raw)
Papers by Jonathan Weeks
Psychological test and assessment modeling, 2016
Examinees may omit responses on a test for a variety of reasons, such as low ability, low motivat... more Examinees may omit responses on a test for a variety of reasons, such as low ability, low motivation, lack of attention, or running out of time. Some decision must be made about how to treat these missing responses for the purpose of scoring and/or scaling the test, particularly if there is an indication that missingness is not skill related. The most common approaches are to treat the responses as either not reached/administered or incorrect. Depending on the total number of missing values, coding all omitted responses as incorrect is likely to introduce negative bias into estimates of item difficulty and examinee ability. On the other hand, if omitted responses are coded as not reached and excluded from the likelihood function, the precision of estimates of item and person parameters will be reduced. This study examines the use of response time information collected in many computer-based assessments to inform the coding of omitted responses. Empirical data from the Programme for the International Assessment of Adult Competencies (PIAAC) literacy and numeracy cognitive tests are used to identify item-specific timing thresholds via several logistic regression models that predict the propensity of responding rather than produce a missing data point. These thresholds can be used to inform the decision about whether an omitted response should be treated as not administered or as incorrect. The results suggest that for many items the timing thresholds (20 to 30 seconds on average) at a high expected probability level of observing a response are notably higher than thresholds used in the evaluation of rapid guessing of responses (e.g., 5 seconds).
Applied Measurement in Education, 2018
Indicators of student academic growth are desired in state accountability systems in order to app... more Indicators of student academic growth are desired in state accountability systems in order to approximate student learning over time and attribute observed growth to schooling inputs. Through an extant analysis of five states' assessment data, this study offers evidence about whether longitudinal match rates and measures of growth differ at the state level for students with disabilities, relative to students without disabilities. There were three main findings: 1) In states in which a modified assessment was offered, students with disabilities were more likely to have missing prior year scores, and consequently missing growth scores; 2) Low scoring students, many of whom has a disability, were more likely to have missing prior scores on the state general assessment, and consequently missing growth scores; 3) Students with and without disabilities showed similar growth using transition and gain score definitions of growth, but students with disabilities had lower growth when estimated via a regressionbased model. Measurement and policy considerations are discussed.
Journal of Educational Psychology, 2019
The opinions expressed are those of the authors and do not represent views of the Institute or th... more The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. We would like to thank Don Powers, Hyo Jeong Shin and Laura Halderman for providing helpful feedback on earlier versions of this manuscript.
International Journal of Testing, 2019
The construct of reading comprehension has changed significantly in the twentyfirst century; howe... more The construct of reading comprehension has changed significantly in the twentyfirst century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.
Behavior Research Methods, 2018
The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on... more The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on the adequacy of the Gf measures administered. Such studies have yielded mixed results, with a suggestion that Gf measurement issues may be partly responsible. The purpose of this study was to develop a Gf test battery comprising tests meeting the following criteria: (a) strong construct validity evidence, based on prior research; (b) reliable and sensitive to change; (c) varying in item types and content; (d) producing parallel tests, so that pretest-posttest comparisons could be made; (e) appropriate time limits; (f) unidimensional, to facilitate interpretation; and (g) appropriate in difficulty for a high-ability population, to detect change. A battery comprising letter, number, and figure series and figural matrix item types was developed and evaluated in three large-N studies (N = 3,067, 2,511, and 801, respectively). Items were generated algorithmically on the basis of proven item models from the literature, to achieve high reliability at the targeted difficulty levels. An item response theory approach was used to calibrate the items in the first two studies and to establish conditional reliability targets for the tests and the battery. On the basis of those calibrations, fixed parallel forms were assembled for the third study, using linear programming methods. Analyses showed that the tests and test battery achieved the proposed criteria. We suggest that the battery as constructed is a promising tool for measuring the effectiveness of cognitive enhancement interventions, and that its algorithmic item construction enables tailoring the battery to different difficulty targets, for even wider applications. Keywords Intelligence. Fluid ability. Gf. Working memory training. Reasoning. Item-response theory. Test assembly General fluid ability (Gf) is Bat the core of what is normally meant by intelligence^(Carroll, 1993, p. 196), and has been shown empirically to be synonymous with general cognitive ability (g), at least within groups with roughly comparable opportunities to learn (Valentin Kvist & Gustafsson, 2008). Gf has been viewed as an essential determinant of one's ability to solve a wide range of novel real-world problems (Schneider & McGrew, 2012). Perhaps because of its association with diverse outcomes, there has been a longstanding interest in improving Gf (i.e., intelligence) through general schooling
ETS Research Report Series, 2016
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Topics in Language Disorders, 2016
Traditional measures of reading ability designed for younger students typically focus on componen... more Traditional measures of reading ability designed for younger students typically focus on componential skills (e.g., decoding, vocabulary), and the items are often presented in a discrete and decontextualized format. The current study was designed to explore whether it was feasible to develop a more integrated, scenario-based assessment of comprehension for younger students. A secondary goal was to examine developmental differences in item performance when administration was in listening versus reading modalities. Cross-sectional differences were examined across kindergarten to third grade on a scenario-based assessment comprising literal comprehension, inference, vocabulary, and background knowledge items. The assessment, originally targeted for third grade, was administered one-on-one to 141 third-grade and 485 second-grade students. It was adapted for and administered to kindergarten (n = 390) and first-grade (n = 419) students by reducing the number of items and switching to a listening comprehension method of administration. Each grade was significantly more accurate than the previous grade on overall performance and background knowledge. A regression analysis showed significant variance associated with background knowledge in predicting comprehension, even after controlling for grade. A deeper analysis of item performance across grades was conducted to examine what elements worked well and where improvements should be made in adapting comprehension assessments for use with young children.
Journal of Statistical Software, 2010
This introduction to the R package plink is a (slightly) modified version of Weeks (2010), publis... more This introduction to the R package plink is a (slightly) modified version of Weeks (2010), published in the Journal of Statistical Software. The R package plink has been developed to facilitate the linking of mixed-format tests for multiple groups under a common item design using unidimensional and multidimensional IRT-based methods. This paper presents the capabilities of the package in the context of the unidimensional methods. The package supports nine unidimensional item response models (the Rasch model, 1PL, 2PL, 3PL, graded response model, partial credit and generalized partial credit model, nominal response model, and multiple-choice model) and four separate calibration linking methods (mean/sigma, mean/mean, Haebara, and Stocking-Lord). It also includes functions for importing item and/or ability parameters from common IRT software, conducting IRT true-score and observed-score equating, and plotting item response curves and parameter comparison plots.
ETS Research Report Series, 2015
Educational Psychology Review, 2014
When designing a reading intervention, researchers and educators face a number of challenges rela... more When designing a reading intervention, researchers and educators face a number of challenges related to the focus, intensity, and duration of the intervention. In this paper, we argue there is another fundamental challenge—the nature of the reading outcome measures used to evaluate the intervention. Many interventions fail to demonstrate significant improvements on standardized measures of reading comprehension. Although there are a number of reasons to explain this phenomenon, an important one to consider is misalignment between the nature of the outcome assessment and the targets of the intervention. In this study, we present data on three theoretically driven summative reading assessments that were developed in consultation with a research and evaluation team conducting an intervention study. The reading intervention, Reading Apprenticeship, involved instructing teachers to use disciplinary strategies in three domains: literature, history, and science. Factor analyses and other psychometric analyses on data from over 12,000 high school students revealed the assessments had adequate reliability, moderate correlations with state reading test scores and measures of background knowledge, a large general reading factor, and some preliminary evidence for separate, smaller factors specific to each form. In this paper, we describe the empirical work that motivated the assessments, the aims of the intervention, and the process used to develop the new assessments. Implications for intervention and assessment are discussed.
Journal of Educational and Behavioral Statistics, 2011
Using longitudinal data for an entire state from 2004 to 2008, this article describes the results... more Using longitudinal data for an entire state from 2004 to 2008, this article describes the results from an empirical investigation of the persistence of value-added school effects on student achievement in reading and math. It shows that when schools are the principal units of analysis rather than teachers, the persistence of estimated school effects across grades can only be reasonably identified by placing strong constraints on the variable persistence model implemented by Lockwood, McCaffrey, Mariano, and Setodji. In general, there are relatively strong correlations between the school effects estimated using these constrained models and a reference model that assumes full persistence. These correlations vary somewhat by grade and the underlying test subject. The results from this study indicate cautious support for previous findings that the assumption of full persistence for cumulative value-added effects may be untenable, and evidence is also presented, which indicates a strong ...
Abstract: Vertical scales are typically developed for the purpose of quantifying achievement grow... more Abstract: Vertical scales are typically developed for the purpose of quantifying achievement growth. In practice, it is commonly assumed that all of the scaled tests measure a single construct; however, in many instances there are strong theoretical and empirical reasons ...
Psychological test and assessment modeling, 2016
Examinees may omit responses on a test for a variety of reasons, such as low ability, low motivat... more Examinees may omit responses on a test for a variety of reasons, such as low ability, low motivation, lack of attention, or running out of time. Some decision must be made about how to treat these missing responses for the purpose of scoring and/or scaling the test, particularly if there is an indication that missingness is not skill related. The most common approaches are to treat the responses as either not reached/administered or incorrect. Depending on the total number of missing values, coding all omitted responses as incorrect is likely to introduce negative bias into estimates of item difficulty and examinee ability. On the other hand, if omitted responses are coded as not reached and excluded from the likelihood function, the precision of estimates of item and person parameters will be reduced. This study examines the use of response time information collected in many computer-based assessments to inform the coding of omitted responses. Empirical data from the Programme for the International Assessment of Adult Competencies (PIAAC) literacy and numeracy cognitive tests are used to identify item-specific timing thresholds via several logistic regression models that predict the propensity of responding rather than produce a missing data point. These thresholds can be used to inform the decision about whether an omitted response should be treated as not administered or as incorrect. The results suggest that for many items the timing thresholds (20 to 30 seconds on average) at a high expected probability level of observing a response are notably higher than thresholds used in the evaluation of rapid guessing of responses (e.g., 5 seconds).
Applied Measurement in Education, 2018
Indicators of student academic growth are desired in state accountability systems in order to app... more Indicators of student academic growth are desired in state accountability systems in order to approximate student learning over time and attribute observed growth to schooling inputs. Through an extant analysis of five states' assessment data, this study offers evidence about whether longitudinal match rates and measures of growth differ at the state level for students with disabilities, relative to students without disabilities. There were three main findings: 1) In states in which a modified assessment was offered, students with disabilities were more likely to have missing prior year scores, and consequently missing growth scores; 2) Low scoring students, many of whom has a disability, were more likely to have missing prior scores on the state general assessment, and consequently missing growth scores; 3) Students with and without disabilities showed similar growth using transition and gain score definitions of growth, but students with disabilities had lower growth when estimated via a regressionbased model. Measurement and policy considerations are discussed.
Journal of Educational Psychology, 2019
The opinions expressed are those of the authors and do not represent views of the Institute or th... more The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. We would like to thank Don Powers, Hyo Jeong Shin and Laura Halderman for providing helpful feedback on earlier versions of this manuscript.
International Journal of Testing, 2019
The construct of reading comprehension has changed significantly in the twentyfirst century; howe... more The construct of reading comprehension has changed significantly in the twentyfirst century; however, some test designs have not evolved sufficiently to capture these changes. Specifically, the nature of literacy sources and skills required has changed (wrought primarily by widespread use of digital technologies). Modern theories of comprehension and discourse processes have been developed to accommodate these changes, and the learning sciences have followed suit. These influences have significant implications for how we think about the development of comprehension proficiency across grades. In this paper, we describe a theoretically driven, developmentally sensitive assessment system based on a scenario-based assessment paradigm, and present evidence for its feasibility and psychometric soundness.
Behavior Research Methods, 2018
The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on... more The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on the adequacy of the Gf measures administered. Such studies have yielded mixed results, with a suggestion that Gf measurement issues may be partly responsible. The purpose of this study was to develop a Gf test battery comprising tests meeting the following criteria: (a) strong construct validity evidence, based on prior research; (b) reliable and sensitive to change; (c) varying in item types and content; (d) producing parallel tests, so that pretest-posttest comparisons could be made; (e) appropriate time limits; (f) unidimensional, to facilitate interpretation; and (g) appropriate in difficulty for a high-ability population, to detect change. A battery comprising letter, number, and figure series and figural matrix item types was developed and evaluated in three large-N studies (N = 3,067, 2,511, and 801, respectively). Items were generated algorithmically on the basis of proven item models from the literature, to achieve high reliability at the targeted difficulty levels. An item response theory approach was used to calibrate the items in the first two studies and to establish conditional reliability targets for the tests and the battery. On the basis of those calibrations, fixed parallel forms were assembled for the third study, using linear programming methods. Analyses showed that the tests and test battery achieved the proposed criteria. We suggest that the battery as constructed is a promising tool for measuring the effectiveness of cognitive enhancement interventions, and that its algorithmic item construction enables tailoring the battery to different difficulty targets, for even wider applications. Keywords Intelligence. Fluid ability. Gf. Working memory training. Reasoning. Item-response theory. Test assembly General fluid ability (Gf) is Bat the core of what is normally meant by intelligence^(Carroll, 1993, p. 196), and has been shown empirically to be synonymous with general cognitive ability (g), at least within groups with roughly comparable opportunities to learn (Valentin Kvist & Gustafsson, 2008). Gf has been viewed as an essential determinant of one's ability to solve a wide range of novel real-world problems (Schneider & McGrew, 2012). Perhaps because of its association with diverse outcomes, there has been a longstanding interest in improving Gf (i.e., intelligence) through general schooling
ETS Research Report Series, 2016
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Topics in Language Disorders, 2016
Traditional measures of reading ability designed for younger students typically focus on componen... more Traditional measures of reading ability designed for younger students typically focus on componential skills (e.g., decoding, vocabulary), and the items are often presented in a discrete and decontextualized format. The current study was designed to explore whether it was feasible to develop a more integrated, scenario-based assessment of comprehension for younger students. A secondary goal was to examine developmental differences in item performance when administration was in listening versus reading modalities. Cross-sectional differences were examined across kindergarten to third grade on a scenario-based assessment comprising literal comprehension, inference, vocabulary, and background knowledge items. The assessment, originally targeted for third grade, was administered one-on-one to 141 third-grade and 485 second-grade students. It was adapted for and administered to kindergarten (n = 390) and first-grade (n = 419) students by reducing the number of items and switching to a listening comprehension method of administration. Each grade was significantly more accurate than the previous grade on overall performance and background knowledge. A regression analysis showed significant variance associated with background knowledge in predicting comprehension, even after controlling for grade. A deeper analysis of item performance across grades was conducted to examine what elements worked well and where improvements should be made in adapting comprehension assessments for use with young children.
Journal of Statistical Software, 2010
This introduction to the R package plink is a (slightly) modified version of Weeks (2010), publis... more This introduction to the R package plink is a (slightly) modified version of Weeks (2010), published in the Journal of Statistical Software. The R package plink has been developed to facilitate the linking of mixed-format tests for multiple groups under a common item design using unidimensional and multidimensional IRT-based methods. This paper presents the capabilities of the package in the context of the unidimensional methods. The package supports nine unidimensional item response models (the Rasch model, 1PL, 2PL, 3PL, graded response model, partial credit and generalized partial credit model, nominal response model, and multiple-choice model) and four separate calibration linking methods (mean/sigma, mean/mean, Haebara, and Stocking-Lord). It also includes functions for importing item and/or ability parameters from common IRT software, conducting IRT true-score and observed-score equating, and plotting item response curves and parameter comparison plots.
ETS Research Report Series, 2015
Educational Psychology Review, 2014
When designing a reading intervention, researchers and educators face a number of challenges rela... more When designing a reading intervention, researchers and educators face a number of challenges related to the focus, intensity, and duration of the intervention. In this paper, we argue there is another fundamental challenge—the nature of the reading outcome measures used to evaluate the intervention. Many interventions fail to demonstrate significant improvements on standardized measures of reading comprehension. Although there are a number of reasons to explain this phenomenon, an important one to consider is misalignment between the nature of the outcome assessment and the targets of the intervention. In this study, we present data on three theoretically driven summative reading assessments that were developed in consultation with a research and evaluation team conducting an intervention study. The reading intervention, Reading Apprenticeship, involved instructing teachers to use disciplinary strategies in three domains: literature, history, and science. Factor analyses and other psychometric analyses on data from over 12,000 high school students revealed the assessments had adequate reliability, moderate correlations with state reading test scores and measures of background knowledge, a large general reading factor, and some preliminary evidence for separate, smaller factors specific to each form. In this paper, we describe the empirical work that motivated the assessments, the aims of the intervention, and the process used to develop the new assessments. Implications for intervention and assessment are discussed.
Journal of Educational and Behavioral Statistics, 2011
Using longitudinal data for an entire state from 2004 to 2008, this article describes the results... more Using longitudinal data for an entire state from 2004 to 2008, this article describes the results from an empirical investigation of the persistence of value-added school effects on student achievement in reading and math. It shows that when schools are the principal units of analysis rather than teachers, the persistence of estimated school effects across grades can only be reasonably identified by placing strong constraints on the variable persistence model implemented by Lockwood, McCaffrey, Mariano, and Setodji. In general, there are relatively strong correlations between the school effects estimated using these constrained models and a reference model that assumes full persistence. These correlations vary somewhat by grade and the underlying test subject. The results from this study indicate cautious support for previous findings that the assumption of full persistence for cumulative value-added effects may be untenable, and evidence is also presented, which indicates a strong ...
Abstract: Vertical scales are typically developed for the purpose of quantifying achievement grow... more Abstract: Vertical scales are typically developed for the purpose of quantifying achievement growth. In practice, it is commonly assumed that all of the scaled tests measure a single construct; however, in many instances there are strong theoretical and empirical reasons ...