Using Classical Test Theory, Item Response Theory, and Rasch Measurement Theory to Evaluate Patient-Reported Outcome Measures: A Comparison of Worked Examples (original) (raw)

State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory

Journal of Patient-Reported Outcomes

Background: This paper is part of a series comparing different psychometric approaches to evaluate patient-reported outcome (PRO) measures using the same items and dataset. We provide an overview and example application to demonstrate 1) using item response theory (IRT) to identify poor and well performing items; 2) testing if items perform differently based on demographic characteristics (differential item functioning, DIF); and 3) balancing IRT and content validity considerations to select items for short forms. Methods: Model fit, local dependence, and DIF were examined for 51 items initially considered for the Patient-Reported Outcomes Measurement Information System® (PROMIS®) Depression item bank. Samejima's graded response model was used to examine how well each item measured severity levels of depression and how well it distinguished between individuals with high and low levels of depression. Two short forms were constructed based on psychometric properties and consensus discussions with instrument developers, including psychometricians and content experts. Calibrations presented here are for didactic purposes and are not intended to replace official PROMIS parameters or to be used for research. Results: Of the 51 depression items, 14 exhibited local dependence, 3 exhibited DIF for gender, and 9 exhibited misfit, and these items were removed from consideration for short forms. Short form 1 prioritized content, and thus items were chosen to meet DSM-V criteria rather than being discarded for lower discrimination parameters. Short form 2 prioritized well performing items, and thus fewer DSM-V criteria were satisfied. Short forms 1-2 performed similarly for model fit statistics, but short form 2 provided greater item precision. Conclusions: IRT is a family of flexible models providing item-and scale-level information, making it a powerful tool for scale construction and refinement. Strengths of IRT models include placing respondents and items on the same metric, testing DIF across demographic or clinical subgroups, and facilitating creation of targeted short forms. Limitations include large sample sizes to obtain stable item parameters, and necessary familiarity with measurement methods to interpret results. Combining psychometric data with stakeholder input (including people with lived experiences of the health condition and clinicians) is highly recommended for scale development and evaluation.

Current issues in psychometric assessment of outcome measures

2012

medicina fluminensis 2012, Vol. 48, No. 4, p. 463-470 463 Abstract. In recent years there has been an increasing use of outcome measures in clinical practice, audit procedures and quality control. The psychometric assessment of these measures is still largely based on classical test theory (CTT), including analysis of internal consistency, reproducibility, and criterion-related validity. But this approach neglects standard criteria and practical attributes that need to be considered when evaluating the fundamental properties of a measurement tool. Conversely, Rasch analysis (RA) is an original item-response theory analysis based on latent-trait modelling, and provides a statistical model that prescribes how data should be in order to comply with theoretical requirements of measurement. RA gives psychometric information not obtainable through CTT, namely: (i) the functioning of rating scale categories; (ii) the measure’s validity, e.g. how well an item performs in terms of its releva...

Pre-validation methods for developing a patient reported outcome instrument

Background Measures that reflect patients' assessment of their health are of increasing importance as outcome measures in randomised controlled trials. The methodological approach used in the pre-validation development of new instruments (item generation, item reduction and question formatting) should be robust and transparent. The totality of the content of existing PRO instruments for a specific condition provides a valuable resource (pool of items) that can be utilised to develop new instruments. Such 'top down' approaches are common, but the explicit pre-validation methods are often poorly reported. This paper presents a systematic and generalisable 5-step pre-validation PRO instrument methodology. Methods The method is illustrated using the example of the Aberdeen Glaucoma Questionnaire (AGQ). The five steps are: 1) Generation of a pool of items; 2) Item de-duplication (three phases); 3) Item reduction (two phases); 4) Assessment of the remaining items' content coverage against a pre-existing theoretical framework appropriate to the objectives of the instrument and the target population (e.g. ICF); and 5) qualitative exploration of the target populations' views of the new instrument and the items it contains. Results The AGQ 'item pool' contained 725 items. Three de-duplication phases resulted in reduction of 91, 225 and 48 items respectively. The item reduction phases discarded 70 items and 208 items respectively. The draft AGQ contained 83 items with good content coverage. The qualitative exploration ('think aloud' study) resulted in removal of a further 15 items and refinement to the wording of others. The resultant draft AGQ contained 68 items. Conclusions This study presents a novel methodology for developing a PRO instrument, based on three sources: literature reporting what is important to patient; theoretically coherent framework; and patients' experience of completing the instrument. By systematically accounting for all items dropped after the item generation phase, our method ensures that the AGQ is developed in a transparent, replicable manner and is fit for validation. We recommend this method to enhance the likelihood that new PRO instruments will be appropriate to the research context in which they are used, acceptable to research participants and likely to generate valid data.

Statistical considerations in the design, analysis and interpretation of clinical studies that use patient-reported outcomes

Statistical Methods in Medical Research, 2014

This special issue of Statistical Methods in Medical Research emphasizes statistical considerations that are unique to a clinical study whose endpoint is based on a patient-reported outcome (PRO), which is any report of the status of a patient's health condition that comes directly from the patient, without interpretation of the patient's response by a clinician or anyone else. 1 The design, analysis and interpretation of results from clinical studies that use PROs to support the efficacy of an investigational medical product are addressed by the papers in this issue. Many of the viewpoints in this issue also apply to clinician-reported outcomes and observer-reported outcomes, in addition to PROs. As Julious and Walters describe in this issue, these three types of outcomes can simply be called 'person-reported outcomes'. 2 Although the concerns and techniques discussed in the papers may pertain to the use of PROs as diagnostic tools in clinical or rehabilitation settings, these settings are not explicitly covered. The selection of a PRO instrument appropriate for the clinical study population and study objective is crucial to the successful outcome of a clinical study. In this issue, Izem et al. discuss statistical considerations regarding the choice of single-item PRO instruments and multi-item PRO instruments. 3 They also describe challenges that must be addressed when existing instruments are adapted for use in clinical studies. In addition to the use of existing instruments, a second approach is to develop a new PRO instrument for the study. Some statistical considerations important to the development of new PRO instruments are described next in this editorial. Increasingly, modern psychometric theory (e.g. item response theory and Rasch models) and classical test theory are being used to develop new PRO instruments. In this issue, Massof provides an overview of these methods and their implications for the validation and scoring of instruments. 4 This overview should be helpful to statisticians who need to understand these approaches and the roles of these approaches in the development of new, validated instruments for use in clinical studies. The potential for differential item functioning (DIF) and, more specifically, intervention-specific DIF needs to be considered when an instrument is being selected for a clinical study. 4 Massof alerts

Literature review to assemble the evidence for response scales used in patient-reported outcome measures

Journal of Patient-Reported Outcomes

Background: In the development of patient-reported outcome (PRO) instruments, little documentation is provided on the justification of response scale selection. The selection of response scales is often based on the developers' preferences or therapeutic area conventions. The purpose of this literature review was to assemble evidence on the selection of response scale types, in PRO instruments. The literature search was conducted in EMBASE, MEDLINE, and PsycINFO databases. Secondary search was conducted on supplementary sources including reference lists of key articles, websites for major PRO-related working groups and consortia, and conference abstracts. Evidence on the selection of verbal rating scale (VRS), numeric rating scale (NRS), and visual analogue scale (VAS) was collated based on pre-determined categories pertinent to the development of PRO instruments: reliability, validity, and responsiveness of PRO instruments, select therapeutic areas, and optimal number of response scale options. Results: A total of 6713 abstracts were reviewed; 186 full-text references included. There was a lack of consensus in the literature on the justification for response scale type based on the reliability, validity, and responsiveness of a PRO instrument. The type of response scale varied within the following therapeutic areas: asthma, cognition, depression, fatigue in rheumatoid arthritis, and oncology. The optimal number of response options depends on the construct, but quantitative evidence suggests that a 5-point or 6-point VRS was more informative and discriminative than fewer response options. Conclusions: The VRS, NRS, and VAS are acceptable response scale types in the development of PRO instruments. The empirical evidence on selection of response scales was inconsistent and, therefore, more empirical evidence needs to be generated. In the development of PRO instruments, it is important to consider the measurement properties and therapeutic area and provide justification for the selection of response scale type.

Patient-Reported Outcome Instrument Selection: Designing a Measurement Strategy

Value in Health, 2007

Objective: To discuss issues in the design of a measurement strategy related to the use of patient-reported outcomes (PROs) in support of a labelling claim. Methods: In association with the release by the US Food and Drug Administration of its draft guidance on the use of PROs to support labeling claims, the Mayo/FDA Patient-Reported Outcomes Consensus Writing Group was formed. This paper, part of a series of manuscripts produced by the Writing Group, focuses on designing a PRO measurement strategy. Results: Developing a PRO measurement strategy begins with a clear statement about the proposed label claim that will derive from the PRO data. Investigators should identify the relevant domains to measure, develop a conceptual framework, identify alternative approaches for measuring the domains, and synthesize the information to design the measurement strategy.

Patient-Reported Outcome Measures: Development and Psychometric Evaluation

2018

This chapter has been created to provide an accessible introduction to the development and psychometric evaluation of patient-reported outcome (PRO) measures specifically designed to assess key endpoints in clinical trials, with the ultimate goal of supporting approval and/or labeling claims for pharmaceutical products. While many of our recommendations are broadly applicable to the development of PRO measures for use in clinical trials in any country and in other types of patient-based research (such as observational studies), this chapter will primarily focus on assembling and documenting the types of evidence needed to facilitate reviews of key study endpoints by the United States (US) Food and Drug Administration (FDA).

An Introduction to Item Response Theory for Patient-Reported Outcome Measurement

The Patient - Patient-Centered Outcomes Research, 2014

The growing emphasis on patient-centered care has accelerated the demand for high-quality data from patient-reported outcome (PRO) measures. Traditionally, the development and validation of these measures has been guided by classical test theory. However, item response theory (IRT), an alternate measurement framework, offers promise for addressing practical measurement problems found in health-related research that have been difficult to solve through classical methods. This paper introduces foundational concepts in IRT, as well as commonly used models and their assumptions. Existing data on a combined sample (n = 636) of Korean American and Vietnamese American adults who responded to the High Blood Pressure Health Literacy Scale and the Patient Health Questionnaire-9 are used to exemplify typical applications of IRT. These examples illustrate how IRT can be used to improve the development, refinement, and evaluation of PRO measures. Greater use of methods based on this framework can increase the accuracy and efficiency with which PROs are measured. Patient-reported outcomes (PROs) have long been a staple of clinical research [1, 2]. For many years, funding agencies and regulatory bodies such as the US Federal Drug Administration, Centers for Medicare & Medicaid Services, the British National Health Services, and more recently, the Patient Centered Outcomes Research Initiative, have

Methodological issues regarding power of classical test theory (CTT) and item response theory (IRT)-based approaches for the comparison of patient-reported outcomes in two groups of patients - a simulation study

BMC Medical Research Methodology, 2010

Background: Patients-Reported Outcomes (PRO) are increasingly used in clinical and epidemiological research. Two main types of analytical strategies can be found for these data: classical test theory (CTT) based on the observed scores and models coming from Item Response Theory (IRT). However, whether IRT or CTT would be the most appropriate method to analyse PRO data remains unknown. The statistical properties of CTT and IRT, regarding power and corresponding effect sizes, were compared. Methods: Two-group cross-sectional studies were simulated for the comparison of PRO data using IRT or CTTbased analysis. For IRT, different scenarios were investigated according to whether items or person parameters were assumed to be known, to a certain extent for item parameters, from good to poor precision, or unknown and therefore had to be estimated. The powers obtained with IRT or CTT were compared and parameters having the strongest impact on them were identified. Results: When person parameters were assumed to be unknown and items parameters to be either known or not, the power achieved using IRT or CTT were similar and always lower than the expected power using the wellknown sample size formula for normally distributed endpoints. The number of items had a substantial impact on power for both methods. Conclusion: Without any missing data, IRT and CTT seem to provide comparable power. The classical sample size formula for CTT seems to be adequate under some conditions but is not appropriate for IRT. In IRT, it seems important to take account of the number of items to obtain an accurate formula. * Correspondence: veronique.sebille@univ-nantes.fr 1 EA 4275 "Biostatistique, recherche clinique et mesures subjectives en santé",

Clinimetric Criteria for Patient-Reported Outcome Measures

Psychotherapy and Psychosomatics, 2021

Patient-reported outcome measures (PROMs) are self-rated scales and indices developed to improve the detection of the patients’ subjective experience. Given that a considerable number of PROMs are available, it is important to evaluate their validity and usefulness in a specific research or clinical setting. Published guidelines, based on psychometric criteria, do not fit in with the complexity of clinical challenges, because of their quest for homogeneity of components and inadequate attention to sensitivity. Psychometric theory has stifled the field and led to the routine use of scales widely accepted yet with a history of poor performance. Clinimetrics, the science of clinical measurements, may provide a more suitable conceptual and methodological framework. The aims of this paper are to outline the major limitations of the psychometric model and to provide criteria for clinimetric patient-reported outcome measures (CLIPROMs). The characteristics related to reliability, sensitivi...