Novel Approaches to Measuring Neurocognitive Functions in Alzheimer’s Disease Clinical Trials (original) (raw)

. Author manuscript; available in PMC: 2023 Apr 1.

Abstract

Purpose of review:

We comprehensively examined recent advancements in developing novel cognitive measures that could significantly enhance detection of outcome changes in Alzheimer’s disease (AD) clinical trials. Previously established measures were largely limited in their ability to detect subtle cognitive declines in preclinical stages of AD, particularly due to weak psychometric properties (including practice effects and ceiling effects) and requirement of in-person visits that impacted ascertainment.

Recent findings:

We present novel cognitive measures that were designed to exhibit reduced practice effects and stronger correlations with AD biomarkers. Additionally, we summarized some recent efforts in developing remote testing measures protocols that are aimed to overcome the limitations and inconvenience of in-person testing, and digital phenotyping, which analyzes subtle forms of digital behavior indicative of cognitive phenotypes. We discuss each measure’s prognostic accuracy and potential utility in AD research while also commenting on their limitations. We also describe our study, the Development of Novel Measures for Alzheimer’s Disease Prevention Trials (NoMAD), that employed a parallel group design in which novel measures and established measures are compared in a clinical trials armature.

Summary:

Overall, we believe that these recent developments offer promising improvements in accurately detecting clinical and preclinical cognitive changes in the AD spectrum; however, further validation of their psychometric properties and diagnostic accuracies are warranted before reliably implementing these novel measures in AD clinical trials.

Keywords: Preclinical Alzheimer’s Disease, Cognition, Neuropsychology, Testing, Biomarkers

INTRODUCTION

Alzheimer’s disease (AD) is the most common cause of dementia, affecting over 5.8 million people in the US [1]. The neuropathology of AD – such as amyloid-beta (Aβ) plaques and tau tangles – can begin years to decades before clinical symptoms are apparent, although subtle cognitive decline can be present [2]. This preclinical phase has garnered much attention as a window for possible interventions, and yet, our ability to detect subtle cognitive decrements in preclinical AD is limited. Indeed, many of the gold-standard cognitive and functional tests in AD research were originally developed for the clinical population who already show significant cognitive changes. There is a pressing need for improved, validated testing in preclinical AD. This article will describe the issues with current gold-standard tests and various attempts at developing novel neuropsychological measures. We also illustrate methodological limitations of validating novel measures and present ways to address these issues.

LIMITATIONS OF ESTABLISHED TESTS

The need for novel measures of cognition comes in part from the significant practice effects observed in serial assessment with well-established tests. Practice effects are improvements in test performance due to repeated exposure to the test materials and paradigm. Practice effects can be problematic firstly because they can be mistaken as improvements in cognition. Moreover, such gains in performance for a specific test cannot be transferred to other tasks or real-world scenarios. Consequently, tests impacted by practice effects may begin to measure test-taking skill rather than the cognitive capacity they are intended to assess. Further, in a clinical trial setting, improvements due to practice effects may obscure subtle cognitive decline in placebo groups, making it more difficult to identify improvements in an active treatment group. It is even possible for a practice effect to be mistaken for a treatment signal, as has been seen in schizophrenia literature [3]. Goldberg et al.[4] analyzed data from Alzheimer’s Disease Neuroimaging Initiatives (ADNI) and found that cognitively normal older adults exhibited low to medium practice effects on widely-used tests over the course of 6–12 months (effect sizes (ES) of 0.20–0.30). Other studies reported practice effects of similar magnitude [5,6]. Though seemingly modest, practice effects of this magnitude could obscure years of subtle cognitive decline in preclinical populations [4]. Many established cognitive tests fail to follow the cognitive science literature on how to reduce practice effects, lacking approaches that reduce memorization of responses or strategies.

Another weakness of established neuropsychological testing is the pronounced ceiling effects in healthy participants. Ceiling effects occur when most participants reach the maximum score on a test due to low level of difficulty. Because ceiling effects reduce variability in data, they prevent meaningful analysis. Ceiling effects are common in orientation questions (date, year, location etc.) used in many cognitive tests, including the Mini Mental Status Examination (MMSE) and the Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog). In the ADNI data set, roughly 80% of healthy controls and 42% of MCI individuals score 10/10 on the MMSE orientation [7]. Functional measurements, designed primarily for individuals with late MCI or AD, are also susceptible to ceiling effects; for instance, a majority of individuals in preclinical stages score zero (i.e., no functional impairment) on the Functional Activities Questionnaire (FAQ), a commonly used informant-based questionnaire [8]. These ceiling effects in neuropsychological and functional testing limit the measures’ ability to identify subtle cognitive or functional decline.

Traditional tests also have relatively weak correlations with AD biomarkers. In cross-sectional studies, cognitive differences between Aβ+ and Aβ- individuals are typically small or undetectable using established tests [911]. In longitudinal studies, Aβ+ individuals decline faster than Aβ- individuals, but this deficit manifests only over the course of years. For instance, one study found that the difference in decline between Aβ+ and Aβ- groups was 0.07 to 0.15 z-scores per year [12]. Tau aggregation is thought to be more closely related to cognitive impairment than Aβ [1315], but even less is known about the association between established measures and tau. Given that both tau PET and plasma p-tau217 may yield high predictability of preclinical and prodromal AD [16], future validation studies of neuropsychological tests should include tau measures to provide convergent evidence for validity.

In sum, there is an acute need for new neuropsychological tests that are supported by cognitive science literature to reduce practice effects and ceiling effects and improve sensitivity to AD biomarkers. In addition, these tests must retain good validity and test-retest reliability and must demonstrate usefulness in a clinical trial setting.

NOVEL MEASURES

There has been recent advancement in neuropsychological test constructions to target problematic psychometric properties of the previously established measures.

Serial Assessments

Many new computerized assessments have been designed for serial administration. One such test is the Computerized Preclinical Alzheimer’s Cognitive Composite (cPACC). The cPACC is a web-based battery intended to be a digital equivalent of the original PACC (paper and pencil based), the current gold-standard in preclinical AD testing. The PACC consists of 4 subtests and assesses orientation, episodic memory, working memory, and processing speed. A digital version of the PACC is desirable because it could standardize administration, eliminate the need for in-person assessment, and allow for easier data transfer and analysis. In the cPACC, episodic memory is tested by the Face Name Hobby Recall (a matching task with a delay component) and Grid Locations (a visual learning task). Two nonverbal tasks assess working memory: a symbol line task, in which participants recall the ordering of a string of symbols, and a visual patterns task, in which participants are shown an array of boxes and asked to recall the order in which boxes turned black. Speeded matching, a computerized analog to the (WAIS-R) digit symbol coding, measures processing speed. A validation study [17] indicated that cPACC scores were moderately associated with the original PACC (r = .61). Participants who scored above and below a 25 on the MMSE had significantly different scores across all measures, implying some sensitivity to cognitive changes. Nonetheless, the cPACC had weak tests-retest reliability on some subtests (r = .36, r = .48, and r= .45 on symbol line, grid locations delayed recall, and visual patterns, respectively) , and there were significant practice effects on episodic memory tasks (cohen’s d = .49 and cohen’s d =.61 on Grid Locations delayed recall and Face Name Hobby delayed recall, respectively). No alternate forms were available to compensate for such practice effects. While the cPACC’s web-based administration may have some benefits, such as reducing the need for in-person testing, it also demonstrated problems with reliability and practices effects, and its correlation to the PACC was only moderate. cPACC also needs further studies for association with AD biomarkers.

The CogState Brief Battery (CBB) is a commercially available, computerized measure designed to minimize cultural and language biases by using playing cards as stimuli. It consists of a simple reaction time task (Detection task), a choice reaction time task (Identification task), a one-back working memory task (One-Back task), and a continuous recognition visual learning task (Learning Task). The CBB assesses attention/vigilance, processing speed, working memory, and visual learning, and uses alternate forms to reduce practice effects. The CBB demonstrated good validity across neuropsychiatric disorders, such as schizophrenia and dementia associated with HIV [18]. A more recent study [19] added the Face-Name Associative Memory Exam (FNAME) and Mnemonic Separation Task (MST), two tasks that recruit medial temporal regions [20,21], to create the Computerized Cognitive Composite (C3) specifically for use in AD and MCI. The C3 correlates with the PACC and Aβ status, though the PACC proved more sensitive to Aβ status than the C3. The C3 exhibits adequate test-retest reliability (ICC=.626) [22], but suffers from moderate practice effects (increases of .25 standard deviations between visits). The C3 has the advantage of standardized digital administration, a brief format (roughly 15 minutes), and reduced cultural bias [19,23], but it does not address other psychometric issues of established tests, such as practice effects, and relatively low correlations with biomarkers.

The NIH Toolbox Cognitive Battery (NIHTB-CB) is another brief computerized battery. It consists of 7 subtests that assess attention and executive functions, language, processing speed, working memory, and episodic memory. It has been well-validated across a wide age range in cognitively healthy people, showing expected age effects and good test-retest reliability [24,25]. The NIH-CB composite has a small practice effect (d = 0.29), comparable to Goldberg’s aforementioned finding for well-established tests (d =0.25) [4]. A delayed recall trial – sometimes added to the NIH-CB to further tailor the assessment to AD-related impairment – has similar accuracy in classifying CN, MCI, and AD, compared to the PACC. Studies evaluating the NIHTB-CB’s sensitivity to AD biomarkers are limited, and only a weak association was noted with tau pathology (and no association with Aβ) [26]. Similar to CogState’s CBB, the NIHTB-CB is valid and advantageous in some respects, with its brevity and its simple, standardized administration. However, psychometrically, the NIHTB-CB does not present significantly improved sensitivity or reduced practice effects compared with established paper and pencil tests, such as the PACC. Moreover, factor analyses indicate a fluid and a crystalized component which do not necessarily map onto known AD cognitive domain compromises. Further studies are needed to clarify its association with AD biomarkers.

Single time-point assessments

Other novel tests have been primarily used as single time-point assessments. Mormino et al. [27] have focused on tests that specifically engage the hippocampus, an area of the brain affected early in AD [28,29]. The study used an associative memory task [30], in which participants were asked to pair concrete nouns (e.g., violin) with famous people or places (e.g., Ronald Reagan, Golden Gate Bridge) and assessed learning with cued recall. To our knowledge, no data on test-retest reliability, practice effects or associations with other neuropsychological tests have been published. Mormino et al. also administered the widely-used and previously-validated Mnemonic Similarity Task (MST) [31] to test precision of pattern separation learning. In this task, participants learned objects with forced encoding and were subsequently asked to categorize new foils, similar lures, or old objects. The MST is thought to be particularly dependent on dentate function. This task has alternate forms and is robust to practice effects [32], making it also suitable for serial assessments. Worse performance on the associative memory task and MST correlated with higher levels AD biomarkers, such as CSF Aβ42/Aβ40 and Aβ+ status, though associations were small [27]. Both tasks were more sensitive to Aβ than the Logical Memory test from the Wechsler Memory Scale. While more biomarker and validation studies are needed, especially for the associative memory task, these tasks demonstrate the value of targeting the hippocampus in preclinical AD testing.

Another promising novel measure called the Loewenstein-Acevedo Scale for Semantic Interference and Learning (LASSI-L) uses a challenging word learning task to detect subtle cognitive impairment. The LASSI-L employs proactive semantic interference paradigm in which learning and recall of one list of words (List A) interferes with the ability to learn another set of words from the same semantic categories (List B). This task also assesses the degree of retroactive learning (List B’s interference with recall of List A). The performance score is based on presence of both proactive- and retroactive interference, as well as the ability to overcome proactive interference, a capacity unique to the LASSI-L. MCI individuals are more susceptible to both types of interference than cognitively normal individuals; indeed, the LASSI-L classified cognitively normal vs. MCI with 90% accuracy [33,34]. Worse performance also correlated with higher amyloid burden and atrophy of the medial temporal lobe [3436]. Test-retest reliability was high (r = 0.60 to r = 0.89) [35], and although no information is available regarding practice effects, creating alternate forms seem feasible. Overall, the LASSI-L is an improvement to established tests, with higher sensitivity and correlation with biomarkers. Its usefulness in clinical trials remains to be tested.

The Need to Assess the Novel Measures

Taken together, recent development of novel cognitive measures allows for diagnostic accuracy and enhanced ability to detect subtle cognitive changes over time. Nonetheless, it is unclear whether novel measures exhibit significant advantages over established measures in terms of psychometric properties and associations with AD biomarkers. Further studies are needed to validate their utility in AD clinical trials and should include more direct comparisons with established measures using parallel group design. While most of validation studies employed a design where each participant receives both novel and established tests, to correlate performance, such a design could lead to test interference, decreased power, and increased subject fatigue.

We aimed to address these limitations by conducting a randomized parallel group design that is consistent with the structure of a clinical trial (see Figure 1A). In an ongoing multisite study entitled, Development of Novel Measures for Alzheimer’s Disease Prevention Trials (NoMAD), 320 participants will be randomized to two clinical trial armatures, one group receiving novel cognitive measures and the other receiving established measures [33]. The novel measures include the No Practice Effect (NPE) battery, which is specifically designed to reduce practice effects according to the cognitive science literature, using three equivalent alternative forms and components that attenuate learning of study protocols (e.g., forced encoding on recognition memory, interference trials on the N-Back) [37]. Executive function, processing speed, and episodic memory are assessed (see Table 1). The Miami Computerized Functional Assessment Scale (CFAS) is another novel measure that is designed to measure functions related to instrumental activities of daily living (IADLs). The CFAS uses computerized simulations of IADLs, such as using an ATM machine or a ticket kiosk. (see Figure 1B). Accuracy and speed for completing the tasks are used to determine functional abilities. This method measures functioning based on participants’ behaviors, and therefore could eliminate subjectivity and bias that could arise from informant measures. The well-established battery consists of the PACC, ADAS-cog and FAQ. NoMAD is currently ongoing, and we aim to test whether the novel measures exhibit significant advantage over established measures in terms of psychometric properties, practice effects, and association with AD biomarkers (plasma phosphorylated tau and neurofilament light) and genetic markers (APOE status). Findings form this study will inform the utility of novel NPE and CFAS measures as reliable measures of cognitive changes in AD clinical trials. Additionally, findings from NOMAD will inform if this strategy is viable and if the tests themselves show promise in terms of validation parameters. If true, this methodology should be implemented in all studies of this type in the future.

Fig 1A.

Fig 1A.

Study Design. Participants are randomized into the novel or established group. The novel group completes the NPE battery and the CFAS at baseline, 3 months, and 12 months. The established group completes the PACC, ADAS-cog, and FAQ at baseline, 3 months, and 12 months. Both groups undergo MRI and have plasma biomarkers drawn at baseline and 12 months.

Table 1:

No Practice Effect (NPE) measures in NoMAD study

Domain Assessed Assessment
Executive function/working memory N-BackLetter-Number SpanBrown Peterson Paradigm
Processing speed Digit Symbol
Verbal Fluency Letter FluencySemantic Fluency
Episodic Memory Word Recognition Memory Immediate RecallWord Recognition Memory Delayed Recall

Fig. 1B.

Fig. 1B.

Miami Computerized Functional Assessment Scale (CFAS). Participants complete computerized simulations of four everyday activities: using an ATM machine, using an online banking system, operating a ticket kiosk, and managing medications.

REMOTE TESTING

Remote testing is an innovative approach to improve diagnostic accuracy and track cognitive changes over time. Smartphones and tablets have become ubiquitous, which led to the advancement of smartphone- and tablet-based remote testing for preclinical AD. Remote testing has multiple advantages. It is highly scalable and cost-effective, allowing access to much larger samples than in-clinic testing can provide. It also lends itself to more frequent testing, leading potentially to more reliable data (as it might eliminate “bad days” or skewed item pools). Further, the fact that participants may complete tests in an environment familiar to them, as opposed to a memory clinic, increases the ecological validity, and reduces anxiety associated with the “white-coat effect”. Remote testing is not without its challenges, however. Its feasibility must be demonstrated, as remote testing protocols can be difficult to monitor compliance and have high attrition [38]. A lack of researcher control over the testing environment with distraction could also lead to poor reliability. Finally, most published smartphone measures still lack rigorous validation. Nonetheless, development and validation of various remote measures offer opportunities for more effective and accurate data collection. Here we will discuss some noteworthy trends and studies in the field of remote testing.

Remote adaptation of in-person tests

Telehealth versions of many well-established tests have been validated and include the Telephone MMSE, Telephone Montreal Cognitive Assessment, and video-conferenced Rowland Universal Dementia Assessment Scale [3942]. In general, these tests correspond highly to in-person tests and demonstrate good feasibility. Some of the aforementioned novel digital tests have also been adapted for telehealth administration. The NIH toolbox has been administered using videoconferencing and showed no significant difference to in-person testing [43]. Cogstate has not published any data on remote protocols, but a study on their digital battery (in-clinic but unsupervised) demonstrated no significant differences between unsupervised and supervised scores [44]. Some measures from the previously-described No Practice Effect (NPE) battery were also adapted for smartphone use to allow for remote administration. NoMAD’s parallel group design allows for direct comparison of the smartphone measures to both established and novel in-clinic measures.

Burst Testing

A second category of remote testing uses a method called “burst” testing, wherein participants complete multiple brief assessments daily for a given span of time. These scores are then averaged to give a more representative measure of cognition. This design comes as a response to the inherent variability of humans’ everyday cognitive function [45], which single time point tests fail to capture. In a study supporting this design’s validity, Sliwinski et al. [46] used smartphones with participants aged 25–65 to deliver five daily tests of working memory and perceptual speed for a period of 2 weeks. The study found excellent reliability and feasibility. In Hassenstab et al.’s burst testing study [47], participants downloaded an app onto their own smartphones and completed four brief batteries a day for 7 days. The batteries included a Grids Tests (spatial memory), a Symbols Tests (processing speed), and Prices Test (associative memory). Preliminary results suggest excellent reliability (r = .90 to .98), and a wide range of correlations with well-established in-person measures (r = .24 to .74) [47]. Associations with biomarkers remain to be established.

Short-Term Serial Testing

Other researchers have used remote administration to effectively evaluate learning ability, leveraging high volume testing to measure improvement over time. Decrements in learning ability have been linked to preclinical AD, with some arguing that preclinical AD is primarily characterized by learning deficits. One study of 94 individuals used a Face-Name Associative Memory Exam (FNAME) to measure learning ability over the course of one year [48]. Tasks were completed every month using a mobile tablet. Aβ+ individuals performed significantly worse than Aβ - individuals after the 4th testing session. In contrast, no difference was observed in well-established measures (administered at baseline and again at 1-year). Although the task suffered from ceiling effects, this method shows promise in identifying learning differences at a much shorter timescale than in-person tests, important for early and accurate detection of cognitive change. Another, more rigorous, learning paradigm called Online Repeated Cognitive Assessment (ORCA) was developed by Lim et al. [49]. Participants were asked to match 50 Chinese characters to their meaning for 25 minutes a day for 6 days. This task, previously validated in a pilot study [50], was highly difficult and was free from ceiling effects. Aβ + individuals scored significantly worse than Aβ – individuals as soon as the second day of testing, with the gap continuing to widen over the course of 6 days. Alternate forms using different stimuli are in development, though currently unavailable. How this test would be used in a clinical trial remains to be seen, not least because of the rigor of its testing regimen. Nevertheless, this study speaks to the value of high volume, high difficulty testing to reveal subtle cognitive deficits. It is also possible that another rigorous verbal learning test administered for the same number of hours might yield similar results and have similar drawbacks.

Digital Phenotyping

The last trend in novel testing discussed here is digital phenotyping. Digital phenotyping uses data from devices to capture subtle behaviors that may be indicative of different cognitive phenotypes. The data can be collected actively, while participants engage in cognitive activities (e.g., assessments), or passively, by monitoring phone use, or other device data, in everyday life. The Digital Neuro Signature (DNS), developed by Altoida, is an augmented reality task in which participants hide objects and then are asked to retrieve them in specific orders. Using machine learning, the DNS assesses global cognitive function by measuring the speed and accuracy of performance, as well as over 100 subtle behaviors, such as grip strength, purposeless screen touch, and gait. A large cohort study (n= 548) showed the DNS to be highly prognostically accurate (.94), sensitive (0.91) and specific (0.82) [51]. The DCTClock, a digital adaptation of the common paper-and-pencil clock drawing test, also uses an active data collection approach. In the original clock drawing test, participants are asked to draw the face of a clock showing a specific time, and clinicians evaluate the end result. The DCTClock uses a digital pen to track behaviors throughout the test, such as pauses and unnecessary pen strokes, in addition to the final product. Scores have been shown to correlate well with well-established measures, such as the PACC, and AD biomarkers, such as Aβ and tau imaging [52]. Apple, Evidation, and Eli Lilly, in contrast, have collaborated to passively track elements of phone use in healthy, MCI, and AD individuals [53]. Phone usage patterns differed among cognitively healthy and cognitively impaired people, with impaired individuals showing slower typing, greater reliance on helper apps, and fewer total messages, to list a few metrics. Other passive phenotyping studies examine movement and activity patterns to identify cognitive decline. One study installed a GPS data logger into each participant’s vehicle, and captured driving performance (using speed, acceleration, and other measures) and the spatiotemporal characteristics of car trips over the course of a year [54]. Driving behavior on its own was highly accurate (F1 = .82) in identifying preclinical AD, as defined by the presence of CSF biomarkers. When age and APOE status were added to the model, the F1 score rose to .96. Another study used a wearable ankle device to measure gait and errors in motor planning, and found striking differences across cognitive groups with large effect sizes (Cohen’s d = 1.17–4.56) [55]. Still other studies use Smart Home technologies and other wearable devices to monitor cognitive status [56]. These studies show promise for passive digital phenotyping and signal a possible avenue for remote testing to pursue.

CONCLUSION

In pre-clinical AD, common paper-and pencil tests are limited by practice effects and ceiling effects, in addition to barriers associated with in-person testing. Many new cognitive measures, in the form of novel in-person testing, remote testing, and digital phenotyping, aim to address these weaknesses. Most of these new measures are digitally administered, signaling a new era in neuropsychological measurement. Overall, we believe that these novel tests and deployment methods appear promising. However, whether they are in fact “better” (i.e., more sensitive to biomarker and to subtle cognitive decline or improvement) than established tests remains unproven.

BULLET POINTS.

Financial support and sponsorship:

This work is supported by the National Institute on Aging grant award R01AG051346 to TG.

Conflicts of interest:

HK reports support from NIH 5T32MH020004. DPD reports scientific advisory to Acadia, Genentech, and Sunovion. DPD has served on the DSMB for Green Valley. TG receives royalties from VeraSci for use of the BACS in clinical trials. SC has nothing to report.

REFERENCES: