Annotated Bibliography, Introduction, and Summary Paragraph: Seeking the Truth

Evidence of Systematic Attenuation in the Measurement of Cognitive Deficits in Schizophrenia Michael L. Thomas University of California San Diego and VA San Diego Healthcare System, San Diego, California Virginie M. Patt San Diego State University/University of California San Diego Andrew Bismark University of California San Diego and VA San Diego Healthcare System, San Diego, California Joyce Sprock University of California San Diego Melissa Tarasenko, Gregory A. Light, and Gregory G. Brown University of California San Diego and VA San Diego Healthcare System, San Diego, California Cognitive tasks that are too hard or too easy produce imprecise measurements of ability, which, in turn, attenuates group differences and can lead to inaccurate conclusions in clinical research. We aimed to illustrate this problem using a popular experimental measure of working memory—the N-back task—and to suggest corrective strategies for measuring working memory and other cognitive deficits in schizo- phrenia. Samples of undergraduates (n 42), community controls (n 25), outpatients with schizo- phrenia (n 33), and inpatients with schizophrenia (n 17) completed the N-back. Predictors of task difficulty—including load, number of word syllables, and presentation time—were experimentally manipulated. Using a methodology that combined techniques from signal detection theory and item response theory, we examined predictors of difficulty and precision on the N-back task. Load and item type were the 2 strongest predictors of difficulty. Measurement precision was associated with ability, and ability varied by group; as a result, patients were measured more precisely than controls. Although difficulty was well matched to the ability levels of impaired examinees, most task conditions were too easy for nonimpaired participants. In a simulation study, N-back tasks primarily consisting of 1- and 2-back load conditions were unreliable, and attenuated effect size (Cohen’sd) by as much as 50%. The results suggest that N-back tasks, as commonly designed, may underestimate patients’ cognitive deficits as a result of nonoptimized measurement properties. Overall, this cautionary study provides a template for identifying and correcting measurement problems in clinical studies of abnormal cognition.

General Scientific Summary Patients’ cognitive deficits can appear smaller than they truly are as a result of measurement artifacts.

This study suggests that a measure commonly used to assess working memory deficits in schizo- phrenia can produce unreliable and attenuated estimates of ability because most items are too easy.

The methodology presented is general, and can be used by investigators to determine whether cognitive tasks used in research are appropriately calibrated for the samples under investigation.

Keywords:effect size, N-back, reliability, schizophrenia, working memory deficits Supplemental materials: This article was published Online First March 9, 2017.

Michael L. Thomas, Department of Psychiatry, University of California San Diego, and VISN-22 Mental Illness, Research, Education, and Clinical Center (MIRECC), VA San Diego Healthcare System, San Diego, Cali- fornia; Virginie M. Patt, Joint Doctoral Program in Clinical Psychology, San Diego State University/University of California San Diego; Andrew Bismark, Department of Psychiatry, University of California San Diego, and VISN-22 Mental Illness, Research, Education, and Clinical Center (MIRECC), VA San Diego Healthcare System; Joyce Sprock, Department of Psychiatry, University of California San Diego; Melissa Tarasenko, Gregory A. Light, and Gregory G. Brown, Department of Psychiatry, University of California San Diego, and VISN-22 Mental Illness, Research,Education and Clinical Center (MIRECC), VA San Diego Healthcare System.

Research reported in this publication was supported, in part, by the National Institute of Mental Health of the National Institutes of Health under award numbers R01 MH065571, R01 MH042228, and K23 MH102420. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Correspondence concerning this article should be addressed to Michael L. Thomas, Department of Psychiatry, University of California San Diego, 9500 Gilman Drive MC: 0738, La Jolla, CA 92093-0738. E-mail:

[email protected] This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Journal of Abnormal Psychology© 2017 American Psychological Association 2017, Vol. 126, No. 3, 312–3240021-843X/17/$12.00 312 Cognitive tasks that are too hard or easy produce imprecise measurements (Lord, 1980), confound studies of differential def- icit (Chapman & Chapman, 1973), and complicate translational research (Callicott et al., 2000;Manoach et al., 1999). Researchers have explicitly recommended that task difficulty be a main crite- rion used to select neurobehavioral probes (Gur, Erwin, & Gur, 1992), and problems associated with using tests with nonoptimized item properties have been known for many years (Lord & Novick, 1968). Despite this, the relative match, or mismatch, between ability and difficulty is rarely discussed in applied research, likely because there have been few demonstrations of its practical con- sequences. In this paper, we illustrate these problems using a popular experimental measure of working memory—the N-back task—and suggest strategies for precisely measuring working memory and other cognitive deficits in schizophrenia. The meth- odology applied is general, and can inform future studies of abnormal cognition in schizophrenia and other neurocognitive disorders. Item Difficulty and Measurement Error Ability estimates are most precise when item difficulty is closely matched to ability (Embretson, 1996;Lord, 1980). To understand why, it is important to distinguish between classic and modern conceptualizations of measurement error. Classical test theory defines measurement error as the square root of one minus the ratio of true score variance to observed score variance: stan- dard error of measurement. As such, measurement error in classi- cal test theory is a constant. Modern psychometrics—particularly item response theory (IRT;Lord, 1980)— on the other hand, defines measurement error as the standard deviation of the esti- mate of ability: standard error of estimate. As such, estimates of measurement error in IRT may vary over scores within a popula- tion (Embretson, 1996); specifically, error is often a “U”-shaped function of ability. Although unequal precision is not a desirable property, it is, unfortunately, a real and ever-present one that may go unnoticed by researchers using classical methods (e.g., split- half reliability or coefficient alpha). This problem occurs because items that are too hard or too easy produce little systematic variation in observed test scores (Lord, 1980); in extreme cases, tests may show “floor” or “ceiling” effects (i.e., when all exam- inees within a particular range of the ability distribution receive the same score;Haynes, Smith, & Hunsley, 2011).

There are practical consequences of administering tests with item difficulties that are poorly matched to ability. It is an axiom of psychometric theory that associations between variables are attenuated to the extent that measures of those variables are unre- liable (Haynes et al., 2011;Spearman, 1904). Moreover, because reliability is a function of the standard errors associated with individual estimates of ability obtained within a sample (Embret- son, 1996;Lord, 1955), and because, as noted above, error often varies with ability, samples with different mean abilities—such as patients and healthy controls— can be measured with unequal reliability. As a result, associations between ability and outcome, as well as changes in ability, can appear relatively smaller in one group when compared to the other purely due to a measurement confound.

IRT can be used to identify and correct these problems (Thomas, 2011). Unfortunately, the approach is rarely used in neuropsycho-logical test development, and the formal use of IRT in small-scale neurocognitive research is unprecedented. AsStrauss (2001,p.12) noted, IRT’s large sample requirements— usually several hundred to thousand participants—implies that the “method does not seem practical for testing specific, theoretically based hypotheses.” However, with the use of alternative, less statistically demanding measurement models, it is possible to utilize certain applications from IRT in small-scale research (e.g.,Thomas, Brown, Thomp- son, et al., 2013). We describe one such model next. Measurement Approach A limiting factor in the application of IRT to measures of abnormal cognition has been the disconnect between measurement models that are popular in item response theory and measurem- ent models that are popular in cognitive assessment. In particular, most applications of item response theory rely on unidimensional measurement models (i.e., models in which a single person vari- able is thought to influence item responses), with only a small portion of studies using multidimensional approaches (i.e., models in which multiple person variables are thought to influence item responses;Thomas, 2011). Applications of the latter that have been published are generally exploratory (e.g.,Thomas, Brown, Gur, et al., 2013). Measurement models used in cognitive assess- ment, in contrast, are often multidimensional, theory-based, and rely heavily on experimental cognitive research.

A prime example is the equal variance signal detection theory (SDT;Snodgrass & Corwin, 1988) model, which is commonly used to score data from recognition memory tasks (e.g.,Kane, Conway, Miura, & Colflesh, 2007;Ragland et al., 2002). The SDT model, shown inFigure 1, distinguishes between two classes of items: targets and foils. Targets are repeated (or old) items that the examinee is expected to remember. Foils are nonrepeated (or new) items that the examinee is not expected to remember. The SDT model assumes that the presentation of target or foil items during testing invokes a sense of familiarity that can be represented as unimodal, symmetric probability distributions with identical vari- ances but different means. The distance between distributions is a measure of discriminability (d=), and is often the primary outcome score of interest.d=can reflect perceptual, memory, or other types of sensitivity to the detection of signal against a backdrop of noise (Witt, Taylor, Sugovic, & Wixted, 2015). However, because the Figure 1.Equal variance, signal detection theory model. T mean of the distribution of familiarity for targets; F mean of the distribution of familiarity for foils;d= Tminus F(discrimination);C criterion; C center value of the criterion relative the midpoint between Tand F (bias). This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 313 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA familiarity distributions of targets and foils often overlap, the SDT model assumes that examinees must establish a criterion, or level of familiarity, beyond which items will be classified as targets. It is useful to define a measure of bias as the value of the criterion relative to the midpoint between target and foil distributions (C center ).C center can reflect both perceptual and response biases (Witt et al., 2015). The primary advantage of using the SDT measurement model in studies of abnormal cognition is the ability to disentangle sensitivity from bias.

Previous work has shown that the SDT model can be formulated as a generalized linear model with coefficients representing exam- inee ability and item difficulty (DeCarlo, 1998,2011). In other work (Thomas et al., 2016), and in theAppendix, we show that this model is also equivalent to a multidimensional IRT model, thus linking a valuable body of psychometric research and technical literature to the measurement of a general class of cognitive constructs. Moreover, because this framework assumes certain item properties based on theory, and allows others to be estimated as a function of task properties, sample size demands are greatly reduced. Researchers can use the approach to investigate standard error of ability estimates, even in relatively small samples, pro- vided that the cognitive tasks used are scored using the SDT framework.

The application of modern psychometric ideas to SDT scoring of test data in experimental studies of abnormal cognition would provide tangible evidence of the problems associated with admin- istering items and tests that poorly match difficulty to ability. Next, we describe one domain of assessment that is ripe for the appli- cation of these ideas: the assessment of working memory deficits in schizophrenia. Working Memory Deficits in Schizophrenia Decreased brain volume, altered morphology, and impaired functioning in brain regions associated with complex cognitive processes (e.g., prefrontal cortex, limbic and paralimbic structures, and temporal lobe) are common in patients diagnosed with schizo- phrenia (Brown & Thompson, 2010;Levitt, Bobrow, Lucia, & Srinivasan, 2010), and are linked to a host of cognitive deficits, including impaired attention, language, executive functioning, pro- cessing speed, and memory (Bilder et al., 2000;Kalkstein, Hur- ford, & Gur, 2010;Reichenberg & Harvey, 2007). Cognitive deficits are core, treatment-refractory, even endophenotypic traits that might prove useful in identifying targets for the next genera- tion of psychological and pharmacological therapies (Brown et al., 2007;Gur et al., 2007;Hyman & Fenton, 2003;Insel, 2012;Lee et al., 2015).

Working memory is a core deficit in patients diagnosed with schizophrenia (Barch & Smith, 2008;Kalkstein et al., 2010;Lee & Park, 2005). Although the construct has been characterized by several evolving theories (Atkinson & Shiffrin, 1968;Baddeley & Hitch, 1974;Cowan, 1988), it can generally be defined as, “those mechanisms or processes that are involved in the control, regula- tion and active maintenance of task-relevant information in the service of complex cognition” (Miyake & Shah, 1999, p. 450). The construct has been intensively studied in cognitive psychology (Baddeley, 1992;Cowan, 1988), neuroscience (Owen, McMillan, Laird, & Bullmore, 2005), and clinical neuropsychology (Lezak, Howieson, Bigler, & Tranel, 2012). Deficits in working memoryalso occur in several other neurological and psychiatric disorders including attention-deficit/hyperactivity disorder (Engelhardt, Nigg, Carr, & Ferreira, 2008), autism (Williams, Goldstein, Car- penter, & Minshew, 2005), dementia (Salmon & Bondi, 2009), depression (Christopher & MacDonald, 2005), traumatic brain injury (Vallat-Azouvi, Weber, Legrand, & Azouvi, 2007), and posttraumatic stress disorder (Shaw et al., 2009).

The N-back task, which asks examinees to monitor a continuous stream of stimuli and respond each time an item is repeated from Nitems before, is one popular measure of working memory deficits in schizophrenia. N-Back tasks were introduced to study serial learning and short-term retention of rapidly changing infor- mation (Kirchner, 1958;Mackworth, 1959;Welford, 1952).Figure 2shows an example of a 2-back task (i.e., load orN 2) using words as stimuli. Examinees are asked to respond to targets but not to foils or lures (i.e., items that have been repeated from some lag other thanN[e.g., a 3-back item presented during a 2-back condition; seeFigure 2] and thus should not be responded to). The N-back task gained popularity as an experimental working mem- ory paradigm in the 1990s (Cohen & Servan-Schreiber, 1992; Gevins & Cutillo, 1993;Gevins et al., 1990;Jonides et al., 1997), and has since been widely adapted, using stimuli varying across modality, including letters, digits, words, shapes, pictures, faces, locations, auditory tones, and even odors (Owen et al., 2005).

These diverse versions of the N-back task have been shown to require both stimulus-specific processes as well as recruit common brain regions (Nystrom et al., 2000;Owen et al., 2005;Ragland et al., 2002). Although experimental versions of the N-back task are popular in schizophrenia and neuroimaging research—to the point of being considered a “gold standard” paradigm (Glahn et al., 2005;Kane & Engle, 2002;Owen et al., 2005), and have even shown efficacy for use in cognitive remediation (Jaeggi, Busch- kuehl, Jonides, & Perrig, 2008)— questions nevertheless remain about the psychometric properties of these tasks (e.g.,Jaeggi, Buschkuehl, Perrig, & Meier, 2010).

Several investigators have reported only moderate, weak, and even nonsignificant associations between N-back performance and Figure 2.Example of a 2-back run from the N-back task. Examinees are asked to respond whenever a word is repeated from 2 words before. Items repeated from 2-back are targets, items that are repeated, but not from 2-back are referred to as lures, and nonrepeated items are referred to as foils. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 314 THOMAS ET AL. performance on prototypical working memory paradigms such as measures of simple and complex span (e.g.,Jacola et al., 2014; Jaeggi et al., 2010;Kane & Engle, 2002;Miller, Price, Okun, Montijo, & Bowers, 2009;Shamosh et al., 2008;Shelton, Elliott, Matthews, Hill, & Gouvier, 2010). One possible cause for the N-back’s poor validity is poor reliability. Reliability estimates reported in the literature have ranged from poor to good (e.g., Jaeggi et al., 2010;Kane et al., 2007;Salthouse, Atkinson, & Berish, 2003;Shelton et al., 2010) and appear to depend on N-back load condition and stimulus modality (e.g.,Jaeggi et al., 2010; Salthouse et al., 2003). Indeed, in a study examining the split-half reliability of the N-back task under different load manipulations, Jaeggi et al. (2010)concluded that, “the N-back task does not seem to be a useful measure of individual differences in working mem- ory [capacity], due to its low reliability.” However, the N-back’s poor, or at least inconsistent, reliability may be a function of poorly matched examinee ability and item difficulty. Current Study In this study our first aim was to determine how task manipu- lations influence difficulty and precision on the N-back. This was accomplished by using techniques from IRT to quantify measure- ment error for estimates ofd=andC center produced by a SDT measurement model. As noted above, measurement error varies when item difficulty is not well matched to the full range of ability within a sample. Because the N-back appears to have a restricted range of difficulty (i.e., few load conditions), and because reliabil- ity estimates reported in the literature have varied substantially from sample to sample, we hypothesized that error in empirical estimates ofd=andC center would vary as a function of ability. Oursecond aim was to use this information to explore the potential impact of imprecision on observed group differences in clinical studies of working memory deficits in schizophrenia. We hypoth- esized that mismatched ability and difficulty would lead to atten- uated precision and effect size. That is, if item difficulty on the N-back is well matched to the abilities of healthy controls or patients, but not both, this should result in unequal precision between groups. Furthermore, because mismatched ability and difficulty increases measurement error, and because measurement error attenuates effect size, we also assumed that restricted range of item difficulty would result in lower effect size for one group when compared to the other. Method Participants We sought to study a heterogeneous sample in order to maxi- mize variance in working memory ability. The sample comprised two cognitively healthy groups— undergraduates (N 42) and community controls (N 25)—and two groups of patients diag- nosed with either schizophrenia or schizoaffective disorder— out- patients (N 33) and inpatients (N 17). Undergraduates were recruited from an experimental subject pool, outpatients and com- munity controls were recruited from the general community, and inpatients were recruited from a locked long-term care facility.

Demographic characteristics of the samples are reported inTable 1. Written consent was obtained from all participants. Patients were assessed on their capacity to provide informed consent. When relevant, consent was obtained from court-ordered conservators.

Table 1 Demographic and Clinical Characteristics Characteristic Undergraduates Community controls Outpatients Inpatients N42 25 33 17 Age (SD) 21.07 (2.11) 38.24 (12.39) 44.94 (11.63) 37.88 (11.13) Male 16 (38%) 10 (40%) 19 (58%) 9 (53%) Female 26 (62%) 15 (60%) 14 (42%) 8 (47%) Hispanic 12 (29%) 2 (8%) 9 (27%) 4 (24%) Race White 13 (32%) 13 (52%) 16 (48%) 12 (71%) Black 0 (0%) 4 (16%) 4 (12%) 0 (0%) Asian 18 (45%) 4 (16%) 0 (0%) 2 (12%) American Indian 0 (0%) 0 (0%) 0 (0%) 1 (6%) Multiracial 2 (5%) 4 (16%) 13 (39%) 2 (12%) Other 7 (18%) 0 (0%) 0 (0%) 0 (0%) Education (SD) 15 (1.18) 15.12 (2.15) 12.09 (2.26) 11.47 (2.03) Parental Education (SD) a — 13.88 (1.81) 12.46 (2.52) 14.10 (2.16) Std. WRAT — 106.38 (8.06) 93.53 (12.38) 93.5 (13.49) Age of Onset — — 22.06 (7.2) 19.62 (5.32) Hospitalizations b — — 9.62 (10.18) 16.71 (9.52) GAF — — 41.34 (4.23) 28.24 (4.98) SAPS Total — — 6.34 (3.73) 6.44 (5.19) SANS Total — — 14.66 (4.12) 5.88 (3.54) c Note. Two undergraduates declined to report their race. SAPS Scale for the Assessment of Positive Symptoms; SANS Scale for the Assessment of Negative Symptoms. “—” implies that data were not collected; WRAT Wide Range Achievement Test; GAF General Assessment of Functioning.

aBased on average of mother and father. bBased on self-report. cAvolition-Apathy and Anhedonia- Asociality Scores for inpatients were based on work, social, and recreational participation within the inpatient facility, and thus are likely smaller (better) than would be observed in the community. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 315 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA Research procedures were reviewed and approved by the UC San Diego Human Subjects Protection Program (protocol numbers 071831, 080435, 101497, and 130874).

Diagnoses (or lack thereof) were verified using the patient and nonpatient editions of the Structured Clinical Interview forDiag- nostic and Statistical Manual of Mental Disorders, fourth edition, text revision (DSM–IV–TR;First, Spitzer, Gibbon, & Williams, 2002a;First, Spitzer, Gibbon, & Williams, 2002b) for both patient groups and community controls, respectively, and by using a self-report questionnaire for the undergraduates. Exclusion criteria included inability to understand consent and self-reported nonflu- ent English speaker, previous significant head injury (i.e., loss of consciousness 30 min, residual neurological symptoms, or ab- normal neuroimaging finding), neurological illness, and severe systemic illness. Patients and community controls were excluded if they had a history of alcohol or substance abuse or dependence within the preceding one month, or had a positive illicit drug toxicology screen at the time of testing. Patients were also ex- cluded if they did not meet diagnostic criteria for schizophrenia or schizoaffective disorder, or if they reported current mania. Under- graduates and community controls were also excluded if they reported any history of psychosis, current Cluster A personality disorder, current Axis I mood disorder, history of psychosis in a first degree family member, or current treatment with any antipsy- chotic or other psychoactive medication.

Cognitive Task An N-back task using words as stimuli designed specifically for the purposes of this study was administered to all participants. We generated a list of words using an online word pool database (Wilson, 1988), saved each word’s letter, syllable, and frequency count, and then removed any offensive words and personal names.

This left us with a stimulus pool of 32,236 English words taken from all parts of speech. Next, we generated one hundred 40-word lists containing 32 foil and 8 target or lure item types, so that 1 out of every 5 words presented, on average, was either a target or a lure. Words were randomly selected from the word pool. 1To prevent examinees from guessing the order and rate at which targets and lures were presented, a script written inR(R Core Team, 2013) was used to pseudorandomize the order of stimulus presentation (although the order was held constant over examin- ees).

We experimentally manipulated three crossed factors: N-back load (3-levels: 1, 2, or 3), number of word syllables (3-levels: 1, 2, or 3), and presentation time (3-levels: 500ms, 1,500ms, or 2,500ms followed by a blank screen to attain a fixed presentation rate of one word every 3,000ms). 2We did not include a 0-back load condition (i.e., where examinees are asked to respond anytime a key word is shown) because we felt that the condition may differ not just quantitatively, but also qualitatively from load conditions that require both active maintenance and continuous updating of newly encoded information. Although load manipulations are common, syllable length and presentation time are generally fixed over items on N-back tasks; however, we reasoned that— because these ma- nipulations can increase pressure on encoding and maintenance processes—they might produce a wider range of item difficulty for the N-back task which could benefit measurement precision over- all.We generated unique 40-word lists for each combination of factors. In addition to the experimentally manipulated factors, word frequency and item count within runs were included in all analyses. At an administration time of two minutes per list, we could not administer all unique combinations of factor levels to each participant. Therefore, we used incomplete counterbalancing of conditions. Participants were administered nine lists each with the requirement that they should receive all levels of each factor.

A short set of instructions followed by a practice trial with feed- back preceded each new N-back load condition. Participants were encouraged to take short breaks after each run. The task was administered online using a web application designed and pro- grammed for neurocognitive task administration and lasted ap- proximately 25 to 30 min per participant. Words were presented in large black font on a light gray background with minimal screen distraction. The protocol was the same for all participants except inpatients, who were administered only six N-back lists (three 1-back followed by three 2-back) due to time and fatigue con- straints. Analyses Model.All analyses were conducted within the context of SDT. In equal variance SDT models, the probability of responding to stimuli can be expressed using the following generalized linear model (seeDeCarlo, 1998 Appendix):

1(P(U ij 1)) C center,i Z jd i2, (1) where 1 is the inverse cumulative distribution function for the normal distribution; P(U ij 1) is the probability that individuali responds positively (presses the button) to itemj;Z jis a binary variable equal to 1 if itemjis a target and 1 if it is a foil or lure; d i=is the ability of individualito discriminate between target and foil or lure items; andC center,i represents individuali’s bias. In order to be consistent with IRT, the SDT model can be modified to express the probability of correct answers (as opposed to the probability of responding) and to include the notion of item diffi- culty (seeAppendix):

1(P(X ij 1)) j Z jCcenter,i d i2, (2) where P(X ij 1) is the probability of a correct answer for indi- vidualion itemj, and jrepresents the easiness of itemj.

Task difficulty. j,d i=, andC center,i vary over items and ex- aminees and can be specified as random effects in a mixed-effects model. Accordingly, we analyzed the item accuracy data using 1We also explored the effect of including words with a similar spelling (n-grams) as the items presentedNwords before (e.g., “DOG” - “CAT” - “DIG” in a 2-back condition). However, early analyses suggested that these items did not add difficulty to the N-back task over and above lures, and were highly variable in terms of difficulty level. For simplicity, these items were removed from all analyses.

2We also manipulated the number of word letters to determine whether syllable and letter effects were independent. Because syllables and letters are correlated, the word letter factor was only partially crossed with the word syllable factor (i.e., 3, 4, or 5 letters for 1-syllable words; 5, 6, or 7 letters for 2-syllable words; and 7, 8, or 9 letters for 3-syllable words).

Results suggested that number of word letters did not significantly improve model fit when number of word syllables had already been accounted for. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 316 THOMAS ET AL. generalized linear mixed modeling (GLMM; seeHox, 2010for a review of multilevel or mixed-effects models) and the lme4 pack- age forR(Bates, Maechler, Bolker, & Walker, 2014). To inves- tigate predictors of task difficulty within an SDT scoring frame- work, we added fixed effect predictors of item accuracy to Equation 2. The predictors of interest included N-back load, num- ber of word syllables, presentation time, and item count within each run (all centered). The effect of item type was also explored, although the effects are complex to dissociate. In the SDT model, values ofd=, andC center determine the difficulty of targets and foils; item difficulty is negatively associated withd=for both targets and foils, and negatively associated withC center for foils but positively associated with Ccenter for targets (seeEquation 2).

In the current approach, the means of the random effects deter- mined the difficulty of targets and foils. Lure difficulty was de- termined the same as foil difficulty, except for a dummy-coded “Lure” variable that captured added difficulty due to the complex- ity of lures. Centered and log-transformed word frequency was included as a covariate. The combined model had the form:

1(P(X ij 1)) j Z jCcenter,i d i2 N-back j b 3 Word Syllables j b 4 Presentation Time j b 5 Word Frequency j b 6 Lure j b 7 (3) where j,di=, andC center ,iwere all treated as random effects, and all other terms were fixed effects with values varying depending on itemj. The model does not have an intercept term so as to allow the means ofd= iandC center ,ito be nonzero (as they should be).

Measurement precision.In the GLMM approachd=and C center are modeled as random effects, which are equivalent to latent abilities in IRT (de Boeck et al., 2011). Individual values of d=andC center for all examinees were derived using maximum a posteriori (MAP) estimates. To quantify measurement error for these estimates, we extracted their posterior standard deviations (PSDs). Both MAPs and PSDs are produced by the lme4 R package. PSD, which is interpreted similarly to standard error of estimate, provides an index of measurement (im)precision based on the observed data. Measurement precision based on the model and fitted parameter estimates was quantified using information functions for multidimensional item response theory models (Reckase, 2009). Information, the inverse of squared standard error, is a statistic that reflects precision in ability estimates. We produced information functions for all combinations of item type by N-back factor levels focusing only ond=while holdingC center to the mean value in the sample.

Effect size.Finally, we simulated data that would allow us to obtain estimates of the expected attenuation in group difference effect size (Cohen’sd) given different combinations of N-back load conditions. This consisted of the following steps. Step 1: We simulated normally distributedd=andC center values hypothetically obtained from samples of nonimpaired and impaired individuals withd=means fixed to 0.0 and 0.8SDsbelow the overall sample mean in the current study respectively (i.e., corresponding to a Cohen’sdvalue of 0.8 [large effect]). Step 2: We created a pool of N-back items based on specific combinations of task difficulty factors (see below). Step 3: We calculatedd=for each participant in the simulated data using conventional formulas (Snodgrass &Corwin, 1988). Step 4: We calculated Spearman-Brown-corrected split-half reliability (Rel. S.B. ) and Cohen’sdstatistics. Step 5:

Repeated Steps 2 through 4 for the following N-back load condi- tions: all 1-back, all 2-back, all 3-back, mix of 1- and 2-back, mix of 1- and 3-back, mix of 2- and 3-back, and mix of 1-, 2-, and 3-back. Importantly, the same total number of item responses were assumed in each simulation (240) hypothetically corresponding to 12 min of testing. To improve efficiency, each run had a distribu- tion of 60% foils, 20% targets, and 20% lures. The mean for the nonimpaired simulation group was fixed to the unweighted grand mean of the sample, as opposed to the sample mean of controls, to account for any demographic mismatch between outpatients and community controls in the current study (see below). We used the Spearman-Brown-corrected split-half reliability so that our results would be consistent with studies of N-back reliability reported in the literature (e.g.,Jaeggi et al., 2010). The simulation was pro- grammed inR.

Results Demographic Characteristics We compared the samples on key characteristics to determine demographic similarity. Because undergraduates are not expected to be demographically similar to patients or community controls, comparisons were restricted only to the latter groups. The samples did not significantly differ with respect to age,F(2, 72) 3.12,ns, 2 .08, or gender, 2(2;N 75) 1.80;ns; c .16.

Moreover, although the groups differed in terms of education,F(2, 72) 19.01,p .001, 2 .35, they did not significantly differ in terms of mean level of parent education,F(2, 46) 2.79,ns, 2 .11. Descriptive Accuracy Results Figure 3shows mean accuracy results (i.e., the proportion of correct answers) broken down by N-back load, item type, and group. It is notable that accuracies were generally well over 50% and many were above 75%. Undergraduates generally performed better than community controls, followed by outpatients, and then inpatients. Foils were the easiest item type, and lures were the most difficult. Items became consistently harder as N-back load in- creased.

Ability and Task Difficulty GLMM parameter estimates are reported inTable 2. The mean empirical estimate of discriminability (d=) was 4.20 in the sample, suggesting that the N-back task was moderately easy overall. The mean empirical estimate of bias (C center ) was 0.78, suggesting that foils were much easier—more often responded to correctly—than targets. The lure effect was significantly negative, indicating that lures were much more difficult than foils. Increasing N-back load, word syllables, and item count, as well as decreasing presentation time all predicted significantly worse accuracy. The effect of word frequency was not statistically significant. Interestingly, the stan- dard deviation of empirically estimated item easiness ( ) was small, suggesting that N-back difficulty was dominated by task rather than individual item features. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 317 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA Measurement Precision Table 3reports mean estimates (MAPs) ofd=andC center as well as measurement errors (PSDs) within each sample. (Note that these results do not attempt to control for demographic covariates.) Ability and measurement precision varied over populations.Figure 4shows estimates ofd=plotted against the errors of those estimates for undergraduates, community controls, and outpatients (inpa- tients were omitted from the figure because, as a result of being administered fewer items [see methods], PSDs associated with inpatients’ ability estimates are higher than other groups). The figure also shows approximate values of reliability corresponding to each PSD level. 3Error appears to be a nonlinear function of ability level; PSDs were generally lower for examinees with low versus high values ofd=. The PSDs generally suggest good or even excellent measurement precision in the sample; this is mainly due to the high number of N-back runs administered.

To further explore measurement precision we created information functions for combinations of N-back load and item type, holding all other task factors at their median values. The results are shown in Figure 5. For interpretability, the information functions (represented by solid, dashed, and dotted lines corresponding to 1, 2, and 3-back loads, respectively) are superimposed over the implied distributions of d=for undergraduates, community controls, outpatients, and inpa- tients. As can be seen, the information functions generally peak atd= values that are lower than the mean of each distribution of ability; this is particularly true of foils and all 1-back conditions. The results suggest that the N-back task was too easy to provide precise, or at least efficient, estimates ofd=for participants with average to above average ability. Moreover, foils provided almost no useful informa-tion about ability. Targets at 3-back and lures at 2 and 3-back were the most informative across all groups. Effect Size Results of the effect size simulation are reported inTable 4.

Reliability was consistently worse for the nonimpaired group.

Reliability overall was closely tied to N-back difficulty. The sim- ulations that used all 1-back conditions and a combination of both 1- and 2-back conditions both produced unacceptably low reliabili- ties, and Cohen’sdeffect size values were severely attenuated for these simulations dropping by 0.37 (46%) and 0.30 (37%) respec- tively (i.e., from large to small and medium effects). The two best performing simulations were those that used all 3-back conditions and a mix of 2- and 3-back conditions. Both produced moderate reliabilities (0.75 and 0.70, respectively), and the simulated atten- uations in Cohen’sdwere 0.18 (22%) and 0.21 (26%) respectively. Discussion The results of this study demonstrate that reliability and mea- sured group differences are both attenuated when cognitive tasks are not well matched to ability within the samples under investi- gation. These problems were demonstrated using a task commonly 3It has been noted by several authors that, given the classical test theory definition that standard error of measurement equals the standard deviation of scores times the square root of one minus reliability, the average standard error of estimate needed in order to achieve adequate, good, or excellent reliability can be calculated. Figure 3.Item accuracy by group, item type, and N-back.nrefers to the number of observed item responses. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 318 THOMAS ET AL. used to study working memory deficits in schizophrenia: the N-back task. We found that N-back load and item type were the two primary determinants of task difficulty. Difficulty increased along with N-back load, and lures and targets were both much harder than foils. Task conditions were maximally informative within the low average to highly impaired spectrum of ability. In a simulation study, we found that N-back tasks composed entirely of low load conditions (i.e., 1- and 2-back) were highly unreliable, and may reduce the observed effect size by half.

Strengths and Limitations Strengths of the study include its novel statistical methodology, the heterogeneous sample, and the use of an experimental design to study task features on the N-back. However, results should be interpreted in light of several limitations. First, our sample and design did not provide data that would be sufficient to examine the dimensionality and construct validity of the N-back task. This topic is discussed in detail below. Second, it is common in psy- chometrics to examine detailed fit statistics to determine how well the theoretical model matches the observed data (Swaminathan, Hambleton, & Rogers, 2007). Although general markers of model fit were good (see supplemental material), we lacked appropriate data to examine item-level fit statistics (i.e., too few responses per item). Third, although common in the literature, we did not includea 0-back load condition, which is sometimes used to form contrast measures which, in theory, control for variance that is irrelevant to the target construct (e.g., attention and motivation). This was because we felt that the 0-back condition—where examinees are typically asked to respond anytime a key word is shown—may differ not just quantitatively, but also qualitatively from load conditions that require both active maintenance and continuous updating of newly encoded information. Fourth, although patients and community controls did not significantly differ in terms of age and gender, controls reported higher education. The difference in education is a common finding that almost certainly reflects, at least in part, a consequence of mental illness. The groups were, however, matched on parental education, which may be a better indicator of premorbid demographic similarity. Nonetheless, to the extent that demographic factors exaggerated differences in work- ing memory between groups, unequal reliability as well as atten- uation in effect size between groups may have been overestimated. Table 2 Generalized Linear Mixed-Effects Model Parameter Estimates MAP est. random effectsMS 2 Discriminability (d=) 4.20 1.03 Bias (C center ) .78 .09 Item easiness ( ) .01 .02 Fixed effectsbSE CI 95% exp(b)r xy.z p N-back load .552 .028 [ .607, .497] .58 .24 .001 Word syllables .094 .027 [ .146, .042] .91 .04 .001 Presentation time .062 .025 [ .111, .013] .94 .02 .014 Item count .009 .002 [ .013, .006] .99 .06 .001 Word frequency .023 .012 [ .048, .001] .98 .02 .059 Lure 2.228 .068 [ 2.362, 2.094] .11 .40 .001 Note. MAP maximum a posteriori;b estimate of regression coefficient;SE standard error of estimate; CI 95% 95% confidence interval; exp(b) coefficients scaled in log-odds;r xy.z partial correlation coefficients. Table 3 Mean Estimates and Error for Discriminability (d=) and Bias (C Center ) by Group Variable UndergraduatesCommunity controls Outpatients Inpatients Discriminability (d=) Estimate 4.78 4.61 3.72 3.14 Error (PSD) .44 .45 .34 .48 Bias (C center ) Estimate .71 .86 .72 .94 Error (PSD) .20 .20 .16 .21 Note. PSD posterior standard deviation. Figure 4.Estimates of discriminability (d=) against the measurement error (PSD) of each estimate. PSD posterior standard deviation. Data for inpatients were omitted because, as a result of being administered fewer items by design (see methods), PSDs associated with inpatients’ ability estimates are higher than other groups. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 319 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA Finally, although our goal was to illustrate a general measurement concern, some results may be specific to characteristics of the current study. However, we purposely collected data from four separate populations and chose a variety of task manipulations in order to increase the range of ability and difficulty under investi- gation. As a practical guide, researchers may wish to compare their samples’ accuracy statistics to our results (seeFigure 3).

Significance and Implications An N-back task with an appropriate number of items, that also includes 2- and 3-back conditions, as well as targets, lures, and foils, is expected to provide reliable, moderately efficient estimates of working memory ability in chronic patients with schizophrenia; the same task, however, is expected to provide less reliable estimates of ability in healthy controls. Because validity coefficients are attenuated by unreliability, associations between N-back scores and outcomes (or predictors) of cognitive impairment can appear weaker in healthy controls when compared with patients with schizophrenia simply because of this measurement artifact. Moreover, the dependence of reliability upon ability has been shown to bring potential confounds instudies of differential deficit (i.e., differences in cognitive abilities between groups;Chapman & Chapman, 1973;Strauss, 2001).

Within the framework of IRT, precision is maximized when predictable variance is maximized. Item information is greatest when the probability of a correct response is 0.50 for dichotomous item responses with no guessing. The common use of SDT to score N-back data in the literature implicitly assumes that examinees do not guess, but rather that response behavior is driven entirely by discriminability (d=) and bias (C center ). Thus, the simple observa- tion that the majority of examinees performed far better than 50% on most N-back items (seeFigure 3) suggests that the test does not produce optimally precise or efficient estimates of ability.

The pattern of measurement error (seeFigure 4) was consistent across samples, suggesting that measurement error was a function of ability but not population. It is reasonable to ask, then, how N-back task manipulations might be altered in such a way to improve the match between ability in difficulty across groups. Our results suggest that researchers should consider using more diffi- cult versions of the N-back task in cognitive studies meant to precisely measure a wide range of individual differences in work- Figure 5.Information functions for all combinations of N-back load by item type holding all other task factors at their median values. For interpretability, the information functions (represented by solid, dashed, and dotted lines corresponding to 1, 2, and 3-back loads) are superimposed over the implied distributions of discriminability (d=) in undergraduates, community controls, outpatients, and inpatients.

Table 4 Simulation Results N-back loadAll Rel. S.B. Nonimpaired Rel. S.B. Impaired Rel. S.B. Simulated Cohen’sdMeasured Cohen’sdAttenuation in Cohen’sd All 1-back .41 .30 .42 .80 .43 .37 All 2-back .61 .50 .62 .80 .54 .26 All 3-back .75 .68 .75 .80 .62 .18 Mix of 1- & 2-back .53 .42 .54 .80 .50 .30 Mix of 1- & 3-back .64 .55 .64 .80 .58 .22 Mix of 2- & 3-back .70 .61 .70 .80 .59 .21 Mix of 1-, 2-, & 3-back .64 .53 .64 .80 .56 .24 Note. Rel. S.B. Spearman-Brown corrected split-half reliability. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 320 THOMAS ET AL. ing memory ability. This is especially true in clinical studies that include healthy controls as a comparison group, or in studies meant to evaluate change over time. Considering the samples as a whole, our results (e.g.,Figure 5) suggest that some examinees with below average ability, most examinees with average ability, and nearly all examinees with above average ability might be measured more efficiently and precisely with additional 4- and possibly even 5-back load conditions. Alternative possibilities for increasing item difficulty without increasing N-back load should also be considered. This might include the use of nonword stimuli, a greater proportion of lures, or dual N-back tasks (Jaeggi et al., 2003). The use of pseudowords (pronounceable word-like letter strings) has particular appeal given that pseudowords tend to have a more pronounced word syllable effect (Valdois et al., 2006) and produce higher false-alarm rates (Greene, 2004) relative to words.

There are, however, two major cautions to consider when eval- uating these recommendations. First, efficient measurement, as is expected to result from administering more difficult items, could come at the cost of tolerability. Parenthetically, we have observed that participants’ reports of mental workload during the N-back task tend to be high even when performance is very good. Four- and especially 5-back runs may cause participants to prematurely discontinue testing, and thus tolerability must be weighed against the benefits of efficient measurement. Second, and perhaps more challenging, manipulating stimulus factors, especially factors other than N-back load, might fundamentally change the task in a way that threatens construct validity.

Indeed, there is a longstanding debate regarding the relative merits of manipulating task difficulty to improve the precision of cognitive measures (seeStrauss, 2001). Changing task difficulty to improve reliability could come at the expense of validity. There are likely several overlapping cognitive processes engaged by the N-back: (a) processes meant to maintain goal and task relevant information without passive/external support—for example, en- coding, storage, and rehearsal; (b) processes meant to manipulate information so as to meet task demands—for example, updating, ordering, and matching; and (c) processes involved in response execution—for example, bias and inhibition (Cohen et al., 1994; Cohen et al., 1997;Jonides et al., 1997;Kane et al., 2007;Lezak et al., 2012;Oberauer, 2005;Wager & Smith, 2003). Because N-back scores likely reflect a weighted composite of these pro- cesses, and because manipulating task difficulty could upset this weighting, the dimensionality of observed scores might vary over conditions (but seeReise, Moore, & Haviland, 2010). From this perspective, it might be argued that task difficulty should only be manipulated if the dimensionality and construct validity of mea- sures can be preserved across conditions.

Researchers interested in investigating, and dissociating, spe- cific deficits using experimental cognitive approaches (seeMac- Donald & Carter, 2002), may prefer to compare performance scores produced by task conditions that are thought to isolate specific cognitive processes (e.g.,Ragland et al., 2002). Unfortu- nately, under such circumstances—where difficulty is held con- stant within, but might differ between, experimental conditions— the amount of nonerror or informative variance in test scores that is directly related to impaired neurocognitive processes might vary over conditions, thus leading to the presently detailed reliability and effect size confounds. As noted byMacDonald and Carter (2002, pp. 880 – 881), “The challenge for researchers from theexperimental cognitive approach is to ensure that their measures of cognitive processes produce an adequate amount of variance so that they are sensitive to the presence of an impairment.” There are two general solutions to this problem. First, researchers can explicitly seek to create process-pure or process-isolating tasks that nonetheless have a wide range of difficulty. Second, researchers can develop mathematical cognitive and psychometric measurement models that link manipulations of item difficulty to specific cognitive processes (e.g.,Brown, Patt, Sawyer, & Thomas, 2016;Brown, Turner, Mano, Bolden, & Thomas, 2013;Embretson, 1984), thereby allowing for the optimization of measurement precision through dif- ficulty manipulations while also accounting for the changing dimen- sionality of observed test scores. To this end, further work is needed to determine how best to manipulate task difficulty and model re- sponse processes on the N-back and other experimental cognitive measures being used in studies of abnormal cognition in schizophre- nia (e.g.,Barch & Smith, 2008). Conclusion This study has demonstrated how task difficulty affects both reliability and effect size measures of group differences. Although concerns related to mismatched ability and difficulty have been known in the psychometric literature for many years—and ac- knowledged using classical psychometric methods in schizophre- nia research (Chapman & Chapman, 1973)—this study is among the first to show the practical, negative consequences of mis- matched ability and difficulty using modern psychometric meth- ods. The problems can be overcome, in part, by administering tasks that include a wide range of difficulty to avoid psychometric floor and ceiling effects. However, researchers must also consider how changes to task difficulty affect tolerability as well as both the dimensionality and the construct validity of measures. References Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In K. W.Spence & J. T. Spence (Eds.), Psychology of learning and motivation(Vol. 2, pp. 89 –195). Oxford, England:

Academic Press. Baddeley, A. (1992). Working memory.Science, 255,556 –559.http://dx Baddeley, A. D., & Hitch, G. (1974). Working memory. In G. H. Bower (Ed.),The psychology of learning and motivation: Advances in research and theory(pp. 47– 89). New York, NY: Academic Press.

Barch, D. M., & Smith, E. (2008). The cognitive neuroscience of working memory: Relevance to CNTRICS and schizophrenia.Biological Psychi- atry, 64,11–17. Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1–7.

Bilder, R. M., Goldman, R. S., Robinson, D., Reiter, G., Bell, L., Bates, J.A.,...Lieberman, J. A. (2000). Neuropsychology of first-episode schizophrenia: Initial characterization and clinical correlates.The Amer- ican Journal of Psychiatry, 157,549 –559. appi.ajp.157.4.549 Brown, G. G., Lohr, J., Notestine, R., Turner, T., Gamst, A., & Eyler, L. T.

(2007). Performance of schizophrenia and bipolar patients on verbal and figural working memory tasks.Journal of Abnormal Psychology, 116, 741–753. Brown, G. G., Patt, V. M., Sawyer, J., & Thomas, M. L. (2016). Double dissociation of a latent working memory process.Journal of Clinical and This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 321 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA Experimental Neuropsychology, 38,59 –75. 13803395.2015.1087467 Brown, G. G., & Thompson, W. K. (2010). Functional brain imaging in schizophrenia: Selected results and methods. In N. R.Swerdlow (Ed.), Behavioral neurobiology of schizophrenia and its treatment(pp. 181–214).

New York, NY:Springer. Brown, G. G., Turner, T. H., Mano, Q. R., Bolden, K., & Thomas, M. L.

(2013). Experimental manipulation of working memory model parame- ters: An exercise in construct validity.Psychological Assessment, 25, 844 – 858. Callicott, J. H., Bertolino, A., Mattay, V. S., Langheim, F. J., Duyn, J., Coppola, R.,...Weinberger, D. R. (2000). Physiological dysfunction of the dorsolateral prefrontal cortex in schizophrenia revisited.Cerebral Cortex, 10,1078 –1092. Chapman, L. J., & Chapman, J. P. (1973). Problems in the measurement of cognitive deficit.Psychological Bulletin, 79,380 –385. 10.1037/h0034541 Christopher, G., & MacDonald, J. (2005). The impact of clinical depres- sion on working memory.Cognitive Neuropsychiatry, 10,379 –399. Cohen, J. D., Forman, S. D., Braver, T. S., Casey, B. J., Servan-Schreiber, D., & Noll, D. C. (1994). Activation of the prefrontal cortex in a nonspatial working memory task with functional MRI.Human Brain Mapping, 1,293–304. Cohen, J. D., Perlstein, W. M., Braver, T. S., Nystrom, L. E., Noll, D. C., Jonides, J., & Smith, E. E. (1997). Temporal dynamics of brain activa- tion during a working memory task.Nature, 386,604 – 608.http://dx Cohen, J. D., & Servan-Schreiber, D. (1992). Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia.Psychological Review, 99,45–77. Cowan, N. (1988). Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information- processing system.Psychological Bulletin, 104,163–191.http://dx.doi .org/10.1037/0033-2909.104.2.163 de Boeck, P., Bakkar, M., Zwitser, R., Nivard, M., Hofman, A., Tuer- linckx, F., & Partchev, I. (2011). The estimation of item response models with the lmer function from the lme4 package in R.Journal of Statistical Software, 39,1–28. DeCarlo, L. T. (1998). Signal detection theory and generalized linear models.Psychological Methods, 3,186 –205. 1082-989X.3.2.186 DeCarlo, L. T. (2011). Signal detection theory with item effects.Journal of Mathematical Psychology, 55,229 –239. .2011.01.002 Embretson, S. E. (1984). A general multicomponent latent trait model for response processes.Psychometrika, 49,175–186. .1007/BF02294171 Embretson, S. E. (1996). The new rules of measurement.Psychological Assessment, 8,341–349. Engelhardt, P. E., Nigg, J. T., Carr, L. A., & Ferreira, F. (2008). Cognitive inhibition and working memory in attention-deficit/hyperactivity disorder.Jour- nal of Abnormal Psychology, 117,591– 605. a0012593 First, M. B., Spitzer, R. L., Gibbon, M., & Williams, J. B. W. (2002a).

Structured clinical interview for DSM–IV–TR Axis I disorders, research version, non-patient ed. (SCID-I/NP). New York, NY: Biometrics Re- search, New York State Psychiatric Institute.

First, M. B., Spitzer, R. L., Gibbon, M., & Williams, J. B. W. (2002b).

Structured clinical interview for DSM–IV–TR Axis I disorders, research version, patient ed. (SCID-I/P). New York, NY: Biometrics Research, New York State Psychiatric Institute.

Gevins, A. S., Bressler, S. L., Cutillo, B. A., Illes, J., Miller, J. C., Stern, J., & Jex, H. R. (1990). Effects of prolonged mental work on functionalbrain topography.Electroencephalography & Clinical Neurophysiology, 76,339 –350. Gevins, A., & Cutillo, B. (1993). Spatiotemporal dynamics of component processes in human working memory.Electroencephalography & Clin- ical Neurophysiology, 87,128 –143. 4694(93)90119-G Glahn, D. C., Ragland, J. D., Abramoff, A., Barrett, J., Laird, A. R., Bearden, C. E., & Velligan, D. I. (2005). Beyond hypofrontality: A quantitative meta-analysis of functional neuroimaging studies of work- ing memory in schizophrenia.Human Brain Mapping, 25,60 – 69. Greene, R. L. (2004). Recognition memory for pseudowords.Journal of Memory and Language, 50,259 –267. .2003.12.001 Gur, R. E., Calkins, M. E., Gur, R. C., Horan, W. P., Nuechterlein, K. H., Seidman, L. J., & Stone, W. S. (2007). The consortium on the genetics of schizophrenia: Neurocognitive endophenotypes.Schizophrenia Bul- letin, 33,49 – 68. Gur, R. C., Erwin, R. J., & Gur, R. E. (1992). Neurobehavioral probes for physiologic neuroimaging studies.Archives of General Psychiatry, 49, 409 – 414. Haynes, S. N., Smith, G., & Hunsley, J. D. (2011).Scientific foundations of clinical assessment. New York, NY: Routledge.

Hox, J. J. (2010).Multilevel analysis: Techniques and applications. New York, NY: Routledge/Taylor & Francis Group.

Hyman, S. E., & Fenton, W. S. (2003). Medicine. What are the right targets for psychopharmacology?Science, 299,350 –351. .1126/science.1077141 Insel, T. R. (2012). Next-generation treatments for mental disorders.Sci- ence Translational Medicine, 4,155ps19. scitranslmed.3004873 Jacola, L. M., Willard, V. W., Ashford, J. M., Ogg, R. J., Scoggins, M. A., Jones, M. M.,...Conklin, H. M. (2014). Clinical utility of the N-back task in functional neuroimaging studies of working memory.Journal of Clinical and Experimental Neuropsychology, 36,875– 886.http://dx.doi .org/10.1080/13803395.2014.953039 Jaeggi, S. M., Buschkuehl, M., Jonides, J., & Perrig, W. J. (2008). Im- proving fluid intelligence with training on working memory.PNAS Proceedings of the National Academy of Sciences of the United States of America, 105,6829 – 6833. Jaeggi, S. M., Buschkuehl, M., Perrig, W. J., & Meier, B. (2010). The concurrent validity of the N-back task as a working memory measure.

Memory, 18,394 – 412. Jaeggi, S. M., Seewer, R., Nirkko, A. C., Eckstein, D., Schroth, G., Groner, R., & Gutbrod, K. (2003). Does excessive memory load attenuate activation in the prefrontal cortex? Load-dependent processing in single and dual tasks: Functional magnetic resonance imaging study.Neuro- Image, 19,210 –225. Jonides, J., Schumacher, E. H., Smith, E. E., Lauber, E. J., Awh, E., Minoshima, S., & Koeppe, R. A. (1997). Verbal working memory load affects regional brain activation as measured by PET.Journal of Cognitive Neuroscience, 9,462– 475. Kalkstein, S., Hurford, I., & Gur, R. C. (2010). Neurocognition in schizo- phrenia.Current Topics in Behavioral Neurosciences, 4,373–390. Kane, M. J., Conway, A. R. A., Miura, T. K., & Colflesh, G. J. H. (2007). Working memory, attention control, and the N-back task: A question of construct validity.

Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 615– 622. Kane, M. J., & Engle, R. W. (2002). The role of prefrontal cortex in working-memory capacity, executive attention, and general fluid intel- ligence: An individual-differences perspective.Psychonomic Bulletin & Review, 9,637– 671. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 322 THOMAS ET AL. Kirchner, W. K. (1958). Age differences in short-term retention of rapidly changing information.Journal of Experimental Psychology, 55,352– 358. Lee, J., Green, M. F., Calkins, M. E., Greenwood, T. A., Gur, R. E., Gur, R. C.,...

Braff, D. L. (2015). Verbal working memory in schizophrenia from the Con- sortium on the Genetics of Schizophrenia (COGS) study: The moderating role of smoking status and antipsychotic medications.Schizophrenia Research, 163, 24 –31. Lee, J., & Park, S. (2005). Working memory impairments in schizophrenia:

A meta-analysis.Journal of Abnormal Psychology, 114,599 – 611. Levitt, J. J., Bobrow, L., Lucia, D., & Srinivasan, P. (2010). A selective review of volumetric and morphometric imaging in schizophrenia.Cur- rent Topics in Behavioral Neurosciences, 4,243–281. 10.1007/7854_2010_53 Lezak, M. D., Howieson, D. B., Bigler, E. D., & Tranel, D. (2012).Neuro- psychological assessment. New York, NY: Oxford University Press.

Lord, F. M. (1955). Estimating test reliability.Educational and Psycho- logical Measurement, 15,325–336. 001316445501500401 Lord, F. M. (1980).Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

Lord, F. M., & Novick, M. R. (1968).Statistical theories of mental test scores (with contributions by A. Birnbaum). Reading, MA: Addison Wesley Pub. Co.

MacDonald, A. W., III, & Carter, C. S. (2002). Cognitive experimental approaches to investigating impaired cognition in schizophrenia: A paradigm shift.Journal of Clinical and Experimental Neuropsychology, 24,873– 882. Mackworth, J. F. (1959). Paced memorizing in a continuous task.Journal of Experimental Psychology, 58,206 –211. h0049090 Manoach, D. S., Press, D. Z., Thangaraj, V., Searl, M. M., Goff, D. C., Halpern, E.,...Warach, S. (1999). Schizophrenic subjects activate dorsolateral prefrontal cortex during a working memory task, as mea- sured by fMRI.Biological Psychiatry, 45,1128 –1137. 10.1016/S0006-3223(98)00318-7 Miller, K. M., Price, C. C., Okun, M. S., Montijo, H., & Bowers, D. (2009).

Is the n-back task a valid neuropsychological measure for assessing working memory?Archives of Clinical Neuropsychology, 24,711–717. Miyake, A., & Shah, P. (1999). Toward unified theories of working memory: Emerging general consensus, unresolved theoretical issues, and future research directions. In A. Miyake & P. Shah (Eds.),Models of working memory: Mechanisms of active maintenance and executive control(pp. 442– 482). Cambridge, United Kingdom: Cambridge Uni- versity Press. Nystrom, L. E., Braver, T. S., Sabb, F. W., Delgado, M. R., Noll, D. C., & Cohen, J. D. (2000). Working memory for letters, shapes, and locations:

FMRI evidence against stimulus-based regional organization in human prefrontal cortex.NeuroImage, 11,424 – 446. nimg.2000.0572 Oberauer, K. (2005). Binding and inhibition in working memory: Individ- ual and age differences in short-term recognition.Journal of Experimen- tal Psychology: General, 134,368 –387. 3445.134.3.368 Owen, A. M., McMillan, K. M., Laird, A. R., & Bullmore, E. (2005).

N-back working memory paradigm: A meta-analysis of normative func- tional neuroimaging studies.Human Brain Mapping, 25,46 –59.http:// Ragland, J. D., Turetsky, B. I., Gur, R. C., Gunning-Dixon, F., Turner, T., Schroeder, L.,...Gur, R. E. (2002). Working memory for complex figures: An fMRI comparison of letter and fractal n-back tasks.Neuro- psychology, 16,370 –379., Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Retrieved from Reckase, M. D. (2009).Multidimensional item response theory. New York, NY: Springer. Reichenberg, A., & Harvey, P. D. (2007). Neuropsychological impairments in schizophrenia: Integration of performance-based and brain imaging findings.Psychological Bulletin, 133,833– 858. .1037/0033-2909.133.5.833 Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores.Journal of Personality Assessment, 92,544 –559. Salmon, D. P., & Bondi, M. W. (2009). Neuropsychological assessment of dementia.Annual Review of Psychology, 60,257–282. 10.1146/annurev.psych.57.102904.190024 Salthouse, T. A., Atkinson, T. M., & Berish, D. E. (2003). Executive functioning as a potential mediator of age-related cognitive decline in normal adults.Journal of Experimental Psychology: General, 132,566 – 594. Shamosh, N. A., Deyoung, C. G., Green, A. E., Reis, D. L., Johnson, M. R., Conway, A. R. A.,...Gray, J. R. (2008). Individual differences in delay discounting: Relation to intelligence, working memory, and anterior prefrontal cortex.Psychological Science, 19,904 –911. 10.1111/j.1467-9280.2008.02175.x Shaw, M. E., Moores, K. A., Clark, R. C., McFarlane, A. C., Strother, S. C., Bryant, R. A.,...Taylor, J. D. (2009). Functional connectivity reveals inefficient working memory systems in post-traumatic stress disorder.Psychiatry Research: Neuroimaging, 172,235–241.http://dx Shelton, J. T., Elliott, E. M., Matthews, R. A., Hill, B. D., & Gouvier, W. D. (2010). The relationships of working memory, secondary mem- ory, and general fluid intelligence: Working memory is special.Journal of Experimental Psychology: Learning, Memory, and Cognition, 36, 813– 820. Snodgrass, J. G., & Corwin, J. (1988). Pragmatics of measuring recognition memory: Applications to dementia and amnesia.Journal of Experimental Psychology: General, 117,34 –50. .1.34 Spearman, C. (1904). The proof and measurement of association between two things.The American Journal of Psychology, 15,72–101.http://dx Strauss, M. E. (2001). Demonstrating specific cognitive deficits: A psy- chometric perspective.Journal of Abnormal Psychology, 110,6 –14. Swaminathan, H., Hambleton, R., & Rogers, H. J. (2007). Assessing the fit of item response theory models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics 26: Psychometrics(pp. 683–718). Boston, MA:

Elsevier North-Holland.

Thomas, M. L. (2011). The value of item response theory in clinical assessment: A review.Assessment, 18,291–307. .1177/1073191110374797 Thomas, M. L., Brown, G. G., Gur, R. C., Hansen, J. A., Nock, M. K., Heeringa, S.,...Stein, M. B. (2013). Parallel psychometric and cognitive modeling analyses of the Penn Face Memory Test in the Army Study to Assess Risk and Resilience in Servicemembers.Journal of Clinical and Experimental Neuropsychology, 35,225–245.http://dx.doi .org/10.1080/13803395.2012.762974 Thomas, M. L., Brown, G. G., Gur, R. C., Moore, T. M., Patt, V. M., Risbrough, V. M., & Bake, D. G. (2016). Psychometric applications of an item response-signal detection model. Manuscript submitted for pub- lication.

Thomas, M. L., Brown, G. G., Thompson, W. K., Voyvodic, J., Greve, D. N., Turner, J. A.,...theFBIRN. (2013). An application of item This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 323 ATTENUATION OF COGNITIVE DEFICITS IN SCHIZOPHRENIA response theory to fMRI data: Prospects and pitfalls.Psychiatry Re- search: Neuroimaging, 212,167–174. .pscychresns.2013.01.009 Valdois, S., Carbonnel, S., Juphard, A., Baciu, M., Ans, B., Peyrin, C., & Segebarth, C. (2006). Polysyllabic pseudo-word processing in reading and lexical decision: Converging evidence from behavioral data, con- nectionist simulations and functional MRI.Brain Research, 1085,149 – 162. Vallat-Azouvi, C., Weber, T., Legrand, L., & Azouvi, P. (2007). Working memory after severe traumatic brain injury.Journal of the International Neuropsycho- logical Society, 13,770 –780. Wager, T. D., & Smith, E. E. (2003). Neuroimaging studies of working memory: A meta-analysis.Cognitive, Affective & Behavioral Neurosci- ence, 3,255–274., A. T. (1952). An apparatus for use in studying serial performance.

The American Journal of Psychology, 65,91–97. .2307/1418834 Williams, D. L., Goldstein, G., Carpenter, P. A., & Minshew, N. J. (2005).

Verbal and spatial working memory in autism.Journal of Autism and Developmental Disorders, 35,747–756. s10803-005-0021-x Wilson, M. D. (1988). The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2.Behavioural Research Methods, Instru- ments and Computers, 20,6 –11.

Witt, J. K., Taylor, J. E. T., Sugovic, M., & Wixted, J. T. (2015). Signal detection measures cannot distinguish perceptual biases from re- sponse biases.Perception, 44,289 –300. p7908 Appendix Derivation of Model Used in Analyses Thisappendixprovides the derivations of the generalized linear model used in all analyses. Equal variance SDT (DeCarlo, 1998; Snodgrass & Corwin, 1988) first assumes that the distributions of familiarity for targets and foils (or lures) can modeled by two normal distributions (mean Tand F, respectively) with equal variance (seeFigure 2). The discrimination parameterd=represents the distance between the two distributions:

d T F (A1) The decision criterion,C, represents the threshold at which individuals may judge that an item looks familiar enough to respond. C can be centered with respect to the mid-point between the two distributions:

C center C T F 2(A2) The probability of responding given that a target was presented, P(U 1 |Target), corresponds mathematically to the area under the target distribution that is to the right of the criterion:

1(P(U 1 |Target)) T C(A3) where 1 is the inverse cumulative distribution function for the normal distribution. Similarly, the probability of responding given that a foil (or lure) was presented, P(U 1 |Foil), corresponds mathematically to the area under the foil distribution that is to the right of the criterion:

1(P(U 1 |Foil)) F C(A4) Using binary variableZ 1if the test item is a target and Z 1 if the test item is a foil (or lure),Equations A3andA4can be combined into the formula that appears in Appendix A of DeCarlo (1998): 1(p(U 1 |Z)) ( F C) 1 Z2 ( T C) Z 12 C center d 2Z(A5) To align the approach with IRT, the model can also be formu- lated to predict the probably of a correct response. A new binary variableXwas thus introduced so thatX 1 for a correct response andX 0 for an incorrect response. Using the property that 1(1 P) 1(P), and knowing that responding is correct when a target is presented whereas nonresponding is correct when a foil is presented,Equation A5yielded 1(P(X 1 |Target)) 1(P(U 1 |Z 1)) C center d 2 1(P(X 1 |Foil)) 1(1 P(U 1 |Z 1)) C center d 2 (A6) These equations were combined, leading to:

1(P(X 1 |Z)) ZC center d 2(A7) To account for item differences in easiness (overjofJitems) and person differences in ability (overiofNpeople), as in IRT, we added the term as well as subscripts to each parameter to arrive at our final equation:

1(P(X ij 1)) j Z jCcenter,i d i2(A8) In this form, the model is functionally equivalent to a multidi- mensional IRT model, but appears superficially distinct because of the use of notation this is common in SDT but not IRT (see Thomas et al., 2016).

Received July 11, 2016 Revision received December 5, 2016 Accepted December 9, 2016 This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 324 THOMAS ET AL.