A signal detection analysis of the relationship between confidence and accuracy in face recognition memory: Implications for eyewitness identification

 

 

Ebbe B. Ebbesen and John T. Wixted

University of California, San Diego

 

 

Working Draft Revision

 

Abstract

Signal detection analyses of two face recognition memory experiments and Monte-Carlo simulations indicated, contrary to the beliefs of most eyewitness memory experts, that confidence and accuracy are intimately related. The results showed that although correlations between confidence and accuracy tend to be low (rarely above .5), higher mean confidence is associated with higher d's, confidence for correct responses is higher than for incorrect ones, ROC curves based on confidence are linear, and confidence is well calibrated to accuracy. Maximum likelihood estimates of confidence cutpoints along with data from a third study in which subjects predicted their accuracy indicated that subjects' expectations of greater than actual memory loss result in reduced confidence in recognition responses.


Introduction

According to a survey by Kassin, Ellsworth, and Smith (1989) eyewitness memory experts seem to agree that the relationship between the accuracy of eyewitness identifications and the degree of confidence that eyewitnesses express in their identifications is, at best, a rather weak one.[1] This consensus seems to be based, in part, on the fact that in studies of recognition memory for human faces, the correlation between accuracy and confidence is generally small.[2] Although the size of these correlations vary widely over studies, many are not significant, and those that are, rarely exceed .5 (Brigham, 1990; Bothwell, Deffenbacher, & Brigham, 1987; Deffenbacher, 1980; Fleet, Brigham, & Bothwell, 1987; Lindsay, 1986; Shapiro & Penrod, 1986; Smith, Kassin, & Ellsworth, 1989; Wells & Murray, 1984; Wells & Lindsay, 1980; Wells & Lindsay, 1985). In the majority of the correlations reported in the face-memory literature, individual differences in memory (and on occasion differences in item memorability, e.g., Read, Vokey, & Hammersley, 1990; Smith, Kassin, & Ellsworth, 1989) provide the source of variation in both accuracy and confidence (Deffenbacher, 1980; Shapiro & Penrod, 1986).

A second aspect of the consensus of opinion reported in Kassin, Ellsworth, and Smith (1989) is the belief that other factors are better predictors of accuracy than confidence. For example, a large majority of experts who agreed that the relationship between confidence and accuracy is weak, also agreed that both duration of exposure and retention interval were strongly related to eyewitness accuracy. Although never stated, presumably these beliefs are based on results from experimental studies in which manipulations of duration and/or retention interval affected mean accuracy across different conditions (Shapiro & Penrod, 1986) and in which the correlations between people’s confidence and their accuracy are small.

The fact that beliefs in the lack of relationship between confidence and accuracy are based on individual differences while virtually all of the theoretical claims about the nature of eyewitness memory are based on the effects that changes in the learning and memory context have on group-average accuracy measures raises an important issue that has been all but ignored in the literature (with one notable exception, Wells & Lindsay, 1985). Namely, how should we define and measure the "relationship" between confidence and accuracy? For example, little research has asked what the relationship between confidence and accuracy might be when measured not over individual (nor even item) differences in memory but rather over those conditions that reliably produce average differences in the accuracy of recognition, such as retention interval. Do conditions that tend to produce low accuracy also produce low confidence estimates? In addition, we know of no attempt to assess the strength of the relationship in terms of different item effects over individuals. That is, are faces that most people remember also those faces in which most people have high confidence? Finally, although reports of correlations between confidence and accuracy abound, researchers have not measured the extent to which the accuracy of identifications is calibrated to (not correlated with) confidence estmates, despite the fact, as we shall discuss later, that this issue may be the most important from an applied point of view. One reason eyewitness experts may have failed to consider how best to measure the strength of the confidence-accuracy relationship is because they have not offered a precise mechanism to explain how confidence and accuracy might be related.

The present research has several related purposes. At one level we investigate whether factors that are known to affect memory in specific and well-known ways might not also affect average confidence in the same manner as accuracy (even if the correlations over individuals and/or items are small). We also ask whether these same manipulations affect the extent to which people's confidence estimates are calibrated to their accuracy. Finally, we take the position that signal detection provides a satisfactory and empirically useful descriptive mechanism for all of the ways of measuring how confidence and accuracy might be related. After all, since the seminal work of Henmon (1911) and Peirce & Jastrow (1884), the idea that accuracy of judgment would be related to confidence in those judgments has been a major part of experimental psychology (Link, 1992), even of recognition memory experiments, but for stimuli other than faces (Donaldson & Murdock, 1968; Murdock, 1980; Ratcliff, Sheu, & Gronlund, 1992; Wickelgren & Norman, 1966).

In eyewitness memory research two different types of explanations have been offered for the low confidence-accuracy correlations. One focuses on factors that may affect one measure and not the other, for example, social pressure to appear confident or pressures to choose someone even when memory is weak (Lindsay & Johnson, 1989). The other explanation, originally proposed by Deffenbacher (1980) but since accepted by others (e.g., Bothwell, Deffenbacher, & Brigham, 1987; Luus and Wells, 1994), assumes that the correlation between confidence and accuracy is only high when the information-processing conditions present at encoding, during memory storage, and at the memory test are optimal. Deffenbacher's optimality hypothesis has the advantage of explaining both the generally low correlations (learning and memory conditions are sub-optimal in most experiments, not to mention actual crime situations) as well as their wide range (experiments differ in how optimal their conditions are). Unfortunately, neither of these explanations provides a model of the way confidence estimates are generated (nor do they explain why so many non-experts, including the Supreme Court of the United States in Neil v. Biggers, seem to believe that confidence and accuracy are strongly related despite their apparent weak empirical relationship).

Signal Detection and the Confidence-Accuracy Relationship in Face Memory

By placing the recognition memory task within a signal detection framework, something that is quite commonly done (Shapiro & Penrod, 1986), it is possible to derive the optimality hypothesis and to make some additional, but previously ignored, predictions about factors that may affect the size of the confidence-accuracy relationship. In addition, a signal detection analysis tends to emphasize the importance of measures of the strength and nature of the confidence-accurracy relationship other than correlation coefficients; measures that suggest that confidence and accuracy are intimately related even when correlations are low. Figure 1 shows a signal detection analysis of a face memory experiment in which subjects are initially exposed to faces for two different durations of exposure and then tested for memory in a “yes/no” recognition task. At the time of the recognition memory task, items are assumed to have different values on a dimension that reflects the strength of subjective evidence that the item was seen before. In addition, those items that actually have been seen before have, on average, stronger evidence values than those items that have not been seen before.[3]

Most applications of signal detection assume that item strengths within both the seen and unseen sets are distributed normally; although logistic distributions are also frequently assumed in other judgment domains (Macmillian & Creelman, 1991). As learning and test conditions improve, the mean strength of the seen items is assumed to increase and the entire distribution of strengths for the seen items shifts towards higher strength values[4]. The signal detection model also assumes that subjects decide whether to say that they recognize a given item by adding a decision mechanism to the underlying strength of evidence system. If a given item’s strength is above a criterion value, the subject identifies the item as having been seen before. Overall accuracy (measured by proportion of correct responses) depends both on the placement of the decision criterion and on the distance between the means of the seen and not seen item distributions (measured as d'). As the distance between the distributions decrease, more seen and unseen items will have similar, and therefore indistinguishable, strength of evidence values causing an increase in the number of errors.  In addition, subjects who place their “yes/no” decision criteria high up on the subjective strength dimension (all other things equal) will tend to correctly reject most of the previously unseen items but will also tend to miss many of the items they have seen before. Conversely, subjects who place their “yes/no” criteria low on the strength dimension will tend to make many false alarms as well as to correctly identify most of the previously seen items.


Figure 1. The figure shows how signal detection represents the effect of increasing the duration of exposure on recognition memory performance. The strengths of evidence associated with not seen and seen items are assumed to be normally distributed over each item type, although it is likely that the variance of the seen items increases as d' increses. In each case confidence estimates are treated as additional cutpoints on the strength of evidence dimension were 1 is "just guessing" and 5 is "absolutely confident" that an item was either seen or not seen before. The optimal placement of the "yes/no" decision criterion moves to higher strength of evidence values as d' increases and presumably the confidence criteria move as well. Exactly how the confidence criteria are likely to move is unknown and like the "yes/no" criteria are probably under the control of the motivational variables.

Confidence is easily added to this model by assuming that subjects apply a second set of decision criteria to the strength dimension for the confidence ratings (Egan, Schulman, & Greenberg, 1959; Macmillian & Creelman, 1991; Donaldson & Murdock, 1968; Wickelgren & Norman, 1966). In this view, each level of confidence that an item had or had not been seen before is a band on the strength dimension. Thus, very high strengths result in a, "yes, I am absolutely confident (I have seen the face before),” response while moderate values might produce, “yes, I am just guessing,” responses. Similarly, a moderately low strength item might result in a, “no, I am moderately confident (I haven't seen the face before),” response.

We can see, in an intuitive fashion, how the above model might explain the optimality effect by examining Figure 1. In sub-optimal (short duration) learning and memory conditions, i.e., ones in which d' is small, correct and incorrect responses will tend to be assigned to similar confidence estimates because so many seen items will have values on the subjective dimension that are close to those for the unseen items. Thus, a significant proportion of the highly confident “yes,” responses will be false alarms and, similarly, a significant proportion of the highly confident “no,” responses will be misses. That is, many highly confident responses will be errors. In contrast, when the learning conditions are optimal and the seen and not seen distributions are widely separated on the strength dimension, virtually all of the highly confident yes and no responses will be correct and only the lower confidence responses will tend to contain a high proportion of errors. Thus, as the learning conditions improve, we would expect more and more of the highly confident responses to be correct.[5]

One feature of the model depicted in Figure 1 about which there is theoretical uncertainty is how changes in learning and memory conditions will affect, if at all, the placement of the confidence criteria. Figure 1 describes a model in which the confidence criteria move in lock-step with changes in the "yes/no" decision criteria, however, there are other reasonable possibilities. For example, the confidence criteria might remain fixed on the evidence scale or they might expand and contract to fill the range of strength of evidence values. Still another possibility is that people might adjust their confidence so as to maintain fixed likelihoods that the responses at each level of confidence are correct (Wixted and Ebbesen, 1995). We shall postpone further discussion of the consequences of this important issue until we have examined some empirical results.

Regardless of the particulars of the decision model, in order to generalize signal-detection reasoning to the different measures of the confidence-accuracy relation one needs to consider the ways in which different sources of variability might be expressed in the signal detection paradigm. Because two measures, "yes/no"-identification and confidence, are involved and because signal detection allows for independence between the decision criterion and the discrimination /d' components of the model, it is possible that different measures might be more influenced by one or another part of the system. For example, in the case of correlations between confidence and accuracy, were we to assume that individual differences in performance are due mostly to the difference between the seen and unseen distributions, i.e., d', it follows that individual difference correlations between accuracy and confidence based on averages over items should be high unless the learning and test conditions are so poor that no one performs much better than chance. This is because all of the individual difference variance in accuracy and in confidence would be the result of individual difference variance in d' and, as can be seen in Figure 1, as d' increases, so does both accuracy and mean confidence. On the other hand, differences between individuals may be more a function of the placement of their decision criteria, or even the location of their confidence cutpoints. In these cases, individual difference correlations might be considerably attenuated because subjects with identical d's might place their criteria in different locations.

In contrast, when an item-based correlation is computed for each subject from the accuracy of that subject's "yes" and "no" responses and that subject's confidence in those responses, individual differences in where subjects place their criteria would be largely irrelevant because each correlation would be for a single subject and that subject's decision criteria would be assumed to remain the same over all of the items used in computation of the correlation. Similarly, if differences between faces are the result of differences in the ease with which subjects can learn and remember them (e.g., some are more distinctive than others), signal detection analysis would again expect the confidence and accuracy correlation to be high when computed over faces because all of the variation in both confidence and accuracy would be d'-produced.

Even if neither individual nor face differences in recognition accuracy are due exclusively to differences in d' and correlations between confidence and accuracy are therefore low, signal detection analysis still makes the strong prediction that the mean confidence that people express (over items) should be controlled by the same variables that effect the learning and memory of those items, that is, accuracy of recognition. Stated differently, mean confidence (averaged over subjects and items) and total proportion correct should be strongly related across learning and test conditions because all of the condition variance should be captured by the difference in the seen and unseen distributions, i.e., d', and all of the individual differences in other parts of the model should average out (provided, of course, that the learning and memory variables do not also independently affect response bias and/or placement of the confidence criteria). Thus, even though individual difference correlations between confidence and accuracy might be relatively small, mean confidence and overall accuracy should be strongly related across different learning and memory conditions. In fact, as we shall see, signal detection places specific constraints on the form of that relationship.

The signal detection analysis also makes several strong predictions about the probability that responses will be correct at each level of confidence, or stated differently, how well calibrated confidence is to accuracy. For example, the model assumes that the "yes" and "no" responses at given levels of confidence tell us about different parts of the seen and not seen distributions. Thus, when we compute the relative probability of "yes" responses to seen compared to not seen faces (i.e., hits and false alarms) conditioned on particular values of confidence, signal detection assumes that these conditional probabilities reflect the relative areas under the seen and not seen distributions that are in each confidence bin. The same is true for "no" responses. With this in mind, examination of panel b in Figure 1 suggests that the probability that a "yes" response is correct should increase as the level of confidence increases because the area under the seen distribution increases relative to the area under the not seen curve as we move from lower to higher confidence bins. A similar pattern should be found for the "no" responses but with the area under the not seen distribution increasing faster, relative to the area under the seen distribution, as confidence increases. The model in Figure 1 also predicts that the conditional probability of yes and of no responses being correct at each confidence level should be higher in optimal than non-optimal learning conditions (because the difference between the relative areas under the curves in each confidence bin increases as d' increases) and that the rate at which the conditional probabilities increase over confidence bands should be higher in optimal than sub-optimal conditions. In other words, confidence should be better calibrated to accuracy the higher the d'.[6]

We designed a simple face-memory experiment to test these various implications of applying signal detection to the confidence-accuracy issue. Both accuracy and confidence were measured as the learning and memory conditions were made more or less optimal by varying, in a between-subjects factorial design, the duration of exposure to each seen face and the length of the retention interval between exposure and testing.[7] Not only were we interested in the correlation between accuracy and confidence over subjects (the typical method of measuring the strength of the relationship between confidence and accuracy), the average correlation over items within subjects, and the correlation over items of average scores over subjects, but we were also interested in examining how memory conditions affected both average accuracy and average confidence and the extent to which confidence estimates were calibrated to the accuracy of individual identification responses.

Experiment 1

Method

Subjects. All 200 subjects were obtained from introductory psychology classes at UCSD and served in partial fulfillment of class requirements. Subjects volunteered to participate in an experiment that would require two sessions, the first of which was two hours long. 111 subjects were female, the rest were male.

Design. A 2 x 4 between subjects factorial design was employed. Two levels of duration of exposure to the study faces (three seconds and eleven seconds) was crossed with four different retention intervals (one hour, one day, seven days, and two weeks). Subjects were randomly assigned to the different conditions after they arrived in small groups to the experimental sessions.

Procedure. The experiment consisted of two phases, a study phase and a test phase. During the study phase, subjects were exposed to slides of 40 different male faces, one at a time, at one of the two exposure durations. During the test phase subjects were exposed to 80 slides of male faces, 40 of which had been in the original study set. Half of the subjects in each condition were exposed to one of two different sets of 40 slides that were randomly selected from the entire set of 80 available faces. The slides consisted of 35 mm color pictures of college age males from the UCSD campus that had been taken in several different settings around campus. Slides were projected on a wall painted with white reflective paint. Because subjects were run in small groups (varying from one to four in size), the visual angles and image sizes varied over subjects within each small group. In addition, the size of each face stimulus varied somewhat from slide to slide. On average, subjects sat nine feet from images of faces that averaged 18 inches high on the projection surface.

When each small group of subjects arrived, they were told that we were interested in their reactions to people’s faces, that we were going to show them a series of faces of males, and that we would explain what we wanted them to tell us about the faces after they had seen all of them. They were also told to pay careful attention to each and every face. The room lights were dimmed and the subjects were shown all 40 slides at one of the two durations. Inter-slide intervals were a function of the speed with which the Kodak carousel projector was able to change slides.

After the study phase was complete, the experimenter explained that there was a second part to the study and that they would have to return for that second phase one hour, the next day, one week, or two weeks later in order to receive their class credit. Conversations with individual subjects about schedule conflicts and the session that each subject would be able to attend at the relevant time were completed next. By running multiple testing sessions each day, it was usually possible to schedule all subjects to a session consistent with the retention interval to which they had been assigned. Five of the 200 subjects who were scheduled for retention interval tests failed to attend (four in the two week condition and one in the week condition).

Subjects were run in small groups in the test phase. The experimenter explained that we were interested in how well they could remember the faces that they had seen before. A response form consisting of 80 numbered rows, ten per page, was used to collect yes/no decisions, confidence estimates, and “reasons.” Subjects were instructed to circle yes if they believed that they had seen a face in the previous study set before and no if not, to indicate their confidence in the yes/no decision on a labeled 5-point confidence scale: just guessing, slightly confident, moderately confident, highly confident, and absolutely confident, and to write down any reasons that they had for picking or not picking a particular face. Finally, the experimenter told the subjects that they would be seeing 80 faces and that they had seen half of them before.

The same viewing conditions were used in the test phase as in the study phase with the exception that each slide was projected for twenty seconds to give subjects time to answer all three questions. In addition, the experimenter told the subjects after every ten slides which row number they were to be filling out for that slide. Orders of slide presentation were re-randomized in both the study and the test phases for each session.

Results and Discussion

Duration and Retention Interval Effects. Of initial interest are the effects of duration of exposure and retention interval on overall accuracy (measured both as d' and total percent correct) and mean confidence (over all eighty test slides).[8] Separate analyses of variance of each measure yielded two main effects and no significant interactions for all three measures: for the duration effect on d', F(1, 187) = 43.75, p<.0001, on an arcsin transform of proportion correct, F(1, 187) = 40.90, p<.0001, and on confidence, F(1, 187) = 17.28, p<.001; for the retention interval effect on d', F(3, 187) = 6.18, p<.0005, on arcsin transformed proportion correct, F(3, 187) = 4.86, p<.01 and on confidence F(3, 187) = 5.98, p<.001; the mean square error for d' was .339, for arcsin proportion correct it was .014, and for confidence it was .362. Table 1 shows the means for confidence, d', and total percent correct as a function of condition. Not surprisingly, accuracy and confidence increased with greater exposure time and decreased with lengthening retention interval. Finally, an analysis of the mean proportion of "yes" responses (overall mean = .41) suggested that the subjects’ placements of their "yes/no" criteria, i.e., response bias, did not contribute to these accuracy results because learning and memory conditions had no effect on this measure (all Fs <1). Clearly, these results provide the necessary starting conditions to examine the effect of strength of memory on the relationship between confidence and accuracy.

 

Table 1

Effect of Duration of Exposure and Length of Retention Interval on Mean Accuracy (Measured as d' and Total Percent Correct) and on Mean Confidencea

 

 

Measure

 

d'

Percent Correct

Mean Confidence

 

Duration of Exposure

Retention Interval

2 sec.

12 sec.

2 sec.

12 sec.

2 sec.

12 sec.

1 hr

1.320

2.006

72.0

80.5

3.25

3.65

24 hrs

1.056

1.753

68.1

78.4

3.12

3.57

168 hrs

1.062

1.459

68.0

73.7

3.08

3.33

336 hrs

.982

1.410

67.0

73.4

2.80

3.12

aSix subjects made no false alarm responses. Because the standard score of a probability of 0 is not defined, we estimated these subject's  d' values by substituting a false alarm rate of .01 for them .

 

Signal Detection. Evidence that the accuracy and confidence data from this experiment were relatively consistent with underlying assumptions of signal detection theory comes from three different analyses of the confidence and accuracy data. The first attempts to fit normalized ROC curves to the “yes/no” and confidence data (summed over subjects) using procedures that have been described as the “rating method” by Macmillian & Creelman (1991). The signal detection model predicts that these normalized ROC curves should be linear. The results of these analyses for the retention interval effect are presented in Figure 2.[9] As expected from signal detection analysis, all four curves are well described by linear functions. In the worst case, r2 for the linear fit was .996 and in the best case it was .998.[10] Second, the fact that the slopes of the normalized curves were all significantly less than 1.0 (between .777 and .844) suggests that the variance in the strength of evidence values for the seen items was between 1.18 and 1.28 times larger than that for the unseen items.[11]


Figure 2. ROC curves resulting from applying the rating method to data from four different retention intervals. The rating method assumes that each level of confidence is a cutpoint on the evidence dimension. Therefore, the normalized cumulative proportion of responses made at each confidence level (starting at "yes, absolutely confident" and continuing to "no, absolutely confident") to the not seen items, z(Prob (FA|Not Seen)), is compared to the equivalent proportion of responses made to the seen items, z(Prob (H|Seen)). If confidence ratings are fixed criteria on the evidence dimension and the underlying distributions are normal, then these ROC curves should be linear.

The second analysis was based on the expectation that subjects should express greater confidence in responses to stimuli whose evidence values are further away from the “yes/no” decision criterion. Examination of panel b in Figure 1 provides the intuition for this prediction. Hits and correct rejections (CRs) should, on average, be associated with more extreme strengths of evidence than should false alarms (FAs) and misses. Table 2 presents the mean confidence ratings for each response type as a function of both duration of exposure and length of retention interval. Examination of this table and a mixed 2 x 4 x (2 x 2 repeated measures) analysis of variance indicates that this prediction was confirmed. Mean confidence for correct responses was much higher than for incorrect responses (F(1, 181) = 648.61, p<.0001). It is also important to note that the size of this response-accuracy effect was significantly affected by both the duration of exposure (F(1, 181) = 20.45, p<.0001) and the length of the retention interval (F(3, 181) = 9.88, p<.0001), but the three-way interaction was not significant. In particular, as the optimality of the learning and memory conditions increased, the difference between confidence in correct compared to incorrect responses increased, exactly as a signal detection analysis predicts. There was also a significant main effect of whether the responses were to seen as opposed to unseen slides. Confidence in responses to seen slides were higher than those to unseen slides (F(1, 181) = 34.54, p<.001). If this effect were due to the larger variance of the seen as opposed to the unseen distribution (variance differences that were suggested by the ROC curve analyses), we might not expect the size of this difference to change with the optimality of the learning and test conditions. Somewhat unexpectedly, although the size of this difference was, indeed, not directly affected by duration of exposure nor by the length of the retention interval, the two manipulations did interact to produce an inexplicable but significant three-way interaction (F(3, 181) = 4.37, p<.01). Examination of the means in Table 2 suggest that this was a consequence of subjects in the most optimal condition, namely, long duration of exposure and one hour retention interval, and subjects in the short duration, one hour retention interval, condition expressing atypically low confidence in their miss and false alarm responses, respectively. Finally, there were no significant differences in the amount of confidence that subjects expressed in their “yes” as opposed to “no” responses.

Table 2

Effect of Duration of Exposure and Length of Retention Interval on Mean Confidence for Hit, False Alarm (FA), Miss, and Correct Rejection (CR) Recognition Response Categoriesa

 

 

Duration of Exposure

 

Three Seconds

Eleven Seconds

Recognition

Retention Interval (In Hours)

Response

1

24

168

336

1

24

168

336

Hit

3.502

3.350

3.153

2.947

3.913

3.873

3.618

3.213

FA

2.446

2.658

2.585

2.377

2.829

2.711

2.617

2.577

Miss

2.899

2.718

2.671

2.614

2.679

2.911

2.868

2.785

CR

3.340

3.229

3.151

2.869

3.671

3.589

3.379

3.267

aThese means were computed by averaging each subject’s confidence estimates for each of the four response types. These averages then served as the raw values from which the means in this table were computed.

The third source of evidence regarding signal detection comes from a relationship between confidence and accuracy that has been known for over 100 years. In 1884 Peirce & Jastrow (1884) empirically determined that confidence in comparative judgments of weights was related to the accuracy of those judgments by the following formula:

m = c * ln p/(1-p)

where m is the mean signed confidence, c is a constant that depends on the confidence scale, and p is the probability of a correct response to a particular difference in weights. Thus, as the difference in weights between a comparison and standard increased, mean signed confidence, as well as the natural log of the relative probability of correct choices, increased according to the above linear equation. By assuming that the not seen and seen distributions of strength of evidence in a signal detection model of recognition memory play similar roles to the distribution of subjective weights in Peirce and Jastrow, we might expect to find a similar result for probability of correct recognition responses and mean confidence across different learning and memory conditions.

It is of no small interest that the signal detection model predicts virtually the same relationship when the "yes/no" response criterion is placed midway between the two distributions. In fact, when there is no response bias and the strength distributions are logistic rather than normal, d' is proportional to the quantity, ln p/(1-p) (Noreen, 1981). Thus, the Peirce and Jastrow formula is equivalent to arguing that the subjects generate “yes” and “no” responses with little or no bias, that d' is not so large that the tails of the distributions play a large role, and that the seen and unseen evidence distributions have equal variance. However, unlike the previous analyses in which confidence levels were coded such that "just guessing" = 1,  "slightly confident" = 2, and so on, Peirce & Jastrow (1884) coded confidence to take account of whether the "yes/no" response was correct by giving negative confidence values to incorrect responses. Applying their procedure, correct responses were be coded 0 (just guessing) to 4 (absolutely confident) and incorrect responses were be coded 0 (just guessing) to -4 (absolutely confident).

When we computed each subject's individual signed confidence mean in this fashion as well as each subject's ln p/(1-p) values, the results were remarkably, though not perfectly, consistent with the Peirce & Jastrow model. Figure 3 contains a scatterplot of these results. As can be seen, despite the fact that the variation in both the x and y values in this plot are a mixture of individual and condition differences, the data were well fit by a linear function r2 = .724, F(1, 193) = 507.2, p<.0001. In addition, when the condition effects on both measures were removed by computing the residuals from 2 (duration) x 4 (retention interval) analyses of variance of each measure, there was only a small reduction in the quality of the linear fit (r2 = .64, F(1, 193) = 344.28, p<.0001) to the residuals. Furthermore, when ln p/(1-p) was added as a co-variate to a 2 x 4 analysis of variance of mean signed confidence, the main effects of learning and memory conditions did not disappear (F(1, 186) = 10.65, p<001 for duration and F(3, 186) = 8.86, p<.001 for retention interval, after removing the effect of ln p/(1-p)). Taken together, these results suggest that the pattern of data in Figure 3 is the result of both condition produced and individual difference variation in memory.

The Peirce & Jastrow model also predicts that the intercept of the best fitting linear function should be equal to zero, and although close at .22, a t-test using the standard error of the estimate (.05) indicated that our obtained intercept was significantly different from zero (t = 4.21, p<.0001). On the other hand, it can be shown that this small deviation from the original model would be expected if, as previously discovered, both the standard deviation of the seen items was larger than that for the not seen items and the subjects were biased to say "no."


Figure 3. Scatter-plot of data for all 195 subjects from Experiment 1 showing the relationship between each subject's natural log of the ratio of total proportion correct (p) to (1 - p) and each subject's mean signed confidence. The relationship should be linear, with intercept 0, if the Peirce and Jastrow assumptions of equal variance and no response bias are correct.

Confidence and Accuracy Correlations. Given that both duration of exposure and retention interval produced highly reliable effects on the accuracy of face recognition memory and on confidence and given that these results were well within the boundaries of what might be expected from signal detection theory, we next analyzed the results from three correlation measures of the relationship between confidence and accuracy. Covariation over individuals of measures averaged over items (individual difference rs), covariation over items (faces) of measures averaged over individuals (face-based rs), and average covariation over items within each individual (response-based rs) were computed for accuracy and for confidence. Table 3 presents the results of the three differently computed correlations between confidence and accuracy for each learning and memory condition. Of initial interest is the fact that individual difference correlations between each subjects’ mean confidence (coded in the more typical fashion as 1 through 5 and without regard to the accuracy of the "yes/no" response) and total proportion correct scores were within the range of results reported in previous studies. Although the overall correlation was highly significant (p<.0001), its absolute value was not particularly high (r (194) =.38). In addition, as shown in Table 3, only two out of the eight within condition individual difference correlations (with ns between 22 and 25) were significant.

These results are consistent with the claim that a person’s average confidence does not appear to be a particularly good predictor of his/her overall accuracy, except possibly in highly optimal conditions (the eleven second-one hour retention interval condition). Furthermore, when the effects of the different learning and memory conditions were removed by examining the correlation between the residuals from the 2 x 4 analyses of variance of mean confidence and of proportion correct that were reported earlier, the resulting r was reduced to .241 (F(1, 193) = 11.91, p<.001), a value in support of the usual claim by eyewitness experts that the relationship between accuracy and confidence is weak. In other words, at the level of individual difference mean confidence-accuracy correlations, the results from this experiment seem similar to those previously reported. People who are generally confident in their memories are just slightly more likely have high accuracy scores.

 

Table 3

 

Effect of Duration of Exposure and Length of Retention Interval on the Size of Three Types of Confidence-Accuracya Correlations Based on Data from the Between Subjects Experiment

 

 

Type of Confidence-Accuracy Correlation

 

Individual Difference

Response-Based

Face-Based

 

Exposure Duration

Retention Interval

3 sec.

11 sec.

3 sec.

11 sec.

3 sec.

11 sec.

1 hr

.20

.63*

.26*

.33*

.49**

.67**

24 hrs

.03

.28

.27*

.28*

.34**

.58**

168 hrs

.52*

.17

.22*

.28*

.50**

.52**

336 hrs

.04

.10

.16*

.18*

.36**

.39**

aAccuracy was measured as total percent correct over 80 items for each subject in the individual difference correlations, as percent correct of all subjects correctly responding to a particular face in the face-based correlations, and as one (correct) and zero (incorrect) for the response-based correlations. Mean confidence in all 80 faces was used for the individual difference correlations, mean confidence over all subjects for a given faces was used in the face-based correlations, and the actual confidence rating for each item was used for the item-based correlations. When the one outlier face was removed from the face-based correlations, the lowest correlation increased to .42 and the highest to .69.

*p<.05

**p<.005

Though only sometimes reported in previous studies, we also computed confidence-accuracy correlations over stimulus items within each subject. That is, each subject's "yes" and "no" responses were coded 1 if correct and 0 if incorrect. These scores were then correlated with the associated confidence ratings for each response over the 80 test slides. Despite the fact that the overall mean correlation across all subjects and conditions was .24, a value even smaller than the equivalent individual difference correlation (but still different from zero, p<.0001), only 2 out of the 195 subjects had correlations that were negative! More importantly, when the condition effects on the means of these within-subject response-based correlations were tested with an analysis of variance of z-transformations of the correlations, both the main effect of exposure duration and of retention interval were significant (F(1, 187) = 7.183, p<.01 and F(3, 187) = 10.54, p<.0001) but the interaction between them was not (F<1). As can be seen in Table 3, the pattern of the response-based correlations was consistent with the general idea of the optimality hypothesis, namely, the correlations tended to decrease as the optimality of the learning and memory conditions, and therefore accuracy, decreased.

To determine whether the size of confidence-accuracy correlations computed over individuals (and responses within individuals) were representative of those that might be obtained when variation in performance was produced by differences between faces, we computed confidence-accuracy correlations between the percent of subjects correctly identifying each face and the mean confidence they expressed in their responses to the same faces. When the results for all 195 subjects were used to compute accuracy and confidence estimates for each face, the correlation was larger than previous examples (r = .545) and highly significant (t(78) = 5.74, p < .00001).[12] In addition, when the correlations were computed within each learning and test condition, every correlation was highly significant suggesting that faces to which most people correctly respond are the same faces in which people tend to have high confidence. Finally, with one minor exception, just as the signal detection description of the optimality hypothesis predicts, the correlations decreased as the learning conditions worsened.[13]

At one level the correlation results in Table 3 suggest that one's conclusion about the strength of the confidence-accuracy relationship may depend on the exact method used to compute the correlation between these two measures. Individual differences in accuracy do not appear to be strongly associated with individual differences in confidence. But, faces to which people tend to respond correctly do appear to be those faces in which they have more confidence in their recognition ability. On the other hand, the average of each subject's correlation between his/her confidence in and accuracy of their recognition responses all tended to be small. Despite these apparent inconsistencies, the pattern of the correlations seemed to conform to the optimality hypothesis because the shorter retention interval and longer duration of exposure conditions tended to produce higher correlations, regardless of the measure.

Monte Carlo Signal Detection Simulations of Different Confidence-Accuracy Measures. If the signal detection model provides as reasonable a description of face-recognition memory as was suggested by the empirical results presented in an earlier section of this paper, then we might expect it to account for measures of the strength of the relationship between confidence and accuracy, as well. Although all of the empirical results presented thus far seem generally consistent with a signal detection interpretation of the optimality hypothesis, exact predictions of how different correlations between confidence and accuracy should change as parameters of the signal detection model, e.g., d', change are not readily available. To both examine this issue and to test the usefulness of the signal detection approach, we conducted a series of Monte-Carlo computer simulations of recognition and confidence responses based on the signal detection model. These simulations were used to examine the effects of a variety of signal detection parameters on different measures of the strength of the relationship between confidence and accuracy. In particular, the effects on the confidence-accuracy relationship of: 1) expanding and contracting the placement of the confidence cutpoints, 2) the size of the variance differences between the signal and the noise distributions, 3) variability over individuals in d' and in confidence cutpoint placements, and 4) different ways of adjusting the confidence criteria (lock-step v. stretching) as d' changed were all examined.

Simulation Methodology. Separate simulations were run for each of a series of programmed d' values: 0, .25, .5, 1, 1.5, 2, 2.5, and 3. These were designed to simulate the different mean d' levels that might occur as a result of differently optimal learning and memory conditions. The simulations took into account the facts that not all subjects have the same d' in a given learning and memory condition nor are they likely to have the same signal to noise variance ratio. Therefore, individual d' values were drawn from normal distributions of d' scores with a programmed mean set to simulate a given learning and memory condition and with a standard deviation (of d's over individuals) set between one fourth and one third of the mean. The standard deviation of each subject's signal distribution (sds) was drawn from a second normal distribution with a mean of 1.25 (a value close to that empirically determined from the ROC curves) and standard deviation between one fourth and one third of 1.25.

A given subject's signal and noise distributions were created for a particular programmed d' level by selecting a d' value and an sds value from their associated distributions. The mean of that subject's signal distribution was set d' units above the mean of the noise distribution, with the latter always set to zero with unit variance. Thus, at a programmed d' value of 1.0, the first subject might have a d' value of .85 and sds of 1.20 and the next might have a d' of 1.23 and a sds of 1.18, and so on. The placement of confidence criteria was also controlled by the simulation, however the degree of variability in the placement of these cuts was controlled by a parameter, v. When v was set to zero, the placement of confidence criteria did not vary over subjects. But, as v increased, the variability, over subjects, in the placement of the different confidence cuts increased. The details of criteria placement were as follows. First, the "yes/no" criterion was randomly placed between 0 and the subject's d' with the amount of random variation over subjects controlled by a proportion of v. When v was equal to 0, the yes/no criterion was placed either midway between 0 and d' or so as to produce a slight "no" response bias in a manner consistent with the empirical results. Next, the no-5 and yes-5 confidence criteria were set to fall 3 standard deviation units to the left and right, respectively, of the yes/no criterion, plus or minus a random proportion of v. The remaining criteria, no-4 through yes-4, were distributed between these two extreme values in order. When v was zero, the spacing between them was either equal or designed to produce more middle-valued confidence estimates (3s and 4s) in a manner similar to the empirical results. When v was greater than zero, the placement of each of the remaining criteria independently varied plus or minus a random proportion of v while still preserving their initial serial order (n4<n3<....<y3<y4) and, if present, a degree of confidence bias.

For reasons that will become clear later, we ran all of the simulations under two different assumptions about the effect that changes in d' had on the pattern of changes in confidence criteria. One was modeled on Figure 1 in which all of the confidence criteria are assumed to move in lock-step with .5d' (i.e., the optimal placement of the yes/no decision criterion). The second was modeled on a pattern similar to that depicted in Figure 4. This stretch model assumed that the "yes-absolutely confident" criterion was placed 3 standard deviation units (based on a noise distribution of unit variance and zero mean) above the mean of the noise distribution and that the "no-absolutely confident" criterion was placed 3 standard deviation units below the mean of the signal distribution. A consequence of this model was that the distance between the most extreme criteria decreased as d' increased because the signal distribution tended to drag the no-5 confidence criteria along with it while the yes-5 criterion remained were it was. Stated differently, the confidence criteria "stretched" as d' became smaller. In both models, the remaining criteria were placed in between the extremes in the manner already described.


Figure 4. Signal detection model of how the confidence criteria might change if subjects compensate for less-optimal learning conditions by "stretching" their confidence criteria further out on the evidence dimension rather than moving them in "lock-step" with the "yes/no" criterion.

Once the noise distribution, the signal distribution, and all of the cutpoints were in place for a given subject (i.e., when a situation similar to one depicted in Figure 1 or Figure 4 was arranged), 40 seen and 40 not seen trials were simulated for that subject. For each seen trial, an evidence value was randomly selected from that subject's signal distribution. If the value fell above the yes/no criterion it was coded as a hit. If it fell below, it was coded as a miss. Similarly, for a not seen trial, items falling above the yes/no criteria were coded as false alarms and those below as correct rejections. The confidence of each response was also coded according to the particular confidence bin into which the evidence value fell. When all 80 trials were completed for a given subject, various statistics were computed, such as, percent correct, mean confidence, and item-based correlations.

Within-Subject Response-Based Correlations. We first examine the effect of d' on the mean within-subject, response-based, confidence-accuracy correlations. A total of 200 simulated subjects were run at each programmed d', first with v set to zero and then set to 1.5. Half of the runs assumed that the confidence cutpoints moved in lock step with the "yes/no" decision criterion and half assumed that they stretched. The means of the response-based correlations were computed for each group of 200 subjects and are presented in Figure 5 as a function of the obtained mean d' for each simulation run. The error bars represent the obtained (+/- 1) standard deviation of the mean correlations obtained at each d' level. All of the reported simulation runs assumed a small amount of confidence bias, i.e., middle confidence bins were larger than extreme ones. As can be seen, the mean simulated confidence-accuracy correlation increased with increasing d' in a highly regular fashion. Individual difference variation in confidence cutpoints had little effect on the mean rs (compare the circles and squares). At d' values below 1.5, the model, lock step v. stretch, had little effect on the size of the response-based correlations. However, only the correlations produced by the stretch model seemed to benefit from further increases in d' although this benefit was accompanied by increasing individual difference variance in the size of the obtained rs (seen in the solid error bars). Still, for both models, even at very high d' values (above 3), the average response-based confidence-accuracy correlation was less than .5. Equally important, the actual mean within-subject correlations for the various memory conditions in our experiment, indicated by the dark "x" data points connected in groups of four (over retention interval) for each duration of exposure fell well within the range of values produced by the simulations.


Figure 5. Results of Monte-Carlo simulations of the effect of d' on the size of average within-subject response-based confidence-accuracy correlations. The light colored lines represent the simulated results for four different simulation runs, two with the lock-step model (green points) and two with the stretch model (blue points). Within each model, one run assumed no individual difference variance in criteria placement (square points), the other assumed large individual difference variance (circles). Standard deviations in the correlations for each simulation run are shown with horizontal bars. The empirically obtained mean correlations and mean d's reported in Table 2 for each experimental condition are indicated by black Xs.

If d' actually does drive the size of the within-subject confidence-accuracy correlations as these simulation results suggest, then one might expect that measures related to signal detection would account for the learning and memory produced effects on the size of the empirically obtained correlations that were already reported. With this in mind, when we re-examined the effect of learning and memory conditions on the actual within-subject response-based correlations from Experiment 1 in light of measures relevant to signal detection, e.g., each subject's mean signed confidence and d', we discovered that almost all of the effect of the conditions on the size of these correlations could be explained by these two measures. In particular, an analysis of covariance indicated that the two measures accounted (with one exception to which we shall return later) for virtually all of the explainable variance in the response-based correlations (F(1, 185) = 60.11, p<.0001 for signed confidence, F(1, 185) = 41.09, p<.0001 for d', F(1, 185) = .92 for the duration of exposure effect, F(3, 185) = 3.60, p=.01 for the retention interval effect, but with all of this effect being due to the fact that the two week retention interval produced significantly smaller correlations than the other three intervals, a contrast testing this pattern accounted for the entire effect, and F(3, 185) = 1.23 for the condition interaction). These results are consistent with the claim that signal detection may provide a more reasonable framework for understanding the relationship between confidence and accuracy than does the optimality hypothesis combined with the size of Pearson correlations.

Individual Difference Correlations. We next examined how confidence-accuracy correlations based on individual differences in average accuracy and confidence behave as d' changes. To examine this issue we compared the results of a number of different sets of simulation runs that differed in terms of v (0 and 1.5), the amount of individual difference variance in d' (set either at a fourth and a third of d'), degree of bias in confidence, "yes/no" decision bias, and the model of how confidence changes as d' changes (lock-step verses stretch). Each individual difference correlation was computed from data for 25 subjects at each programmed d' value. The mean individual difference r was computed from 50 such correlations, as was the standard deviation of the rs.

Although bias in both confidence and decision criteria and in the amount of individual difference variance in d' all affected the size of individual difference rs, their effects were small compared to model type and individual difference variance in confidence. For this reason, Figure 6 shows the results of a set of these runs selected to highlight the effects of model type and individual difference variance in confidence (for square data points v=0 and for circles v=1.5). In an attempt to remain close to the empirically obtained results, all of the presented simulation data assumed a moderate amount of confidence bias, a standard deviation over subjects of d' equal to one-third the programmed d', and a slight decision bias favoring "no" responses. As can be seen, the results from these runs present quite a different pattern than those for the item-based correlations. Of interest is the interaction between model type and extent of individual difference variation in confidence (v). Although the size of individual difference variation in the placement of confidence criteria had virtually no effect on the size of response-based rs, it had a major effect on the size of the individual difference rs, especially for the lock-step model. When the simulation assumed no variation in confidence placements (the square data points), d' drove the size of the correlations in both models, although the lock-step model consistently produced lower individual difference rs than the stretch model. When the more empirically reasonable assumption of individual difference variance in placement of confidence cutpoints was programmed (the circle data points), mean rs at all d' values were reduced, however, the reduction was much greater for the lock-step than for the stretch model. In fact, the reduction was so big for the lock-step model that increases in d' had little effect on the mean rs for the lock-step model, except at the highest d' values. In addition, the variance in the size of these correlations was considerable in all cases, with many negative individual difference rs occurring when d' was below 1. Finally, the empirically obtained individual difference correlations generally fell within +/- 1 standard deviation of the mean rs for the stretch model that assumed v=1.5 (the filled circles), although clearly there was wide variation in these empirical correlations (probably due to the small ns in our simulated experiments).

Several features of these simulation results are of interest. The first is the generally small size of the correlations between confidence and accuracy. Although the results from these simulations are consistent with the idea that correlation measures of the confidence-accuracy relationship generally result in small values, even with large d's, the reasons for the small correlations need to be examined. Consider the within-subject response-based correlations between confidence and accuracy first. They have an obvious constraint virtually assuring that most correlations will tend to be small, but significant. Coding responses "0" if incorrect and "1" if correct and regressing these raw scores against confidence ensures that the best-fitting linear functions will never be able to provide a perfect fit of the data. Because all of the y-values in the regression will be either zero or one, the best a straight line will be able to do is "split the difference" at each confidence level adjusted according to the relative number of ones verses zeros. In short, the fact that within-subject response-based correlations between confidence and accuracy are small, may tell us less about the nature and strength of the relationship between these two measures and more about the fact that predicted accuracy will always be somewhere between the actual values of 1 and 0.

Next consider the size of individual difference correlations between average confidence and percent correct. Although the models depicted in Figures 1 and 4 both suggest that data produced by operators with higher d's will yield higher confidence and higher accuracy scores, as depicted both models assume that the placement of the confidence cuts is determined only by a subject's d'. It seems more likely that different subjects with identical d's are likely to place their confidence cuts in different locations (just as the simulations assumed when v>0). Subjects with widely spread confidence criteria will tend to produce lower mean confidence estimates while those with compressed criteria will have higher mean confidence scores even if their d's are identical. Unless the degree of spread in confidence cuts over subjects is related to d' (as the stretch model assumes is generally the case), the resulting individual difference correlations will tend to be small. In short, when researchers use the absolute size of within subject response-based or individual difference correlations to measure the "strength" of the relationship between confidence and accuracy, they virtually guarantee that they will conclude that the relationship between confidence and accuracy is weak. Fortunately, there are other measures of the strength of the relationship.


Figure 6. Results of Monte-Carlo simulations of the effect of d' on the size of average individual difference confidence-accuracy correlations. The light lines represents the simulated results for four different simulation runs, two with the lock-step model (green points) and two with the stretch model (blue points). Within each model, one run assumed no individual difference variance in criteria placement (square points), the other assumed large individual difference variance (circles). Standard deviations in the correlations for each simulation run are shown with horizontal bars. The empirically obtained mean correlations and mean d's reported in Table 2 for each experimental condition are indicated by black Xs.

Face-Based Correlations. One is the face-based correlation, namely, the correlation over items between the percent of subjects who get an item correct and the mean confidence subjects expressed in their responses to that item. These measures eliminate individual difference variability in criteria placements by computing an average over all subjects and by using the same subjects for each item average. Thus, whatever the average criteria placements are for one item, they will be tend to be the same for every other item because the same subjects produced the results for each item. In this way, the only factors driving the size of these correlations will be d' and the size of the item differences (because with no item variability, the correlations would have to be zero no matter how big d' was). This reasoning suggests that results from simulations in which v=0 are most relevant to face-based correlations because when v=0, the placement of the confidence cuts depend only on d' and the model. Looking again at Figure 6, when v=0, the correlations between average confidence and percent correct were higher at almost all d' values than when v>0. This result is consistent with the data reported in Table 3 in which the face-based correlations were generally higher than the other types of confidence-accuracy correlations. In fact, as d' varied from .98 to 2 across conditions, the faced-based correlations varied from .36 to .67. As the results in Figure 6 show, the range of simulation results when v=0 were remarkably similar over the same range of d' values: .36 to .67 for the stretch model and .23 to .44 for the lock-step model.

Mean Confidence and Proportion Correct Over Conditions. Another method of measuring the relationship between confidence and accuracy is to examine how mean (averaged over subjects and faces) accuracy and mean confidence (coded 1 through 5) covary over the different learning and memory conditions. To examine this issue we conducted several additional simulation runs and then compared them to the empirical data. Figure 7 presents the results of these analyses. Looking first at the simulation results (represented by gray lines and data points in the Figure), data from four different simulation runs selected to emphasize the effects of v and model type are presented. Two of the runs (open data points) were based on the lock-step model and the other two (filled data points) were based on the stretch model. In addition, the extent of individual differences in confidence criteria placement (v=0 or 1.5) was crossed with these manipulations. The data points in each simulation run represent the mean confidence and mean percent correct for 200 subjects at each average d' value used in the previous simulations. As can be seen, although these manipulations of signal detection parameters have a large effect on the mean confidence that a given d' will produce, within a given set of signal detection assumptions, the relationship between percent correct and mean confidence is strong. In every case, as mean percent correct went up because the learning and memory conditions produced a higher d', mean confidence also went up. However, as might be expected, the slope of the relationship was somewhat stronger in the stretch model because it assumes that the confidence bands expand, thereby producing more low confidence responses, as d' gets smaller. Other simulation runs with different parameter values produced results consistent with intuition. In particular, confidence bias affected the overall mean confidence but not the relationship between confidence and accuracy; decision bias affected percent correct but not the relationship between the measures.


Figure 7. Relationship between mean total proportion correct and mean confidence as a function of d' from the same Monte-Carlo simulation runs used to compute the correlations in Figure 5. The empirically obtained mean correlations and mean d's reported in Table 2 for each experimental condition are indicated by black Xs. The dotted lines represent the 95% confidence interval for a linear fit of the data without the two-week retention interval data.

Looking next at the empirical results (represented by black "X"s in Figure 7), better learning and memory conditions produced both more accuracy and higher confidence, exactly as predicted by a signal detection analysis. In addition, a linear fit of the experimental data accounted for 87% of the variance and produced a confidence-accuracy correlation considerably higher than any of the earlier correlations (r(6) = .93, p<.001). However, one feature of the experimental data in Figure 7 seems inconsistent with the signal detection models presented above, namely, the fact that at the longest, two week retention interval, confidence dropped at a faster rate relative to the drop in accuracy. In particular, if the data points for the two-week retention interval are left out, a linear function fits the remaining data almost perfectly (r2 = .995). More importantly, the data from the two-week retention interval are well beyond the 95% confidence bands derived from that linear fit (the dotted lines in Figure 7). That is, the effect on accuracy of increasing the retention interval from one to two weeks was relatively small compared to the effect of waiting 24 hours before being tested, but the effect on mean confidence of increasing the retention interval from one to two weeks was relatively large compared to waiting 24 hours before testing.[14]

These results have two important implications. First, even though both between and within subject confidence-accuracy correlations may be small in a given experiment, the relationship between confidence and accuracy over conditions may be much stronger. Second, the possibility that mean confidence changed more rapidly than mean accuracy as the retention interval increased beyond one week is consistent with the idea that the confidence cuts are not locked to the decision criteria as d' changes (unlike the assumption that Donaldson and Murdock, 1968, seemed to make). This idea may help explain the fact that the within-subject correlations in the two-week retention interval were not completely predicted by signal detection measures.

Metatheories of Memory. The reason that we introduced the stretch model can now be presented. One feature of the signal detection analysis that has been well studied concerns the effect that changes in learning and test conditions have on the placement of the yes/no decision criterion. For example, although signal detection allows the yes/no criterion to be controlled by factors other than strength of memory, e.g., payoffs for saying yes, in a typical recognition experiment in which the subjects are explicitly told that they have seen half of the items before (or one in which subjects might guess this to be true), the yes/no criterion should be placed midway between the means of the two strength of evidence distributions to maximize accuracy (Macmillian & Creelman, 1991). This implies that the placement of the yes/no criterion will shift downward on the evidence dimension as learning and memory conditions deteriorate (see panel b in Figure 1). What will happen to the confidence criteria as the yes/no criterion moves? Will they stay fixed or be dragged along? Since explicit payoffs for confidence estimates are rarely given in most face memory experiments (let alone in real world criminal investigations), signal detection does not provide an optimal placement strategy for confidence criteria.

One simple view of how confidence might change with learning and memory conditions we called the lock-step model and was depicted in panel b in Figure 1. In this model, we assumed that the confidence criteria move in lock step with any change that occurs in the yes/no decision criterion. Note that the sizes of the confidence bins do not change. This lock-step model also predicts that as accuracy worsens, overall confidence should decrease because fewer seen and unseen items will fall into the higher confidence bins. This model also makes the very strong prediction that whatever effect learning and memory conditions have on confidence (via the placement of the decision criteria), those effects should be completely explained by changes in d', regardless of how those d' differences were produced (e.g., by changes at encoding or during the retention interval). In sum, this simple lock-step view of the relationship between confidence and yes/no decision criteria cannot explain the fact that the two week retention interval seemed to cause confidence to drop more rapidly than accuracy.

Another model of the way confidence might change as learning and memory conditions change assumes that everyday experience provides people with information about the conditions that tend to improve memories and the conditions that tend to weaken memories. If people do have theories about how their memories are affected by such variables as the length of a retention interval, they might use those theories to adjust their confidence criteria, rather than moving them in lock step with the yes/no decision criterion. For example, most people believe that memory fades with time (Loftus, 1979; Yarmey, 1979), but suppose they believe that memory fades at a faster rate than it actually does. In an attempt to maintain a given likelihood of being correct, subjects might adjust their confidence criteria to reflect those beliefs. That is, if subjects know it has been a long time since they have seen the study items, they might require an even greater strength of evidence before saying they are absolutely sure they have seen an item. In signal detection terms, their confidence criteria shift in a more conservative direction, as was depicted in Figure 4, rather than move in lock step with the yes/no criterion. By comparing the effect on the confidence criteria of moving from optimal to sub-optimal memory conditions depicted in panel b in Figure 1 with those depicted in Figure 4, we can see that the latter would predict a much greater reduction in confidence as accuracy and d' decreased. This conservative-shift model predicts many fewer high confidence "yes" responses both because d' is decreasing and because the subjects require even stronger evidence, at both extremes, before giving very high confidence estimates. Which learning and memory variables would have this selective effect on confidence might depend on the nature of the theories that subjects hold. If for example, subjects' beliefs about the effects of duration of exposure were relatively accurate while those for retention interval were not, confidence shifts that are unexplainable by d' might be observed for the latter but not the former.

This metamemory conservative-shift model is very similar to the stretch model that we used in the simulations. The only difference is that the stretch simulations assumed that the yes-5 confidence cut was fixed at 3 standard deviations above the noise distribution while the conservative-shift view implies that subjects might move their yes-5 confidence in an even more conservative direction (farther away from the noise distribution) the weaker they believe their memory to be.

Calibration of Confidence Ratings to Accuracy. Before exploring the role that metatheories might play in the confidence-accuracy relationship, we present the results of a fifth measure of the strength of this relationship. Not only can this measure help distinguish between the lock-step and stretch/conservative-shift confidence models, it is also virtually equivalent to the calibration curves that are common in decision-making and judgment of learning research (Lichtenstein & Fischhoff, 1977; Nelson & Dunlosky, 1991). In the present context we would argue that confidence is similar to "odds of being correct" estimates and that absolute confidence is similar to estimates that are close 1.0 odds of being correct while "just guessing" is similar to chance odds estimates (.5 in present experimental procedure).

According to the signal detection model, the areas under the seen and not seen curves in each confidence bin determine the proportion of responses that will be correct at each confidence level. With this in mind, Figures 1 and 4 show that higher confidence responses should always be associated with a higher probability of correct responses except when d' approaches zero (assuming the distributions are symmetrical around their means and the response criterion is placed optimally). On the other hand, the exact placement of the confidence cutpoints, along with the size of d', will determine how well calibrated confidence is to accuracy; although, in general, the more spaced out the confidence criteria are, the better calibrated confidence will be. One way of measuring the degree to which confidence judgments are calibrated to the accuracy of the yes and no responses that also has considerable applied significance is to treat all responses as independent events. That is, even though different subjects may have different d's, different placements of criteria, and different seen to not seen distribution variance ratios, we can ask what the odds are that "yes" or "no" responses are correct given they were accompanied by a particular confidence estimate, regardless of who generated the responses.


Figure 8. "Yes" response calibration curves produced from the same four sets of Monte-Carlo simulation runs used throughout. Each line within the four graphs shows how, at each d' level, the probability of producing a correct "yes" response, hits/(hits+false alarms), changes as confidence in the response increases. The d' values for the curves within each graph were 0, .25, .5, 1, 1.5, 2, 2.5, and 3. In each graph, the curve with the lowest proportion correct had d's of 0 and that with the highest had d's of 3. In the bottom two graphs, black lines represent d' values closest to those obtained in Experiment 1.

Figure 8 shows the "yes" response results for this type of analysis from the same four sets of simulation runs used to produce the confidence-accuracy correlations presented in Figure 5. Data points connected by lines are from 200 "subjects" whose d's were selected from the same distribution. In the bottom two graphs, the black lines and data points show the calibration curves for d' values close to the range obtained in the experiment, namely, d'=1, 1.5, and 2. We highlight these only in the bottom two graphs because these make the more realistic assumption that there are individual differences in how subjects place their confidence cutpoints while the top two graphs make the less realistic assumption of no individual differences in this response. Looking at all of the calibration curves supports the intuition from our signal detection analysis that the degree of calibration of confidence to accuracy depends mostly on d'. The model type and the amount of individual difference variance in confidence play small, but interesting, roles compared to d'. Nevertheless, for applied reasons, it is important to note that the signal detection approach predicts that even very small d's will yield confidence-accuracy relationships in which higher confidence estimates are associated with higher probabilities of correct "yes" responses, even in circumstances in which correlational measures may be near zero and insignificant, i.e., d'=.25!

 

Table 4

Number of "Yes" Responses (N) and Percent of Them Correct (%Hits) as a Function of Exposure Duration, Retention Interval, and Confidence in the Between-Subjects Experiment

 

 

 

Length of Retention Interval (in Hours)

 

 

1

24

168

336

Exposure Duration

Response

Confidence

 

N

% Hits

 

N

% Hits

 

N

% Hits

 

N

% Hits

 

Absolute

189

97.88

151

97.35

132

93.18

72

91.67

 

High

159

85.53

176

76.7

165

76.36

124

82.26

Three

Moderate

222

70.72

219

71.23

209

71.29

198

68.18

 

Slight

173

60.12

180

56.67

192

60.94

214

63.08

 

Guess

67

64.18

83

56.63

83

63.86

110

62.73

 

Absolute

371

97.04

352

96.59

218

94.95

161

93.79

 

High

154

88.31

157

81.53

185

81.08

148

81.76

Eleven

Moderate

143

77.62

154

70.78

185

72.97

192

74.48

 

Slight

105

71.43

110

70.91

148

64.86

176

69.32

 

Guess

49

67.35

75

61.33

67

53.73

112

70.54

 

Table 4 presents the empirical results for the same analysis of "yes" responses. Exactly as predicted from the signal detection analysis, the likelihood that "yes" responses were correct given that subjects said they were "just guessing," was much lower (.628) than the probability that "yes" responses were correct when they were accompanied by absolute confidence (.959). Examination of the black data points in the bottom two graphs in Figure 8 (when v was equal to 1.5) indicates that the lock-step and the stretch models produce a subtle difference in the calibration curves that is most noticeable at the highest confidence levels. Because the stretch model assumes that subjects widen their confidence cutpoints as d' gets lower, it predicts that the probability of the higher confidence "yes" responses will be less affected by reductions in d' compared to the lock-step model. That is, the slope of the calibration curve should remain fairly stable as d' changes (within the d' ranges our experimental conditions produced) if the stretch model is correct. We tested these predictions by performing a 4 (retention interval) x 2 (duration of exposure) x 5 (confidence level) log-linear fit of the yes response frequencies coded as hits or false alarms in which we treated each response as an independent event. Although the fit of the overall model was highly significant (X2 (39) = 726.14, p < .0001), consistent with the stretch model, Wald Chi-Square effect tests revealed that this was due to only three main effects: retention interval (X2 (3) = 13.57, p = .0036), exposure duration (X2 (1) = 4.68, p = .031), and confidence (X2 (4) = 361.26, p < .0001). Most importantly, none of the interactions reached significance.[15] In addition, when we repeated this analysis assuming an ordinal constraint on confidence, the main effects of retention interval and exposure duration disappeared. Considering that the Wald Chi-Square effects of retention interval and duration of exposure were highly significant without confidence in the model (retention interval X2 (3) = 35.47, p < .0001 and exposure duration X2 (1) = 62.84, p < .0001), this latter result suggests that the effects of retention interval and duration on the accuracy of "yes" responses was almost completely mediated by confidence! In effect, if one wanted to predict the odds of a "yes" response being correct, knowing the confidence that was expressed in that response would provide virtually all of the relevant information. Knowing the optimality of the learning and memory conditions would provide almost no additional predictive information over and above the subject’s expressed confidence in that response.

Even after controlling for the differential rates of the confidence estimates across experimental conditions (seen as Ns in Table 4), the relationship between confidence and accuracy remained remarkably strong and consistent, in a pattern predicted by the stretch model simulation data (when v=1.5). In the worst memory condition (three second exposure duration and 336 hour retention interval with a mean d' of .98) 91.7% of the absolutely confident "yes" responses were hits while only 62.7% of the responses labeled as guesses were hits; and in the best memory condition (eleven seconds exposure and one hour retention with a mean d' of 2) 97% of the absolutely confident responses were hits while 67.4% of the guesses were hits. This result along with the much greater ability of confidence than condition differences to predict the hit verses false alarm rate presents a very different picture of the strength of the relationship between confidence and accuracy than that which is typically concluded from the correlation-based methods of measuring that relationship.


Figure 9. "No" response calibration curves produced from the same four sets of Monte-Carlo simulation runs used throughout. Each line within the four graphs shows how, at each d' level, the probability of producing a correct "no" response, correct rejections/(correct rejections+misses), changes as confidence in the response increases. The d' values for the curves within each graph were 0, .25, .5, 1, 1.5, 2, 2.5, and 3. In each graph, the steeper the slope, the greater the d'. In the bottom two graphs, black lines represent d' values closest to those obtained in Experiment 1.

Similar though not identical results are presented in Figure 9 for the simulated "no" responses. As can be seen in Figure 9, the pattern of calibration curves for the "no" responses (for the identical simulation runs that produced the curves in Figure 8) is somewhat different than that for the "yes" responses. In particular, note how at higher d's the probability of low confidence responses being correct drops well below chance. This difference is primarily the result of the biased placement of the "yes/no" decision criterion in the simulations. To see how this can happen, one need merely examine Figure 4 and imagine that the "yes/no" decision criterion was moved to the position occupied by the yes-1 criterion. Subjects would be saying "no-just guessing" at a point where the area under the seen distribution is greater than the area under the not seen distribution, that is, they would produce more incorrect misses than correct rejections. This effect would increase as d' increased (up to a point that would depend on the spacing of the cutpoints and the variance of the seen and not seen distributions).

Table 5

Number of "No" Responses (N) and Percent of Them Correct (%CR) as a Function of Exposure Duration, Retention Interval, and Confidence in the Between-Subjects Experiment

 

 

 

Length of Retention Interval (in Hours)

 

 

1

24

168

336

Exposure Duration

Response

Confidence

 

N

 

%CR

 

N

 

%CR

 

N

 

%CR

 

N

 

%CR

 

Absolute

223

81.17

143

80.42

150

83.33

62

70.97

 

High

278

72.66

337

70.33

256

72.27

230

68.26

Three

Moderate

318

67.92

341

65.69

421

63.9

346

67.05

 

Slight

255

59.22

222

54.95

271

56.09

242

61.98

 

Guess

101

55.45

141

52.48

104

45.19

152

54.61

 

Absolute

331

92.45

257

84.05

167

81.44

191

79.58

 

High

300

81.67

339

77.58

324

77.78

271

75.28

Eleven

Moderate

278

71.22

270

74.44

340

68.53

263

70.34

 

Slight

181

56.35

189

63.49

183

57.38

269

60.97

 

Guess

83

45.78

74

48.65

93

52.69

118

58.47

Table 5 shows the empirically obtained calibration results for the no responses. When subjects indicated they were "just guessing," 51.67% of their no responses were correct but when they said that they were absolutely confident, 81.67% of their no responses were correct.[16] Clearly confidence was also well calibrated to the accuracy of the no responses. Comparison of the lock-step and stretch models for the no responses shows that both models predict that the degree of calibration should get worse as d' decreases. A log-linear fit for the "no" responses suggested, once again, that confidence was an excellent predictor of accuracy. Retention interval (X2 (3) = 6.56, p = .088), exposure duration (X2 (1) = 16.89, p < .0001), and confidence (X2 (5) = 327.47, p < .0001) were significant. However, unlike the "yes" responses,  but still consistent with the signal detection simulations, this analysis also produced two significant interactions: retention interval by confidence (X2 (12) = 24.37, p = .018), and exposure duration by confidence (X2 (4) = 10.17, p = .038), with the remaining interaction effects not reaching significance. The nature of the two interaction effects were consistent with the idea that the no response confidence criteria shifted in a more conservative direction as d' decreased. Namely, the extent to which confidence was calibrated to the accuracy of the "no" responses decreased as the learning and memory conditions worsened. For example, in the least optimal memory conditions, the difference in accuracy between "just guessing" responses and "absolutely confident" responses was 15%, but the difference in the most optimal conditions was about 45%.

Clearly both in absolute and relative terms knowing the level of confidence that subjects expressed in their yes and no responses was highly predictive of the accuracy of those responses (even without knowing which subject generated the responses). This relationship between confidence and accuracy was largely hidden by the correlational results presented earlier. Thus, even when both within and between subjects confidence-accuracy correlations appear to be low and not significant, confidence can be highly calibrated to accuracy. In fact, it is possible to show how the responses of two subjects can each be highly calibrated while their mean confidence and percent correct scores are inversely related. This can happen when the subject with the higher d' happens to have wider confidence bins than the subject with a lower d'. Although the former will be more accurate and less confident, for both subjects, higher confidence responses will still be associated with a greater likelihood of those responses being correct.

Another implication of these results concerns the effect that different learning and memory conditions have on confidence criteria placement. The different pattern of calibration for the "yes" and "no" responses tends to support the stretch/conservative-shift models over the lock-step model. Apparently, subjects do not simply move their confidence criteria in lock-step with the "yes/no" decision criterion as d' changes.

One way to test the hypothesis that people might adjust their confidence criteria based on their metatheories of memory would be to assess the effect of duration of exposure and length of retention interval in such a way that subjects were deprived of information about the source of whatever item differences in subjective strength of evidence they might experience. That is, take away their knowledge of each item’s time since last seen (and its duration of exposure) at the time of testing. Experiment 2 was designed in an attempt to create these conditions. It also provided another opportunity to examine the confidence-accuracy relationship from a somewhat different face-recognition procedure than that used in the Experiment 1.

To hide information about the length of a retention interval (as well as the duration of exposure) associated with seen items, we varied these factors within subject. In a typical within subjects retention interval design, items with different retention intervals are presented and tested in blocks. Subjects see some items and then are tested at one retention interval, then they see more items and are tested at a different retention interval. This procedure does not hide information about the retention interval, however, because the subjects know how long it has been since the last study session. Therefore, we required that subjects study items at different times during a three-week interval and return for one final test session in which all of the items were mixed together. In this way, subjects did not have, at the test session, non-memory based information about how long it had been since they had seen a particular item. Unfortunately, this procedure has the drawback of adding retroactive and proactive interference effects to the retention interval as well as confounding it with order effects. Nevertheless, since our primary concern was the relationship between d', confidence, and accuracy and not the pure effects of retention interval and duration of exposure, we felt this was a small price to pay to keep subjects blind, during the test phase, to the learning conditions that were associated with individual  items.

Experiment 2

Method

Subjects. Thirty-five subjects were again obtained from introductory psychology classes at UCSD and served in partial fulfillment of class requirements. Subjects volunteered to participate in an experiment lasting about three weeks and requiring that they return for a total of six sessions over that three-week interval.

Design. A 2 x 5 within-subjects factorial design was employed. Two levels of duration of exposure to the study faces (two seconds and twelve seconds) were crossed with five different retention intervals (one hour, 24 hours, 168 hours, and 336 hours, and 384 hours). This unusual pattern of retention intervals was selected because pilot testing indicated that there was an unexpectedly strong primacy (or “first in”) effect that interfered with the retention interval effect. In an attempt provide a long retention interval in which this “first in” effect was minimized, we followed the first session with another, two days later, with the intention of discarding data for items from the very first study session.

Procedure. In a manner similar to the previous experiment, subjects studied a total of 50 color slides of faces and were tested for recognition with 100 slides. Unlike the prior experiment, however, the decision task was: had this person been seen before and not had the identical slide been seen before. Thus, the 50 previously seen items were not identical pictures of the same person but consisted instead of the same person with some minor changes in appearance, e.g., changes in clothing, minor changes in hair style, changes in facial expression, and so on. All of the stimulus people were randomly divided into two sets of 50. Half of the subjects studied one set, the other half studied the other set.

Subjects were run in groups of one to four at each of five study sessions. During each study session, subjects were shown 10 slides of male and female college age individuals (using the same methods of display as in Experiment 1). Half of the slides in each session were seen for the short duration and half for the long duration. Subjects were instructed to look at each person carefully and then during a 15 second inter-item interval predict whether they thought that they would recognize this person later, indicate how confident they were in this prediction, and write down a reason, if they had one, from a list that we provided, why they thought they might or might recognize that person later.

At the beginning of the first session, subjects were told that they would have to return on four additional days over the next three weeks for a brief time but that the last session would last for a much longer time (two hours) than the previous ones. After the introductory first study session, subjects returned two days later for the next session, then a week later for the next session, then five or six days later for next session, and then the following day for the last session. At the end of the last study session of 10 slides, subjects were given an hour break after which they had to return to the laboratory for the test session.

During the test, the slides of the previously seen individuals were randomly mixed with slides of 50 new people. Subjects were shown each test slide for 20 seconds during which time they indicated whether they had seen this person before and their confidence in this response. If they indicated that they had seen the person before, they were asked whether they had seen the person for a short or a long time as well as when they had seen that person before (which session). Finally, subjects wrote down reasons from the same list provided during the study sessions why they thought they did or did not see the person before.

Results and Discussion

Duration and Retention Interval Effects. Because the exposure and retention interval conditions were varied within subjects for the seen items only, data for just one set of not seen items was available for each subject. Thus, the learning and test conditions affected the results for the seen items only. Nevertheless, we computed separate estimates of d' by reusing the results for the unseen items for each estimate. Table 6 presents the results (excluding those from the first study session) for d', percent correct, and mean confidence. As in the between subjects design, the main effects of retention interval and duration of exposure were significant for all three measures but the interactions were not (for an arcsin transformation of proportion correct: the retention interval Greenhouse-Geisser corrected F(2.79, 94.87) = 4.27, p<.01, the exposure duration F(1, 34) = 21.7, p<.0001[17], but the interaction was not significant; for confidence: the Greenhouse-Geisser corrected retention interval F(2.68, 91.3) = 3 .00, p<.05, the exposure duration F(1, 34) = 14.31, p<.001, and again the interaction was not significant). Somewhat unexpectedly, however, the one hour retention interval yielded slightly lower memory scores than the one day condition, possibly due either to fatigue or to interference. Thus, the more typical retention interval memory losses were found by comparing the one day retention interval with the week and two week intervals. Regardless, these conditions did create significantly different performances on the key measures.

 

Table 6

 

Effect of Duration of Exposure and Length of Retention Interval on Mean Accuracy (Measured as d' and Percent Correct) and on Mean Confidence a

 

 

Measure

 

d'

Percent Correct

Mean Confidence

 

Duration of Exposure

Retention Interval

2 sec.

12 sec.

2 sec.

12 sec.

2 sec.

12 sec.

1 hr

1.41

1.77

65.1

77.7

3.457

3.817

24 hrs

1.48

1.92

66.9

81.1

3.623

3.920

168 hrs

1.34

1.53

64.0

69.7

3.457

3.720

336 hrs

1.19

1.55

56.6

68.0

3.480

3.589

a Computations for the mean d’s were hampered by the fact that each subject was only tested on five previously seen people at each retention interval and duration. The small n meant that some subjects were 100% correct in some cells, a value whose inverse normal deviate is not defined. In an attempt to correct for this, we recoded 100% to 95% correct. A similar correction was applied on the 0% correct side. For these reasons, the absolute values of the mean d’s in this table should be not be taken at face value.

 

Signal Detection. The first evidence that signal detection provided a reasonably good description of the data from this experiment can be seen in Figure 10. Linear functions fit the normalized ROC curves[18] about as well as they fit data from the first experiment (the smallest r2 was .991 and the largest was .996). Interestingly enough, the slopes of these ROCs were somewhat less than those in the previous study (ranging from .634 in the one hour retention interval to .686 in the 336 hour retention interval) suggesting that the variance of seen item distributions were between 1.46 and 1.58 times larger than the distribution for the unseen items.

The second type of evidence comes from the mean confidence ratings obtained for each response type. Because each subject only generated five responses to seen slides at each duration and retention interval, most subjects did not produce all response types in all conditions. This made a complete analysis of variance of condition by response-type inappropriate. Nevertheless, we were able to examine the effects of response type on least square estimates of mean confidence: Hits = 3.860, FA = 2.809, Miss = 3.120, CR = 3.477 (Greenhouse-Geisser corrected F(1.59, 50.77) = 27.13, p<.0001). As expected by signal detection, and exactly as in Experiment 1, mean confidence was higher for correct than incorrect responses (F(1, 96) = 65.33, p<.0001). In addition, as in Experiment 1, subjects were more confident in their responses to the slides they had seen before, i.e., hit and miss responses, than to the slides they hadn’t seen before (F(1, 96) = 15.89, p<.0001). Fortunately, we were able to examine the effect that duration had on the size of the hit v. miss response-type effect by collapsing over different retention intervals. In particular, a 2 (duration) by 2 (hit/miss) within-subjects analysis of variance of mean confidence supported the signal detection prediction: the difference in mean confidence between hit and miss responses was larger in the long duration (4.037 v. 3.066) than short duration (3.667 v. 3.224) of exposure conditions (F(1, 34) = 16.86, p=.0002). Missing data and high correlations across conditions prevented us from performing a similar test for the different retention intervals, but when we computed the mean hit confidence minus mean miss confidence for each subject in each retention interval and compared these scores, we found that this difference was larger in the shorter, one and 24 hour, retention interval conditions (.66 and .77) than the longer, 168 and 336 hour conditions (.48 and .61). Finally, there were no differences in the confidence subjects expressed in their “yes” compared to “no” responses. Thus, like Experiment 1, although not as neat, these mean confidence results are consistent with the signal detection model. Responses that we would predict were based on evidence values closer to the "yes/no" criterion had lower confidence means than responses based on evidence further from c.


Figure 10. ROC curves from applying the rating method to data from four retention intervals in a within-subjects design in which the subjects did not know how long it had been since they had seen each slide. The same rating method was used here as in Figure 2, however, the normalized cumulative proportion of responses for the not seen items did not change as the length of the retention interval changed. Linear functions were fit with least squares regression.

When we applied the Peirce and Jastrow equation to each subject's overall mean signed confidence and ln p/(1-p) accuracy measure, the results supported a signal detection interpretation consistent with the previous analyses. In particular, as Figure 10 shows, individual differences in memory (collapsed across the different duration and retention intervals for the seen slides) were linearly related (r2 = .796, F(1, 33) = 128.53, p<.0001) to individual differences in signed confidence scores. Consistent with the ROC data and the fact that subjects were again slightly biased to say, "no," (the total mean proportion of "yes" responses was .446), the intercept of this function was slightly larger than zero (t = 2.27, p = .03). Furthermore, despite the use of rather different recognition and retention interval procedures, the parameter values of this fit were not significantly different from those obtained in Experiment 1 (slope = 1.03 and 1.12, intercept = 2.24 and 2.54 for Experiments 1 and 2, respectively).  In sum, the ROC curves, the response-type effects on mean confidence, and the Peirce and Jastrow analysis all seem to fit about as well within a signal detection framework as did the data from Experiment 1.

Confidence and Accuracy Correlations. As was done for Experiment 1, within-subject and between-subject confidence-accuracy correlations were computed to determine how the more typical methods of examining the relationship between confidence and accuracy behaved. Unlike the first experiment, but quite consistent with previous reports, the overall between subject correlation (based on mean confidence, coded 1-5, and total proportion correct) was not only small, it was not significant (r(33) = .218, p>.1). The within condition confidence-accuracy correlations, for the seen items only, are in Table 7. With one exception (the one hour between-subjects case) all of the correlations for the long exposure slides were significantly different from zero, but none of those from the short duration condition were. Because all but six subjects had missing accuracy scores in one or another condition (e.g., all of their responses were correct or incorrect), a reasonable 2-way analysis of variance could not be computed on the within-subjects mean correlations, however, when two separate analyses of variance, one for duration and the other for retention interval, were computed, the duration effect was significant (F(1, 34) = 12.70, p<.005) but the retention interval effect was not (F>1). Unlike Experiment 1, only duration of exposure seemed to have a large effect on the confidence-accuracy correlations.[19] But then, the duration of exposure manipulation also had a much bigger effect on d' than did the retention interval manipulation.

 

Table 7

 

Effect of Duration of Exposure and Length of Retention Interval on the Size of Confidence-Accuracy Within and Between Subject Correlations for the Seen Items Onlya

 

 

Type of Confidence-Accuracy Correlation

 

 

Within-Subjects

Between-Subjects

 

Exposure Duration

Retention Interval

2 sec.

12 sec.

2 sec.

12 sec.

1 hr

.040

.395*

.244

.147

24 hrs

.196

.396*

.046

.680*

168 hrs

.124

.395*

.151

.509*

336 hrs

.060

.310*

.328

.395*

a Each of the mean within subject correlations is based on a different n depending on the number of subjects who correctly identified all of the seen items in a given condition. If a subject correctly identified all of the items, there was no variation in accuracy and therefore a correlation could not be computed for that subject.

*p<.05

 

Mean Confidence and Proportion Correct. If this experiment was successful in preventing subjects from using information about the length of the retention interval in setting their confidence criteria, then mean confidence should not drop at a relatively faster rate than proportion correct as the retention interval increased from one to two weeks. In addition, mean confidence and total proportion correct should be monotonically (but possibly linearly, given the range of d' values produced) related across all of the learning conditions. Figure 11 shows that these predictions were confirmed. Unlike the results from the first experiment, the two-week retention interval did not result in a relatively more rapid drop in confidence than in accuracy. Instead the data points were well within the 95% confidence intervals (shown as gray lines) of the best fitting linear function (r2 = .912) for the data that excluded the two-week retention interval.[20]

 


Figure 11. The relationship between mean confidence (coded 1 through 5) and mean proportion correct for each learning and memory condition in Experiment 2. The squares are data from the short duration and the dots are from the long duration of exposure. The gray lines represent the 95% confidence interval for the best fitting linear function leaving out the data from the two-week retention interval.

Calibration of Confidence to Accuracy. Figure 12 shows the proportion of all of the yes and no responses that were correct given the level of confidence expressed in the response. Clearly, confidence was higly calibrated to the accuracy of both the yes and the no responses (the main effect of confidence Wald X2 (4) = 163.29, p<.0001), although the calibration was better for the yes responses than for the no responses (the yes/no by confidence interaction Wald X2 (4) = 26.52, p<.0001). Because all of the false alarms were made to a standard set of not seen stimuli, it was impossible to test whether the degree of calibration varied as the duration and retention interval of the seen stimuli changed. Regardless, these results again show that confidence in and accuracy of face recognition can be highly related even when correlation measures on the same data suggest the relationship is weak or non-existent.


 


Figure 12. Calibration curves for "yes" and "no" responses for the data from Experiment 2. The darker line and square data points represents the probability of all "yes" responses being correct given that they were accompanied by a particular confidence rating. The lighter line and circles represent the same results for all "no" responses.

Confidence Cutpoints. Taken together, the results from Experiments 1 and 2 are consistent with the idea that when subjects know how long it has been since they had studied some faces, they adjust their confidence criteria to reflect their beliefs about the effect that the long retention interval has on their memories. Nevertheless, we attempted to test , more directly, the hypothesis that subjects in the first experiment were setting their confidence criteria even more conservatively as the retention interval increased while subjects in the second experiment had relatively fixed confidence criteria as d' changed by attempting to directly estimate the placement of the confidence criteria in the two experiments using signal detection and maximum likelihood procedures detailed in Dorfman & Alf (1969), Ogilvie & Creelman (1968), and Swets & Pickett (1982).

In this procedure, the parameters of the signal detection model are adjusted to maximize the theoretical probability of obtaining the observed distribution of "yes/no" and confidence responses. The parameters were r (the ratio of the noise to signal distribution standard deviations), d', and five cutpoints: "yes-5," "yes-3," "c," "no-3," and "no-5." The r values were about .8 and roughly equal across the different learning and memory conditions. Although the estimated d' values were higher than those in Tables 1 and 3, the estimated values were linearly related (r2 = .985) to those in the Tables with a slope not different from 1.0.

Of most interest in these analyses is whether the maximum likelihood estimates for the confidence cutpoints would support the conservative-shift model for the data from Experiment 1 but not for the data from Experiment 2. Figure 13 presents the results of these maximum-likelihood fits for both experiments. The graphs show the maximum likelihood estimates for the five estimated cutpoints plotted against the maximum likelihood d' estimates. The data are collapsed over the duration conditions for both fits because the pattern of effects of retention interval was virtually identical in the two duration conditions.[21] As can be seen in the top panel, when subjects in Experiment 1 were aware of the length of the retention interval for individual items, placement of the higher confidence cutpoints for the “yes” responses actually did become more conservative (after an initial tendency to be drawn along in lock-step fashion with the “yes/no” decision criterion) as the retention interval increased and d' decreased. In a fashion completely consistent with the calibration results, the increase in conservativeness was even more pronounced for the “no” responses because the conservative shift combined with, rather than worked against, the tendency for the “yes/no” criterion (c) to shift downward with decreasing d'. Using the standard errors of the maximum-likelihood estimates, t-tests confirmed that both the “yes-5” and “no-5” cutpoints were significantly more extreme in the two week retention interval condition than in the one hour condition. In short, the subjects seemed to expand the range of their most extreme confidence bins thereby requiring greater strength of evidence (at both ends of the scale) before indicating high confidence. In stark contrast, but not surprisingly, the lower panel shows that when subjects in Experiment 2 were unaware of the length of the retention interval for individual items, the placement of their confidence cutpoints did not change as the retention interval increased and d' decreased. Of course, it is difficult to imagine how they could have changed, given that the subjects had no information about which faces were from which retention interval.


Figure 13. The top and bottom parts of the figure show the maximum likelihood estimates for d', the “yes-absolutely confident,” “yes-moderately confident,” “yes-no,” “no-moderately confident,” and “no-absolutely confident,” response cutpoints as a function of the length of the retention interval for Experiments 1 and 2. Results for Experiment 1 are in the top figure.

Experiment 3

If people do have metatheories of the effects that retention interval has on their memories for faces and those theories predict that memory decays at a rate faster than it actually does, we might expect that subjects’ predictions of their memory performance over retention interval would drop more rapidly than their actual performance drops over equivalent retention intervals. To test this idea we designed an “observer simulation” of Experiment 1. We presented descriptions of the procedure and conditions that we used in Experiment 1 to naive subjects and asked them to predict their memory performance.

Method

Subjects. One hundred and seventeen students from a class that one of us was teaching served as subjects for no credit.

Procedure. All members of the class were told about the procedures used in Experiment 1. The class was told to imagine that they were asked to study 40 faces of individuals projected on a screen for 12 seconds each. (The instructor demonstrated 12 seconds by counting off the time while watching a second hand.) They were also told to imagine that after studying all 40 faces they would take a one-hour break and then be tested for their memories of the faces. The recognition procedure was described and they were further instructed that if they had no memory for any of the slides and just guessed, they would obtain a score of 50% correct and that if they had a perfect memory, they would correctly recognize each slide they had seen before and correctly reject each slide they hadn’t seen before thereby obtaining a score of 100% correct. At this point all of the class was asked to estimate how well they thought they would perform on the recognition test after a one hour delay by providing the percent correct they thought they would obtain. The class was then asked to provide additional estimates for their performance after a day, one week, and two-week delay. Finally, the class was asked to provide estimates for all four retention intervals, but this time to imagine that they were only able to study the slides for 3 seconds rather than 12 seconds each. The instructor then demonstrated 3 seconds by counting off the time while watching a second hand.

Results and Discussion

Clearly the design of this observer simulation is such that many differences exist between the actual experiment and the observer simulation. Nevertheless, it is still of interest to compare the shape of the forgetting function that subjects predicted for themselves with the functions that were actually obtained in Experiment 1. Figure 14 presents these results. As was the case for the actual performance data, analyses of variance indicated that both the main effect of duration and of retention interval on the predicted percent correct (arcsin transformed) scores were highly significant (p<.0001) but the interaction was not. More importantly, the pattern of the observers’ average predicted memory performance appeared to be somewhat different from the actual performance that equivalent conditions produced. In particular, the observers predicted that their recognition accuracy would drop off much more rapidly as the retention interval increased to two weeks than the rate at which their actual memory performance declined. Interestingly, the observers did a remarkably good job of predicting the size of the effect of duration of exposure.

 


Figure 14. Solid dark lines show the actual percent correct recognition responses in Experiment 1 and the dotted lines show the predicted percent corrects for another group of subjects who simply guessed how well they would do after hearing a verbal description of the same experiment.

The latter result may explain why we found evidence in support of the length of the retention interval producing a conservative shift in confidence cutpoints but no such effect for duration of exposure. If subjects’ metatheories about the effect of duration of exposure on recognition accuracy are nearly correct (at least within the range of values that we studied), then any effect those theories might have on the placement of confidence would be no different than the direct effect of d'.

General Discussion

The results from the present experiments are consistent with the view that confidence and accuracy in face memory are highly related to each other despite what many eyewitness memory experts seem to believe. However, the nature of this relationship is not well described by the size of simple Pearson correlation coefficients that are often computed in face and eyewitness memory research. Instead, the relationship is better described by a signal detection model in which confidence estimates are cutpoints located on the same underlying psychological dimension as the "yes/no" decision criterion. This model assumes that higher confidence identification responses are always associated with a higher likelihood of being correct than lower confidence responses (except when d' is near zero) even though certain correlation measures of the confidence-accuracy relationship may be small and non-significant.

We do not mean to argue that signal detection provides a perfect description of all of the accuracy and confidence data in face recognition research. But, signal detection (Egan, Schulman, & Greenberg, 1959; Macmillian & Creelman, 1991), provides a useful description of the strong relationship that exists between confidence and accuracy, a relationship that is virtually hidden from view by Pearson correlations between confidence and accuracy but quite clear when other measures of the relationship are computed, namely, means across different learning and memory conditions, correlations between mean confidence and accuracy over different faces, and especially calibration curves.

Optimality Hypothesis Redefined

If our view is correct, it suggests that the optimality hypothesis explanation for the size of confidence-accuracy correlations needs to be slightly refined. According to the optimality hypothesis, correlations between confidence and accuracy will be higher in optimal, compared to sub-optimal, learning and test conditions (provided that what are defined as optimal conditions actually produce higher d's). Although this view is generally consistent with the signal detection analysis, the emphasis on optimal and sub-optimal implies a dichotomy of conditions that is highly artificial and directs attention away from the underlying process. As we demonstrated by the Monte-Carlo simulations, the relationship between d' and the size of both the response-based and individual difference-based confidence-accuracy correlation is a continuous one. Learning and test conditions are neither optimal nor sub-optimal, they simply control d' which in turn controls the proportion of highly confident responses that are likely to be correct, and thereby the strength of the confidence-accuracy relationship.

Although the former argument may seem like a small point, it can have important applied consequences. The belief that confidence is not a good predictor of accuracy has caused a number of researchers to argue that jurors need to be instructed in court by an expert to rely less on confidence and more on other predictors (e.g., Lindsay, Wells, & O'Conner, 1989; Penrod and Cutler, 1987). These experts suggest it is reasonable to tell jurors that confidence is not a good predictor of the accuracy of identifications of actual eyewitnesses to crimes because the “learning and test conditions” in most real crime situations are sub-optimal (Kassin, Ellsworth, & Smith, 1989) . Although Deffenbacher (1980) provided a list of some situational factors that might enhance optimality (e.g., a warning that a memory test will occur, “moderate” situational stress, “ample” duration of exposure, “high” familiarity with the target, a “brief” retention interval, “similar” condition of the target at encoding and test, additional “consistent information” presented during the retention interval, a forced-choice testing procedure with “unbiased” instructions, and “low similarity” of the targets to the distracters), the application of these to any particular witness to a crime requires knowing how the witness’s experiences matched these conditions. Because these conditions are not well defined (Is a 1 hour or a 48 hour retention interval considered to be brief? Is 20 seconds or two minutes of exposure ample? If the target changes his hairstyle, is he still similar enough?), it is difficult to know, by looking at the situation, whether conditions are optimal or sub-optimal. Put differently, the optimality hypothesis fails to provide the expert with a procedure to measure optimality.

To be fair, Deffenbacher (1980) also suggested that optimal might be defined as those conditions that produced accuracy rates above 70% and/or d's above 2.0. But, even this rule creates uncertainty about the definition of optimal, because as can be seen in our Table 1, accuracy rates around 70% are not, in general, associated with d's of 2.0. In fact, assuming that c is placed midway between the means of the seen and not seen distributions, d's of 2.0 will yield overall accuracy rates in the mid 80% range. More importantly, at our present stage of understanding, we have no idea how the above list of conditions might combine to determine d' (e.g., how many units of less than optimal “stress” are sufficient to over come the optimal conditions of “ample” exposure and “low similarity” of targets and foils?). In addition, as the simulation results demonstrated, the effect of d' on the size of confidence-accuracy correlations depends on the level of other signal detection parameters and the specific method used to compute the correlation. Because our theories of learning and memory are not yet refined enough to predict exact d's, percent corrects, much less confidence-accuracy correlations for given combinations of specific learning and memory conditions, focusing on an outcome measure such as d' could have much greater applied utility.[22]

For example, if our claims are correct, we know how d' and within-subject response-based confidence-accuracy correlations are related (see Figure 5). Finding a way to assess d' from the confidence expressed by the witness has much greater utility than simply claiming that witnesses to crimes are obviously in sub-optimal learning and test conditions and therefore their confidence is diagnostically useless. Until eyewitness memory experts establish the exact functional relationships between different combinations of learning and test conditions and d' -- a daunting task -- the claim that crimes are sub-optimal learning situations without suitable measurement of the strength of each eyewitness's memory simply tells us that the experts believe all eyewitnesses to all crimes have poor memory.

Despite the difficulty of determining how learning and memory variables combine to affect d', one empirical fact that experts do not seem to know is that higher confidence estimates are more probable the higher d' is. We can see this in Tables 4 and 5. The probability of absolute confidence increased dramatically as d' increased. For example in the best memory condition in Experiment 1 (1hr, 11 seconds), subjects were absolutely confident in 45% of all of their yes responses, but in the worst memory condition (336hr, 3 seconds) subjects were absolutely confident in only 10% of all of their yes responses. This important result implies that confidence may be a good predictor of the optimality of the learning and memory conditions, as well as d' and the accuracy of individual identification responses. Recent work on the rate at which highly confident identifications occur in actual crime situations (Moore, Ebbesen, & Konecni, 1994) suggest that highly confident positive identifications occur in over 90% of the identifications in the real world. Even correcting for what are often assumed to be strong demands to appear confident in the real world (for which we know of no direct empirical evidence), the claim that real world witnessing conditions tend to produce d's lower than .25 seems a bit strained.

Deffenbacher’s conclusion that, “the judiciary should cease reliance on witness confidence as an index of witness accuracy,” seems premature. Eyewitness memory experts who have uncritically accepted this conclusion on the basis of low correlations and who have testified, as such, in court have been misinforming jurors about the nature of the confidence-accuracy relationship (with obvious real-world consequences).

Measuring d' from confidence-accuracy correlations

From a slightly different point of view, the various simulation results suggest that whether the size of a confidence-accuracy correlation will be a useful indicator of d' or percent correct depends on a number of previously unrecognized factors. In particular, we know from the simulation results that a narrow range of within subject response-based correlations is associated with a wide range of d' values. For example, as d' grows from 1 to 3 the mean response-based r grows from around .2 to .35. In fact, this correlation changes most rapidly as d' moves from 0 to 1, all situations that would be defined as sub-optimal according to Deffenbacher’s rules. On the other hand, the size of response-based correlations are relatively immune to factors other than d'.

Unlike response-based correlations, the degree of individual difference variance in placement of the confidence cutpoints (v in our simulations) plays a large role in the size of individual difference correlations A high degree of variance will attenuate these correlations, even when d' is high. The way in which subjects adjust their confidence cuts as d' changes will also play a role in the size of individual difference correlations. In short, unless we take into account individual differences in confidence and know the effect that d' is having on confidence placement, using individual difference confidence-accuracy correlations to estimate the size of d' seems foolish.

Interestingly, because the face-based correlations average over individual differences and because their size does not appear to level off with d' values over 1, one might argue that they offer the "best" reflection of d' of the three different correlational measures of the relationship between confidence and accuracy. To the extent that this conclusion has any applied significance, it suggests that experts might be more concerned about whether multiple witnesses confidently picked the same suspect than whether a given witness is generally confident.

Generalizing to the real-world

Another implication of the signal detection approach is that it raises serious questions about the external validity of the claim that confidence and accuracy are only weakly related. One could argue from the current perspective that the frequently reported low correlations are simply the result of experimenters running conditions in which the average d' is less than 1. That is, when experimenters report low confidence-accuracy correlations they may be telling us little more than the subjects in their memory task did not learn and/or remember very much. If so, the real issue is not whether confidence is related to accuracy but rather the rate at which experimental simulations of eyewitness recognition memory reproduce the d' values that real crimes produce in actual eyewitnesses. If experimental tasks are sampling from conditions that produce lower d's than are produced in actual crime situations, then estimates about the real world diagnosticity of confidence based on those experiments will be too low. Conversely, if witnesses tend to learn and remember very little in the real world and experiments are sampling from conditions that produce higher d's, the current conclusions about confidence and accuracy may be correct. Thus, before we can make a claim about the diagnosticity of confidence in the real word, we must know the distribution of d's that are produced in actual crime situations -- a task that no one has bothered to accomplish.

Still, it would be foolish to argue that people could not, under the right motivational conditions, construct confidence estimates that did not behave in a manner consistent with the signal detection approach. Clearly people might dissemble either or both identification and confidence responses. What we are arguing is that when properly motivated, people are quite capable of producing recognition memory data that conforms to assumptions of our signal detection analysis. We would argue further that the signal detection approach makes it very clear what the motivational conditions must do to destroy the confidence-accuracy link. For example, motivational conditions that encouraged people to express greater confidence than they otherwise might would cause the confidence cuts to move closer to c, but this would not eliminate the confidence-accuracy relation. Confidence will still tend to be calibrated to accuracy because higher confidence bins will tend to contain a greater proportion of correct responses. In fact, the relation would disappear only when reports of confidence are no longer ordinally associated with strength of subjective evidence, for example, "absolutely confident" is placed closer to c than "slightly confident".

Measuring the strength of the confidence-accuracy relationship

What is the proper measure of the relationship between confidence and accuracy? If the signal detection model is correct, more confident responses will always be more likely to be correct than less confident responses, even when d' is close to, but not quite yet, zero and even though confidence-accuracy correlations are barely above zero. To say that confidence should not be used as a predictor of accuracy implies that a Pearson correlation is the correct measure of the strength of the relationship. In fact, we have tried to show that even very good memory (high d's) will tend to produce rather low, though significant, correlations. The reasons for low correlations depend on whether correlations are based on averages over items or on single responses (to different items within a subject or to one item per subject, as in event memory research). When correlations are based on average confidence and overall accuracy scores for individual subjects, individual differences in other than d' can affect the size of the correlations. Differences in the use of the confidence scale (confidence response biases) as well as differences in placement of the "yes/no" cutpoint will attenuate such correlations even though each subject's confidence and accuracy may be highly related, exactly as signal detection requires. As already noted, Pearson correlations based on single responses have the obvious statistical problems of being based on a dichotomous dependent variable (1 if correct and 0 if incorrect). In short, small confidence-accuracy correlations, even a large number of small correlations, are not sufficient evidence to claim that two response measures are weakly, much less, unrelated. As we saw in both a between- and a within-subjects experiment and in our simulation results, confidence can be remarkably well calibrated to accuracy even when confidence-accuracy correlations are low and not significant.

The occasional report of significant or nearly significant negative correlations between confidence and accuracy (e.g., Read et al., 1990) may seem inconsistent with the signal detection model. They are not. Negative confidence-accuracy correlations can be explained by assuming that items that the experimenter defines as seen actually produce less subjective evidence of having been seen before than items that were not seen before. Although this may seem unlikely, the use in face recognition memory of a decision task in which we ask the subject to identify the person rather than the exact slide of a face means that experimenters can alter the previously seen person to such an extent that subjects will be very confident that they have not seen the person before when they, in fact, did. Furthermore, the better the subjects learn what the person originally looked like, the greater the likelihood that they will say they are confident that they have not seen the dramatically altered individual before, even though they have. This is exactly what happened in Read et al. (1990) when they had subjects study faces of people taken at a young age and tested them with pictures taken at an older age. Negative confidence-accuracy item-based correlations were reported (for previously seen people only) because subjects tended to decide, with confidence, that slides that the experimenter claimed depicted people who were seen before were not the same individuals that were seen before. Thus, such negative correlations are more a function of what the experimenter chooses to call a previously seen slide than they are of a basic recognition memory process.

Although correlations between measures of accuracy and of confidence may not provide a reasonable picture of the relationship between confidence and accuracy, calibration curves always do. In the present instance, calibration curves tell us how much higher the odds that responses in which people are absolutely confident, for whatever reason (such as, felt pressure, need to appear confident, better learning, or metatheories of memory) are correct compared to responses in which other degrees of confidence are expressed. Clearly, from an applied point of view, this kind of information is far more informative to a jury or prosecutor than is a correlation coefficient. The latter provides no information about under or over confidence whereas calibration curves do. Furthermore, we suggest that calibration curves that take no notice of subject and stimulus differences have the greatest applied utility since we currently have little information about the distribution of types of people who serve as witnesses or about the types of faces that they are asked to identify.

Metamemory

The most surprising (and perhaps most difficult to accept) finding from the present research was the evidence that people who knew how long it had been since they had seen the faces moved their most extreme confidence cutpoints further out on the evidence dimension as the retention interval increased. Although a number of results (the maximum likelihood estimates of the cutpoints, the unusual relationship between average confidence and overall accuracy in the between subjects but not the within subjects experiment, the large change in calibration of the "no" responses but not of the "yes" responses as d' changed, and people's predicted accuracy) were all consistent with this conclusion, the relatively low number of absolutely confident false alarms in all conditions in the both experiments force us to be somewhat tentative about claiming that signal detection provides the best fit of the data. The stretch model (see Figure 4) requires that subjects generate fewer and fewer absolutely confident false alarms as the criteria expand. However, the overall rate of absolutely confident false alarms (relative to total number of responses) was generally less than 1%. For example, out of approximately 2000 responses to the not seen faces in the two week retention interval conditions in Experiment 1, only 16 were absolutely confident false alarms. Because of this, it was particularly difficult to develop a stable estimate of the exact position of the most extreme confidence cutpoints. Despite this technical difficulty, we believe, at the very least, that the evidence supports the conclusion that the “yes - absolutely confident” cutpoint is not automatically dragged down in "lock-step" fashion with the "yes/no" criterion as d' get smaller and smaller.

That the confidence criteria may not move in lock-step with c as d' changes suggest some degree of independence between confidence and accuracy. Such independence raises the possibility that variables that affect "yes/no" response bias may not be the same as those that control where subjects place their confidence cutpoints. For example, individual differences may express themselves differently in terms of d', placement of c, and confidence criteria. Self-confident individuals may be willing to express higher confidence with weaker evidence but be no different in d' values than less self-confident people. Alternatively, instructions to avoid false alarms may decrease the tendency of subjects to “guess” yes, while the social setting may increase the subject’s desire to appear confident. Our findings suggest that even people’s beliefs about how their own memories are affected by various learning and test conditions may be another factor in the relationship between confidence and accuracy. Such forms of independence between confidence and accuracy have little to do with whether confidence is diagnostic of the accuracy of recognition responses, however. If the signal detection model is correct, as the calibration results from both the experimental and simulation data confirm, higher confidence recognition responses will always have a higher probability of being correct than less confident recognition responses (unless d' is zero). Furthermore, if subjects adjust their confidence criteria to correct for other beliefs about their memories, and those beliefs underestimate how good their memories are, the adjustments that they make in confidence will, in general, reduce the rate of absolutely confident false alarms, a tendency that runs counter to the claim that witnesses are positively identifying innocent suspects at a high rate.

If the metatheory explanation for the conservative shift in confidence is correct, it raises several interesting issues. First, what are appropriate methods of assessing the accuracy of such metatheories. For example, a number of researchers (Cutler, Penrod, & Dexter, 1990; Deffenbacher & Loftus, 1982; Loftus, 1983; Rahaim & Brodsky, 1982; Seltzer, Lopes, & Venuti, 1990; Wells & Leippe, 1981; Yarmey & Jones, 1983) argue that jurors have a poor understanding of how eyewitness memory actually works and will frequently draw incorrect conclusions about the accuracy of eyewitness testimony unless experts educate them. In particular, jurors’ supposedly over rely on a witness’s confidence in estimating the accuracy of the witness. Generally this kind of conclusion is supported by evidence showing that jurors “incorrectly” use a witness’s confidence to predict  accuracy in situations in which the correlation between confidence and accuracy is small. But, if the signal detection analysis is correct, it may not be the jurors who are over weighting the diagnosticity of confidence but the researchers who are under weighting it. The real issue, in this context,  is the relative diagnosticity of the different types of information typically available to jurors. This is never measured. What is needed if this kind of evidence is to be used to assess the accuracy of jurors’ metatheories is a comparison of the accuracy of expert predictions with those of jurors across a wide range of witnessing conditions and testimonies. Until such evidence is presented, it seems premature to argue that jurors need to be educated by experts just because the former make mistakes and the latter argue that their theories are better.

Although the metatheory idea (in a signal detection framework) provides a satisfactory explanation for the pattern of the data from these experiments, it would be a mistake to conclude from the current results that the signal detection plus meta-memory model provides the only, or even the best, theoretical description of the relationship between confidence and accuracy in face recognition. Despite the considerable history relating confidence to signal detection, e.g., Clarke (1960), and despite both empirical and simulation results that were consistent with signal detection, other explanations are possible. One is to recast the recognition task into a form consistent with wave theory of similarity (Link, 1992). In this view increasingly conservative confidence estimates would be the result not necessarily of metatheories about one’s own memory but rather would reflect the amount of time that it takes to build sufficient information for a decision to be made as well as the amount of evidence the subjects thought was necessary before deciding. In this view, the effect of retention interval on confidence could result either from a weakening of the rate at which the evidence that items were seen before builds over time or an adjustment in the decision criteria. The real advantage of wave theory is not that it provides a non-metatheory explanation for the confidence shifts, but rather that it directs attention to the time that subjects take to decide (e.g., Chance & Goldstein, 1987; Henmon, 1911, Sporer, 1993, 1994), as well as to the confidence that they express in those decisions. Regardless of the relative merits of wave theory over signal detection, both would agree that confidence and accuracy are highly related, that the use of correlation coefficients to assess the strength and nature of that relationship is totally inadequate, that calibration curves provide a more useful picture of the relationship, and that, although correct in spirit, the optimality hypothesis directs attention away from important theoretical and applied issues that are made obvious by a signal detection analysis.


References

 

Bothwell, R. K., Deffenbacher, K. A., & Brigham, J. C. (1987). Correlation of eyewitness accuracy and confidence: Optimality hypothesis revisited. Journal of Applied Psychology, 72, 691-695.

Brigham, J. C. (1990). Target person distinctiveness and attractiveness as moderator variables in the confidence-accuracy relationship in eyewitness identifications. Basic and Applied Social Psychology, 11, 101-115.

Clarke, F. R. (1960). Confidence ratings, second-choice responses and confusion matrices in intelligibility tests. Journal of the Acoustical Society of America, 32, 35-46.

Chance, J. E., & Goldstein, A. G. (1987). Rentention interval and face recognition: Response latency measures. Bulletin of Psychonomic Society, 25(6), 415-418.

Cutler, B. L., Penrod, S. D., & Dexter, H. R. (1990). Juror sensitivity to eyewitness identification evidence. Law and Human Behavior, 14, 185-191.

Deffenbacher, K. A. (1980). Eyewitness  and confidence: Can we infer anything about their relationship? Law and Human Behavior, 4, 243-260.

Deffenbacher, K. A., & Loftus, E. F. (1982). Do jurors share a common understanding concerning eyewitness behavior? Law and Human Behavior, 6, 15-30.

Donaldson, W. & Murdock, B. B. (1968) Criterion change in continuous recognition memory. Journal of Experimental Psychology, 76, 325-330.

Dorfman, D. D., & Alf, E., Jr. (1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating-method data. Journal of Mathematical Psychology, 6, 487-496.

Egan, J. P., Schulman, A. I., & Greenberg, G. Z. (1959). Operating characteristics determined by binary decisions and by ratings. Journal of Acoustical Society of America, 31, 768-773.

Fleet, M. L., Brigham, J. C., & Bothwell, R. K. (1987). The confidence-accuracy relationship: The effects of confidence assessment and choosing. Journal of Applied Social Psychology, 17(2), 171-187.

Henmon, V. A. C. (1911). The relation of time of a judgment to its accuracy. Psychological Review, 18, 186-201.

Kassin, S. M., Ellsworth, P. C., & Smith, V. L. (1989). The "general acceptance" of psychological research on eyewitness testimony: A survey of the experts. American Psychologist, 44(8), 1089-1098.

Lichtenstein, S., & Fischhoff, B. (1977) Do those who know more also know more about how much they know? The calibration of probability judgments. Organizational Behavior and Human Performance, 20, 159-183.

Lindsay, D. S., & Johnson, M. K. (1989). The eyewitness suggestibility effect and memory for source. Memory and Cognition, 17, 349-358.

Lindsay, R. C. (1986). Confidence and accuracy of eyewitness identification from lineups. Law and Human Behavior, 10, 229-239.

Lindsay, R. C., Wells, G. L., & O'Connor, F. J. (1989). Mock-juror belief of accurate and inaccurate eyewitnesses: A replication and extension. Law and Human Behavior, 13(3), 333-339.

Link, S. W. (1992). The wave theory of difference and similarity. Hillsdale, NJ.: Lawrence Erlbaum Associates. 

Loftus, E. F. (1979). Eyewitness testimony. Cambridge, MA: Harvard University Press. 

Loftus, E. F. (1983). Silence is not golden. American Psychologist, 38, 564-572.

Luus, C. A. E., & Wells, G. L. (1994). Eyewitness identification confidence. Cambridge University Press, New York, NY, US. 

Macmillian, N. A., & Creelman, C. D. (1991). Detection Theory: A user's guide. New York: Cambridge University Press. 

Murdock, B. B., Jr. (1980). Short-term recognition memory. In R. S. Nickerson (Ed.), Attention and performance VIII (pp. 497-519). Hillsdale, N.J.: Lawrence Erlbaum Associates, Inc.

Neil v. Biggers (1972). In (Vol. 409 U.S., pp. 188).

Nelson, T. O., & Dunlosky, J. (1991). When people's judgments of learning (JOLs) are extremely accurate at predicting subsequent recall: The "delayed-JOL effect." Psychological Science, 2, 267-270.

Noreen, D. L. (1981). Optimal decision rules for some common psychophysical paradigms. Mathmatical Psychology and Psychophysiology, 13, 227-279.

Ogilvie, J. C., & Creelman, D. D. (1968). Maximum-likelihood estimation of receiver operating characteristic curve parameters. Journal of Mathematical Psychology, 5, 377-391.

Peirce, C. S., & Jastrow, J. (1884). On small differences in sensation. Memoirs of the National Academy of Sciences, 3, 73-83.

 

Penrod, S. D., & Cutler, B. L. (1987). Assessing the competency of juries. In I. Weiner & A. Hess (Eds.), The handbook of Forensic Psychology, . New York: John Wiley & Sons.

Rahaim, G. L., & Brodsky, S. L. (1982). Empirical evidence versus common sense: Juror and lawyer knowledge of eyewitness accuracy. Law and Psychology Review, 7, 1-15.

Ratcliff, R., Sheu, C., & Gronlund, S. D. (1992) Testing global memory models using ROC curves. Psychological Review, 99, 518-535.

 Read, J., Vokey, J., & Hammersley, R. (1990). Changing photos of faces: Effects of exposure duration and photo similarity on recognition and the accruacy-confidence relationship. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16, 870-882.

Seltzer, R., Lopes, G. M., & Venuti, M. (1990). Juror ability to recognize the limitations of eyewitness identifications. Forensic Reports, 3, 121-137.

Shapiro, P. N., & Penrod, S. (1986). Meta-analysis of facial identification studies. Psychological Bulletin, 100, 139-156.

Smith, V. L., Kassin, S. M., & Ellsworth, P. C. (1989). Eyewitness accuracy and confidence: Within- versus between-subjects correlations. Journal of Applied Psychology, 74, 356-359.

Sporer, S.L. (1993). Eyewitness identification accuracy, confidence, and decision times in simultaneous and sequential lineups. Journal of Applied Psychology, 78, 22-33.

Sporer, S.L. (1994). Decision times and eyewitness identification accuracy in simultaneous and sequential lineups. Cambridge University Press, New York, NY, US.

Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic systems: Methods from signal detection theory. New York: Academic Press. 

Wells, G. L., & Leippe, M. R. (1981). How do triers of fact infer the accuracy of eyewitness identifications? Using memory for peripheral detail can be misleading. Journal of Applied Psychology, 66, 682-687.

Wells, G. L., & Lindsay, R. C. L. (1980). On estimating the diagnosticity of eyewitness identifications. Psychological Bulletin, 88, 776-784.

Wells, G. L., & Lindsay, R. C. L. (1985). Methodological notes on the accuracy-confidence relation in eyewitness identifications. Journal of Applied Psychology, 70, 413-419.

Wells, G. L., & Murray, D. (1984). Eyewitness confidence. In G. Wells and E. F. Loftus (Eds.), Eyewitness testimony: Psychological perspectives. Cambridge: Cambridge University Press.

Wickelgren, W.A., & Norman, D.A. (1966). Strength models and serial position in short-term recognition memory. Journal of Mathmatical Psychology, 3, 316-347.

Wixted, J., & Ebbesen, E.B. (1991). The mathematics of forgetting functions. Psychological Science, 2, 409-415.

Wixed, J., & Ebbesen, E.B. (1995). A detection analysis of face recognition memory. Paper presented at the 36th meeting of the Psychonomic Society, Los Angeles.

Yarmey, A.D. (1979). The psychology of eyewitness testimony. New York: The Free Press. 

Yarmey, A.D., & Jones, H. (1983). Is the study of eyewitness identification a matter of common sense? In S. Lloyd-Bostock and B. Clifford (Eds.), Evaluating eyewitness evidence. New York: Wiley.

 


Author Notes

 

We would like to express our thanks to three very hard working undergraduates, Roger Boucher, Claudia Mendias Canale, and Joanna Adler, for assistance in running the experiments reported here. Stephen W. Link provided some very insightful comments on an earlier draft that we believe substantially improved this paper. Some of the results from Experiment 1 were reported in a previous paper by us. This research was supported by the University of California, San Diego. Reprint requests may be sent to either author at Department of Psychology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0109 or via email to eebbesen@ucsd.edu or jwixted@ucsd.edu.


Footnotes



[1]Personal experience of one of us as expert witness in criminal trials is consistent with the claim that most eyewitness experts believe that confidence and accuracy are unrelated. Virtually ever defense expert that he has heard testify claims that confidence in and accuracy of identifications are unrelated.

[2]Most of the evidence concerning the relationship between confidence and accuracy has come from two different procedures: face recognition memory experiments and event memory studies in which participants witness an event and then attempt to identify an individual in the event from a lineup. In this report we focus on the face recognition procedure, although a complete analysis of the relationship between confidence and accuracy in eyewitness memory will eventually have to include evidence from both procedures.

[3]Familiarity is frequently assumed to be the underlying subjective dimension on which the “yes/no” decision criterion is based. However, this is arbitrary because the signal detection model only requires that some single continuous subjective dimension that reflects the difference between seen and unseen faces be used and feelings-of-familiarity is only one such dimension. Other possibilities include average strength of features, or degree of match between a recalled face and the presented item, or even just “strength of evidence” (Link, 1992).

[4]Although it will prove to be an important issue later, the representation in Figure 1 also assumes that increasing the duration of exposure to faces does not affect the variance of the distribution of strength of evidence values for the seen items. The optimality of the conditions only affects the mean of the distribution.

[5]This prediction should be especially true for responses to the items that have been seen before.

[6]The exact form of calibration will depend, however, on where the decision criterion is placed.

[7]Parts of the accuracy data from this experiment were reported in a previous paper (Wixted & Ebbesen, 1991).

[8] d' was computed for each subject according to the following: d' = z(#hits/seen) - z(#false alarms/not seen). Because six subjects produced no false alarm responses, their d's could not be directly computed. To overcome this difficulty, we estimated their false alarm rate to by .01 rather than zero.

[9]Although we could have computed 8 different ROC curves, one for each condition in the experiment, we felt, for reasons that will become clear later, that showing the results for the retention intervals, collapsed over duration was sufficient.

[10]It is important to note that these least square fits are based on dependent cumulative proportions and therefore should not be used as true goodness of fit indicators.

[11]This differential slope result is completely consistent with recent claims made by Ratcliff, Sheu, and Gronlund (1992) regarding recognition memory for items other than faces.

[12]When one unusual outlying face was removed from the analysis, the correlation increased to .617!

[13]It is important to note that this argument implies that signal detection reasoning can be generalized from item strength distributions assumed to be in the heads of single individuals to subject distributions for single items.

[14]When the mean signed confidence and mean ln p/(1-p) data were analyzed in the same manner as "raw" confidence and proportion correct, the identical pattern of results were obtained. The two longest retention interval conditions both produced data beyond the 95% confidence intervals resulting from a least-squares linear fit of the remaining six conditions (r2 = .993, F(1/4) = 609.98, p<.0001). In addition, in support of the (Peirce & Jastrow, 1884) equation, the intercept of this linear function was not significantly different from zero (i.e., -.02 with a standard error of .059, t = -.4).

[15]When we constructed individual calibration curves for each subject's yes responses, the mean slope over all 195 subjects was not only highly different from zero (t(194) = 12.435, p < .0001), the .093 change in percent correct units per unit of confidence also suggested a high degree of average calibration. Setting "just guessing" at 50% correct, this average slope would put "absolutely confident" at 87.2% correct. These results indicate that the group calibration results were not driven by a few highly calibrated subjects existing in a sea of error.

[16]Like the yes responses, the average slope of the individual calibration curves for each subject's no responses was .112 (t(194) = 16.65, p < .0001).

[17]The F values were virtually the same for d' scores.

[18]Because each subject only experienced one set of not seen slides, the ROC curves for the different retention intervals were generated by using the same false alarm data for each curve. In addition, as in the between subjects experiment, these data were based on the cumulative sum of the frequencies of each response type over all subjects. For both of these reasons the goodness-of-fit tests are presented for information only without associated p-values.

[19]When the data from the seen slides from Experiment 1 were analyzed separately to compare to the results from Experiment 2, the pattern of results were unchanged.

[20]An identical pattern emerged when we computed mean signed confidence and ln p(1-p) for the different conditions. In addition, although the confidence bands were somewhat larger in this experiment than in Experiment 1, visual inspection of the figure shows that the pattern of results in no way suggested that longer retention intervals were accompanied with a greater drop in confidence than accuracy.

[21]Because our maximum likelihood fitting routine was limited in the number of free parameters that it could be fit at one time, we were forced to collapse over several of the confidence categories. However, no matter what method of constraining the number of free parameters that we used, the results of the fits were all consistent with the conclusions reached here. In addition, when the position of each of the cutpoints were directly estimated using the "root mean square" procedure outlined in (Macmillian & Creelman, 1991), the pattern of results were virtually identical to those produced by the maximum likelihood procedure.

[22]It should be clear, however, that were we to use d’ to assess the diagnosticity of confidence, we would need to have some method of measuring a particular witness’s d’ or at least be able to narrow the range of expected d’s to something that would improve the jurors' ability to infer accuracy beyond what they might do without our help.