Ebbe B. Ebbesen and
John T. Wixted
University of
California, San Diego
Working Draft
Revision
Signal
detection analyses of two face recognition memory experiments and Monte-Carlo
simulations indicated, contrary to the beliefs of most eyewitness memory
experts, that confidence and accuracy are intimately related. The results
showed that although correlations between confidence and accuracy tend to be
low (rarely above .5), higher mean confidence is associated with higher d's,
confidence for correct responses is higher than for incorrect ones, ROC curves
based on confidence are linear, and confidence is well calibrated to accuracy.
Maximum likelihood estimates of confidence cutpoints along with data from a
third study in which subjects predicted their accuracy indicated that subjects'
expectations of greater than actual memory loss result in reduced confidence in
recognition responses.
According to a
survey by Kassin, Ellsworth, and Smith (1989) eyewitness memory
experts seem to agree that the relationship between the accuracy of eyewitness
identifications and the degree of confidence that eyewitnesses express in their
identifications is, at best, a rather weak one.[1]
This consensus seems to be based, in part, on the fact that in studies of
recognition memory for human faces, the correlation between accuracy and
confidence is generally small.[2]
Although the size of these correlations vary widely over studies, many are not
significant, and those that are, rarely exceed .5 (Brigham, 1990; Bothwell, Deffenbacher, & Brigham,
1987; Deffenbacher, 1980; Fleet, Brigham, & Bothwell, 1987; Lindsay, 1986;
Shapiro & Penrod, 1986; Smith, Kassin, & Ellsworth, 1989; Wells &
Murray, 1984; Wells & Lindsay, 1980; Wells & Lindsay, 1985). In the
majority of the correlations reported in the face-memory literature, individual
differences in memory (and on occasion differences in item memorability, e.g., Read, Vokey, & Hammersley, 1990; Smith, Kassin,
& Ellsworth, 1989) provide the source of variation in both accuracy and
confidence (Deffenbacher, 1980; Shapiro & Penrod, 1986).
A second aspect
of the consensus of opinion reported in Kassin, Ellsworth, and Smith (1989) is
the belief that other factors are better predictors of accuracy than
confidence. For example, a large majority of experts who agreed that the
relationship between confidence and accuracy is weak, also agreed that both
duration of exposure and retention interval were strongly related to eyewitness
accuracy. Although never stated, presumably these beliefs are based on results
from experimental studies in which manipulations of duration and/or retention
interval affected mean accuracy across different conditions (Shapiro &
Penrod, 1986) and in which the correlations between people’s confidence and
their accuracy are small.
The fact that
beliefs in the lack of relationship between confidence and accuracy are based
on individual differences while virtually all of the theoretical claims about
the nature of eyewitness memory are based on the effects that changes in the
learning and memory context have on group-average accuracy measures raises an
important issue that has been all but ignored in the literature (with one
notable exception, Wells & Lindsay, 1985). Namely, how should we define and
measure the "relationship" between confidence and accuracy? For
example, little research has asked what the relationship between confidence and
accuracy might be when measured not over individual (nor even item) differences
in memory but rather over those conditions that reliably produce average
differences in the accuracy of recognition, such as retention interval. Do
conditions that tend to produce low accuracy also produce low confidence
estimates? In addition, we know of no attempt to assess the strength of the
relationship in terms of different item effects over individuals. That is, are
faces that most people remember also those faces in which most people have high
confidence? Finally, although reports of correlations between confidence and
accuracy abound, researchers have not measured the extent to which the accuracy
of identifications is calibrated to (not correlated with) confidence estmates,
despite the fact, as we shall discuss later, that this issue may be the most
important from an applied point of view. One reason eyewitness experts may have
failed to consider how best to measure the strength of the confidence-accuracy
relationship is because they have not offered a precise mechanism to explain
how confidence and accuracy might be related.
The present
research has several related purposes. At one level we investigate whether
factors that are known to affect memory in specific and well-known ways might
not also affect average confidence in the same manner as accuracy (even if the
correlations over individuals and/or items are small). We also ask whether
these same manipulations affect the extent to which people's confidence
estimates are calibrated to their accuracy. Finally, we take the position that
signal detection provides a satisfactory and empirically useful descriptive
mechanism for all of the ways of measuring how confidence and accuracy
might be related. After all, since the seminal work of Henmon (1911) and Peirce
& Jastrow (1884), the idea that accuracy of judgment would be related to
confidence in those judgments has been a major part of experimental psychology (Link, 1992), even of recognition memory experiments,
but for stimuli other than faces (Donaldson & Murdock, 1968; Murdock, 1980;
Ratcliff, Sheu, & Gronlund, 1992; Wickelgren & Norman, 1966).
In eyewitness
memory research two different types of explanations have been offered for the
low confidence-accuracy correlations. One focuses on factors that may affect
one measure and not the other, for example, social pressure to appear confident
or pressures to choose someone even when memory is weak (Lindsay & Johnson, 1989). The other
explanation, originally proposed by Deffenbacher (1980) but since accepted by others
(e.g., Bothwell, Deffenbacher, & Brigham, 1987; Luus and Wells, 1994),
assumes that the correlation between confidence and accuracy is only high when
the information-processing conditions present at encoding, during memory
storage, and at the memory test are optimal. Deffenbacher's optimality hypothesis has the advantage
of explaining both the generally low correlations (learning and memory
conditions are sub-optimal in most experiments, not to mention actual crime
situations) as well as their wide range (experiments differ in how optimal
their conditions are). Unfortunately, neither of these explanations provides a
model of the way confidence estimates are generated (nor do they explain why so
many non-experts, including the Supreme Court of the United States in Neil v.
Biggers, seem to believe that confidence and accuracy are strongly related
despite their apparent weak empirical relationship).
By placing the
recognition memory task within a signal detection framework, something that is
quite commonly done (Shapiro & Penrod, 1986), it is possible to derive the
optimality hypothesis and to make some additional, but previously ignored,
predictions about factors that may affect the size of the confidence-accuracy
relationship. In addition, a signal detection analysis tends to emphasize the
importance of measures of the strength and nature of the confidence-accurracy
relationship other than correlation coefficients; measures that suggest that
confidence and accuracy are intimately related even when correlations are low.
Figure 1 shows a signal detection analysis of a face memory experiment in which
subjects are initially exposed to faces for two different durations of exposure
and then tested for memory in a “yes/no” recognition task. At the time of the
recognition memory task, items are assumed to have different values on a
dimension that reflects the strength of subjective evidence that the item was
seen before. In addition, those items that actually have been seen before have,
on average, stronger evidence values than those items that have not been seen
before.[3]
Most
applications of signal detection assume that item strengths within both the
seen and unseen sets are distributed normally; although logistic distributions
are also frequently assumed in other judgment domains (Macmillian & Creelman, 1991). As learning and
test conditions improve, the mean strength of the seen items is assumed to
increase and the entire distribution of strengths for the seen items shifts
towards higher strength values[4].
The signal detection model also assumes that subjects decide whether to say
that they recognize a given item by adding a decision mechanism to the
underlying strength of evidence system. If a given item’s strength is above a
criterion value, the subject identifies the item as having been seen before.
Overall accuracy (measured by proportion of correct responses) depends both on
the placement of the decision criterion and on the distance between the means
of the seen and not seen item distributions (measured as d'). As the distance
between the distributions decrease, more seen and unseen items will have
similar, and therefore indistinguishable, strength of evidence values causing
an increase in the number of errors. In
addition, subjects who place their “yes/no” decision criteria high up on the
subjective strength dimension (all other things equal) will tend to correctly
reject most of the previously unseen items but will also tend to miss many of
the items they have seen before. Conversely, subjects who place their “yes/no”
criteria low on the strength dimension will tend to make many false alarms as
well as to correctly identify most of the previously seen items.

Figure 1. The figure shows how signal
detection represents the effect of increasing the duration of exposure on
recognition memory performance. The strengths of evidence associated with not
seen and seen items are assumed to be normally distributed over each item type,
although it is likely that the variance of the seen items increases as d'
increses. In each case confidence estimates are treated as additional cutpoints
on the strength of evidence dimension were 1 is "just guessing" and 5
is "absolutely confident" that an item was either seen or not seen
before. The optimal placement of the "yes/no" decision criterion
moves to higher strength of evidence values as d' increases and presumably the
confidence criteria move as well. Exactly how the confidence criteria are
likely to move is unknown and like the "yes/no" criteria are probably
under the control of the motivational variables.
Confidence is
easily added to this model by assuming that subjects apply a second set of
decision criteria to the strength dimension for the confidence ratings (Egan, Schulman, & Greenberg, 1959; Macmillian
& Creelman, 1991; Donaldson & Murdock, 1968; Wickelgren & Norman,
1966). In this view, each level of confidence that an item had or had not been
seen before is a band on the strength dimension. Thus, very high strengths
result in a, "yes, I am absolutely confident (I have seen the face
before),” response while moderate values might produce, “yes, I am just
guessing,” responses. Similarly, a moderately low strength item might result in
a, “no, I am moderately confident (I haven't seen the face before),” response.
We can see, in
an intuitive fashion, how the above model might explain the optimality effect
by examining Figure 1. In sub-optimal (short duration) learning and memory
conditions, i.e., ones in which d' is small, correct and incorrect responses
will tend to be assigned to similar confidence estimates because so many seen
items will have values on the subjective dimension that are close to those for
the unseen items. Thus, a significant proportion of the highly confident “yes,”
responses will be false alarms and, similarly, a significant proportion of the
highly confident “no,” responses will be misses. That is, many highly confident
responses will be errors. In contrast, when the learning conditions are optimal
and the seen and not seen distributions are widely separated on the strength
dimension, virtually all of the highly confident yes and no responses will be
correct and only the lower confidence responses will tend to contain a high
proportion of errors. Thus, as the learning conditions improve, we would expect
more and more of the highly confident responses to be correct.[5]
One feature of
the model depicted in Figure 1 about which there is theoretical uncertainty is
how changes in learning and memory conditions will affect, if at all, the
placement of the confidence criteria. Figure 1 describes a model in which the
confidence criteria move in lock-step
with changes in the "yes/no" decision criteria, however, there are
other reasonable possibilities. For example, the confidence criteria might
remain fixed on the evidence scale or they might expand and contract to fill
the range of strength of evidence values. Still another possibility is that
people might adjust their confidence so as to maintain fixed likelihoods that
the responses at each level of confidence are correct (Wixted and Ebbesen,
1995). We shall postpone further discussion of the consequences of this
important issue until we have examined some empirical results.
Regardless of
the particulars of the decision model, in order to generalize signal-detection
reasoning to the different measures of the confidence-accuracy relation one
needs to consider the ways in which different sources of variability might be
expressed in the signal detection paradigm. Because two measures,
"yes/no"-identification and confidence, are involved and because
signal detection allows for independence between the decision criterion and the
discrimination /d' components of the model, it is possible that different
measures might be more influenced by one or another part of the system. For
example, in the case of correlations between confidence and accuracy, were we
to assume that individual differences in performance are due mostly to the
difference between the seen and unseen distributions, i.e., d', it follows that
individual difference correlations between accuracy and confidence based on
averages over items should be high unless the learning and test conditions are
so poor that no one performs much better than chance. This is because all of
the individual difference variance in accuracy and in confidence would be the
result of individual difference variance in d' and, as can be seen in Figure 1,
as d' increases, so does both accuracy and mean confidence. On the other hand,
differences between individuals may be more a function of the placement of
their decision criteria, or even the location of their confidence cutpoints. In
these cases, individual difference correlations might be considerably
attenuated because subjects with identical d's might place their criteria in
different locations.
In contrast,
when an item-based correlation is computed for each subject from the accuracy
of that subject's "yes" and "no" responses and that
subject's confidence in those responses, individual differences in where
subjects place their criteria would be largely irrelevant because each
correlation would be for a single subject and that subject's decision criteria
would be assumed to remain the same over all of the items used in computation
of the correlation. Similarly, if differences between faces are the result of
differences in the ease with which subjects can learn and remember them (e.g.,
some are more distinctive than others), signal detection analysis would again
expect the confidence and accuracy correlation to be high when computed over
faces because all of the variation in both confidence and accuracy would be
d'-produced.
Even if neither
individual nor face differences in recognition accuracy are due exclusively to
differences in d' and correlations between confidence and accuracy are
therefore low, signal detection analysis still makes the strong prediction that
the mean confidence that people express (over items) should be controlled by
the same variables that effect the learning and memory of those items, that is,
accuracy of recognition. Stated differently, mean confidence (averaged over
subjects and items) and total proportion correct should be strongly related
across learning and test conditions because all of the condition variance
should be captured by the difference in the seen and unseen distributions,
i.e., d', and all of the individual differences in other parts of the model
should average out (provided, of course, that the learning and memory variables
do not also independently affect response bias and/or placement of the
confidence criteria). Thus, even though individual difference correlations
between confidence and accuracy might be relatively small, mean confidence and
overall accuracy should be strongly related across different learning and
memory conditions. In fact, as we shall see, signal detection places specific
constraints on the form of that relationship.
The signal
detection analysis also makes several strong predictions about the probability
that responses will be correct at each level of confidence, or stated
differently, how well calibrated confidence is to accuracy. For example, the
model assumes that the "yes" and "no" responses at given
levels of confidence tell us about different parts of the seen and not seen
distributions. Thus, when we compute the relative probability of
"yes" responses to seen compared to not seen faces (i.e., hits and
false alarms) conditioned on particular values of confidence, signal detection
assumes that these conditional probabilities reflect the relative areas under
the seen and not seen distributions that are in each confidence bin. The same
is true for "no" responses. With this in mind, examination of panel b
in Figure 1 suggests that the probability that a "yes" response is
correct should increase as the level of confidence increases because the area
under the seen distribution increases relative to the area under the not seen
curve as we move from lower to higher confidence bins. A similar pattern should
be found for the "no" responses but with the area under the not seen
distribution increasing faster, relative to the area under the seen
distribution, as confidence increases. The model in Figure 1 also predicts that
the conditional probability of yes and of no responses being correct at each
confidence level should be higher in optimal than non-optimal learning
conditions (because the difference between the relative areas under the curves
in each confidence bin increases as d' increases) and that the rate at which the
conditional probabilities increase over confidence bands should be higher in
optimal than sub-optimal conditions. In other words, confidence should be
better calibrated to accuracy the higher the d'.[6]
We designed a
simple face-memory experiment to test these various implications of applying
signal detection to the confidence-accuracy issue. Both accuracy and confidence
were measured as the learning and memory conditions were made more or less
optimal by varying, in a between-subjects factorial design, the duration of
exposure to each seen face and the length of the retention interval between
exposure and testing.[7]
Not only were we interested in the correlation between accuracy and confidence
over subjects (the typical method of measuring the strength of the relationship
between confidence and accuracy), the average correlation over items within
subjects, and the correlation over items of average scores over subjects, but
we were also interested in examining how memory conditions affected both
average accuracy and average confidence and the extent to which confidence
estimates were calibrated to the accuracy of individual identification
responses.
Subjects.
All 200 subjects were obtained from introductory psychology classes at UCSD and
served in partial fulfillment of class requirements. Subjects volunteered to
participate in an experiment that would require two sessions, the first of
which was two hours long. 111 subjects were female, the rest were male.
Design.
A 2 x 4 between subjects factorial design was employed. Two levels of duration
of exposure to the study faces (three seconds and eleven seconds) was crossed
with four different retention intervals (one hour, one day, seven days, and two
weeks). Subjects were randomly assigned to the different conditions after they
arrived in small groups to the experimental sessions.
Procedure.
The experiment consisted of two phases, a study phase and a test phase. During
the study phase, subjects were exposed to slides of 40 different male faces,
one at a time, at one of the two exposure durations. During the test phase
subjects were exposed to 80 slides of male faces, 40 of which had been in the
original study set. Half of the subjects in each condition were exposed to one
of two different sets of 40 slides that were randomly selected from the entire
set of 80 available faces. The slides consisted of 35 mm color pictures of
college age males from the UCSD campus that had been taken in several different
settings around campus. Slides were projected on a wall painted with white
reflective paint. Because subjects were run in small groups (varying from one
to four in size), the visual angles and image sizes varied over subjects within
each small group. In addition, the size of each face stimulus varied somewhat
from slide to slide. On average, subjects sat nine feet from images of faces
that averaged 18 inches high on the projection surface.
When each small
group of subjects arrived, they were told that we were interested in their
reactions to people’s faces, that we were going to show them a series of faces
of males, and that we would explain what we wanted them to tell us about the
faces after they had seen all of them. They were also told to pay careful
attention to each and every face. The room lights were dimmed and the subjects
were shown all 40 slides at one of the two durations. Inter-slide intervals
were a function of the speed with which the Kodak carousel projector was able
to change slides.
After the study
phase was complete, the experimenter explained that there was a second part to
the study and that they would have to return for that second phase one hour,
the next day, one week, or two weeks later in order to receive their class
credit. Conversations with individual subjects about schedule conflicts and the
session that each subject would be able to attend at the relevant time were
completed next. By running multiple testing sessions each day, it was usually
possible to schedule all subjects to a session consistent with the retention
interval to which they had been assigned. Five of the 200 subjects who were
scheduled for retention interval tests failed to attend (four in the two week
condition and one in the week condition).
Subjects were
run in small groups in the test phase. The experimenter explained that we were
interested in how well they could remember the faces that they had seen before.
A response form consisting of 80 numbered rows, ten per page, was used to
collect yes/no decisions, confidence estimates, and “reasons.” Subjects were instructed
to circle yes if they believed that they had seen a face in the previous study
set before and no if not, to indicate their confidence in the yes/no decision
on a labeled 5-point confidence scale: just guessing, slightly confident,
moderately confident, highly confident, and absolutely confident, and to write
down any reasons that they had for picking or not picking a particular face.
Finally, the experimenter told the subjects that they would be seeing 80 faces
and that they had seen half of them before.
The same
viewing conditions were used in the test phase as in the study phase with the
exception that each slide was projected for twenty seconds to give subjects
time to answer all three questions. In addition, the experimenter told the
subjects after every ten slides which row number they were to be filling out
for that slide. Orders of slide presentation were re-randomized in both the
study and the test phases for each session.
Duration and
Retention Interval Effects. Of initial interest are the effects of duration
of exposure and retention interval on overall accuracy (measured both as d' and
total percent correct) and mean confidence (over all eighty test slides).[8]
Separate analyses of variance of each measure yielded two main effects and no
significant interactions for all three measures: for the duration effect on d',
F(1, 187) = 43.75, p<.0001, on an arcsin transform of
proportion correct, F(1, 187) = 40.90, p<.0001, and on
confidence, F(1, 187) = 17.28, p<.001; for the retention
interval effect on d', F(3, 187) = 6.18, p<.0005, on arcsin
transformed proportion correct, F(3, 187) = 4.86, p<.01 and on
confidence F(3, 187) = 5.98, p<.001; the mean square error for
d' was .339, for arcsin proportion correct it was .014, and for confidence it
was .362. Table 1 shows the means for confidence, d', and total percent correct
as a function of condition. Not surprisingly, accuracy and confidence increased
with greater exposure time and decreased with lengthening retention interval.
Finally, an analysis of the mean proportion of "yes" responses
(overall mean = .41) suggested that the subjects’ placements of their
"yes/no" criteria, i.e., response bias, did not contribute to these
accuracy results because learning and memory conditions had no effect on this
measure (all Fs <1). Clearly, these results provide the necessary
starting conditions to examine the effect of strength of memory on the
relationship between confidence and accuracy.
Table 1
Effect of
Duration of Exposure and Length of Retention Interval on Mean Accuracy
(Measured as d' and Total Percent Correct) and on Mean Confidencea
|
|
Measure |
|||||
|
|
d' |
Percent Correct |
Mean Confidence |
|||
|
|
Duration of
Exposure |
|||||
|
Retention Interval |
2 sec. |
12 sec. |
2 sec. |
12 sec. |
2 sec. |
12 sec. |
|
1 hr |
1.320 |
2.006 |
72.0 |
80.5 |
3.25 |
3.65 |
|
24 hrs |
1.056 |
1.753 |
68.1 |
78.4 |
3.12 |
3.57 |
|
168 hrs |
1.062 |
1.459 |
68.0 |
73.7 |
3.08 |
3.33 |
|
336 hrs |
.982 |
1.410 |
67.0 |
73.4 |
2.80 |
3.12 |
aSix subjects made no false alarm responses.
Because the standard score of a probability of 0 is not defined, we estimated
these subject's d' values by substituting
a false alarm rate of .01 for them .
Signal Detection. Evidence
that the accuracy and confidence data from this experiment were relatively
consistent with underlying assumptions of signal detection theory comes from
three different analyses of the confidence and accuracy data. The first
attempts to fit normalized ROC curves to the “yes/no” and confidence data
(summed over subjects) using procedures that have been described as the “rating
method” by Macmillian & Creelman (1991). The signal detection model
predicts that these normalized ROC curves should be linear. The results of
these analyses for the retention interval effect are presented in Figure 2.[9]
As expected from signal detection analysis, all four curves are well described
by linear functions. In the worst case, r2 for the linear fit was .996
and in the best case it was .998.[10]
Second, the fact that the slopes of the normalized curves were all
significantly less than 1.0 (between .777 and .844) suggests that the variance
in the strength of evidence values for the seen items was between 1.18 and 1.28
times larger than that for the unseen items.[11]

Figure 2. ROC curves resulting from applying
the rating method to data from four different retention intervals. The rating
method assumes that each level of confidence is a cutpoint on the evidence
dimension. Therefore, the normalized cumulative proportion of responses made at
each confidence level (starting at "yes, absolutely confident" and
continuing to "no, absolutely confident") to the not seen items,
z(Prob (FA|Not Seen)), is compared to the equivalent proportion of responses
made to the seen items, z(Prob (H|Seen)). If confidence ratings are fixed
criteria on the evidence dimension and the underlying distributions are normal,
then these ROC curves should be linear.
The second
analysis was based on the expectation that subjects should express greater
confidence in responses to stimuli whose evidence values are further away from
the “yes/no” decision criterion. Examination of panel b in Figure 1 provides
the intuition for this prediction. Hits and correct rejections (CRs) should, on
average, be associated with more extreme strengths of evidence than should
false alarms (FAs) and misses. Table 2 presents the mean confidence ratings for
each response type as a function of both duration of exposure and length of
retention interval. Examination of this table and a mixed 2 x 4 x (2 x 2
repeated measures) analysis of variance indicates that this prediction was
confirmed. Mean confidence for correct responses was much higher than for
incorrect responses (F(1, 181) = 648.61, p<.0001). It is also
important to note that the size of this response-accuracy effect was
significantly affected by both the duration of exposure (F(1, 181) =
20.45, p<.0001) and the length of the retention interval (F(3,
181) = 9.88, p<.0001), but the three-way interaction was not
significant. In particular, as the optimality of the learning and memory
conditions increased, the difference between confidence in correct compared to
incorrect responses increased, exactly as a signal detection analysis predicts.
There was also a significant main effect of whether the responses were to seen
as opposed to unseen slides. Confidence in responses to seen slides were higher
than those to unseen slides (F(1, 181) = 34.54, p<.001). If
this effect were due to the larger variance of the seen as opposed to the
unseen distribution (variance differences that were suggested by the ROC curve
analyses), we might not expect the size of this difference to change
with the optimality of the learning and test conditions. Somewhat unexpectedly,
although the size of this difference was, indeed, not directly affected by
duration of exposure nor by the length of the retention interval, the two
manipulations did interact to produce an inexplicable but significant three-way
interaction (F(3, 181) = 4.37, p<.01). Examination of the
means in Table 2 suggest that this was a consequence of subjects in the most
optimal condition, namely, long duration of exposure and one hour retention
interval, and subjects in the short duration, one hour retention interval,
condition expressing atypically low confidence in their miss and false alarm
responses, respectively. Finally, there were no significant differences in the
amount of confidence that subjects expressed in their “yes” as opposed to “no”
responses.
Table 2
Effect of
Duration of Exposure and Length of Retention Interval on Mean Confidence for
Hit, False Alarm (FA), Miss, and Correct Rejection (CR) Recognition Response
Categoriesa
|
|
Duration of
Exposure |
|||||||
|
|
Three Seconds |
Eleven Seconds |
||||||
|
Recognition |
Retention Interval
(In Hours) |
|||||||
|
Response |
1 |
24 |
168 |
336 |
1 |
24 |
168 |
336 |
|
Hit |
3.502 |
3.350 |
3.153 |
2.947 |
3.913 |
3.873 |
3.618 |
3.213 |
|
FA |
2.446 |
2.658 |
2.585 |
2.377 |
2.829 |
2.711 |
2.617 |
2.577 |
|
Miss |
2.899 |
2.718 |
2.671 |
2.614 |
2.679 |
2.911 |
2.868 |
2.785 |
|
CR |
3.340 |
3.229 |
3.151 |
2.869 |
3.671 |
3.589 |
3.379 |
3.267 |
aThese means were computed by averaging each
subject’s confidence estimates for each of the four response types. These averages
then served as the raw values from which the means in this table were computed.
The third
source of evidence regarding signal detection comes from a relationship between
confidence and accuracy that has been known for over 100 years. In 1884 Peirce & Jastrow (1884) empirically determined
that confidence in comparative judgments of weights was related to the accuracy
of those judgments by the following formula:
m = c * ln p/(1-p)
where m is the
mean signed confidence, c is a constant
that depends on the confidence scale, and p is the probability of a correct
response to a particular difference in weights. Thus, as the difference in
weights between a comparison and standard increased, mean signed confidence, as
well as the natural log of the relative probability of correct choices,
increased according to the above linear equation. By assuming that the not seen
and seen distributions of strength of evidence in a signal detection model of
recognition memory play similar roles to the distribution of subjective weights
in Peirce and Jastrow, we might expect to find a similar result for
probability of correct recognition responses and mean confidence across
different learning and memory conditions.
It is of no
small interest that the signal detection model predicts virtually the same
relationship when the "yes/no" response criterion is placed midway
between the two distributions. In fact, when there is no response bias and the
strength distributions are logistic rather than normal, d' is proportional to
the quantity, ln p/(1-p) (Noreen, 1981). Thus, the Peirce and Jastrow formula is equivalent to arguing that the subjects generate
“yes” and “no” responses with little or no bias, that d' is not so large that
the tails of the distributions play a large role, and that the seen and unseen
evidence distributions have equal variance. However, unlike the previous
analyses in which confidence levels were coded such that "just
guessing" = 1, "slightly
confident" = 2, and so on, Peirce & Jastrow (1884) coded confidence to take
account of whether the "yes/no" response was correct by giving
negative confidence values to incorrect responses. Applying their procedure,
correct responses were be coded 0 (just guessing) to 4 (absolutely confident)
and incorrect responses were be coded 0 (just guessing) to -4 (absolutely
confident).
When we
computed each subject's individual signed
confidence mean in this fashion as well as each subject's ln p/(1-p) values,
the results were remarkably, though not perfectly, consistent with the Peirce & Jastrow model. Figure 3 contains a scatterplot of these
results. As can be seen, despite the fact that the variation in both the x and
y values in this plot are a mixture of individual and condition differences,
the data were well fit by a linear function r2 = .724, F(1,
193) = 507.2, p<.0001. In addition, when the condition effects on
both measures were removed by computing the residuals from 2 (duration) x 4
(retention interval) analyses of variance of each measure, there was only a
small reduction in the quality of the linear fit (r2 = .64, F(1,
193) = 344.28, p<.0001) to the residuals. Furthermore, when ln
p/(1-p) was added as a co-variate to a 2 x 4 analysis of variance of mean
signed confidence, the main effects of learning and memory conditions did not
disappear (F(1, 186) = 10.65, p<001 for duration and F(3,
186) = 8.86, p<.001 for retention interval, after removing the effect
of ln p/(1-p)). Taken together, these results suggest that the pattern of data
in Figure 3 is the result of both condition produced and individual difference
variation in memory.
The Peirce & Jastrow model also predicts that the intercept of the best
fitting linear function should be equal to zero, and although close at .22, a
t-test using the standard error of the estimate (.05) indicated that our
obtained intercept was significantly different from zero (t = 4.21, p<.0001).
On the other hand, it can be shown that this small deviation from the original
model would be expected if, as previously discovered, both the standard
deviation of the seen items was larger than that for the not seen items and the
subjects were biased to say "no."

Figure 3. Scatter-plot of data for
all 195 subjects from Experiment 1 showing the relationship between each
subject's natural log of the ratio of total proportion correct (p) to (1 - p)
and each subject's mean signed confidence. The relationship should be linear,
with intercept 0, if the Peirce and Jastrow assumptions of equal variance and
no response bias are correct.
Confidence
and Accuracy Correlations. Given that both duration of exposure and
retention interval produced highly reliable effects on the accuracy of face
recognition memory and on confidence and given that these results were well
within the boundaries of what might be expected from signal detection theory,
we next analyzed the results from three correlation measures of the
relationship between confidence and accuracy. Covariation over individuals of
measures averaged over items (individual difference rs), covariation over items
(faces) of measures averaged over individuals (face-based rs), and average
covariation over items within each individual (response-based rs) were computed
for accuracy and for confidence. Table 3 presents the results of the three
differently computed correlations between confidence and accuracy for each
learning and memory condition. Of initial interest is the fact that individual
difference correlations between each subjects’ mean confidence (coded in the
more typical fashion as 1 through 5 and without regard to the accuracy of the
"yes/no" response) and total proportion correct scores were within
the range of results reported in previous studies. Although the overall
correlation was highly significant (p<.0001), its absolute value was
not particularly high (r (194) =.38). In addition, as shown in Table 3,
only two out of the eight within condition individual difference correlations
(with ns between 22 and 25) were significant.
These results
are consistent with the claim that a person’s average confidence does not
appear to be a particularly good predictor of his/her overall accuracy, except
possibly in highly optimal conditions (the eleven second-one hour retention
interval condition). Furthermore, when the effects of the different learning
and memory conditions were removed by examining the correlation between the
residuals from the 2 x 4 analyses of variance of mean confidence and of
proportion correct that were reported earlier, the resulting r was
reduced to .241 (F(1, 193) = 11.91, p<.001), a value in
support of the usual claim by eyewitness experts that the relationship between
accuracy and confidence is weak. In other words, at the level of individual
difference mean confidence-accuracy correlations, the results from this
experiment seem similar to those previously reported. People who are generally
confident in their memories are just slightly more likely have high accuracy
scores.
Table 3
Effect of Duration of
Exposure and Length of Retention Interval on the Size of Three Types of
Confidence-Accuracya Correlations Based on Data from the
Between Subjects Experiment
|
|
Type
of Confidence-Accuracy Correlation |
||||||
|
|
Individual
Difference |
Response-Based |
Face-Based |
||||
|
|
Exposure
Duration |
||||||
|
Retention Interval |
3
sec. |
11
sec. |
3
sec. |
11
sec. |
3
sec. |
11
sec. |
|
|
1 hr |
.20 |
.63* |
.26* |
.33* |
.49** |
.67** |
|
|
24 hrs |
.03 |
.28 |
.27* |
.28* |
.34** |
.58** |
|
|
168 hrs |
.52* |
.17 |
.22* |
.28* |
.50** |
.52** |
|
|
336 hrs |
.04 |
.10 |
.16* |
.18* |
.36** |
.39** |
|
aAccuracy was measured as total percent
correct over 80 items for each subject in the individual difference
correlations, as percent correct of all subjects correctly responding to a
particular face in the face-based correlations, and as one (correct) and zero
(incorrect) for the response-based correlations. Mean confidence in all 80
faces was used for the individual difference correlations, mean confidence over
all subjects for a given faces was used in the face-based correlations, and the
actual confidence rating for each item was used for the item-based
correlations. When the one outlier face was removed from the face-based
correlations, the lowest correlation increased to .42 and the highest to .69.
*p<.05
**p<.005
Though only
sometimes reported in previous studies, we also computed confidence-accuracy
correlations over stimulus items within each subject. That is, each subject's
"yes" and "no" responses were coded 1 if correct and 0 if
incorrect. These scores were then correlated with the associated confidence
ratings for each response over the 80 test slides. Despite the fact that the
overall mean correlation across all subjects and conditions was .24, a value
even smaller than the equivalent individual difference correlation (but still
different from zero, p<.0001), only 2 out of the 195 subjects had
correlations that were negative! More importantly, when the condition effects
on the means of these within-subject response-based correlations were tested
with an analysis of variance of z-transformations of the correlations, both the
main effect of exposure duration and of retention interval were significant (F(1,
187) = 7.183, p<.01 and F(3, 187) = 10.54, p<.0001) but the
interaction between them was not (F<1). As can be seen in Table 3,
the pattern of the response-based correlations was consistent with the general
idea of the optimality hypothesis, namely, the correlations tended to decrease
as the optimality of the learning and memory conditions, and therefore
accuracy, decreased.
To determine
whether the size of confidence-accuracy correlations computed over individuals
(and responses within individuals) were representative of those that might be
obtained when variation in performance was produced by differences between
faces, we computed confidence-accuracy correlations between the percent of
subjects correctly identifying each face and the mean confidence they expressed
in their responses to the same faces. When the results for all 195 subjects
were used to compute accuracy and confidence estimates for each face, the
correlation was larger than previous examples (r = .545) and highly significant
(t(78) = 5.74, p < .00001).[12]
In addition, when the correlations were computed within each learning and test condition,
every correlation was highly significant suggesting that faces to which most
people correctly respond are the same faces in which people tend to have high
confidence. Finally, with one minor exception, just as the signal detection
description of the optimality hypothesis predicts, the correlations decreased
as the learning conditions worsened.[13]
At one level
the correlation results in Table 3 suggest that one's conclusion about the
strength of the confidence-accuracy relationship may depend on the exact method
used to compute the correlation between these two measures. Individual
differences in accuracy do not appear to be strongly associated with individual
differences in confidence. But, faces to which people tend to respond correctly
do appear to be those faces in which they have more confidence in their
recognition ability. On the other hand, the average of each subject's
correlation between his/her confidence in and accuracy of their recognition
responses all tended to be small. Despite these apparent inconsistencies, the
pattern of the correlations seemed to conform to the optimality hypothesis
because the shorter retention interval and longer duration of exposure
conditions tended to produce higher correlations, regardless of the measure.
Monte Carlo
Signal Detection Simulations of Different Confidence-Accuracy Measures. If
the signal detection model provides as reasonable a description of
face-recognition memory as was suggested by the empirical results presented in
an earlier section of this paper, then we might expect it to account for
measures of the strength of the relationship between confidence and accuracy,
as well. Although all of the empirical results presented thus far seem
generally consistent with a signal detection interpretation of the optimality
hypothesis, exact predictions of how different correlations between confidence
and accuracy should change as parameters of the signal detection model, e.g.,
d', change are not readily available. To both examine this issue and to test the
usefulness of the signal detection approach, we conducted a series of
Monte-Carlo computer simulations of recognition and confidence responses based
on the signal detection model. These simulations were used to examine the
effects of a variety of signal detection parameters on different measures of
the strength of the relationship between confidence and accuracy. In
particular, the effects on the confidence-accuracy relationship of: 1)
expanding and contracting the placement of the confidence cutpoints, 2) the
size of the variance differences between the signal and the noise
distributions, 3) variability over individuals in d' and in confidence cutpoint
placements, and 4) different ways of adjusting the confidence criteria
(lock-step v. stretching) as d' changed were all examined.
Simulation
Methodology. Separate simulations were run for each of a series of programmed
d' values: 0, .25, .5, 1, 1.5, 2, 2.5, and 3. These were designed to simulate
the different mean d' levels that might occur as a result of differently
optimal learning and memory conditions. The simulations took into account the
facts that not all subjects have the same d' in a given learning and memory
condition nor are they likely to have the same signal to noise variance ratio.
Therefore, individual d' values were drawn from normal distributions of d'
scores with a programmed mean set to simulate a given learning and memory
condition and with a standard deviation (of d's over individuals) set between
one fourth and one third of the mean. The standard deviation of each subject's
signal distribution (sds) was drawn from a second normal
distribution with a mean of 1.25 (a value close to that empirically determined
from the ROC curves) and standard deviation between one fourth and one third of
1.25.
A given
subject's signal and noise distributions were created for a particular
programmed d' level by selecting a d' value and an sds value from
their associated distributions. The mean of that subject's signal distribution
was set d' units above the mean of the noise distribution, with the latter
always set to zero with unit variance. Thus, at a programmed d' value of 1.0,
the first subject might have a d' value of .85 and sds of 1.20 and
the next might have a d' of 1.23 and a sds of 1.18, and so on. The
placement of confidence criteria was also controlled by the simulation, however
the degree of variability in the placement of these cuts was controlled by a
parameter, v. When v was set to zero, the placement of confidence criteria did
not vary over subjects. But, as v increased, the variability, over subjects, in
the placement of the different confidence cuts increased. The details of
criteria placement were as follows. First, the "yes/no" criterion was
randomly placed between 0 and the subject's d' with the amount of random
variation over subjects controlled by a proportion of v. When v was equal to 0,
the yes/no criterion was placed either midway between 0 and d' or so as to
produce a slight "no" response bias in a manner consistent with the
empirical results. Next, the no-5 and yes-5 confidence criteria were set to
fall 3 standard deviation units to the left and right, respectively, of the
yes/no criterion, plus or minus a random proportion of v. The remaining
criteria, no-4 through yes-4, were distributed between these two extreme values
in order. When v was zero, the spacing between them was either equal or
designed to produce more middle-valued confidence estimates (3s and 4s) in a
manner similar to the empirical results. When v was greater than zero, the
placement of each of the remaining criteria independently varied plus or minus
a random proportion of v while still preserving their initial serial order
(n4<n3<....<y3<y4) and, if present, a degree of confidence bias.
For reasons
that will become clear later, we ran all of the simulations under two different
assumptions about the effect that changes in d' had on the pattern of changes
in confidence criteria. One was modeled on Figure 1 in which all of the
confidence criteria are assumed to move in lock-step
with .5d' (i.e., the optimal placement of the yes/no decision criterion). The
second was modeled on a pattern similar to that depicted in Figure 4. This stretch model assumed that the
"yes-absolutely confident" criterion was placed 3 standard deviation
units (based on a noise distribution of unit variance and zero mean) above the
mean of the noise distribution and that the "no-absolutely
confident" criterion was placed 3 standard deviation units below the mean
of the signal distribution. A consequence of this model was that the
distance between the most extreme criteria decreased as d' increased because
the signal distribution tended to drag the no-5 confidence criteria along with
it while the yes-5 criterion remained were it was. Stated differently, the
confidence criteria "stretched" as d' became smaller. In both models,
the remaining criteria were placed in between the extremes in the manner
already described.

Figure 4. Signal detection model of
how the confidence criteria might change if subjects compensate for
less-optimal learning conditions by "stretching" their confidence
criteria further out on the evidence dimension rather than moving them in
"lock-step" with the "yes/no" criterion.
Once the noise
distribution, the signal distribution, and all of the cutpoints were in place
for a given subject (i.e., when a situation similar to one depicted in Figure 1
or Figure 4 was arranged), 40 seen and 40 not seen trials were simulated for
that subject. For each seen trial, an evidence value was randomly selected from
that subject's signal distribution. If the value fell above the yes/no
criterion it was coded as a hit. If it fell below, it was coded as a miss.
Similarly, for a not seen trial, items falling above the yes/no criteria were
coded as false alarms and those below as correct rejections. The confidence of
each response was also coded according to the particular confidence bin into
which the evidence value fell. When all 80 trials were completed for a given
subject, various statistics were computed, such as, percent correct, mean
confidence, and item-based correlations.
Within-Subject
Response-Based Correlations. We first examine the effect of d' on the mean
within-subject, response-based, confidence-accuracy correlations. A total of
200 simulated subjects were run at each programmed d', first with v set to zero
and then set to 1.5. Half of the runs assumed that the confidence cutpoints
moved in lock step with the "yes/no" decision criterion and half assumed
that they stretched. The means of the response-based correlations were computed
for each group of 200 subjects and are presented in Figure 5 as a function of
the obtained mean d' for each simulation run. The error bars represent the
obtained (+/- 1) standard deviation of the mean correlations obtained at each
d' level. All of the reported simulation runs assumed a small amount of
confidence bias, i.e., middle confidence bins were larger than extreme ones. As
can be seen, the mean simulated confidence-accuracy correlation increased with
increasing d' in a highly regular fashion. Individual difference variation in
confidence cutpoints had little effect on the mean rs (compare the circles and
squares). At d' values below 1.5, the model, lock step v. stretch, had little
effect on the size of the response-based correlations. However, only the
correlations produced by the stretch model seemed to benefit from further
increases in d' although this benefit was accompanied by increasing individual
difference variance in the size of the obtained rs (seen in the solid error
bars). Still, for both models, even at very high d' values (above 3), the
average response-based confidence-accuracy correlation was less than .5.
Equally important, the actual mean within-subject correlations for the various
memory conditions in our experiment, indicated by the dark "x" data
points connected in groups of four (over retention interval) for each duration
of exposure fell well within the range of values produced by the simulations.

Figure 5. Results of Monte-Carlo
simulations of the effect of d' on the size of average within-subject
response-based confidence-accuracy correlations. The light colored lines
represent the simulated results for four different simulation runs, two with
the lock-step model (green points) and two with the stretch model (blue
points). Within each model, one run assumed no individual difference variance
in criteria placement (square points), the other assumed large individual
difference variance (circles). Standard deviations in the correlations for each
simulation run are shown with horizontal bars. The empirically obtained mean
correlations and mean d's reported in Table 2 for each experimental condition
are indicated by black Xs.
If d' actually
does drive the size of the within-subject confidence-accuracy correlations as
these simulation results suggest, then one might expect that measures related
to signal detection would account for the learning and memory produced effects
on the size of the empirically obtained correlations that were already
reported. With this in mind, when we re-examined the effect of learning and
memory conditions on the actual within-subject response-based correlations from
Experiment 1 in light of measures relevant to signal detection, e.g., each
subject's mean signed confidence and
d', we discovered that almost all of the effect of the conditions on the size
of these correlations could be explained by these two measures. In particular,
an analysis of covariance indicated that the two measures accounted (with one
exception to which we shall return later) for virtually all of the explainable
variance in the response-based correlations (F(1, 185) = 60.11, p<.0001
for signed confidence, F(1, 185) = 41.09, p<.0001 for d', F(1,
185) = .92 for the duration of exposure effect, F(3, 185) = 3.60, p=.01
for the retention interval effect, but with all of this effect being due to the
fact that the two week retention interval produced significantly smaller
correlations than the other three intervals, a contrast testing this pattern
accounted for the entire effect, and F(3, 185) = 1.23 for the condition
interaction). These results are consistent with the claim that signal detection
may provide a more reasonable framework for understanding the relationship
between confidence and accuracy than does the optimality hypothesis combined
with the size of Pearson correlations.
Individual
Difference Correlations. We next examined how confidence-accuracy correlations
based on individual differences in average accuracy and confidence behave as d'
changes. To examine this issue we compared the results of a number of different
sets of simulation runs that differed in terms of v (0 and 1.5), the amount of
individual difference variance in d' (set either at a fourth and a third of
d'), degree of bias in confidence, "yes/no" decision bias, and the
model of how confidence changes as d' changes (lock-step verses stretch). Each
individual difference correlation was computed from data for 25 subjects at
each programmed d' value. The mean individual difference r was computed from 50
such correlations, as was the standard deviation of the rs.
Although bias
in both confidence and decision criteria and in the amount of individual
difference variance in d' all affected the size of individual difference rs,
their effects were small compared to model type and individual difference
variance in confidence. For this reason, Figure 6 shows the results of a set of
these runs selected to highlight the effects of model type and individual
difference variance in confidence (for square data points v=0 and for circles
v=1.5). In an attempt to remain close to the empirically obtained results, all
of the presented simulation data assumed a moderate amount of confidence bias,
a standard deviation over subjects of d' equal to one-third the programmed d',
and a slight decision bias favoring "no" responses. As can be seen,
the results from these runs present quite a different pattern than those for
the item-based correlations. Of interest is the interaction between model type
and extent of individual difference variation in confidence (v). Although the
size of individual difference variation in the placement of confidence criteria
had virtually no effect on the size of response-based rs, it had a major effect
on the size of the individual difference rs, especially for the lock-step
model. When the simulation assumed no variation in confidence placements (the
square data points), d' drove the size of the correlations in both models, although
the lock-step model consistently produced lower individual difference rs than
the stretch model. When the more empirically reasonable assumption of
individual difference variance in placement of confidence cutpoints was
programmed (the circle data points), mean rs at all d' values were reduced,
however, the reduction was much greater for the lock-step than for the stretch
model. In fact, the reduction was so big for the lock-step model that increases
in d' had little effect on the mean rs for the lock-step model, except at the
highest d' values. In addition, the variance in the size of these correlations
was considerable in all cases, with many negative individual difference rs
occurring when d' was below 1. Finally, the empirically obtained individual difference
correlations generally fell within +/- 1 standard deviation of the mean rs for
the stretch model that assumed v=1.5 (the filled circles), although clearly
there was wide variation in these empirical correlations (probably due to the
small ns in our simulated experiments).
Several
features of these simulation results are of interest. The first is the
generally small size of the correlations between confidence and accuracy.
Although the results from these simulations are consistent with the idea that
correlation measures of the confidence-accuracy relationship generally result
in small values, even with large d's, the reasons for the small correlations
need to be examined. Consider the within-subject response-based correlations
between confidence and accuracy first. They have an obvious constraint
virtually assuring that most correlations will tend to be small, but
significant. Coding responses "0" if incorrect and "1" if
correct and regressing these raw scores against confidence ensures that the best-fitting
linear functions will never be able to provide a perfect fit of the data.
Because all of the y-values in the regression will be either zero or one, the
best a straight line will be able to do is "split the difference" at
each confidence level adjusted according to the relative number of ones verses
zeros. In short, the fact that within-subject response-based correlations
between confidence and accuracy are small, may tell us less about the nature
and strength of the relationship between these two measures and more about the
fact that predicted accuracy will always be somewhere between the actual values
of 1 and 0.
Next consider
the size of individual difference correlations between average confidence and
percent correct. Although the models depicted in Figures 1 and 4 both suggest
that data produced by operators with higher d's will yield higher confidence
and higher accuracy scores, as depicted both models assume that the placement
of the confidence cuts is determined only by a subject's d'. It seems more
likely that different subjects with identical d's are likely to place their
confidence cuts in different locations (just as the simulations assumed when
v>0). Subjects with widely spread confidence criteria will tend to produce
lower mean confidence estimates while those with compressed criteria will have
higher mean confidence scores even if their d's are identical. Unless the
degree of spread in confidence cuts over subjects is related to d' (as the
stretch model assumes is generally the case), the resulting individual
difference correlations will tend to be small. In short, when researchers use
the absolute size of within subject response-based or individual difference
correlations to measure the "strength" of the relationship between confidence
and accuracy, they virtually guarantee that they will conclude that the
relationship between confidence and accuracy is weak. Fortunately, there are
other measures of the strength of the relationship.

Figure 6. Results of Monte-Carlo
simulations of the effect of d' on the size of average individual difference
confidence-accuracy correlations. The light lines represents the simulated
results for four different simulation runs, two with the lock-step model (green
points) and two with the stretch model (blue points). Within each model, one
run assumed no individual difference variance in criteria placement (square
points), the other assumed large individual difference variance (circles).
Standard deviations in the correlations for each simulation run are shown with
horizontal bars. The empirically obtained mean correlations and mean d's
reported in Table 2 for each experimental condition are indicated by black Xs.
Face-Based
Correlations. One is the face-based correlation, namely, the correlation over
items between the percent of subjects who get an item correct and the mean
confidence subjects expressed in their responses to that item. These measures
eliminate individual difference variability in criteria placements by computing
an average over all subjects and by using the same subjects for each item
average. Thus, whatever the average criteria placements are for one item, they
will be tend to be the same for every other item because the same subjects
produced the results for each item. In this way, the only factors driving the
size of these correlations will be d' and the size of the item differences
(because with no item variability, the correlations would have to be zero no
matter how big d' was). This reasoning suggests that results from simulations
in which v=0 are most relevant to face-based correlations because when v=0, the
placement of the confidence cuts depend only on d' and the model. Looking again
at Figure 6, when v=0, the correlations between average confidence and percent
correct were higher at almost all d' values than when v>0. This result is
consistent with the data reported in Table 3 in which the face-based
correlations were generally higher than the other types of confidence-accuracy
correlations. In fact, as d' varied from .98 to 2 across conditions, the
faced-based correlations varied from .36 to .67. As the results in Figure 6
show, the range of simulation results when v=0 were remarkably similar over the
same range of d' values: .36 to .67 for the stretch model and .23 to .44 for
the lock-step model.
Mean Confidence
and Proportion Correct Over Conditions. Another method of measuring the
relationship between confidence and accuracy is to examine how mean (averaged
over subjects and faces) accuracy and mean confidence (coded 1 through 5)
covary over the different learning and memory conditions. To examine this issue
we conducted several additional simulation runs and then compared them to the
empirical data. Figure 7 presents the results of these analyses. Looking first
at the simulation results (represented by gray lines and data points in the
Figure), data from four different simulation runs selected to emphasize the
effects of v and model type are presented. Two of the runs (open data points) were
based on the lock-step model and the other two (filled data points) were based
on the stretch model. In addition, the extent of individual differences in
confidence criteria placement (v=0 or 1.5) was crossed with these
manipulations. The data points in each simulation run represent the mean
confidence and mean percent correct for 200 subjects at each average d' value
used in the previous simulations. As can be seen, although these manipulations
of signal detection parameters have a large effect on the mean confidence that
a given d' will produce, within a given set of signal detection assumptions,
the relationship between percent correct and mean confidence is strong. In
every case, as mean percent correct went up because the learning and memory
conditions produced a higher d', mean confidence also went up. However, as
might be expected, the slope of the relationship was somewhat stronger in the
stretch model because it assumes that the confidence bands expand, thereby
producing more low confidence responses, as d' gets smaller. Other simulation
runs with different parameter values produced results consistent with
intuition. In particular, confidence bias affected the overall mean confidence
but not the relationship between confidence and accuracy; decision bias
affected percent correct but not the relationship between the measures.

Figure 7.
Relationship between mean total proportion correct and mean confidence as a
function of d' from the same Monte-Carlo simulation runs used to compute the
correlations in Figure 5. The empirically obtained mean correlations and mean
d's reported in Table 2 for each experimental condition are indicated by black
Xs. The dotted lines represent the 95% confidence interval for a linear fit of the
data without the two-week retention interval data.
Looking next at
the empirical results (represented by black "X"s in Figure 7), better
learning and memory conditions produced both more accuracy and higher
confidence, exactly as predicted by a signal detection analysis. In addition, a
linear fit of the experimental data accounted for 87% of the variance and
produced a confidence-accuracy correlation considerably higher than any of the
earlier correlations (r(6) = .93, p<.001). However, one
feature of the experimental data in Figure 7 seems inconsistent with the signal
detection models presented above, namely, the fact that at the longest, two
week retention interval, confidence dropped at a faster rate relative to the
drop in accuracy. In particular, if the data points for the two-week retention
interval are left out, a linear function fits the remaining data almost
perfectly (r2
= .995). More importantly, the data from the two-week retention interval are
well beyond the 95% confidence bands derived from that linear fit (the dotted
lines in Figure 7). That is, the effect on accuracy of increasing the retention
interval from one to two weeks was relatively small compared to the effect of
waiting 24 hours before being tested, but the effect on mean confidence of
increasing the retention interval from one to two weeks was relatively large
compared to waiting 24 hours before testing.[14]
These results
have two important implications. First, even though both between and within
subject confidence-accuracy correlations may be small in a given experiment,
the relationship between confidence and accuracy over conditions may be much
stronger. Second, the possibility that mean confidence changed more rapidly
than mean accuracy as the retention interval increased beyond one week is
consistent with the idea that the confidence cuts are not locked to the
decision criteria as d' changes (unlike the assumption that Donaldson and
Murdock, 1968, seemed to make). This idea may help explain the fact that the
within-subject correlations in the two-week retention interval were not
completely predicted by signal detection measures.
Metatheories
of Memory. The reason that we introduced the stretch model can now be
presented. One feature of the signal detection analysis that has been well
studied concerns the effect that changes in learning and test conditions have
on the placement of the yes/no decision criterion. For example, although signal
detection allows the yes/no criterion to be controlled by factors other than
strength of memory, e.g., payoffs for saying yes, in a typical recognition
experiment in which the subjects are explicitly told that they have seen half
of the items before (or one in which subjects might guess this to be true), the
yes/no criterion should be placed midway between the means of the two strength
of evidence distributions to maximize accuracy (Macmillian & Creelman, 1991). This implies that
the placement of the yes/no criterion will shift downward on the evidence
dimension as learning and memory conditions deteriorate (see panel b in Figure
1). What will happen to the confidence criteria as the yes/no criterion moves?
Will they stay fixed or be dragged along? Since explicit payoffs for confidence
estimates are rarely given in most face memory experiments (let alone in real
world criminal investigations), signal detection does not provide an optimal
placement strategy for confidence criteria.
One simple view
of how confidence might change with learning and memory conditions we called the
lock-step model and was depicted in panel b in Figure 1. In this model, we
assumed that the confidence criteria move in lock step with any change that
occurs in the yes/no decision criterion. Note that the sizes of the confidence
bins do not change. This lock-step model also predicts that as accuracy
worsens, overall confidence should decrease because fewer seen and unseen items
will fall into the higher confidence bins. This model also makes the very
strong prediction that whatever effect learning and memory conditions have on
confidence (via the placement of the decision criteria), those effects should
be completely explained by changes in d', regardless of how those d'
differences were produced (e.g., by changes at encoding or during the retention
interval). In sum, this simple lock-step view of the relationship between
confidence and yes/no decision criteria cannot explain the fact that the two
week retention interval seemed to cause confidence to drop more rapidly than
accuracy.
Another model
of the way confidence might change as learning and memory conditions change
assumes that everyday experience provides people with information about the
conditions that tend to improve memories and the conditions that tend to weaken
memories. If people do have theories about how their memories are affected by
such variables as the length of a retention interval, they might use those
theories to adjust their confidence criteria, rather than moving them in lock
step with the yes/no decision criterion. For example, most people believe that
memory fades with time (Loftus, 1979; Yarmey, 1979), but suppose they believe
that memory fades at a faster rate than it actually does. In an attempt to
maintain a given likelihood of being correct, subjects might adjust their
confidence criteria to reflect those beliefs. That is, if subjects know it has
been a long time since they have seen the study items, they might require an
even greater strength of evidence before saying they are absolutely sure they
have seen an item. In signal detection terms, their confidence criteria shift
in a more conservative direction, as was depicted in Figure 4, rather than move
in lock step with the yes/no criterion. By comparing the effect on the
confidence criteria of moving from optimal to sub-optimal memory conditions
depicted in panel b in Figure 1 with those depicted in Figure 4, we can see
that the latter would predict a much greater reduction in confidence as
accuracy and d' decreased. This conservative-shift model predicts many fewer high
confidence "yes" responses both because d' is decreasing and
because the subjects require even stronger evidence, at both extremes, before
giving very high confidence estimates. Which learning and memory variables
would have this selective effect on confidence might depend on the nature of
the theories that subjects hold. If for example, subjects' beliefs about the
effects of duration of exposure were relatively accurate while those for
retention interval were not, confidence shifts that are unexplainable by d'
might be observed for the latter but not the former.
This metamemory
conservative-shift model is very similar to the stretch model that we used in
the simulations. The only difference is that the stretch simulations assumed
that the yes-5 confidence cut was fixed at 3 standard deviations above the
noise distribution while the conservative-shift view implies that subjects
might move their yes-5 confidence in an even more conservative direction
(farther away from the noise distribution) the weaker they believe their memory
to be.
Calibration
of Confidence Ratings to Accuracy. Before exploring the role that
metatheories might play in the confidence-accuracy relationship, we present the
results of a fifth measure of the strength of this relationship. Not only can
this measure help distinguish between the lock-step and
stretch/conservative-shift confidence models, it is also virtually equivalent
to the calibration curves that are common in decision-making and judgment of
learning research (Lichtenstein & Fischhoff, 1977; Nelson & Dunlosky,
1991). In the present context we would argue that confidence is similar to
"odds of being correct" estimates and that absolute confidence is
similar to estimates that are close 1.0 odds of being correct while "just
guessing" is similar to chance odds estimates (.5 in present experimental
procedure).
According to
the signal detection model, the areas under the seen and not seen curves in
each confidence bin determine the proportion of responses that will be correct
at each confidence level. With this in mind, Figures 1 and 4 show that higher
confidence responses should always be associated with a higher
probability of correct responses except when d' approaches zero (assuming the
distributions are symmetrical around their means and the response criterion is
placed optimally). On the other hand, the exact placement of the confidence
cutpoints, along with the size of d', will determine how well calibrated
confidence is to accuracy; although, in general, the more spaced out the
confidence criteria are, the better calibrated confidence will be. One way of
measuring the degree to which confidence judgments are calibrated to the
accuracy of the yes and no responses that also has considerable applied
significance is to treat all responses as independent events. That is, even
though different subjects may have different d's, different placements of
criteria, and different seen to not seen distribution variance ratios, we can
ask what the odds are that "yes" or "no" responses are correct
given they were accompanied by a particular confidence estimate, regardless of
who generated the responses.

Figure 8. "Yes" response
calibration curves produced from the same four sets of Monte-Carlo simulation
runs used throughout. Each line within the four graphs shows how, at each d'
level, the probability of producing a correct "yes" response,
hits/(hits+false alarms), changes as confidence in the response increases. The
d' values for the curves within each graph were 0, .25, .5, 1, 1.5, 2, 2.5, and
3. In each graph, the curve with the lowest proportion correct had d's of 0 and
that with the highest had d's of 3. In the bottom two graphs, black lines
represent d' values closest to those obtained in Experiment 1.
Figure 8 shows
the "yes" response results for this type of analysis from the same
four sets of simulation runs used to produce the confidence-accuracy
correlations presented in Figure 5. Data points connected by lines are from 200
"subjects" whose d's were selected from the same distribution. In the
bottom two graphs, the black lines and data points show the calibration curves
for d' values close to the range obtained in the experiment, namely, d'=1, 1.5,
and 2. We highlight these only in the bottom two graphs because these make the
more realistic assumption that there are individual differences in how subjects
place their confidence cutpoints while the top two graphs make the less
realistic assumption of no individual differences in this response. Looking at
all of the calibration curves supports the intuition from our signal detection
analysis that the degree of calibration of confidence to accuracy depends
mostly on d'. The model type and the amount of individual difference variance
in confidence play small, but interesting, roles compared to d'. Nevertheless,
for applied reasons, it is important to note that the signal detection approach
predicts that even very small d's will yield confidence-accuracy relationships
in which higher confidence estimates are associated with higher probabilities
of correct "yes" responses, even in circumstances in which
correlational measures may be near zero and insignificant, i.e., d'=.25!
Table 4
Number of "Yes"
Responses (N) and Percent of Them
Correct (%Hits) as a Function of Exposure Duration, Retention Interval, and
Confidence in the Between-Subjects Experiment
|
|
|
Length of Retention
Interval (in Hours) |
|||||||
|
|
|
1 |
24 |
168 |
336 |
||||
|
Exposure Duration |
Response Confidence |
N |
% Hits |
N |
% Hits |
N |
% Hits |
N |
% Hits |
|
|
Absolute |
189 |
97.88 |
151 |
97.35 |
132 |
93.18 |
72 |
91.67 |
|
|
High |
159 |
85.53 |
176 |
76.7 |
165 |
76.36 |
124 |
82.26 |
|
Three |
Moderate |
222 |
70.72 |
219 |
71.23 |
209 |
71.29 |
198 |
68.18 |
|
|
Slight |
173 |
60.12 |
180 |
56.67 |
192 |
60.94 |
214 |
63.08 |
|
|
Guess |
67 |
64.18 |
83 |
56.63 |
83 |
63.86 |
110 |
62.73 |
|
|
Absolute |
371 |
97.04 |
352 |
96.59 |
218 |
94.95 |
161 |
93.79 |
|
|
High |
154 |
88.31 |
157 |
81.53 |
185 |
81.08 |
148 |
81.76 |
|
Eleven |
Moderate |
143 |
77.62 |
154 |
70.78 |
185 |
72.97 |
192 |
74.48 |
|
|
Slight |
105 |
71.43 |
110 |
70.91 |
148 |
64.86 |
176 |
69.32 |
|
|
Guess |
49 |
67.35 |
75 |
61.33 |
67 |
53.73 |
112 |
70.54 |
Table 4 presents
the empirical results for the same analysis of "yes" responses.
Exactly as predicted from the signal detection analysis, the likelihood that
"yes" responses were correct given that subjects said they were
"just guessing," was much lower (.628) than the probability that
"yes" responses were correct when they were accompanied by absolute
confidence (.959). Examination of the black data points in the bottom two
graphs in Figure 8 (when v was equal to 1.5) indicates that the lock-step and the
stretch models produce a subtle difference in the calibration curves that is
most noticeable at the highest confidence levels. Because the stretch model
assumes that subjects widen their confidence cutpoints as d' gets lower, it
predicts that the probability of the higher confidence "yes"
responses will be less affected by reductions in d' compared to the lock-step
model. That is, the slope of the calibration curve should remain fairly stable
as d' changes (within the d' ranges our experimental conditions produced) if
the stretch model is correct. We tested these predictions by performing a 4
(retention interval) x 2 (duration of exposure) x 5 (confidence level)
log-linear fit of the yes response frequencies coded as hits or false alarms in
which we treated each response as an independent event. Although the fit of the
overall model was highly significant (X2 (39) = 726.14, p
< .0001), consistent with the stretch model, Wald Chi-Square effect tests
revealed that this was due to only three main effects: retention interval (X2 (3) = 13.57, p = .0036), exposure duration (X2 (1) = 4.68, p = .031), and confidence
(X2 (4) = 361.26, p <
.0001). Most importantly, none of the interactions reached significance.[15]
In addition, when we repeated this analysis assuming an ordinal constraint on
confidence, the main effects of retention interval and exposure duration
disappeared. Considering that the Wald Chi-Square effects of retention interval
and duration of exposure were highly significant without confidence in the
model (retention interval X2 (3) = 35.47, p <
.0001 and exposure duration X2 (1) = 62.84, p
< .0001), this latter result suggests that the effects of retention interval
and duration on the accuracy of "yes" responses was almost completely
mediated by confidence! In effect, if one wanted to predict the odds of a
"yes" response being correct, knowing the confidence that was
expressed in that response would provide virtually all of the relevant
information. Knowing the optimality of the learning and memory conditions would
provide almost no additional predictive information over and above the
subject’s expressed confidence in that response.
Even after
controlling for the differential rates of the confidence estimates across experimental
conditions (seen as Ns in Table 4), the relationship between confidence and
accuracy remained remarkably strong and consistent, in a pattern predicted by
the stretch model simulation data (when v=1.5). In the worst memory condition
(three second exposure duration and 336 hour retention interval with a mean d'
of .98) 91.7% of the absolutely confident "yes" responses were hits
while only 62.7% of the responses labeled as guesses were hits; and in the best
memory condition (eleven seconds exposure and one hour retention with a mean d'
of 2) 97% of the absolutely confident responses were hits while 67.4% of the
guesses were hits. This result along with the much greater ability of
confidence than condition differences to predict the hit verses false alarm
rate presents a very different picture of the strength of the relationship
between confidence and accuracy than that which is typically concluded from the
correlation-based methods of measuring that relationship.

Figure 9. "No" response
calibration curves produced from the same four sets of Monte-Carlo simulation
runs used throughout. Each line within the four graphs shows how, at each d'
level, the probability of producing a correct "no" response, correct
rejections/(correct rejections+misses), changes as confidence in the response
increases. The d' values for the curves within each graph were 0, .25, .5, 1,
1.5, 2, 2.5, and 3. In each graph, the steeper the slope, the greater the d'.
In the bottom two graphs, black lines represent d' values closest to those
obtained in Experiment 1.
Similar though
not identical results are presented in Figure 9 for the simulated
"no" responses. As can be seen in Figure 9, the pattern of
calibration curves for the "no" responses (for the identical
simulation runs that produced the curves in Figure 8) is somewhat different
than that for the "yes" responses. In particular, note how at higher
d's the probability of low confidence responses being correct drops well below
chance. This difference is primarily the result of the biased placement of the
"yes/no" decision criterion in the simulations. To see how this can
happen, one need merely examine Figure 4 and imagine that the
"yes/no" decision criterion was moved to the position occupied by the
yes-1 criterion. Subjects would be saying "no-just guessing" at a
point where the area under the seen distribution is greater than the area under
the not seen distribution, that is, they would produce more incorrect misses
than correct rejections. This effect would increase as d' increased (up to a
point that would depend on the spacing of the cutpoints and the variance of the
seen and not seen distributions).
Table 5
Number of "No"
Responses (N) and Percent of Them
Correct (%CR) as a Function of Exposure Duration, Retention Interval, and
Confidence in the Between-Subjects Experiment
|
|
|
Length of Retention
Interval (in Hours) |
|||||||
|
|
|
1 |
24 |
168 |
336 |
||||
|
Exposure Duration |
Response Confidence |
N |
%CR |
N |
%CR |
N |
%CR |
N |
%CR |
|
|
Absolute |
223 |
81.17 |
143 |
80.42 |
150 |
83.33 |
62 |
70.97 |
|
|
High |
278 |
72.66 |
337 |
70.33 |
256 |
72.27 |
230 |
68.26 |
|
Three |
Moderate |
318 |
67.92 |
341 |
65.69 |
421 |
63.9 |
346 |
67.05 |
|
|
Slight |
255 |
59.22 |
222 |
54.95 |
271 |
56.09 |
242 |
61.98 |
|
|
Guess |
101 |
55.45 |
141 |
52.48 |
104 |
45.19 |
152 |
54.61 |
|
|
Absolute |
331 |
92.45 |
257 |
84.05 |
167 |
81.44 |
191 |
79.58 |
|
|
High |
300 |
81.67 |
339 |
77.58 |
324 |
77.78 |
271 |
75.28 |
|
Eleven |
Moderate |
278 |
71.22 |
270 |
74.44 |
340 |
68.53 |
263 |
70.34 |
|
|
Slight |
181 |
56.35 |
189 |
63.49 |
183 |
57.38 |
269 |
60.97 |
|
|
Guess |
83 |
45.78 |
74 |
48.65 |
93 |
52.69 |
118 |
58.47 |
Table 5 shows the
empirically obtained calibration results for the no responses. When subjects
indicated they were "just guessing," 51.67% of their no responses
were correct but when they said that they were absolutely confident, 81.67% of
their no responses were correct.[16]
Clearly confidence was also well calibrated to the accuracy of the no
responses. Comparison of the lock-step and stretch models for the no responses
shows that both models predict that the degree of calibration should get worse
as d' decreases. A log-linear fit for the "no" responses suggested,
once again, that confidence was an excellent predictor of accuracy. Retention
interval (X2 (3) = 6.56, p
= .088), exposure duration (X2 (1) =
16.89, p < .0001), and confidence (X2 (5) = 327.47,
p < .0001) were significant. However, unlike the "yes"
responses, but still consistent with
the signal detection simulations, this analysis also produced two significant
interactions: retention interval by confidence (X2 (12) =
24.37, p = .018), and exposure duration by confidence (X2 (4) = 10.17, p =
.038), with the remaining interaction effects not reaching significance. The
nature of the two interaction effects were consistent with the idea that the no
response confidence criteria shifted in a more conservative direction as d'
decreased. Namely, the extent to which confidence was calibrated to the
accuracy of the "no" responses decreased as the learning and memory
conditions worsened. For example, in the least optimal memory conditions, the
difference in accuracy between "just guessing" responses and
"absolutely confident" responses was 15%, but the difference in the
most optimal conditions was about 45%.
Clearly both in
absolute and relative terms knowing the level of confidence that subjects
expressed in their yes and no responses was highly predictive of the
accuracy of those responses (even without knowing which subject generated the
responses). This relationship between confidence and accuracy was largely
hidden by the correlational results presented earlier. Thus, even when both
within and between subjects confidence-accuracy correlations appear to be low
and not significant, confidence can be highly calibrated to accuracy. In fact,
it is possible to show how the responses of two subjects can each be highly
calibrated while their mean confidence and percent correct scores are inversely
related. This can happen when the subject with the higher d' happens to have
wider confidence bins than the subject with a lower d'. Although the former
will be more accurate and less confident, for both subjects, higher confidence
responses will still be associated with a greater likelihood of those responses
being correct.
Another implication
of these results concerns the effect that different learning and memory
conditions have on confidence criteria placement. The different pattern of
calibration for the "yes" and "no" responses tends to
support the stretch/conservative-shift models over the lock-step model.
Apparently, subjects do not simply move their confidence criteria in lock-step
with the "yes/no" decision criterion as d' changes.
One way to test
the hypothesis that people might adjust their confidence criteria based on
their metatheories of memory would be to assess the effect of duration of
exposure and length of retention interval in such a way that subjects were
deprived of information about the source of whatever item differences in
subjective strength of evidence they might experience. That is, take away their
knowledge of each item’s time since last seen (and its duration of exposure) at
the time of testing. Experiment 2 was designed in an attempt to create these
conditions. It also provided another opportunity to examine the
confidence-accuracy relationship from a somewhat different face-recognition
procedure than that used in the Experiment 1.
To hide
information about the length of a retention interval (as well as the duration
of exposure) associated with seen items, we varied these factors within
subject. In a typical within subjects retention interval design, items with
different retention intervals are presented and tested in blocks. Subjects see
some items and then are tested at one retention interval, then they see more
items and are tested at a different retention interval. This procedure does not
hide information about the retention interval, however, because the subjects
know how long it has been since the last study session. Therefore, we required
that subjects study items at different times during a three-week interval and
return for one final test session in which all of the items were mixed
together. In this way, subjects did not have, at the test session, non-memory
based information about how long it had been since they had seen a particular
item. Unfortunately, this procedure has the drawback of adding retroactive and
proactive interference effects to the retention interval as well as confounding
it with order effects. Nevertheless, since our primary concern was the
relationship between d', confidence, and accuracy and not the pure effects of
retention interval and duration of exposure, we felt this was a small price to
pay to keep subjects blind, during the test phase, to the learning conditions
that were associated with individual
items.
Subjects.
Thirty-five subjects were again obtained from introductory psychology classes
at UCSD and served in partial fulfillment of class requirements. Subjects
volunteered to participate in an experiment lasting about three weeks and
requiring that they return for a total of six sessions over that three-week
interval.
Design.
A 2 x 5 within-subjects factorial design was employed. Two levels of duration
of exposure to the study faces (two seconds and twelve seconds) were crossed
with five different retention intervals (one hour, 24 hours, 168 hours, and 336
hours, and 384 hours). This unusual pattern of retention intervals was selected
because pilot testing indicated that there was an unexpectedly strong primacy
(or “first in”) effect that interfered with the retention interval effect. In
an attempt provide a long retention interval in which this “first in” effect
was minimized, we followed the first session with another, two days later, with
the intention of discarding data for items from the very first study session.
Procedure.
In a manner similar to the previous experiment, subjects studied a total of 50
color slides of faces and were tested for recognition with 100 slides. Unlike
the prior experiment, however, the decision task was: had this person
been seen before and not had the identical slide been seen before. Thus, the 50
previously seen items were not identical pictures of the same person but
consisted instead of the same person with some minor changes in appearance,
e.g., changes in clothing, minor changes in hair style, changes in facial
expression, and so on. All of the stimulus people were randomly divided into
two sets of 50. Half of the subjects studied one set, the other half studied
the other set.
Subjects were
run in groups of one to four at each of five study sessions. During each study
session, subjects were shown 10 slides of male and female college age
individuals (using the same methods of display as in Experiment 1). Half of the
slides in each session were seen for the short duration and half for the long
duration. Subjects were instructed to look at each person carefully and then
during a 15 second inter-item interval predict whether they thought that they
would recognize this person later, indicate how confident they were in this
prediction, and write down a reason, if they had one, from a list that we
provided, why they thought they might or might recognize that person later.
At the
beginning of the first session, subjects were told that they would have to
return on four additional days over the next three weeks for a brief time but
that the last session would last for a much longer time (two hours) than the
previous ones. After the introductory first study session, subjects returned two
days later for the next session, then a week later for the next session, then
five or six days later for next session, and then the following day for the
last session. At the end of the last study session of 10 slides, subjects were
given an hour break after which they had to return to the laboratory for the
test session.
During the
test, the slides of the previously seen individuals were randomly mixed with
slides of 50 new people. Subjects were shown each test slide for 20 seconds
during which time they indicated whether they had seen this person before and
their confidence in this response. If they indicated that they had seen the
person before, they were asked whether they had seen the person for a short or
a long time as well as when they had seen that person before (which session).
Finally, subjects wrote down reasons from the same list provided during the
study sessions why they thought they did or did not see the person before.
Duration and
Retention Interval Effects. Because the exposure and retention interval
conditions were varied within subjects for the seen items only, data for just
one set of not seen items was available for each subject. Thus, the learning
and test conditions affected the results for the seen items only. Nevertheless,
we computed separate estimates of d' by reusing the results for the unseen
items for each estimate. Table 6 presents the results (excluding those from the
first study session) for d', percent correct, and mean confidence. As in the
between subjects design, the main effects of retention interval and duration of
exposure were significant for all three measures but the interactions were not
(for an arcsin transformation of proportion correct: the retention interval
Greenhouse-Geisser corrected F(2.79, 94.87) = 4.27, p<.01, the
exposure duration F(1, 34) = 21.7, p<.0001[17],
but the interaction was not significant; for confidence: the Greenhouse-Geisser
corrected retention interval F(2.68, 91.3) = 3 .00, p<.05, the
exposure duration F(1, 34) = 14.31, p<.001, and again the
interaction was not significant). Somewhat unexpectedly, however, the one hour
retention interval yielded slightly lower memory scores than the one day
condition, possibly due either to fatigue or to interference. Thus, the more typical
retention interval memory losses were found by comparing the one day retention
interval with the week and two week intervals. Regardless, these conditions did
create significantly different performances on the key measures.
Table 6
Effect of Duration of Exposure
and Length of Retention Interval on Mean Accuracy (Measured as d' and Percent
Correct) and on Mean Confidence a
|
|
Measure |
|||||
|
|
d' |
Percent Correct |
Mean Confidence |
|||
|
|
Duration of
Exposure |
|||||
|
Retention Interval |
2 sec. |
12 sec. |
2 sec. |
12 sec. |
2 sec. |
12 sec. |
|
1 hr |
1.41 |
1.77 |
65.1 |
77.7 |
3.457 |
3.817 |
|
24 hrs |
1.48 |
1.92 |
66.9 |
81.1 |
3.623 |
3.920 |
|
168 hrs |
1.34 |
1.53 |
64.0 |
69.7 |
3.457 |
3.720 |
|
336 hrs |
1.19 |
1.55 |
56.6 |
68.0 |
3.480 |
3.589 |
a Computations for the mean
d’s were hampered by the fact that each subject was only tested on five
previously seen people at each retention interval and duration. The small n meant
that some subjects were 100% correct in some cells, a value whose inverse
normal deviate is not defined. In an attempt to correct for this, we recoded
100% to 95% correct. A similar correction was applied on the 0% correct side.
For these reasons, the absolute values of the mean d’s in this table should be
not be taken at face value.
Signal
Detection. The first evidence that signal detection provided a reasonably
good description of the data from this experiment can be seen in Figure 10.
Linear functions fit the normalized ROC curves[18]
about as well as they fit data from the first experiment (the smallest r2 was .991 and
the largest was .996). Interestingly enough, the slopes of these ROCs were
somewhat less than those in the previous study (ranging from .634 in the one
hour retention interval to .686 in the 336 hour retention interval) suggesting
that the variance of seen item distributions were between 1.46 and 1.58 times
larger than the distribution for the unseen items.
The second type
of evidence comes from the mean confidence ratings obtained for each response
type. Because each subject only generated five responses to seen slides at each
duration and retention interval, most subjects did not produce all response
types in all conditions. This made a complete analysis of variance of condition
by response-type inappropriate. Nevertheless, we were able to examine the
effects of response type on least square estimates of mean confidence: Hits =
3.860, FA = 2.809, Miss = 3.120, CR = 3.477 (Greenhouse-Geisser corrected F(1.59,
50.77) = 27.13, p<.0001). As expected by signal detection, and
exactly as in Experiment 1, mean confidence was higher for correct than
incorrect responses (F(1, 96) = 65.33, p<.0001). In addition,
as in Experiment 1, subjects were more confident in their responses to the
slides they had seen before, i.e., hit and miss responses, than to the slides
they hadn’t seen before (F(1, 96) = 15.89, p<.0001).
Fortunately, we were able to examine the effect that duration had on the size
of the hit v. miss response-type effect by collapsing over different retention
intervals. In particular, a 2 (duration) by 2 (hit/miss) within-subjects
analysis of variance of mean confidence supported the signal detection
prediction: the difference in mean confidence between hit and miss responses
was larger in the long duration (4.037 v. 3.066) than short duration (3.667 v.
3.224) of exposure conditions (F(1, 34) = 16.86, p=.0002).
Missing data and high correlations across conditions prevented us from
performing a similar test for the different retention intervals, but when we
computed the mean hit confidence minus mean miss confidence for each subject in
each retention interval and compared these scores, we found that this
difference was larger in the shorter, one and 24 hour, retention interval
conditions (.66 and .77) than the longer, 168 and 336 hour conditions (.48 and
.61). Finally, there were no differences in the confidence subjects expressed
in their “yes” compared to “no” responses. Thus, like Experiment 1, although
not as neat, these mean confidence results are consistent with the signal
detection model. Responses that we would predict were based on evidence values
closer to the "yes/no" criterion had lower confidence means than
responses based on evidence further from c.

Figure 10. ROC
curves from applying the rating method to data from four retention intervals in
a within-subjects design in which the subjects did not know how long it had
been since they had seen each slide. The same rating method was used here as in
Figure 2, however, the normalized cumulative proportion of responses for the
not seen items did not change as the length of the retention interval changed.
Linear functions were fit with least squares regression.
When we applied
the Peirce and Jastrow equation to each subject's overall mean signed confidence and ln p/(1-p)
accuracy measure, the results supported a signal detection interpretation consistent
with the previous analyses. In particular, as Figure 10 shows, individual
differences in memory (collapsed across the different duration and retention
intervals for the seen slides) were linearly related (r2 = .796, F(1,
33) = 128.53, p<.0001) to individual differences in signed confidence
scores. Consistent with the ROC data and the fact that subjects were again
slightly biased to say, "no," (the total mean proportion of
"yes" responses was .446), the intercept of this function was
slightly larger than zero (t = 2.27, p = .03). Furthermore, despite the
use of rather different recognition and retention interval procedures, the
parameter values of this fit were not significantly different from those
obtained in Experiment 1 (slope = 1.03 and 1.12, intercept = 2.24 and 2.54 for
Experiments 1 and 2, respectively). In
sum, the ROC curves, the response-type effects on mean confidence, and the
Peirce and Jastrow analysis all seem to fit about as well within a signal
detection framework as did the data from Experiment 1.
Confidence
and Accuracy Correlations. As was done for Experiment 1, within-subject and
between-subject confidence-accuracy correlations were computed to determine how
the more typical methods of examining the relationship between confidence and
accuracy behaved. Unlike the first experiment, but quite consistent with
previous reports, the overall between subject correlation (based on mean
confidence, coded 1-5, and total proportion correct) was not only small, it was
not significant (r(33) = .218, p>.1). The within condition
confidence-accuracy correlations, for the seen items only, are in Table 7. With
one exception (the one hour between-subjects case) all of the correlations for
the long exposure slides were significantly different from zero, but none of
those from the short duration condition were. Because all but six subjects had
missing accuracy scores in one or another condition (e.g., all of their
responses were correct or incorrect), a reasonable 2-way analysis of variance
could not be computed on the within-subjects mean correlations, however, when
two separate analyses of variance, one for duration and the other for retention
interval, were computed, the duration effect was significant (F(1, 34) =
12.70, p<.005) but the retention interval effect was not (F>1).
Unlike Experiment 1, only duration of exposure seemed to have a large effect on
the confidence-accuracy correlations.[19]
But then, the duration of exposure manipulation also had a much bigger effect
on d' than did the retention interval manipulation.
Table 7
Effect of Duration of Exposure
and Length of Retention Interval on the Size of Confidence-Accuracy Within and
Between Subject Correlations for the Seen Items Onlya
|
|
Type
of Confidence-Accuracy Correlation |
|
|||
|
|
Within-Subjects |
Between-Subjects |
|||
|
|
Exposure
Duration |
||||
|
Retention Interval |
2
sec. |
12
sec. |
2
sec. |
12
sec. |
|
|
1 hr |
.040 |
.395* |
.244 |
.147 |
|
|
24 hrs |
.196 |
.396* |
.046 |
.680* |
|
|
168 hrs |
.124 |
.395* |
.151 |
.509* |
|
|
336 hrs |
.060 |
.310* |
.328 |
.395* |
|
a Each of the mean within
subject correlations is based on a different n depending on the number of subjects who correctly identified all of
the seen items in a given condition. If a subject correctly identified all of
the items, there was no variation in accuracy and therefore a correlation could
not be computed for that subject.
*p<.05
Mean
Confidence and Proportion Correct. If this experiment was successful in
preventing subjects from using information about the length of the retention
interval in setting their confidence criteria, then mean confidence should not
drop at a relatively faster rate than proportion correct as the retention
interval increased from one to two weeks. In addition, mean confidence and
total proportion correct should be monotonically (but possibly linearly, given
the range of d' values produced) related across all of the learning conditions.
Figure 11 shows that these predictions were confirmed. Unlike the results from
the first experiment, the two-week retention interval did not result in
a relatively more rapid drop in confidence than in accuracy. Instead the data
points were well within the 95% confidence intervals (shown as gray lines) of
the best fitting linear function (r2 = .912) for the data that excluded
the two-week retention interval.[20]

Figure 11. The relationship between
mean confidence (coded 1 through 5) and mean proportion correct for each
learning and memory condition in Experiment 2. The squares are data from the
short duration and the dots are from the long duration of exposure. The gray
lines represent the 95% confidence interval for the best fitting linear
function leaving out the data from the two-week retention interval.
Calibration
of Confidence to Accuracy. Figure 12 shows the proportion of all of the yes
and no responses that were correct given the level of confidence expressed in
the response. Clearly, confidence was higly calibrated to the accuracy of both
the yes and the no responses (the main effect of confidence Wald X2 (4) = 163.29, p<.0001), although
the calibration was better for the yes responses than for the no responses (the
yes/no by confidence interaction Wald X2 (4) = 26.52,
p<.0001). Because all of the false alarms were made to a standard set of not
seen stimuli, it was impossible to test whether the degree of calibration
varied as the duration and retention interval of the seen stimuli changed.
Regardless, these results again show that confidence in and accuracy of face
recognition can be highly related even when correlation measures on the same
data suggest the relationship is weak or non-existent.

Figure 12. Calibration curves for "yes" and
"no" responses for the data from Experiment 2. The darker line and
square data points represents the probability of all "yes" responses
being correct given that they were accompanied by a particular confidence rating.
The lighter line and circles represent the same results for all "no"
responses.
Confidence
Cutpoints. Taken together, the results from Experiments 1 and 2 are
consistent with the idea that when subjects know how long it has been since
they had studied some faces, they adjust their confidence criteria to reflect
their beliefs about the effect that the long retention interval has on their
memories. Nevertheless, we attempted to test , more directly, the hypothesis
that subjects in the first experiment were setting their confidence criteria
even more conservatively as the retention interval increased while subjects in
the second experiment had relatively fixed confidence criteria as d' changed by
attempting to directly estimate the placement of the confidence criteria in the
two experiments using signal detection and maximum likelihood procedures
detailed in Dorfman & Alf (1969), Ogilvie & Creelman
(1968), and Swets & Pickett (1982).
In this
procedure, the parameters of the signal detection model are adjusted to
maximize the theoretical probability of obtaining the observed distribution of
"yes/no" and confidence responses. The parameters were r (the ratio
of the noise to signal distribution standard deviations), d', and five
cutpoints: "yes-5," "yes-3," "c,"
"no-3," and "no-5." The r values were about .8 and roughly
equal across the different learning and memory conditions. Although the
estimated d' values were higher than those in Tables 1 and 3, the estimated
values were linearly related (r2 = .985) to those in the Tables with a
slope not different from 1.0.
Of most
interest in these analyses is whether the maximum likelihood estimates for the
confidence cutpoints would support the conservative-shift model for the data
from Experiment 1 but not for the data from Experiment 2. Figure 13 presents
the results of these maximum-likelihood fits for both experiments. The graphs
show the maximum likelihood estimates for the five estimated cutpoints plotted
against the maximum likelihood d' estimates. The data are collapsed over the
duration conditions for both fits because the pattern of effects of retention
interval was virtually identical in the two duration conditions.[21]
As can be seen in the top panel, when subjects in Experiment 1 were aware of
the length of the retention interval for individual items, placement of the
higher confidence cutpoints for the “yes” responses actually did become more
conservative (after an initial tendency to be drawn along in lock-step fashion
with the “yes/no” decision criterion) as the retention interval increased and
d' decreased. In a fashion completely consistent with the calibration results,
the increase in conservativeness was even more pronounced for the “no”
responses because the conservative shift combined with, rather than worked
against, the tendency for the “yes/no” criterion (c) to shift downward with
decreasing d'. Using the standard errors of the maximum-likelihood estimates,
t-tests confirmed that both the “yes-5” and “no-5” cutpoints were significantly
more extreme in the two week retention interval condition than in the one hour
condition. In short, the subjects seemed to expand the range of their most
extreme confidence bins thereby requiring greater strength of evidence (at both
ends of the scale) before indicating high confidence. In stark contrast, but
not surprisingly, the lower panel shows that when subjects in Experiment 2 were
unaware of the length of the retention interval for individual items, the
placement of their confidence cutpoints did not change as the retention
interval increased and d' decreased. Of course, it is difficult to imagine how
they could have changed, given that the subjects had no information about which
faces were from which retention interval.

Figure 13. The
top and bottom parts of the figure show the maximum likelihood estimates for
d', the “yes-absolutely confident,” “yes-moderately confident,” “yes-no,”
“no-moderately confident,” and “no-absolutely confident,” response cutpoints as
a function of the length of the retention interval for Experiments 1 and 2.
Results for Experiment 1 are in the top figure.
If people do
have metatheories of the effects that retention interval has on their memories
for faces and those theories predict that memory decays at a rate faster than
it actually does, we might expect that subjects’ predictions of their memory
performance over retention interval would drop more rapidly than their actual
performance drops over equivalent retention intervals. To test this idea we
designed an “observer simulation” of Experiment 1. We presented descriptions of
the procedure and conditions that we used in Experiment 1 to naive subjects and
asked them to predict their memory performance.
Subjects.
One hundred and seventeen students from a class that one of us was teaching
served as subjects for no credit.
Procedure.
All members of the class were told about the procedures used in Experiment 1.
The class was told to imagine that they were asked to study 40 faces of individuals
projected on a screen for 12 seconds each. (The instructor demonstrated 12
seconds by counting off the time while watching a second hand.) They were also
told to imagine that after studying all 40 faces they would take a one-hour
break and then be tested for their memories of the faces. The recognition
procedure was described and they were further instructed that if they had no
memory for any of the slides and just guessed, they would obtain a score of 50%
correct and that if they had a perfect memory, they would correctly recognize
each slide they had seen before and correctly reject each slide they hadn’t
seen before thereby obtaining a score of 100% correct. At this point all of the
class was asked to estimate how well they thought they would perform on the
recognition test after a one hour delay by providing the percent correct they
thought they would obtain. The class was then asked to provide additional
estimates for their performance after a day, one week, and two-week delay.
Finally, the class was asked to provide estimates for all four retention
intervals, but this time to imagine that they were only able to study the
slides for 3 seconds rather than 12 seconds each. The instructor then
demonstrated 3 seconds by counting off the time while watching a second hand.
Clearly the
design of this observer simulation is such that many differences exist between
the actual experiment and the observer simulation. Nevertheless, it is still of
interest to compare the shape of the forgetting function that subjects
predicted for themselves with the functions that were actually obtained in
Experiment 1. Figure 14 presents these results. As was the case for the actual
performance data, analyses of variance indicated that both the main effect of
duration and of retention interval on the predicted percent correct (arcsin
transformed) scores were highly significant (p<.0001) but the interaction
was not. More importantly, the pattern of the observers’ average predicted
memory performance appeared to be somewhat different from the actual
performance that equivalent conditions produced. In particular, the observers
predicted that their recognition accuracy would drop off much more rapidly as
the retention interval increased to two weeks than the rate at which their
actual memory performance declined. Interestingly, the observers did a
remarkably good job of predicting the size of the effect of duration of
exposure.

Figure 14.
Solid dark lines show the actual percent correct recognition responses in
Experiment 1 and the dotted lines show the predicted percent corrects for
another group of subjects who simply guessed how well they would do after
hearing a verbal description of the same experiment.
The latter
result may explain why we found evidence in support of the length of the
retention interval producing a conservative shift in confidence cutpoints but
no such effect for duration of exposure. If subjects’ metatheories about the
effect of duration of exposure on recognition accuracy are nearly correct (at
least within the range of values that we studied), then any effect those
theories might have on the placement of confidence would be no different than
the direct effect of d'.
The results
from the present experiments are consistent with the view that confidence and
accuracy in face memory are highly related to each other despite what many
eyewitness memory experts seem to believe. However, the nature of this
relationship is not well described by the size
of simple Pearson correlation coefficients that are often computed in face and
eyewitness memory research. Instead, the relationship is better described by a
signal detection model in which confidence estimates are cutpoints located on
the same underlying psychological dimension as the "yes/no" decision
criterion. This model assumes that higher confidence identification responses
are always associated with a higher likelihood of being correct than
lower confidence responses (except when d' is near zero) even though certain
correlation measures of the confidence-accuracy relationship may be small and
non-significant.
We do not mean
to argue that signal detection provides a perfect description of all of the accuracy
and confidence data in face recognition research. But, signal detection (Egan, Schulman, & Greenberg, 1959; Macmillian
& Creelman, 1991), provides a useful description of the strong relationship
that exists between confidence and accuracy, a relationship that is virtually
hidden from view by Pearson correlations between confidence and accuracy but
quite clear when other measures of the relationship are computed, namely, means
across different learning and memory conditions, correlations between mean
confidence and accuracy over different faces, and especially calibration
curves.
Optimality
Hypothesis Redefined
If our view is
correct, it suggests that the optimality hypothesis explanation for the size of
confidence-accuracy correlations needs to be slightly refined. According to the
optimality hypothesis, correlations between confidence and accuracy will be
higher in optimal, compared to sub-optimal, learning and test conditions
(provided that what are defined as optimal conditions actually produce higher
d's). Although this view is generally consistent with the signal detection
analysis, the emphasis on optimal and sub-optimal implies a dichotomy of
conditions that is highly artificial and directs attention away from the
underlying process. As we demonstrated by the Monte-Carlo simulations, the
relationship between d' and the size of both the response-based and individual
difference-based confidence-accuracy correlation is a continuous one. Learning
and test conditions are neither optimal nor sub-optimal, they simply control d'
which in turn controls the proportion of highly confident responses that are
likely to be correct, and thereby the strength of the confidence-accuracy
relationship.
Although the
former argument may seem like a small point, it can have important applied
consequences. The belief that confidence is not a good predictor of accuracy
has caused a number of researchers to argue that jurors need to be instructed
in court by an expert to rely less on confidence and more on other predictors
(e.g., Lindsay, Wells, & O'Conner, 1989; Penrod and Cutler, 1987). These
experts suggest it is reasonable to tell jurors that confidence is not a good
predictor of the accuracy of identifications of actual eyewitnesses to crimes
because the “learning and test conditions” in most real crime situations are
sub-optimal (Kassin, Ellsworth, & Smith, 1989) . Although
Deffenbacher (1980) provided a list of some situational factors that might
enhance optimality (e.g., a warning that a memory test will occur, “moderate”
situational stress, “ample” duration of exposure, “high” familiarity with the
target, a “brief” retention interval, “similar” condition of the target at
encoding and test, additional “consistent information” presented during the retention
interval, a forced-choice testing procedure with “unbiased” instructions, and
“low similarity” of the targets to the distracters), the application of these
to any particular witness to a crime requires knowing how the witness’s
experiences matched these conditions. Because these conditions are not well
defined (Is a 1 hour or a 48 hour retention interval considered to be brief? Is
20 seconds or two minutes of exposure ample? If the target changes his
hairstyle, is he still similar enough?), it is difficult to know, by looking at
the situation, whether conditions are optimal or sub-optimal. Put differently,
the optimality hypothesis fails to provide the expert with a procedure to
measure optimality.
To be fair, Deffenbacher (1980) also suggested that optimal might
be defined as those conditions that produced accuracy rates above 70% and/or
d's above 2.0. But, even this rule creates uncertainty about the definition of
optimal, because as can be seen in our Table 1, accuracy rates around 70% are
not, in general, associated with d's of 2.0. In fact, assuming that c is placed
midway between the means of the seen and not seen distributions, d's of 2.0
will yield overall accuracy rates in the mid 80% range. More importantly, at
our present stage of understanding, we have no idea how the above list of
conditions might combine to determine d' (e.g., how many units of less than
optimal “stress” are sufficient to over come the optimal conditions of “ample”
exposure and “low similarity” of targets and foils?). In addition, as the
simulation results demonstrated, the effect of d' on the size of
confidence-accuracy correlations depends on the level of other signal detection
parameters and the specific method used to compute the correlation. Because our
theories of learning and memory are not yet refined enough to predict exact
d's, percent corrects, much less confidence-accuracy correlations for given
combinations of specific learning and memory conditions, focusing on an outcome
measure such as d' could have much greater applied utility.[22]
For example, if
our claims are correct, we know how d' and within-subject response-based
confidence-accuracy correlations are related (see Figure 5). Finding a way to
assess d' from the confidence expressed by the witness has much greater utility
than simply claiming that witnesses to crimes are obviously in sub-optimal
learning and test conditions and therefore their confidence is diagnostically
useless. Until eyewitness memory experts establish the exact functional relationships
between different combinations of learning and test conditions and d' -- a
daunting task -- the claim that crimes are sub-optimal learning situations
without suitable measurement of the strength of each eyewitness's memory simply
tells us that the experts believe all eyewitnesses to all crimes have poor
memory.
Despite the
difficulty of determining how learning and memory variables combine to affect
d', one empirical fact that experts do not seem to know is that higher
confidence estimates are more probable the higher d' is. We can see this in
Tables 4 and 5. The probability of absolute confidence increased dramatically
as d' increased. For example in the best memory condition in Experiment 1 (1hr,
11 seconds), subjects were absolutely confident in 45% of all of their yes
responses, but in the worst memory condition (336hr, 3 seconds) subjects were
absolutely confident in only 10% of all of their yes responses. This important
result implies that confidence may be a good predictor of the optimality of the
learning and memory conditions, as well as d' and the accuracy of individual
identification responses. Recent work on the rate at which highly confident
identifications occur in actual crime situations (Moore, Ebbesen, &
Konecni, 1994) suggest that highly confident positive identifications occur in
over 90% of the identifications in the real world. Even correcting for what are
often assumed to be strong demands to appear confident in the real world (for
which we know of no direct empirical evidence), the claim that real world
witnessing conditions tend to produce d's lower than .25 seems a bit strained.
Deffenbacher’s
conclusion that, “the judiciary should cease reliance on witness confidence as
an index of witness accuracy,” seems premature. Eyewitness memory experts who
have uncritically accepted this conclusion on the basis of low correlations and
who have testified, as such, in court have been misinforming jurors about the
nature of the confidence-accuracy relationship (with obvious real-world consequences).
Measuring d'
from confidence-accuracy correlations
From a slightly
different point of view, the various simulation results suggest that whether
the size of a confidence-accuracy correlation will be a useful indicator of d'
or percent correct depends on a number of previously unrecognized factors. In
particular, we know from the simulation results that a narrow range of within
subject response-based correlations is associated with a wide range of d'
values. For example, as d' grows from 1 to 3 the mean response-based r grows
from around .2 to .35. In fact, this correlation changes most rapidly as d'
moves from 0 to 1, all situations that would be defined as sub-optimal
according to Deffenbacher’s rules. On the other hand, the size of
response-based correlations are relatively immune to factors other than d'.
Unlike
response-based correlations, the degree of individual difference variance in
placement of the confidence cutpoints (v in our simulations) plays a large role
in the size of individual difference correlations A high degree of variance
will attenuate these correlations, even when d' is high. The way in which
subjects adjust their confidence cuts as d' changes will also play a role in
the size of individual difference correlations. In short, unless we take into
account individual differences in confidence and know the effect that d' is
having on confidence placement, using individual difference confidence-accuracy
correlations to estimate the size of d' seems foolish.
Interestingly,
because the face-based correlations average over individual differences and
because their size does not appear to level off with d' values over 1, one
might argue that they offer the "best" reflection of d' of the three
different correlational measures of the relationship between confidence and
accuracy. To the extent that this conclusion has any applied significance, it
suggests that experts might be more concerned about whether multiple witnesses
confidently picked the same suspect than whether a given witness is generally
confident.
Generalizing to
the real-world
Another
implication of the signal detection approach is that it raises serious
questions about the external validity of the claim that confidence and accuracy
are only weakly related. One could argue from the current perspective that the
frequently reported low correlations are simply the result of experimenters
running conditions in which the average d' is less than 1. That is, when
experimenters report low confidence-accuracy correlations they may be telling us
little more than the subjects in their memory task did not learn and/or
remember very much. If so, the real issue is not whether confidence is related
to accuracy but rather the rate at which experimental simulations of eyewitness
recognition memory reproduce the d' values that real crimes produce in actual
eyewitnesses. If experimental tasks are sampling from conditions that produce
lower d's than are produced in actual crime situations, then estimates about
the real world diagnosticity of confidence based on those experiments will be
too low. Conversely, if witnesses tend to learn and remember very little in the
real world and experiments are sampling from conditions that produce higher
d's, the current conclusions about confidence and accuracy may be correct.
Thus, before we can make a claim about the diagnosticity of confidence in the
real word, we must know the distribution of d's that are produced in actual
crime situations -- a task that no one has bothered to accomplish.
Still, it would
be foolish to argue that people could not, under the right motivational
conditions, construct confidence estimates that did not behave in a manner
consistent with the signal detection approach. Clearly people might dissemble
either or both identification and confidence responses. What we are arguing is
that when properly motivated, people are quite capable of producing recognition
memory data that conforms to assumptions of our signal detection analysis. We
would argue further that the signal detection approach makes it very clear what
the motivational conditions must do to destroy the confidence-accuracy link.
For example, motivational conditions that encouraged people to express greater
confidence than they otherwise might would cause the confidence cuts to move
closer to c, but this would not eliminate the confidence-accuracy relation.
Confidence will still tend to be calibrated to accuracy because higher
confidence bins will tend to contain a greater proportion of correct responses.
In fact, the relation would disappear only when reports of confidence are no
longer ordinally associated with strength of subjective evidence, for example,
"absolutely confident" is placed closer to c than "slightly
confident".
Measuring the
strength of the confidence-accuracy relationship
What is the
proper measure of the relationship between confidence and accuracy? If the
signal detection model is correct, more confident responses will always
be more likely to be correct than less confident responses, even when d' is
close to, but not quite yet, zero and even though confidence-accuracy
correlations are barely above zero. To say that confidence should not be used
as a predictor of accuracy implies that a Pearson correlation is the correct
measure of the strength of the relationship. In fact, we have tried to show
that even very good memory (high d's) will tend to produce rather low, though
significant, correlations. The reasons for low correlations depend on whether
correlations are based on averages over items or on single responses (to different
items within a subject or to one item per subject, as in event memory
research). When correlations are based on average confidence and overall
accuracy scores for individual subjects, individual differences in other than
d' can affect the size of the correlations. Differences in the use of the
confidence scale (confidence response biases) as well as differences in
placement of the "yes/no" cutpoint will attenuate such correlations
even though each subject's confidence and accuracy may be highly related,
exactly as signal detection requires. As already noted, Pearson correlations
based on single responses have the obvious statistical problems of being based
on a dichotomous dependent variable (1 if correct and 0 if incorrect). In
short, small confidence-accuracy correlations, even a large number of small
correlations, are not sufficient evidence to claim that two response measures
are weakly, much less, unrelated. As we saw in both a between- and a
within-subjects experiment and in our simulation results, confidence can be
remarkably well calibrated to accuracy even when confidence-accuracy
correlations are low and not significant.
The occasional
report of significant or nearly significant negative correlations between
confidence and accuracy (e.g., Read et al., 1990) may seem inconsistent with the signal detection
model. They are not. Negative confidence-accuracy correlations can be explained
by assuming that items that the experimenter defines as seen actually produce
less subjective evidence of having been seen before than items that were not
seen before. Although this may seem unlikely, the use in face recognition
memory of a decision task in which we ask the subject to identify the person
rather than the exact slide of a face means that experimenters can alter the
previously seen person to such an extent that subjects will be very confident
that they have not seen the person before when they, in fact, did. Furthermore,
the better the subjects learn what the person originally looked like, the
greater the likelihood that they will say they are confident that they have not
seen the dramatically altered individual before, even though they have. This is
exactly what happened in Read et al. (1990) when they had subjects study faces
of people taken at a young age and tested them with pictures taken at an older
age. Negative confidence-accuracy item-based correlations were reported (for
previously seen people only) because subjects tended to decide, with
confidence, that slides that the experimenter claimed depicted people who were
seen before were not the same individuals that were seen before. Thus, such
negative correlations are more a function of what the experimenter chooses to call
a previously seen slide than they are of a basic recognition memory process.
Although
correlations between measures of accuracy and of confidence may not provide a
reasonable picture of the relationship between confidence and accuracy,
calibration curves always do. In the present instance, calibration curves tell
us how much higher the odds that responses in which people are absolutely
confident, for whatever reason (such as, felt pressure, need to appear
confident, better learning, or metatheories of memory) are correct compared to
responses in which other degrees of confidence are expressed. Clearly, from an
applied point of view, this kind of information is far more informative to a
jury or prosecutor than is a correlation coefficient. The latter provides no
information about under or over confidence whereas calibration curves do.
Furthermore, we suggest that calibration curves that take no notice of subject
and stimulus differences have the greatest applied utility since we currently
have little information about the distribution of types of people who serve as
witnesses or about the types of faces that they are asked to identify.
Metamemory
The most
surprising (and perhaps most difficult to accept) finding from the present
research was the evidence that people who knew how long it had been since they
had seen the faces moved their most extreme confidence cutpoints further out on
the evidence dimension as the retention interval increased. Although a number
of results (the maximum likelihood estimates of the cutpoints, the unusual
relationship between average confidence and overall accuracy in the between
subjects but not the within subjects experiment, the large change in
calibration of the "no" responses but not of the "yes"
responses as d' changed, and people's predicted accuracy) were all consistent
with this conclusion, the relatively low number of absolutely confident false
alarms in all conditions in the both experiments force us to be somewhat
tentative about claiming that signal detection provides the best fit of the
data. The stretch model (see Figure 4) requires that subjects generate fewer
and fewer absolutely confident false alarms as the criteria expand. However,
the overall rate of absolutely confident false alarms (relative to total number
of responses) was generally less than 1%. For example, out of approximately
2000 responses to the not seen faces in the two week retention interval
conditions in Experiment 1, only 16 were absolutely confident false alarms.
Because of this, it was particularly difficult to develop a stable estimate of
the exact position of the most extreme confidence cutpoints. Despite this
technical difficulty, we believe, at the very least, that the evidence supports
the conclusion that the “yes - absolutely confident” cutpoint is not
automatically dragged down in "lock-step" fashion with the
"yes/no" criterion as d' get smaller and smaller.
That the
confidence criteria may not move in lock-step with c as d' changes suggest some
degree of independence between confidence and accuracy. Such independence
raises the possibility that variables that affect "yes/no" response
bias may not be the same as those that control where subjects place their
confidence cutpoints. For example, individual differences may express themselves
differently in terms of d', placement of c, and confidence criteria.
Self-confident individuals may be willing to express higher confidence with
weaker evidence but be no different in d' values than less self-confident
people. Alternatively, instructions to avoid false alarms may decrease the
tendency of subjects to “guess” yes, while the social setting may increase the
subject’s desire to appear confident. Our findings suggest that even people’s
beliefs about how their own memories are affected by various learning and test
conditions may be another factor in the relationship between confidence and
accuracy. Such forms of independence between confidence and accuracy have
little to do with whether confidence is diagnostic of the accuracy of
recognition responses, however. If the signal detection model is correct, as
the calibration results from both the experimental and simulation data confirm,
higher confidence recognition responses will always have a higher probability
of being correct than less confident recognition responses (unless d' is zero).
Furthermore, if subjects adjust their confidence criteria to correct for other
beliefs about their memories, and those beliefs underestimate how good their
memories are, the adjustments that they make in confidence will, in general,
reduce the rate of absolutely confident false alarms, a tendency that runs
counter to the claim that witnesses are positively identifying innocent
suspects at a high rate.
If the
metatheory explanation for the conservative shift in confidence is correct, it
raises several interesting issues. First, what are appropriate methods of
assessing the accuracy of such metatheories. For example, a number of
researchers (Cutler, Penrod, & Dexter, 1990; Deffenbacher
& Loftus, 1982; Loftus, 1983; Rahaim & Brodsky, 1982; Seltzer, Lopes,
& Venuti, 1990; Wells & Leippe, 1981; Yarmey & Jones, 1983) argue
that jurors have a poor understanding of how eyewitness memory actually works
and will frequently draw incorrect conclusions about the accuracy of eyewitness
testimony unless experts educate them. In particular, jurors’ supposedly over
rely on a witness’s confidence in estimating the accuracy of the witness.
Generally this kind of conclusion is supported by evidence showing that jurors
“incorrectly” use a witness’s confidence to predict accuracy in situations in which the correlation between
confidence and accuracy is small. But, if the signal detection analysis is
correct, it may not be the jurors who are over weighting the diagnosticity of
confidence but the researchers who are under weighting it. The real issue, in
this context, is the relative
diagnosticity of the different types of information typically available to
jurors. This is never measured. What is needed if this kind of evidence is to
be used to assess the accuracy of jurors’ metatheories is a comparison of the
accuracy of expert predictions with those of jurors across a wide range of
witnessing conditions and testimonies. Until such evidence is presented, it
seems premature to argue that jurors need to be educated by experts just
because the former make mistakes and the latter argue that their theories are
better.
Although the
metatheory idea (in a signal detection framework) provides a satisfactory
explanation for the pattern of the data from these experiments, it would be a
mistake to conclude from the current results that the signal detection plus
meta-memory model provides the only, or even the best, theoretical description
of the relationship between confidence and accuracy in face recognition.
Despite the considerable history relating confidence to signal detection, e.g.,
Clarke (1960), and despite both empirical and
simulation results that were consistent with signal detection, other
explanations are possible. One is to recast the recognition task into a form
consistent with wave theory of similarity (Link, 1992). In this view increasingly conservative
confidence estimates would be the result not necessarily of metatheories about
one’s own memory but rather would reflect the amount of time that it takes to
build sufficient information for a decision to be made as well as the amount of
evidence the subjects thought was necessary before deciding. In this view, the
effect of retention interval on confidence could result either from a weakening
of the rate at which the evidence that items were seen before builds over time
or an adjustment in the decision criteria. The real advantage of wave theory is
not that it provides a non-metatheory explanation for the confidence shifts,
but rather that it directs attention to the time that subjects take to decide
(e.g., Chance & Goldstein, 1987; Henmon, 1911, Sporer, 1993, 1994), as well
as to the confidence that they express in those decisions. Regardless of the
relative merits of wave theory over signal detection, both would agree that
confidence and accuracy are highly related, that the use of correlation
coefficients to assess the strength and nature of that relationship is totally
inadequate, that calibration curves provide a more useful picture of the
relationship, and that, although correct in spirit, the optimality hypothesis
directs attention away from important theoretical and applied issues that are
made obvious by a signal detection analysis.
References
Bothwell, R.
K., Deffenbacher, K. A., & Brigham, J. C. (1987). Correlation of eyewitness
accuracy and confidence: Optimality hypothesis revisited. Journal of Applied
Psychology, 72, 691-695.
Brigham, J. C. (1990).
Target person distinctiveness and attractiveness as moderator variables in the
confidence-accuracy relationship in eyewitness identifications. Basic and
Applied Social Psychology, 11, 101-115.
Clarke, F. R.
(1960). Confidence ratings, second-choice responses and confusion matrices in
intelligibility tests. Journal of the Acoustical Society of America, 32,
35-46.
Chance, J. E.,
& Goldstein, A. G. (1987). Rentention interval and face recognition:
Response latency measures. Bulletin of Psychonomic Society, 25(6),
415-418.
Cutler, B. L.,
Penrod, S. D., & Dexter, H. R. (1990). Juror sensitivity to eyewitness
identification evidence. Law and Human Behavior, 14, 185-191.
Deffenbacher,
K. A. (1980). Eyewitness and
confidence: Can we infer anything about their relationship? Law and Human
Behavior, 4, 243-260.
Deffenbacher,
K. A., & Loftus, E. F. (1982). Do jurors share a common understanding
concerning eyewitness behavior? Law and Human Behavior, 6, 15-30.
Donaldson, W.
& Murdock, B. B. (1968) Criterion change in continuous recognition memory. Journal
of Experimental Psychology, 76, 325-330.
Dorfman, D. D.,
& Alf, E., Jr. (1969). Maximum likelihood estimation of parameters of
signal detection theory and determination of confidence intervals-rating-method
data. Journal of Mathematical Psychology, 6, 487-496.
Egan, J. P.,
Schulman, A. I., & Greenberg, G. Z. (1959). Operating characteristics
determined by binary decisions and by ratings. Journal of Acoustical Society
of America, 31, 768-773.
Fleet, M. L.,
Brigham, J. C., & Bothwell, R. K. (1987). The confidence-accuracy
relationship: The effects of confidence assessment and choosing. Journal of
Applied Social Psychology, 17(2), 171-187.
Henmon, V. A.
C. (1911). The relation of time of a judgment to its accuracy. Psychological
Review, 18, 186-201.
Kassin, S. M.,
Ellsworth, P. C., & Smith, V. L. (1989). The "general acceptance"
of psychological research on eyewitness testimony: A survey of the experts. American
Psychologist, 44(8), 1089-1098.
Lichtenstein,
S., & Fischhoff, B. (1977) Do those who know more also know more about how
much they know? The calibration of probability judgments. Organizational
Behavior and Human Performance, 20, 159-183.
Lindsay, D. S.,
& Johnson, M. K. (1989). The eyewitness suggestibility effect and memory
for source. Memory and Cognition, 17, 349-358.
Lindsay, R. C. (1986). Confidence
and accuracy of eyewitness identification from lineups. Law and Human
Behavior, 10, 229-239.
Lindsay, R. C., Wells, G. L.,
& O'Connor, F. J. (1989). Mock-juror belief of accurate and inaccurate
eyewitnesses: A replication and extension. Law and Human Behavior, 13(3),
333-339.
Link, S. W.
(1992). The wave theory of difference and similarity. Hillsdale, NJ.:
Lawrence Erlbaum Associates.
Loftus, E. F.
(1979). Eyewitness testimony. Cambridge, MA: Harvard University
Press.
Loftus, E. F.
(1983). Silence is not golden. American Psychologist, 38,
564-572.
Luus, C. A. E.,
& Wells, G. L. (1994). Eyewitness identification confidence.
Cambridge University Press, New York, NY, US.
Macmillian, N.
A., & Creelman, C. D. (1991). Detection Theory: A user's guide. New
York: Cambridge University Press.
Murdock, B. B.,
Jr. (1980). Short-term recognition memory. In R. S. Nickerson (Ed.), Attention
and performance VIII (pp. 497-519). Hillsdale, N.J.: Lawrence Erlbaum
Associates, Inc.
Neil v. Biggers
(1972). In (Vol. 409 U.S., pp. 188).
Nelson, T. O.,
& Dunlosky, J. (1991). When people's judgments of learning (JOLs) are
extremely accurate at predicting subsequent recall: The "delayed-JOL
effect." Psychological Science, 2, 267-270.
Noreen, D. L.
(1981). Optimal decision rules for some common psychophysical paradigms. Mathmatical
Psychology and Psychophysiology, 13, 227-279.
Ogilvie, J. C.,
& Creelman, D. D. (1968). Maximum-likelihood estimation of receiver
operating characteristic curve parameters. Journal of Mathematical
Psychology, 5, 377-391.
Peirce, C. S.,
& Jastrow, J. (1884). On small differences in sensation. Memoirs of the
National Academy of Sciences, 3, 73-83.
Penrod, S. D., & Cutler, B.
L. (1987). Assessing the competency of juries. In I. Weiner & A. Hess
(Eds.), The handbook of Forensic Psychology, . New York: John Wiley
& Sons.
Rahaim, G. L.,
& Brodsky, S. L. (1982). Empirical evidence versus common sense: Juror and
lawyer knowledge of eyewitness accuracy. Law and Psychology Review, 7,
1-15.
Ratcliff, R.,
Sheu, C., & Gronlund, S. D. (1992) Testing global memory models using ROC
curves. Psychological Review, 99, 518-535.
Read, J., Vokey, J., & Hammersley, R. (1990).
Changing photos of faces: Effects of exposure duration and photo similarity on
recognition and the accruacy-confidence relationship. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 16,
870-882.
Seltzer, R.,
Lopes, G. M., & Venuti, M. (1990). Juror ability to recognize the
limitations of eyewitness identifications. Forensic Reports, 3,
121-137.
Shapiro, P. N.,
& Penrod, S. (1986). Meta-analysis of facial identification studies. Psychological
Bulletin, 100, 139-156.
Smith, V. L., Kassin,
S. M., & Ellsworth, P. C. (1989). Eyewitness accuracy and confidence:
Within- versus between-subjects correlations. Journal of Applied Psychology,
74, 356-359.
Sporer, S.L.
(1993). Eyewitness identification accuracy, confidence, and decision times in
simultaneous and sequential lineups. Journal of Applied Psychology, 78,
22-33.
Sporer, S.L.
(1994). Decision times and eyewitness identification accuracy in
simultaneous and sequential lineups.
Cambridge University Press, New York, NY, US.
Swets, J. A.,
& Pickett, R. M. (1982). Evaluation of diagnostic systems: Methods from
signal detection theory. New York: Academic Press.
Wells, G. L.,
& Leippe, M. R. (1981). How do triers of fact infer the accuracy of
eyewitness identifications? Using memory for peripheral detail can be
misleading. Journal of Applied Psychology, 66, 682-687.
Wells, G. L.,
& Lindsay, R. C. L. (1980). On estimating the diagnosticity of eyewitness
identifications. Psychological Bulletin, 88, 776-784.
Wells, G. L.,
& Lindsay, R. C. L. (1985). Methodological notes on the accuracy-confidence
relation in eyewitness identifications. Journal of Applied Psychology, 70,
413-419.
Wells, G. L.,
& Murray, D. (1984). Eyewitness confidence. In G. Wells and E. F. Loftus
(Eds.), Eyewitness testimony: Psychological perspectives. Cambridge:
Cambridge University Press.
Wickelgren,
W.A., & Norman, D.A. (1966). Strength models and serial position in
short-term recognition memory. Journal of Mathmatical Psychology, 3,
316-347.
Wixted, J.,
& Ebbesen, E.B. (1991). The mathematics of forgetting functions. Psychological
Science, 2, 409-415.
Wixed, J.,
& Ebbesen, E.B. (1995). A detection analysis of face recognition memory.
Paper presented at the 36th meeting of the Psychonomic Society, Los Angeles.
Yarmey, A.D. (1979).
The psychology of eyewitness testimony. New York: The Free Press.
Yarmey, A.D.,
& Jones, H. (1983). Is the study of eyewitness identification a matter of
common sense? In S. Lloyd-Bostock and B. Clifford (Eds.), Evaluating
eyewitness evidence. New York: Wiley.
Author
Notes
We would like to express our thanks to three very hard working undergraduates, Roger Boucher, Claudia Mendias Canale, and Joanna Adler, for assistance in running the experiments reported here. Stephen W. Link provided some very insightful comments on an earlier draft that we believe substantially improved this paper. Some of the results from Experiment 1 were reported in a previous paper by us. This research was supported by the University of California, San Diego. Reprint requests may be sent to either author at Department of Psychology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0109 or via email to eebbesen@ucsd.edu or jwixted@ucsd.edu.
Footnotes
[1]Personal experience of one of us as expert witness in criminal trials is consistent with the claim that most eyewitness experts believe that confidence and accuracy are unrelated. Virtually ever defense expert that he has heard testify claims that confidence in and accuracy of identifications are unrelated.
[2]Most of the evidence concerning the relationship between confidence and accuracy has come from two different procedures: face recognition memory experiments and event memory studies in which participants witness an event and then attempt to identify an individual in the event from a lineup. In this report we focus on the face recognition procedure, although a complete analysis of the relationship between confidence and accuracy in eyewitness memory will eventually have to include evidence from both procedures.
[3]Familiarity is frequently assumed to be the underlying subjective dimension on which the “yes/no” decision criterion is based. However, this is arbitrary because the signal detection model only requires that some single continuous subjective dimension that reflects the difference between seen and unseen faces be used and feelings-of-familiarity is only one such dimension. Other possibilities include average strength of features, or degree of match between a recalled face and the presented item, or even just “strength of evidence” (Link, 1992).
[4]Although it will prove to be an important issue later, the representation in Figure 1 also assumes that increasing the duration of exposure to faces does not affect the variance of the distribution of strength of evidence values for the seen items. The optimality of the conditions only affects the mean of the distribution.
[5]This prediction should be especially true for responses to the items that have been seen before.
[6]The exact form of calibration will depend, however, on where the decision criterion is placed.
[7]Parts of the accuracy data from this experiment were reported in a previous paper (Wixted & Ebbesen, 1991).
[8] d' was computed for each subject according to the following: d' = z(#hits/seen) - z(#false alarms/not seen). Because six subjects produced no false alarm responses, their d's could not be directly computed. To overcome this difficulty, we estimated their false alarm rate to by .01 rather than zero.
[9]Although we could have computed 8 different ROC curves, one for each condition in the experiment, we felt, for reasons that will become clear later, that showing the results for the retention intervals, collapsed over duration was sufficient.
[10]It is important to note that these least square fits are based on dependent cumulative proportions and therefore should not be used as true goodness of fit indicators.
[11]This differential slope result is completely consistent with recent claims made by Ratcliff, Sheu, and Gronlund (1992) regarding recognition memory for items other than faces.
[12]When one unusual outlying face was removed from the analysis, the correlation increased to .617!
[13]It is important to note that this argument implies that signal detection reasoning can be generalized from item strength distributions assumed to be in the heads of single individuals to subject distributions for single items.
[14]When the mean signed confidence and mean ln p/(1-p) data were analyzed in the same manner as "raw" confidence and proportion correct, the identical pattern of results were obtained. The two longest retention interval conditions both produced data beyond the 95% confidence intervals resulting from a least-squares linear fit of the remaining six conditions (r2 = .993, F(1/4) = 609.98, p<.0001). In addition, in support of the (Peirce & Jastrow, 1884) equation, the intercept of this linear function was not significantly different from zero (i.e., -.02 with a standard error of .059, t = -.4).
[15]When we constructed individual calibration curves for each subject's yes responses, the mean slope over all 195 subjects was not only highly different from zero (t(194) = 12.435, p < .0001), the .093 change in percent correct units per unit of confidence also suggested a high degree of average calibration. Setting "just guessing" at 50% correct, this average slope would put "absolutely confident" at 87.2% correct. These results indicate that the group calibration results were not driven by a few highly calibrated subjects existing in a sea of error.
[16]Like the yes responses, the average slope of the individual calibration curves for each subject's no responses was .112 (t(194) = 16.65, p < .0001).
[17]The F values were virtually the same for d' scores.
[18]Because each subject only experienced one set of not seen slides, the ROC curves for the different retention intervals were generated by using the same false alarm data for each curve. In addition, as in the between subjects experiment, these data were based on the cumulative sum of the frequencies of each response type over all subjects. For both of these reasons the goodness-of-fit tests are presented for information only without associated p-values.
[19]When the data from the seen slides from Experiment 1 were analyzed separately to compare to the results from Experiment 2, the pattern of results were unchanged.
[20]An identical pattern emerged when we computed mean signed confidence and ln p(1-p) for the different conditions. In addition, although the confidence bands were somewhat larger in this experiment than in Experiment 1, visual inspection of the figure shows that the pattern of results in no way suggested that longer retention intervals were accompanied with a greater drop in confidence than accuracy.
[21]Because our maximum likelihood fitting routine was limited in the number of free parameters that it could be fit at one time, we were forced to collapse over several of the confidence categories. However, no matter what method of constraining the number of free parameters that we used, the results of the fits were all consistent with the conclusions reached here. In addition, when the position of each of the cutpoints were directly estimated using the "root mean square" procedure outlined in (Macmillian & Creelman, 1991), the pattern of results were virtually identical to those produced by the maximum likelihood procedure.
[22]It should be clear, however, that were we to use d’ to assess the diagnosticity of confidence, we would need to have some method of measuring a particular witness’s d’ or at least be able to narrow the range of expected d’s to something that would improve the jurors' ability to infer accuracy beyond what they might do without our help.